0% found this document useful (0 votes)
42 views4 pages

Phishing Website Identification Based On Double Weight Random Forest

The document discusses a method for phishing website identification based on a double weight random forest algorithm. The algorithm uses k-means clustering on feature data to select important features. A decision tree is trained on the clustered features and tests data to calculate the weight of each tree. The weighted trees are combined in an improved Bayesian formula to determine weights and classify test data, aiming to improve phishing detection accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views4 pages

Phishing Website Identification Based On Double Weight Random Forest

The document discusses a method for phishing website identification based on a double weight random forest algorithm. The algorithm uses k-means clustering on feature data to select important features. A decision tree is trained on the clustered features and tests data to calculate the weight of each tree. The weighted trees are combined in an improved Bayesian formula to determine weights and classify test data, aiming to improve phishing detection accuracy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer

Engineering and Applications (CVIDL & ICCEA)

Phishing website identification based on double


weight random forest
2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA) | 978-1-6654-5911-2/22/$31.00 ©2022 IEEE | DOI: 10.1109/CVIDLICCEA56201.2022.9824544

Zhixin Zhou1* Chenghaoyue Zhang2


Fuling Big Data Application Development Center, Chongqing Fuling No. 16 Middle School
Chongqing Normal University Chongqing, China
Chongqing, China [email protected]
[email protected]

Abstract—Aiming at the problems of insufficient detection II. BASIC THEORY


accuracy and high misjudgment rate caused by a large amount of
redundant data, a random forest algorithm based on the A. K-means clustering algorithm
combination of feature weight selection and decision tree weight The k-means clustering algorithm is an iterative clustering
was proposed to construct a phishing website detection model. The analysis algorithm. The steps are to pre-divide the data into K
feature data uses the clustering algorithm to form clusters, selects
groups, randomly select K objects as the initial clustering centers,
the features inside and at the edge of the cluster to train the
and then calculate the clustering of each object and each seed.
decision tree, inputs the test data set to calculate the weight of each
decision tree, and combines the improved Bayesian formula to
The distance between cluster centers, assigning each object to
determine each decision tree. The weight of the decision tree is the cluster center closest to it. Cluster centers and the objects
finally formed into a double-weight random forest algorithm, assigned to them represent a cluster. Each time a sample is
which can improve the accuracy of phishing website detection. assigned, the cluster center of the cluster is recalculated based
on the existing objects in the cluster. This process will repeat
Keywords-Phishing Detection; random forest; Feature selection until a certain termination condition is met. Termination
conditions can be that no (or a minimum number) of objects are
I. INTRODUCTION reassigned to different clusters, no (or a minimum number) of
Phishing website, as a fake website disguised as a legitimate cluster centers change again, and the sum of squared errors is
website, is that scammers add some virus codes through the locally minimized [2].
loopholes of legitimate websites, and steal users' bank cards, B. Random Forest Algorithm
credit cards and other account passwords and other private
information through user input on the website. Fraudsters take Random forest is an ensemble learning algorithm composed
advantage of users' curiosity and undefended psychology to of multiple decision trees, and each decision tree is assigned an
make the interface of phishing website very similar to that of independent subspace and is allowed to grow freely. Finally, a
legitimate websites. If the user does not observe carefully when simple majority vote is used to designate the category with the
browsing the website, it is impossible to distinguish whether it most votes as the final classification result [3].
is a normal website or not. If you fail to realize that it is a There are three types of decision trees: ID3, C4.5, and CART
phishing website, it is likely to cause direct losses in the random forest algorithm. They are suitable for different
At present, the detection methods of phishing websites feature types. There are different correction methods for the
mainly include black and white list filtering technology, URL overfitting problem. In this paper, the detection of phishing
address analysis of phishing websites and extracting relevant websites is a binary classification problem. Regression tree-
features of websites to identify phishing websites [1]. Among based classifiers are more suitable for this problem. The
them, extracting website-related features to identify phishing classification and regression tree is to build an unpruned
websites has higher accuracy, but the identification efficiency is decision tree, which can increase the diversity of the decision
low, and extracting page features is more complicated. tree model when combined with the self-aggregation method
(Bagging) and the random feature selection method [4]. Mainly
This paper proposes a dual-weight random forest algorithm through the self-service resampling method (Bootstrap) to select
based on the combination of feature weight and decision tree the training set from the original sample set, extract multiple
weight for phishing website detection. In the feature weight features in each training set to train a decision tree model, and
establishment stage, the K-means clustering algorithm is used to put these independent decision trees of the same category into
process the features, and the clustering cluster results are the decision forest. When there is a new sample input, all
obtained. Different weights are assigned to different positions in decision trees are used to determine the ownership of the forest
the clusters. The linear scanning method is used to randomly classification, and then the decision tree selection mechanism
select features to form a decision tree, and each cluster is tested votes to determine the final classification result with an absolute
by a priori data. The accuracy of the decision tree determines the majority.
weight of each tree.

978-1-6654-5911-2/22/$31.00 ©2022 IEEE

263
Authorized licensed use limited to: Welcome Shri Guru Gobind Singhji Inst of Eng & Tech Nanded. Downloaded on February 16,2024 at 11:57:22 UTC from IEEE Xplore. Restrictions apply.
III. DOUBLE WEIGHT RANDOM FOREST B. Weighted random forest design
A. Feature weight and selection Kuncheva [6] studied four combination methods of majority
voting, weighted majority voting, recall combiner and naive
The cluster formed by clustering contains multiple feature Bayes in the classification algorithm, and tested the relationship
samples and can calculate the cluster center. The value of each between classifier weight and prediction accuracy respectively.
feature sample in the cluster is not the same. At this time, the His results show that the Bayesian formula is best suited for
cluster center point can better represent the entire cluster. handling imbalanced data in classification problems. Bayesian
Similarly, the features closer to the cluster center are more formula is widely used in probabilistic forecasting. Its
representative of the entire cluster. For the features at the edge application is characterized by the combination of prior
of the cluster, explain This feature can be well differentiated probability and actual results. For a given training dataset,
from other clusters. The two types of feature samples in the estimate the posterior probabilities as accurately as possible
cluster can better represent the cluster and contain more valuable from the conditional probabilities.
information for classification, so the above two features should
be given higher weights. The Bayesian formula calculates the posterior probability
through the prior probability and the conditional probability, as
The website feature weights are generated based on the shown in formula (4), where 𝑃 𝐴 is the prior probability that
cluster center distance. The clustering result contains 𝑀 feature event 𝐴 occurs, and 𝑃 𝐵 ∣ 𝐴 is the event 𝐴 occurs when the
samples in total, forming 𝐶 clusters, the 𝑖-th cluster contains 𝑀 event occurs. The conditional probability that event 𝐵 occurs,
feature samples, and the cluster center is denoted as 𝐶 . Calculate 𝑃 𝐵 is the prior probability that event B occurs, and 𝑃 𝐴 ∣ 𝐵
the average distance from each feature sample inside each is the posterior probability. According to the Bayesian formula,
cluster to the cluster center, as shown in formula (1), where 𝑖
1,2, … , 𝐶 , 𝑥 represents each sample point, 𝐷 𝑖 represents ∣
each cluster Average distance within clusters. 𝑃 𝐴∣𝐵 (4)

Kuncheva deduces the relationship between the weight of the


∑ , classifier and the prediction accuracy of the classifier as shown
𝐷 𝑖 (1) in formula (5), where 𝑝 is the prediction accuracy of the
classifier, and 𝜔 is the weight of the classifier.
Calculate the distance from each feature sample 𝑥 to the
cluster center point 𝐶 of the cluster where it belongs, and
subtract the average distance 𝐷 𝑖 from the distance to obtain 𝜔 ∝ log ,0 𝑝 1 (5)
the absolute value 𝐷 𝑥 , of the feature sample and the mean,
The above theoretical research is introduced into the random
as shown in formula (2), In the formula, 𝑖 1,2, … , 𝐶 , 𝑘 forest algorithm. In the random forest, the basic classifier is the
1,2, … , 𝑀 , The smaller the 𝐷 𝑥 , is, the more the feature is decision tree, and the Bayesian theory is used to evaluate the
located in the middle of the cluster center and the edge, The performance of a single decision tree in the random forest. First,
larger 𝐷 𝑥 , is, it means that the feature sample is in the cluster according to the traditional random forest algorithm process, use
center or close to the cluster boundary, indicating that the feature the weight-based random selection algorithm to select feature
sample is more valuable and more effective for classification samples to generate 𝑁 decision trees, then input a set of marked
judgment. test samples to use the decision tree to make judgments, and use
the Bayesian formula to calculate the prediction of each decision
tree Accuracy, take the average of the prediction accuracy of this
𝐷 𝑥, 𝑥, 𝐶 𝐷 𝑖 (2) set of data as the prediction accuracy of the decision tree, as
shown in formula (6), where 𝑆 means that there are 𝑆 samples in
Calculate the weight of each feature sample, as shown in this set of test data, and 𝑎𝑐𝑐 is the average of the decision tree
formula (3), 𝑊 , represents the weight value of the 𝑘-th feature Accuracy.
sample in the 𝑖-th cluster.

𝑎𝑐𝑐 ∑ (6)
,
𝑊, (3)
∑ , The weight 𝜔 of each decision tree in the random forest can
be transformed from formula (5) to obtain formula (7).
According to the weight of each feature sample, a weight-
based random selection algorithm, that is, the linear scan method,
is used when the selected feature generates a decision tree. The 𝜔 ln (7)
algorithm flow of the linear scanning method is to first calculate
the sum of the weights of all feature samples W, call the random In the traditional random forest after training, the input
function to obtain a random value in the interval 0, 𝑊 , then sample set 𝑋 and the number of sample categories are 𝐶, then
scan the feature samples from the beginning to the back, and the final prediction output 𝐻 𝑋 is shown in formula (8), where
continuously subtract each feature from 𝑊 The weight value of ℎ 𝑋 is the prediction result of the 𝑡-th decision tree, 𝐼 ⋅ is the
the sample, when 𝑊 is less than the weight of a feature, the indicator function, when the internal parameter of the function is
feature sample is selected. true, the function value is 1, otherwise it is 0, and N is the
number of decision trees.

264
Authorized licensed use limited to: Welcome Shri Guru Gobind Singhji Inst of Eng & Tech Nanded. Downloaded on February 16,2024 at 11:57:22 UTC from IEEE Xplore. Restrictions apply.
𝐻 𝑋 arg 𝑚𝑎𝑥 ∑ 𝐼 ℎ 𝑋 𝑦 (8) 300 0.78324274 0.80348285
, ,⋯,
150 0.88345724 0.92872355
Since a weighted decision tree is added, each tree must be 6 Wpbc
multiplied by the corresponding weight value, which is rewritten 300 0.89982675 0.93024824
in combination with formula (8), and the prediction function of
the output result is shown in formula (9), where 𝑤 is the weight
The above experimental data is only to verify the effect of
value of the 𝑡-th decision tree.
random forest with decision tree weight. In order to verify the
actual effect of website feature weight and double weight
𝐻 𝑋 arg 𝑚𝑎𝑥 ∑ 𝐼 ℎ 𝑋 𝑦 ∗𝑤 (9) random forest with decision tree weight combination, it is
, ,⋯,
necessary to use website feature sample set for testing. The
IV. EXPERIMENTAL TEST features of the public phishing website dataset are few, so the
self-built data sample set is used for testing, and the phishing
In order to verify the superiority of the partial random forest
website link is obtained from the Phishtank website to generate
algorithm with decision tree weight over the traditional random
a data set, and different numbers of website features are selected
forest algorithm, the UCI [6] public data set was used for testing,
to form a sample set, and the same feature sample set Different
and the experiment compared the random forest with decision
random forest algorithms are used for testing, among which
tree weight and the traditional random forest in different public
random forest algorithms include traditional random forest
data sets. classification accuracy below. Six public datasets are
algorithm, random for rest algorithm with decision tree weight
collected from the UCI Machine Learning Repository, which are
and double weight random forest algorithm, and compared with
different classification problems in terms of the number of
DRF (Dynamic Random Forest) at the same time, for four
samples, the number of features, and the number of classes. The
random forests Two groups of samples are used for testing, and
sample information of these public datasets is shown in TABLE
each group of samples contains the same number of normal
I.
websites and phishing websites. The average correct rate of the
TABLE I. DATASET INFORMATION two groups of samples is shown in Figure 1.
dataset Number of number of Number of
name samples features categories
1 Breast 699 9 3
2 Glass 214 7 9
3 Sonar 208 60 2
Heart-
4 270 13 2
statlog
5 Bupa 345 6 2
6 Wpbc 198 32 2
In this experiment, 80% of the data in the data set was input
to the experiment to evaluate the classification accuracy and
false positive rate of the algorithm in different public data sets.
Random forest builds 150 and 300 decision trees for testing in Figure 1. Example of a figure caption. (figure caption)
the experiment, and repeats 15 times each time to input the data
set, respectively calculates the accuracy and false positive rate It can be seen from Figure 1. that the accuracy rate of website
of each result of the two algorithms, and the average of each feature training samples fluctuates slightly when there are more
result is used as the final result of the class dataset test. The than 4000 copies, but it basically remains within a certain range.
average accuracy of the two random forest algorithms is shown There is no room for further improvement by increasing the
in TABLE II. number of training samples. Double-weight random forest is
TABLE II. CLASSIFICATION PERFORMANCE COMPARISON
obviously superior in accuracy rate. to the other two random
dataset decision traditional Random Forest with forests. Take 4000 website feature training samples as an
name tree random forest Decision Tree Weights example to compare the accuracy rate, false positive rate and
missed judgment rate of the three algorithms, input the same test
150 0.92647253 0.94754117
1 Breast set, and take the average of 10 tests as the final result, as shown
300 0.91992749 0.93821487 in TABLE III.
150 0.74273412 0.76965207 TABLE III. PERFORMANCE COMPARISON
2 Glass
300 0.7463818 0.77234547 RF RFWDTW DRF DWRF
150 0.75274376 0.81021857 Accuracy 87.85% 91.46% 92.71% 94.93%
3 Sonar
300 0.72532722 0.80491435 misjudgment 9.42% 7.85% 6.38% 4.72%

Heart- 150 0.73763923 0.75793213 missed judgment 2.73% 0.69% 0.91% 0.35%
4
statlog
300 0.73236864 0.75642639 Combining Figure 1. and TABLE III. It can be seen that the
accuracy rate of random forest with decision tree weight is
5 Bupa 150 0.76823943 0.78648245 91.46%, and the average accuracy rate of double-weight random

265
Authorized licensed use limited to: Welcome Shri Guru Gobind Singhji Inst of Eng & Tech Nanded. Downloaded on February 16,2024 at 11:57:22 UTC from IEEE Xplore. Restrictions apply.
forest algorithm combined with website feature weight is about C8 267 0.9620 481 0.1053 0.8947
94.93%. Compared with traditional random forest, these two
algorithms The accuracy of detecting phishing websites has been C9 266 0.9620 481 0.0526 0.9474
greatly improved, and the double-weight random forest also has C10 257 0.9580 479 0.1429 0.8571
a large performance improvement for dynamic random forest.
average 268.1 0.9624 481.2 0.1405 0.8595
The double-weight random forest algorithm is not much
different from the random forest with decision tree weight in
terms of false positive rate, and the improvement of the correct The experimental results are shown in TABLE IV. The
rate is mainly to reduce the missed detection rate. Through double weight random forest can obtain higher accuracy in the
comparative experiments, it is well demonstrated that the dual- detection of phishing websites, and the accuracy of different
weight random forest algorithm combining website feature categories of websites is not significantly different, indicating
weight and decision tree weight has been optimized in all aspects, that the overall effect is better.
especially in reducing the missed detection rate. It shows that the
random forest algorithm is used for phishing website detection V. CONCLUSIONS
and has relatively high accuracy. Website detection based on blacklist or webpage
The evaluation standard [7] of phishing website detection characteristics cannot meet the time and timeliness requirements
system is generally measured by three indicators: Accuracy, of batch phishing detection. In order to better cope with the real-
False Positive Rate (FPR) and False Positive Rate (FPR). They time detection of massive phishing websites, a dual-weight
are defined as follows: random forest algorithm for phishing website detection is
designed and verified. The experimental results show that the
representative features can be screened out through the
Accuracy (10)
clustering algorithm. These features are used to generate a
decision tree, which improves the accuracy of the detection
FPR (11) model and reduces the missed detection rate. The next step will
be to optimize the complexity of the algorithm, improve the
efficiency of the algorithm, and reduce the overall time-
FNR (12) consuming of detection.
The test data of 5000 website samples are shown in Table 3. REFERENCES
The average accuracy of system detection is 96.24%, the average
[1] Li, Y., Xiao, R., Feng, J., & Zhao, L. (2013). A semi-supervised learning
false positive rate is 14.05%, and the average false negative rate approach for detection of phishing webpages. Optik, 124(23), 6027-6033.
is 85.95%. [2] Sahu, Kanti, and S. K. Shrivastava. "Kernel K-means clustering for
TABLE IV. SAMPLE TEST RESULTS phishing website and malware categorization." International Journal of
Computer Applications 111.9 (2015).
group phishing Accuracy correct amount FNR TNR [3] Qi, Yanjun. "Random forest for bioinformatics." Ensemble machine
C1 255 0.9640 482 0.1667 0.8333 learning. Springer, Boston, MA, 2012. 307-323.
[4] Lee, Tae-Hwy, Aman Ullah, and Ran Wang. "Bootstrap aggregating and
C2 269 0.9620 481 0.2105 0.7895 random forest." Macroeconomic Forecasting in the Era of Big Data.
Springer, Cham, 2020. 389-429.
C3 301 0.9660 483 0.0588 0.9412
[5] Kuncheva, Ludmila I., and Juan J. Rodríguez. "A weighted voting
C4 254 0.9700 485 0.2000 0.8000 framework for classifiers ensembles." Knowledge and Information
Systems 38.2 (2014): 259-275.
C5 274 0.9580 479 0.0476 0.9524
[6] Dua, Dheeru, and Casey Graff. "UCI machine learning repository." (2017).
C6 275 0.9640 482 0.2778 0.7222 [7] Fressin, François, et al. "The false positive rate of Kepler and the
occurrence of planets." The Astrophysical Journal 766.2 (2013): 81.
C7 263 0.9580 479 0.1429 0.8571

266
Authorized licensed use limited to: Welcome Shri Guru Gobind Singhji Inst of Eng & Tech Nanded. Downloaded on February 16,2024 at 11:57:22 UTC from IEEE Xplore. Restrictions apply.

You might also like