Phishing Website Identification Based On Double Weight Random Forest
Phishing Website Identification Based On Double Weight Random Forest
263
Authorized licensed use limited to: Welcome Shri Guru Gobind Singhji Inst of Eng & Tech Nanded. Downloaded on February 16,2024 at 11:57:22 UTC from IEEE Xplore. Restrictions apply.
III. DOUBLE WEIGHT RANDOM FOREST B. Weighted random forest design
A. Feature weight and selection Kuncheva [6] studied four combination methods of majority
voting, weighted majority voting, recall combiner and naive
The cluster formed by clustering contains multiple feature Bayes in the classification algorithm, and tested the relationship
samples and can calculate the cluster center. The value of each between classifier weight and prediction accuracy respectively.
feature sample in the cluster is not the same. At this time, the His results show that the Bayesian formula is best suited for
cluster center point can better represent the entire cluster. handling imbalanced data in classification problems. Bayesian
Similarly, the features closer to the cluster center are more formula is widely used in probabilistic forecasting. Its
representative of the entire cluster. For the features at the edge application is characterized by the combination of prior
of the cluster, explain This feature can be well differentiated probability and actual results. For a given training dataset,
from other clusters. The two types of feature samples in the estimate the posterior probabilities as accurately as possible
cluster can better represent the cluster and contain more valuable from the conditional probabilities.
information for classification, so the above two features should
be given higher weights. The Bayesian formula calculates the posterior probability
through the prior probability and the conditional probability, as
The website feature weights are generated based on the shown in formula (4), where 𝑃 𝐴 is the prior probability that
cluster center distance. The clustering result contains 𝑀 feature event 𝐴 occurs, and 𝑃 𝐵 ∣ 𝐴 is the event 𝐴 occurs when the
samples in total, forming 𝐶 clusters, the 𝑖-th cluster contains 𝑀 event occurs. The conditional probability that event 𝐵 occurs,
feature samples, and the cluster center is denoted as 𝐶 . Calculate 𝑃 𝐵 is the prior probability that event B occurs, and 𝑃 𝐴 ∣ 𝐵
the average distance from each feature sample inside each is the posterior probability. According to the Bayesian formula,
cluster to the cluster center, as shown in formula (1), where 𝑖
1,2, … , 𝐶 , 𝑥 represents each sample point, 𝐷 𝑖 represents ∣
each cluster Average distance within clusters. 𝑃 𝐴∣𝐵 (4)
264
Authorized licensed use limited to: Welcome Shri Guru Gobind Singhji Inst of Eng & Tech Nanded. Downloaded on February 16,2024 at 11:57:22 UTC from IEEE Xplore. Restrictions apply.
𝐻 𝑋 arg 𝑚𝑎𝑥 ∑ 𝐼 ℎ 𝑋 𝑦 (8) 300 0.78324274 0.80348285
, ,⋯,
150 0.88345724 0.92872355
Since a weighted decision tree is added, each tree must be 6 Wpbc
multiplied by the corresponding weight value, which is rewritten 300 0.89982675 0.93024824
in combination with formula (8), and the prediction function of
the output result is shown in formula (9), where 𝑤 is the weight
The above experimental data is only to verify the effect of
value of the 𝑡-th decision tree.
random forest with decision tree weight. In order to verify the
actual effect of website feature weight and double weight
𝐻 𝑋 arg 𝑚𝑎𝑥 ∑ 𝐼 ℎ 𝑋 𝑦 ∗𝑤 (9) random forest with decision tree weight combination, it is
, ,⋯,
necessary to use website feature sample set for testing. The
IV. EXPERIMENTAL TEST features of the public phishing website dataset are few, so the
self-built data sample set is used for testing, and the phishing
In order to verify the superiority of the partial random forest
website link is obtained from the Phishtank website to generate
algorithm with decision tree weight over the traditional random
a data set, and different numbers of website features are selected
forest algorithm, the UCI [6] public data set was used for testing,
to form a sample set, and the same feature sample set Different
and the experiment compared the random forest with decision
random forest algorithms are used for testing, among which
tree weight and the traditional random forest in different public
random forest algorithms include traditional random forest
data sets. classification accuracy below. Six public datasets are
algorithm, random for rest algorithm with decision tree weight
collected from the UCI Machine Learning Repository, which are
and double weight random forest algorithm, and compared with
different classification problems in terms of the number of
DRF (Dynamic Random Forest) at the same time, for four
samples, the number of features, and the number of classes. The
random forests Two groups of samples are used for testing, and
sample information of these public datasets is shown in TABLE
each group of samples contains the same number of normal
I.
websites and phishing websites. The average correct rate of the
TABLE I. DATASET INFORMATION two groups of samples is shown in Figure 1.
dataset Number of number of Number of
name samples features categories
1 Breast 699 9 3
2 Glass 214 7 9
3 Sonar 208 60 2
Heart-
4 270 13 2
statlog
5 Bupa 345 6 2
6 Wpbc 198 32 2
In this experiment, 80% of the data in the data set was input
to the experiment to evaluate the classification accuracy and
false positive rate of the algorithm in different public data sets.
Random forest builds 150 and 300 decision trees for testing in Figure 1. Example of a figure caption. (figure caption)
the experiment, and repeats 15 times each time to input the data
set, respectively calculates the accuracy and false positive rate It can be seen from Figure 1. that the accuracy rate of website
of each result of the two algorithms, and the average of each feature training samples fluctuates slightly when there are more
result is used as the final result of the class dataset test. The than 4000 copies, but it basically remains within a certain range.
average accuracy of the two random forest algorithms is shown There is no room for further improvement by increasing the
in TABLE II. number of training samples. Double-weight random forest is
TABLE II. CLASSIFICATION PERFORMANCE COMPARISON
obviously superior in accuracy rate. to the other two random
dataset decision traditional Random Forest with forests. Take 4000 website feature training samples as an
name tree random forest Decision Tree Weights example to compare the accuracy rate, false positive rate and
missed judgment rate of the three algorithms, input the same test
150 0.92647253 0.94754117
1 Breast set, and take the average of 10 tests as the final result, as shown
300 0.91992749 0.93821487 in TABLE III.
150 0.74273412 0.76965207 TABLE III. PERFORMANCE COMPARISON
2 Glass
300 0.7463818 0.77234547 RF RFWDTW DRF DWRF
150 0.75274376 0.81021857 Accuracy 87.85% 91.46% 92.71% 94.93%
3 Sonar
300 0.72532722 0.80491435 misjudgment 9.42% 7.85% 6.38% 4.72%
Heart- 150 0.73763923 0.75793213 missed judgment 2.73% 0.69% 0.91% 0.35%
4
statlog
300 0.73236864 0.75642639 Combining Figure 1. and TABLE III. It can be seen that the
accuracy rate of random forest with decision tree weight is
5 Bupa 150 0.76823943 0.78648245 91.46%, and the average accuracy rate of double-weight random
265
Authorized licensed use limited to: Welcome Shri Guru Gobind Singhji Inst of Eng & Tech Nanded. Downloaded on February 16,2024 at 11:57:22 UTC from IEEE Xplore. Restrictions apply.
forest algorithm combined with website feature weight is about C8 267 0.9620 481 0.1053 0.8947
94.93%. Compared with traditional random forest, these two
algorithms The accuracy of detecting phishing websites has been C9 266 0.9620 481 0.0526 0.9474
greatly improved, and the double-weight random forest also has C10 257 0.9580 479 0.1429 0.8571
a large performance improvement for dynamic random forest.
average 268.1 0.9624 481.2 0.1405 0.8595
The double-weight random forest algorithm is not much
different from the random forest with decision tree weight in
terms of false positive rate, and the improvement of the correct The experimental results are shown in TABLE IV. The
rate is mainly to reduce the missed detection rate. Through double weight random forest can obtain higher accuracy in the
comparative experiments, it is well demonstrated that the dual- detection of phishing websites, and the accuracy of different
weight random forest algorithm combining website feature categories of websites is not significantly different, indicating
weight and decision tree weight has been optimized in all aspects, that the overall effect is better.
especially in reducing the missed detection rate. It shows that the
random forest algorithm is used for phishing website detection V. CONCLUSIONS
and has relatively high accuracy. Website detection based on blacklist or webpage
The evaluation standard [7] of phishing website detection characteristics cannot meet the time and timeliness requirements
system is generally measured by three indicators: Accuracy, of batch phishing detection. In order to better cope with the real-
False Positive Rate (FPR) and False Positive Rate (FPR). They time detection of massive phishing websites, a dual-weight
are defined as follows: random forest algorithm for phishing website detection is
designed and verified. The experimental results show that the
representative features can be screened out through the
Accuracy (10)
clustering algorithm. These features are used to generate a
decision tree, which improves the accuracy of the detection
FPR (11) model and reduces the missed detection rate. The next step will
be to optimize the complexity of the algorithm, improve the
efficiency of the algorithm, and reduce the overall time-
FNR (12) consuming of detection.
The test data of 5000 website samples are shown in Table 3. REFERENCES
The average accuracy of system detection is 96.24%, the average
[1] Li, Y., Xiao, R., Feng, J., & Zhao, L. (2013). A semi-supervised learning
false positive rate is 14.05%, and the average false negative rate approach for detection of phishing webpages. Optik, 124(23), 6027-6033.
is 85.95%. [2] Sahu, Kanti, and S. K. Shrivastava. "Kernel K-means clustering for
TABLE IV. SAMPLE TEST RESULTS phishing website and malware categorization." International Journal of
Computer Applications 111.9 (2015).
group phishing Accuracy correct amount FNR TNR [3] Qi, Yanjun. "Random forest for bioinformatics." Ensemble machine
C1 255 0.9640 482 0.1667 0.8333 learning. Springer, Boston, MA, 2012. 307-323.
[4] Lee, Tae-Hwy, Aman Ullah, and Ran Wang. "Bootstrap aggregating and
C2 269 0.9620 481 0.2105 0.7895 random forest." Macroeconomic Forecasting in the Era of Big Data.
Springer, Cham, 2020. 389-429.
C3 301 0.9660 483 0.0588 0.9412
[5] Kuncheva, Ludmila I., and Juan J. Rodríguez. "A weighted voting
C4 254 0.9700 485 0.2000 0.8000 framework for classifiers ensembles." Knowledge and Information
Systems 38.2 (2014): 259-275.
C5 274 0.9580 479 0.0476 0.9524
[6] Dua, Dheeru, and Casey Graff. "UCI machine learning repository." (2017).
C6 275 0.9640 482 0.2778 0.7222 [7] Fressin, François, et al. "The false positive rate of Kepler and the
occurrence of planets." The Astrophysical Journal 766.2 (2013): 81.
C7 263 0.9580 479 0.1429 0.8571
266
Authorized licensed use limited to: Welcome Shri Guru Gobind Singhji Inst of Eng & Tech Nanded. Downloaded on February 16,2024 at 11:57:22 UTC from IEEE Xplore. Restrictions apply.