Darknet Traffic Classification and Adversarial Attacks Using Machine
Darknet Traffic Classification and Adversarial Attacks Using Machine
a r t i c l e i n f o a b s t r a c t
Article history: The anonymous nature of darknets is commonly exploited for illegal activities. Previous research has em-
Received 12 June 2022 ployed machine learning and deep learning techniques to automate the detection of darknet traffic in an
Revised 22 October 2022
attempt to block these criminal activities. This research aims to improve darknet traffic detection by as-
Accepted 9 January 2023
sessing a wide variety of machine learning and deep learning techniques for the classification of such traf-
Available online 14 January 2023
fic and for classification of the underlying application types. We find that a Random Forest model outper-
Keywords: forms other state-of-the-art machine learning techniques used in prior work with the CIC-Darknet2020
Darknet dataset. To evaluate the robustness of our Random Forest classifier, we obfuscate select application type
Classification classes to simulate realistic adversarial attack scenarios. We demonstrate that our best-performing clas-
Adversarial attacks sifier can be degraded by such attacks, and we consider ways to effectively deal with such adversarial
Convolutional neural network attacks.
Auxiliary-Classifier generative adversarial
network © 2023 The Author(s). Published by Elsevier Ltd.
Random forest This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
https://doi.org/10.1016/j.cose.2023.103098
0167-4048/© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
2. Background
2
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
other deep-learning techniques: Long Short-Term Memory (LSTM) (MLP), Convolutional Neural Networks (CNN), and Auxiliary Clas-
and Gated Recurrent Units (GRU). They addressed the issue of sifier Generative Adversarial Networks (AC-GAN) as classifiers. We
having an imbalanced dataset by performing Synthetic Minority experiment with different levels of SMOTE during a preprocessing
Oversampling Technique (SMOTE) on Tor, the minority traffic class. phase, oversampling the minority classes of the CIC-Darknet2020
They used Principle Component Analysis (PCA), Decision Trees dataset to assess the effects of data augmentation and class bal-
(DT), and Extreme Gradient Boosting (XGBoost) to extract 20 fea- ance on classifier performance. We also consider using the AC-GAN
tures before feeding the data into CNN-LSTM and CNN-GRU ar- generator for data augmentation, but we find that it is ineffective
chitectures. Their CNN layer was used to extract features from for this purpose. We experiment with representations of the dark-
the input data, while LSTM and GRU did sequence prediction on net traffic features as 2-dimensional grayscale images for CNN and
these features. CNN-LSTM in combination with XGBoost as the fea- AC-GAN. Then we test the robustness of our best-performing clas-
ture selector produced the best F1-scores, achieving 96% classifying sifier in obfuscation scenarios, which serve to simulate adversar-
traffic type and 89% classifying application type. ial attacks, assuming both the perspectives of an attacker and de-
The study (Iliadis and Kaifas, 2021) focused on just traffic type fender.
from the CIC-Darknet2020 dataset. They used k-Nearest Neigh- In our adversarial attacks, we apply statistical knowledge of the
bors (k-NN), Multi-layer Perceptron (MLP), RF, DT, and Gradient- dataset to obfuscate specific data features, disguising one or more
Boosting Decision Trees (GBDT) to do binary and multi-class clas- classes as others. We explore three scenarios whereby we either
sification. For binary classification, they grouped the data into obfuscate the training data, the validation data or both. Obfuscat-
two classes, namely, benign and darknet, similar to Lashkari et al. ing just the validation data simulates an attack scenario in which
(2020). For the multi-class problem, they used the original four traffic data is disguised while our classifier is yet unaware of the
classes of traffic type (Tor, non-Tor, VPN or non-VPN). They found attack, and thus we can only apply previously trained models with-
that RF was the most effective classifier for traffic type, yielding out a chance to learn from the obfuscation. Obfuscating just the
F1-scores of 98.7% for binary classification and 98.61% for multi- training data simulates a scenario in which an attacker has ac-
class classification. cessed our training data to poison it, such that we train our clas-
Using the same dataset, the authors of (Demertzis et al.) further sifier with malformed assumptions or outright malicious supervi-
broke down the application categories into 11 classes and used sion. A third scenario supposes we collect some of the obfuscated
Weighted Agnostic Neural Networks (WANN) to classify the data. traffic data before training our classifier, and thus have a chance
Unlike regular ANNs, WANNs do not update neuron weights, but to update our classification models to detect obfuscated validation
rather update their own network architecture piece-wise. WANNs data.
rank different architectures by performance and complexity, form-
ing new network layers from the highest ranked architecture. Their
3.1. Dataset
best WANN model achieved 92.68% accuracy on application layer
classification.
The CIC-Darknet2020 dataset (Lashkari et al., 2020) is an
The UNB-CIC Tor and non-Tor dataset, also known as ISCX-
amalgamation of two public datasets from the University of
Tor2016 (Lashkari et al., 2017), was used by Sarkar et al. (2020) to
New Brunswick. It combines the ISCXTor2016 and ISCXVPN2016
classify Tor and non-Tor traffic using Deep Neural Networks (DNN).
datasets, which capture real-time traffic using Wireshark and
They built two models, DNN-A with 3-layers and DNN-B with 5-
TCPdump (Gil et al., 2016; Lashkari et al., 2017). CICFlowMe-
layers. DNN-A classified Tor from non-Tor samples with 98.81% ac-
ter (Lashkari, 2018) is used to generate CIC-Darknet2020 dataset
curacy, while DNN-B achieved 99.89% accuracy. For Tor samples,
features from these traffic samples. Each CIC-Darknet2020 sample
they built a 4-layer Deep Neural Network to classify eight applica-
consists of traffic features extracted in this manner from raw traffic
tion types. This model attained 95.6% accuracy.
packet capture sessions. CIC-Darknet2020 consists of 158,659 hier-
In another study, Hu et al. (2020) generated their own dataset,
archically labeled samples. The top level traffic category labels con-
capturing darknet traffic across eight application categories (brows-
sist of Tor, non-Tor, VPN, and non-VPN. Within these top level cate-
ing, chat, email, file transfer, P2P, audio, video and VOIP) sourced
gories, samples are further categorized by the types of application
from four different darknets (Tor, I2P, ZeroNet, and Freenet). They
used to generate the traffic. These type subcategories are audio-
used a 3-layer hierarchical approach for classification. The first
streaming, browsing, chat, email, file transfer, P2P, video-streaming,
layer classified traffic as either darknet or normal. In the second
and VOIP. Table 2 details the applications that are used to generate
layer, samples classified correctly as darknet were then classified
each type of traffic at the application level.
by their darknet source. The third layer then classified application
type for each of the darknet sources. The techniques (Hu et al.,
2020) used for classification include Logistic Regression (LR), RF, 3.2. Preprocessing
MLP, GBDT, Light Gradient Boosting (LightGB), XGBoost, LSTM, and
DT. Their hierarchical method attained 99.42% accuracy in the first The CIC-Darknet2020 dataset has samples with missing data,
layer, 96.85% accuracy in the second layer and 92.46% accuracy in more specifically, feature values of “NaN”. We remove samples
the third layer. with these values in our data cleaning phase. As shown in Table 3,
Table 1 provides a summary of the prior work presented in there are significantly less Tor samples compared to the other traf-
this section. We note that the research in Iliadis and Kaifas (2021), fic categories. Prior work using this dataset eliminated CICFlowMe-
Lashkari et al. (2020), Sarwar et al. (2021) use the same dataset ter the flow labels, namely, Flow Id, Timestamp, Source IP
that we consider in this paper. and Destination IP. The Flow Id, and Timestamp, which
are also eliminated in our research as well. However, to obtain as
3. Methodology much information as possible from the CIC-Darknet2020 dataset,
we separate each octet of the source and destination IP addresses
The primary goal of this research is to improve upon the state- into their own feature columns. Preliminary tests run on the
of-the-art classification of darknet traffic by exploring the perfor- dataset with and without these IP octet features indicate an im-
mance of Support Vector Machines (SVM), Random Forest (RF), provement in the performance of the classifiers when this IP in-
Gradient-Boosting Decision Trees (GBDT), Extreme Gradient Boost- formation is retained. Thus our dataset contains 72 features total
ing (XGBoost), k-Nearest Neighbors (k-NN), Multilayer Perceptron after this preprocessing step.
3
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
Table 1
Summary of previous work.
Table 2 Table 4
CIC-Darknet2020 application classes (Lashkari et al., 2020). Samples per application category.
Table 3
Samples per traffic category. that 100% SMOTE results in an equal number of samples for each
Traffic Type Samples class, while lower thresholds of SMOTE result in an equal number
of samples among only those classes which are oversampled.
Non-Tor 93,357
Non-VPN 23,864
Tor 1393 3.2.2. Data representation
VPN 22,920 SVM and RF both use each the dataset samples in their original
format, which is a 1-dimensional array. However, we reshape each
sample to be 2-dimensional for CNN and AC-GAN. Intuitively, the
The CIC-Darknet2020 dataset was scaled by min-max normal- data is reshaped as 9 × 9 grayscale images, where each of our 72
ization, which applies the equation features is represented as a single pixel with the remaining pixels
produced by zero padding. The pixels are ordered as their respec-
(value − min )
normalizedValue = tive features appeared in the CIC-Darknet2020 dataset, starting at
(max − min ) the top left corner of the image as shown in Fig. 3, where each
to every value in each feature column. Note that this serves to row represents samples from an application class, color-coded for
scale the feature values between 0 and 1. We also apply min-max readability.
normalization to our IP octet feature columns. Both CNN and AC-GAN convolve local structures within the 2-
D images, so adjacent pixels play an important role in classifi-
3.2.1. Data balancing cation. Therefore, we experiment with strategies to reorder the
The CIC-Darknet2020 dataset does not have balanced sam- data to achieve better performance. We order the pixels by fea-
ple counts among traffic and application classes, as shown in ture importance—as determined by our Random Forest classifier—
Tables 3 and 4. To explore the effect of reducing this imbalance starting at the top left corner of the image, and also reorganize the
on the classification task, we oversample each minority class us- pixels spiraling outward from the center of the image. This latter
ing SMOTE. SMOTE interpolates linearly between feature values strategy tends to group pixels with larger values toward the center
to produce new samples (Bhagat and Patil, 2015). We experiment of each image, as shown in Fig. 4.
with the following levels of oversampling: 0% (no SMOTE), 20%,
40%, 60%, 80% (partial SMOTE), and 100% (full SMOTE). SMOTE is 3.2.3. Data augmentation experiment
performed on all classes with less than the oversampling thresh- We experimented with AC-GAN as an alternative to SMOTE,
old as compared to the class with the largest sample count. Note with the goal of generating realistic artificial samples that can be
4
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
used to augment our dataset. Again, we use data augmentation to 3.3. Evaluation metrics
address the issue of class imbalance. However, we abandoned this
approach as we found that the fake images generated by AC-GAN In our experiments, we use accuracy and F1-score to mea-
are consistently detectable by a CNN model with accuracy ranging sure the performance of each classifier. Accuracy is computed as
from 99% to 100%. We believe that the failure of our AC-GAN to the total number of correct predictions over the number of sam-
produce realistic fake images is due to the depth of the AC-GAN ples tested. The F1-score is the weighted average of precision and
neural network architecture, which is constrained by the input im- recall metrics, which is better for unbalanced datasets like CIC-
age size. In any case, we were unsuccessful in our attempt to use Darknet2020. Similar to accuracy, F1-scores fall between 0 and 1,
AC-GAN to augment our data. with 1 being the best possible. The F1-score is computed as
An example of four fake samples compared to real samples can
(Precision × Recall )
be found in Fig. 5. The fake samples in this figure may appear to be F1 = 2 ×
useful but, again, a CNN can distinguish the fake from the real with (Precision + Recall )
essentially 100% accuracy. This clearly shows that from a machine Precision calculates the ratio of samples classified correctly for the
learning perspective, the fakes samples are not sufficient for data positive class, while recall measures the total number of positive
augmentation. samples that were classified correctly. Precision and recall are com-
5
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
6
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
classes (Stamp, 2022). For our research, we perform preliminary our input image. We experiment with different cutout sizes in-
tests to determine the best kernel for our dataset, with the result cluding 2 × 2, 3 × 3, and 4 × 4 and randomize the position of the
being the Gaussian radial basis function (RBF). cutout within the mask. Refer to Fig. 7 for some examples of masks
with 3 × 3 cutouts. Our cutout experiments are discussed in detail
4.1.5. Random forest in Section 5.2, below.
Random Forest (RF) is an ensemble method that generalizes
Decision Trees (DT). While a DT is a simple and efficient classi- 4.1.7. Auxiliary-classifier generative adversarial network
fication algorithm, it is highly sensitive to variance in the train- Generative Adversarial Networks (GAN) are comprised of two
ing data and hence prone to overfitting. RF compensates for these neural network architectures—a generator and a discriminator—
deficiencies by generating many subsets of the dataset, then ran- that compete in a zero-sum game during training. The generator
domly selecting features (with replacement) and trains a DT for takes noise from a latent space as input and produces images that
each subset. This process is called bootstrapping. To classify, RF feed into the discriminator. The discriminator is given both real
takes the majority vote from all resulting DT in a process called and generated images and is tasked to classify them as either real
aggregation. Together bootstrapping and aggregation is referred or fake. The discriminator error is then fed back into the generator
to as bagging (Misra and Li, 2020; Stamp, 2022). RF also en- to improve its image generation. AC-GAN is an extension of this
ables us to rank the importance of features based on the mean base GAN architecture, taking a class label as additional input to
entropy within the component DTs. Feature importance tells us the generator while predicting this label as part of the discrimina-
how influential each feature is when classifying samples with tor output. The objective of the AC-GAN generator is to minimize
the RF. Based on small-scale experiments, we found that the de- the ability of the discriminator to distinguish between real and
fault hyperparameters in Scikit-learn yielded the best results; fake images and also maximize the accuracy of the discriminator
see (sklearn.ensemble.RandomForestClassifier) for the details. when predicting the class label (Mudavathu et al., 2018; Nagaraju
and Stamp, 2021). Besides using the AC-GAN generator in data aug-
mentation experiments, we also explore the secondary class pre-
4.1.6. Convolutional neural networks diction output of the discriminator as a classifier.
Convolutional Neural Networks (CNN) are a unique type of neu- Our AC-GAN architecture is inspired by the ImageNet model
ral network that focus on local structures, making them ideal for described in (Odena et al., 2017). However, since that architecture
image analysis. CNNs are composed of an image input layer, con- was built for image sizes 32 × 32 or larger, we modify that ar-
volution and pooling layers and a fully-connected output layer that chitecture to accommodate our 9 × 9 image size by reducing the
produces a vector of class scores. Convolutional and pooling lay- number of convolutional and transposed convolutional layers in
ers are the fundamental components of any CNN architecture. In the discriminator and generator, respectively.
convolutional layers, the output of the previous layer (or the raw We fine-tune our AC-GAN hyperparameters by experimenting
image in the initial convolutional layer) is convolved with random- with the following.
ized filters to produce local structure maps that are joined to cre-
ate the output of the layer. In the convolutional process, the fil- • Latent space size (81, 100)
ter windows slide across the input image, thus emphasizing lo- • Initial number of convolution filters (15, 40, 64, 192, 202, 384,
cal structure, and providing a degree of translation invariance. The 50 0, 150 0)
components of each filter are learned when training a CNN. Pool- • Number of nodes in the first dense layer (31, 81, 128, 384, 405,
ing layers decrease total training time by reducing the dimension- 768, 10 0 0, 30 0 0)
ality of the resulting feature maps, concentrating effort on the • Filter size (3 × 3, 5 × 5)
most signifcant features (Convolutional Neural Networks for Visual • Stride size (2 × 2, 3 × 3)
Recognition; Lashkari et al. 2020). For this research, we use max
We observe accuracies within the range of 70% to 73% when
pooling.
classifying application type with these hyperparameters. The best-
Our CNN architecture is based on that described in (Lashkari
performing architecture with the shortest runtime duration is used
et al., 2020). We experiment with various hyperparameters, testing
in this research; Tables 6 and 7 detail our generator and discrimi-
all combinations of the following in a grid search.
nator architecture, respectively.
• Initial number of convolution filters (9, 32, 64, 81) We feed training data to our AC-GAN model in batches of 64
• Filter size (2 × 2, 3 × 3) samples. Batch normalization (BatchNorm) layers are applied be-
• Percentage dropout (0.2, 0.5 ) tween convolutional layers to regularize the training gradient step
• Number of nodes in the first dense layer (72, 256) size. BatchNorm is thought to smooth local optimization steps and
stabilize training, thereby accelerating convergence of GAN mod-
All these architectures yield accuracies within the range of 86% els (Santurkar et al., 2018).
to 88% when classifying application type. Therefore, we select the
architecture that produces the highest accuracy. Our select CNN ar- 4.2. Adversarial attacks
chitecture is illustrated in Fig. 6. Note that we use Adam for our
optimizer and sparse categorical cross entropy for our loss func- Our adversarial attacks rely on obfuscation, which serves to dis-
tion. guise application classes based on applied probability analysis. We
Dropout is a common technique used to combat overfitting in select application classes to disguise as other classes based on min-
neural networks with fully-connected layers. However, it is found imum and maximum sum statistical distance between all class fea-
to be not as effective with convolution layers. A better regular- tures, as specified in Algorithm 1.
ization technique for CNN is to “cut out” sections of the input We also select a third class transformation to perform based
images. Such cutouts force CNN to learn from the other parts on maximal classifier confusion, whose sum statistical distance be-
of an image during training, which tends to activate filters that tween class features is notably low, but not the minimum between
would otherwise atrophy. It is comparative in effect to dropouts classes. We ensure our class transformation can be decoded by
except that it operates on the input stage rather than the in- encoding features with a deterministic algorithm, given here as
termediate layers (DeVries and Taylor; Li et al. 2021). We im- Algorithm 2. We impose no additional restrictions on feature trans-
plement cutouts by creating feature masks of equivalent size to formation.
7
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
Table 6
AC-GAN generator architecture.
Table 7
AC-GAN discriminator architecture.
We start by generating normalized histograms of feature values From Table 8, we observe that class 0 is most different from
per class to assess the probability at which values occur within class 5 and class 3 is most similar to class 7. We pick the classes
each class. To decide which classes to obfuscate, we examine the with the minimum and maximum sum of statistical distances be-
sums of the distances between feature probability distributions tween features, changing class 0 (audiostreaming) to class 5 (P2P)
from each class to each other class. We use the cdist function of and class 3 (email) to class 7 (VOIP). We also examine the confu-
the scipy Python library to calculate the Euclidean distance be- sion matrix for our best-performing classifier, RF, which is shown
tween probability distributions. This provides an estimate of the in Fig. 8. RF is observed to be most confused between class 2
overall difference between classes while considering all feature (chat) and class 3 (email), so we decide to additionally obfuscate
probability distributions. In the case of application type, this yields class 2 with class 3. We arbitrarily choose to transform lower num-
the 8 × 8 array in Table 8, where the Class numbers correspond to bered classes to higher numbered classes, e.g., disguising class 2 as
those in Table 4, above. class 3 instead of class 3 as class 2.
8
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
Table 8
Statistical distances between pairs of application classes.
Class 0 1 2 3 4 5 6 7
9
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
What is obfuscated? Scenario We evaluate CNN and the AC-GAN discriminator given differ-
Scenario description ent 2-D pixel representations of the data features. All of our 2-D
Training data Validation data
representations of the data are of size 9 × 9, where each pixel is
1 Simulates a novel attack
a feature. The pixels in the original representation follow the or-
where we apply an outdated
model for classification der that the features appear in the CIC-Darknet2020 dataset. We
2 Simulates an attack hypothesize that grouping the pixels together would have a posi-
on our training data, tive effect on the performance of our classifiers since convolutions
poisoning the classifier operate on local structures. Our results show that CNN performs
3 Simulates a novel defense
where we train our model
best when the pixels are sorted by RF feature importance and then
on some obfuscated data grouped together at center of the image. However, this is not true
for the AC-GAN discriminator. AC-GAN does better using the origi-
nal data representation, contrary to our hypothesis. Table 10 shows
bin b0 corresponds to values 0.00 to 0.01 and so on. Given the the results for these experiments.
value of v, we find the bin that v falls into. The value v = 0.178142
is in bin b17 , which contains values between 0.17 to 0.18. Bin b17 5.2. Cutout experiments
is indicated by the red arrows in Fig. 9. We then flip the sorted
DCPD index at b17 to locate our target bin, indicated by the black Initially, our CNN model is able to achieve 88% accuracy classi-
arrows in Fig. 9. This target bin b58 , which contains values be- fying application type within 15 epochs. However, we notice that
tween 0.58 to 0.59. To obfuscate, we subtract the difference be- overfitting starts to occur the longer we run our model. To reduce
tween b17 and b58 from v. In this example, our new transformed overfitting, we apply cutouts to the training data. We experiment
value is with different cutout sizes: 2 × 2, 3 × 3, and 4 × 4. We observe
that cutouts allow our CNN to train for a longer period of time
v = 0.178142 − (0.17 − 0.58 ) = 0.588142
without overfitting. The loss graphs in Fig. 10 show how the CNN
which falls into the target bin b58 . We repeat this for all the fea- model overfits after 20 epochs in the original execution but does
tures to transform the sample from class 2 to class 3. not overfit with cutouts. There is little difference in the effects of
Note that this obfuscation technique is designed to maximize applying 2 × 2 compared to 3 × 3 cutouts. Both delay overfitting
the effectiveness of a simulated adversarial attack. Our approach at the same rate and the accuracies for both linger at 88%. No-
ignores practical limitations on the ability of attackers to modify tably, we witness a 1% decrease in accuracy with 4 × 4 cutouts. As
the statistics of the data. Hence these simulated attacks can be our images are only 9 × 9 pixels, a 4 × 4 cutout likely deletes too
considered worst-case scenarios, from the perspective of detecting much information from the image, negatively affecting the accu-
darknet traffic under adversarial attack. racy. While cutouts address the issue of overfitting, we find that
more training does not significantly improve the performance of
5. Results and discussion CNN on the dataset under consideration. Thus, we do not employ
cutouts in the CNN results reported below.
In this section, we consider a wide range of experiments. First,
we determine which of the three 2-D image representation tech- 5.3. SMOTE Experiments
niques discussed in Section 5.1 is most effective. Then we consider
the use of cutouts, which can serve to reduce overfitting and im- We compare the performance of our classifiers with various lev-
prove accuracy in CNNs. We then turn our attention to the imbal- els of SMOTE, performing SMOTE to oversample the training data
ance problem, with a series of SMOTE experiments. We conclude before training each classifier for both cases, that is, traffic type
this section with an extensive set of experiments involving various and application type. The results from these experiments appear
adversarial attack scenarios. in Tables 11 and 12, respectively, where the best result for each
10
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
Table 10
2-D data representation results.
CNN AC-GAN
Table 11 in every case. We conclude that for the problem under considera-
Traffic classification F1-scores at various SMOTE levels.
tion, SMOTE is of some value for fine tuning models.
Learning SMOTE percentage Our RF model without SMOTE outperforms the state-of-the-
technique 0% 20% 40% 60% 80% 100%
art F1-scores for both traffic and application classification tasks.
We observe a 1.1% improvement for traffic classification as com-
GBDT 0.961 0.961 0.960 0.960 0.958 0.958
pared to Iliadis and Kaifas (2021), where they also found RF to be
XGBoost 0.983 0.983 0.982 0.980 0.977 0.975
k-NN 0.884 0.884 0.881 0.875 0.871 0.868 their best classifier. The study (Iliadis and Kaifas, 2021) only clas-
MLP 0.821 0.821 0.850 0.788 0.676 0.744 sified traffic type, thus no application type performance is avail-
SVM 0.986 0.993 0.993 0.993 0.993 0.993 able for comparison. For application classification, our RF model
RF 0.998 0.998 0.998 0.998 0.998 0.998 achieved a 3.2% increase over (Sarwar et al., 2021). In addition,
CNN 0.998 0.995 0.995 0.995 0.996 0.995 our CNN model outperformed the CNN results in Lashkari et al.
AC-GAN 0.974 0.980 0.984 0.986 0.987 0.987 (2020) by 2.8% and is within 0.2% of the more complex and costly
CNN-LSTM results in Sarwar et al. (2021). We are only able to
compare classification results for application type with (Lashkari
SMOTE level is boxed. We observe that reducing class imbalance et al., 2020) because they approach traffic type classification as
using SMOTE does not have a large effect on the performance of a binary problem while we address it as a multiclass problem.
most of the classifiers. With the exception of the MLP traffic classi- Table 13 summarizes the best performance of our classifiers in
fication experiments, SMOTE only affects the F1-score by about 1% comparison to relevant prior work, where the best results in
to 2% in each case. Note also that the MLP results are the poorest the Traffic and Application columns are boxed. Overall, RF is our
11
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
Table 12
Application classification F1-scores at various SMOTE levels.
best-performing classifier and MLP and k-NN perform the worst. 5.4. Adversarial attack experiments
Also of note is the fact that the AC-GAN classifier is one of
the best performing models in the traffic classification problem, With improvement in the accuracy of darknet traffic detection
but it performs relatively poorly in the application classification by machine learning and deep learning techniques, it is realistic to
task. anticipate that attackers will attempt to find ways to circumvent
12
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
13
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
Table 14
Class accuracies for attack scenarios.
14
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
Our research was limited by the availability of darknet traffic Data availability
datasets. We selected the CIC-Darknet2020 dataset because it is
frequently cited and publicly accessible; however the dataset suf- Data will be made available on request.
fers from a substantial imbalance. We attempted to compensate for
this class imbalance by generating artificial samples with AC-GAN
and SMOTE. The artificial SMOTE samples marginally improved our
classification results. Seeking to improve the quality of artificial References
samples, we assessed AC-GAN as a sample generator. However, our
AC-GAN-generated samples were not useful for data augmentation Bhagat, R.C., Patil, S.S., 2015. Enhanced SMOTE algorithm for classification of imbal-
anced big-data using random forest. In: 2015 IEEE International Advance Com-
purposes. An approach that future researchers might consider is to
puting Conference, pp. 403–408.
use clustering to group samples within a class, then train one GAN Branwen, G., Christin, N., Décary-Hétu, D., Andersen, R. M., StExo, Presidente,
per cluster to generate samples. Other variations of GAN might also E., Anonymous, Lau, D., Sohhlz, Kratunov, D., Cakic, V., Buskirk, V., Whom,
be better suited for multiclass sample generation and could con- McKenna, M., Goode, S., 2015. Dark net market archives, 2011–2015. https:
//www.gwern.net/DNM-archives.
ceivably generate more realistic samples. Convolutional Neural Networks for Visual Recognition, 2022. Convolu-
We kept our obfuscations fairly basic, with the goal being to tional neural networks for visual recognition. https://cs231n.github.io/
demonstrate that we could confuse our best classifier, with few convolutional-networks.
Demertzis, K., Tsiknas, K., Takezis, D., Skianis, C., Iliadis, L., 2021. Darknet traffic big-
restrictions imposed on the hypothetical attacker. Under more re- data analysis and network management for real-time automating of the mali-
alistic attack scenarios, it may not be possible to so easily modify cious intent detection process by a weight agnostic neural networks framework.
features which define darknets such as Tor and VPN, but it would https://arxiv.org/abs/2102.08411.
DeVries, T., Taylor, G. W., 2017. Improved regularization of convolutional neural net-
be possible to obfuscate traffic features at the application layer works with cutout. https://arxiv.org/abs/1708.04552.
such as those produced by CICFlowMeter analysis. We introduced Dingledine, R., Mathewson, N., Syverson, P., 2004. Tor: the second-generation onion
a loose correlation to one statistical metric, an independent sum router. In: 13th USENIX Security Symposium (USENIX Security 04). https://
www.usenix.org/conference/13th- usenix- security- symposium/tor- second-
of distances between DCPD across all sample features. We noted
generation- onion- router.
that 2 out of the 3 classes we chose to obfuscate were misclassi- Gil, G.D., Lashkari, A.H., Mamun, M., Ghorbani, A.A., 2016. Characterization of en-
fied not as the intended classes, but with a majority of predictions crypted and VPN traffic using time-related features. In: 2nd International Con-
ference on Information Systems Security and Privacy, pp. 407–414.
distributed among other classes. This results from the fact that our
Hu, Y., Zou, F., Li, L., Yi, P., 2020. Traffic classification of user behaviors in Tor, I2P,
obfuscation metric does not account for the statistical relationship ZeroNet, Freenet. In: 2020 IEEE 19th International Conference on Trust, Security
between more than two classes, nor does it account for any de- and Privacy in Computing and Communications, pp. 418–424.
pendency between the CIC-Darknet2020 feature values. Iliadis, L.A., Kaifas, T., 2021. Darknet traffic classification using machine learning
techniques. In: 2021 10th International Conference on Modern Circuits and Sys-
There is much more remaining work that could be done to tems Technologies (MOCAST), pp. 1–4.
extend the adversarial obfuscation analysis presented this paper. imblearn, 2022. imblearn 0.0. https://pypi.org/project/imblearn/.
Real traffic features could be modified on live network traffic (e.g., Lashkari, A. H., 2018. CICFlowmeter-v4.0 (formerly known as iscxflowmeter) is a
network traffic bi-flow generator and analyser for anomaly detection. https:
changing IP addresses, ports, packet lengths or intervals), or se- //github.com/ISCX/CICFlowMeter.
lect features could be prohibited from modification during ob- Lashkari, A.H., Draper-Gil, G., Mamun, M.S.I., Ghorbani, A.A., 2017. Characterization
fuscation, which is likely to be a realistic constraint. An even of Tor traffic using time based features. In: 3rd International Conference on In-
formation System Security and Privacy, pp. 253–262.
larger task is to explore the dependency between features in or- Lashkari, A.H., Kaur, G., Rahali, A., 2020. Didarknet: a contemporary approach to de-
der to anticipate counterattacks. One possible avenue that future tect and characterize the darknet traffic using deep image learning. In: Proceed-
research could take with respect to the CIC-Darknet2020 dataset ings of 10th International Conference on Communication and Network Security,
pp. 1–13.
is to develop an obfuscation method to exploit Random Forest
Li, J., Chang, H.-C., Stamp, M., 2021. Free-text keystroke dynamics for user authenti-
feature importance, or the weights of a linear SVM. This might cation. https://arxiv.org/abs/2107.07009.
better correlate the relationship between classifier response and Misra, S., Li, H., 2020. Noninvasive fracture characterization based on the classifica-
tion of sonic wave travel times. In: Misra, S., Li, H., He, J. (Eds.), Machine Learn-
dataset statistics. We only tested our obfuscation method using
ing for Subsurface Characterization. Elsevier, pp. 243–287.
our best-performing classifier. It would also be interesting to ex- Mudavathu, K.D.B., Rao, M.V.P.C.S., Ramana, K.V., 2018. Auxiliary conditional gener-
plore how other classifiers respond to similar obfuscation tech- ative adversarial networks for image data set augmentation. In: 2018 3rd Inter-
niques, so as to determine which classifiers are most robust to such national Conference on Inventive Computation Technologies, pp. 263–269.
Nagaraju, R., Stamp, M., 2021. Auxiliary-classifier GAN for malware analysis.
attacks. Odena, A., Olah, C., Shlens, J., 2017. Conditional image synthesis with auxiliary clas-
sifier GANs. In: Proceedings of the 34th International Conference on Machine
Learning. In: ICML, Vol. 70, pp. 2642–2651.
Santurkar, S., Tsipras, D., Ilyas, A., Madry, A., 2018. How does batch normalization
Author contribution help optimization? In: Proceedings of the 32nd International Conference on
Neural Information Processing Systems, pp. 2488–2498.
Mark Stamp proposed and guided the research, and edited the Sarkar, D., Vinod, P., Yerima, S.Y., 2020. Detection of Tor traffic using deep learning.
In: Proceedings of IEEE/ACS 17th International Conference on Computer Systems
paper. and Applications, pp. 1–8.
Nhien Rust-Nguyen performed the majority of the experiments, Sarwar, M.B., Hanif, M.K., Talib, R., Younas, M., Sarwar, M.U., 2021. Darkdetect: dark-
developed some of the key ideas used in this research, and wrote net traffic detection and categorization using modified convolution-long short-
-term memory. IEEE Access 9, 113705–113713.
the first draft of the paper. Scikit-learn: Machine Learning in Python, 2022. Scikit-learn: machine learning
Shruti Sharma completed several of the experiments included in Python. https://scikit-learn.org/stable/index.html.
in the paper. sklearn.ensemble.Random ForestClassifier, 2022. sklearn.ensemble.Random
ForestClassifier. https://scikit-learn.org/stable/modules/generated/
sklearn.ensemble.RandomFore
stClassifier.html.
Declaration of Competing Interest Stamp, M., 2022. Introduction to Machine Learning with Applications in Information
Security, 2nd ed. Chapman and Hall/CRC, Boca Raton, FL.
Synced, 2017. Tree boosting with XGBoost — why does XGBoost win “ev-
The authors declare the following financial interests/personal ery” machine learning competition?https://syncedreview.com/2017/10/22/
relationships which may be considered as potential competing in- tree- boosting- with- xgboost- why- does- xgboost- win- every- machine- learning- co
terests: Mark Stamp reports financial support was provided by San mpetition/.
Tor Project History, 2006. Tor project history. https://www.torproject.org/about/
Jose State University. Nhien Rust-Nguyen reports was provided by history/.
San Jose State University. Venkateswaran, R., 2001. Virtual private networks. IEEE Potentials 20 (1), 11–15.
15
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098
Nhien Rust-Nguyen received her master’s in computer science in May 2022. Mark Stamp is a professor of computer science at San Jose State University. His
Her research interests are in applications of machine learning and deep primary research focus is on problems at the interface between information se-
learning. curity and machine learning. He has published more than 150 research articles
and textbooks in information security (Information Security: Principles and Prac-
Shruti Sharma will received her master’s in data science in December 2022. tice, 3rd edition, Wiley, September 2021) and machine learning (Introduction to
Her research interests are in applications of machine learning and deep Machine Learning with Applications in Information Security, 2nd edition, Chapman
learning. and Hall/CRC, May 2022).
16