0% found this document useful (0 votes)
104 views16 pages

Darknet Traffic Classification and Adversarial Attacks Using Machine

Uploaded by

Andi Novianto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views16 pages

Darknet Traffic Classification and Adversarial Attacks Using Machine

Uploaded by

Andi Novianto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Computers & Security 127 (2023) 103098

Contents lists available at ScienceDirect

Computers & Security


journal homepage: www.elsevier.com/locate/cose

Darknet traffic classification and adversarial attacks using machine


learning
Nhien Rust-Nguyen, Shruti Sharma, Mark Stamp∗
Department of Computer Science, San Jose State University, United States

a r t i c l e i n f o a b s t r a c t

Article history: The anonymous nature of darknets is commonly exploited for illegal activities. Previous research has em-
Received 12 June 2022 ployed machine learning and deep learning techniques to automate the detection of darknet traffic in an
Revised 22 October 2022
attempt to block these criminal activities. This research aims to improve darknet traffic detection by as-
Accepted 9 January 2023
sessing a wide variety of machine learning and deep learning techniques for the classification of such traf-
Available online 14 January 2023
fic and for classification of the underlying application types. We find that a Random Forest model outper-
Keywords: forms other state-of-the-art machine learning techniques used in prior work with the CIC-Darknet2020
Darknet dataset. To evaluate the robustness of our Random Forest classifier, we obfuscate select application type
Classification classes to simulate realistic adversarial attack scenarios. We demonstrate that our best-performing clas-
Adversarial attacks sifier can be degraded by such attacks, and we consider ways to effectively deal with such adversarial
Convolutional neural network attacks.
Auxiliary-Classifier generative adversarial
network © 2023 The Author(s). Published by Elsevier Ltd.
Random forest This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)

1. Introduction to contribute by promoting accurate classification of traffic fea-


tures from the well-studied CIC-Darknet2020 (Lashkari et al., 2020)
Most of us are familiar with the Internet and the World dataset, which is a collection of traffic features from two dark-
Wide Web (WWW, or web). We regularly access both using web nets, namely, The Onion Router (Tor) and a Virtual Private Network
browsers or other networked applications to share information (VPN). This dataset also includes corresponding traffic generated
publicly, guided by search engine indexing of the Domain Name over clearnet sessions using the same applications.
System (DNS) over globally bridged Internet Protocol (IP) networks. We consider a wide variety of classic machine learning tech-
This publicly accessible and indexed address space is known as niques, as well as modern neural networking architectures. We
the surface web or clearnet. In contrast, the WWW address space also represent traffic features as grayscale images and apply image-
which is not indexed by search engines but still publicly accessible based deep learning architectures as classifiers, namely, Convolu-
is known as the deep web. Private networks within the deep web tional Neural Networks (CNN) and Auxiliary-Classifier Generative
or networks comprised of unallocated address space are known as Adversarial Networks (AC-GAN). To assess the issue of extreme
darknets and collectively termed the dark web. Fig. 1 illustrates the class imbalance within the CIC-Darknet2020 dataset, we explore
relationship between these layers of the Internet. data augmentation—specifically, we consider both the generative
The dark web is reached by an overlay network requiring spe- network of our AC-GAN model, as well as Synthetic Minority Over-
cial software, user authorization, or non-standard communication sampling Technique (SMOTE). Our results show that Random Forest
protocols (Demertzis et al.). Many darknets afford users anonymity is the most effective among the models tested, both for classifying
during communication and thus facilitate criminal activities, in- traffic type and for classifying the underlying application types. We
cluding hacking, media piracy, terrorism, human trafficking, and also find SMOTE is beneficial for fine tuning our models, the best
child pornography (Branwen et al., 2015; Sarwar et al., 2021). Re- of which is a Random Forest (RF).
searchers have illuminated darknet traffic with machine learning Having established baseline classification performance, we con-
and deep learning techniques, to better identify and inhibit these sider the robustness of our RF classifier in some detail by ap-
criminal activities. The research presented in this paper strives proaching the problem of darknet traffic detection adversarially.
From the perspective of an attacker, we obfuscate the application
classes in an attempt to evade detection. As a proof-of-concept,

Corresponding author. we apply an encoding scheme to transform class features using
E-mail address: [email protected] (M. Stamp). probability analysis of the CIC-Darknet2020 dataset. We strongly

https://doi.org/10.1016/j.cose.2023.103098
0167-4048/© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

2. Background

In this section, we first discuss the two broad categories of data


in our dataset, namely, Tor and VPN traffic. Then we discuss the
most relevant examples of related work.

2.1. The onion router

Initially, The Onion Router (Tor) was a project started by the


United States Navy to secure government communication. Since
2006, Tor has become a nonprofit with thousands of servers (called
relays or relay nodes) run by volunteers across the world (Tor
Project History). Tor clients anonymize their TCP application IP
addresses and sessions keys, sending encrypted application traf-
fic through a network of relays (Sarkar et al., 2020). An example
client application is the Tor Browser, which allows users to browse
the web anonymously.
Fig. 1. Layers of the Internet (Demertzis et al., 2021). Tor generally selects a relay path of three or more nodes and
encrypts the data once for each node using temporary symmetric
keys. The encrypted data hops from relay to relay, where each relay
node only knows about the previous node and the next node along
the path. This design makes it difficult to trace the original identity
of Tor clients. Each relay removes a layer of encryption, so that by
the last relay, the original data is forwarded to the intended des-
tination as plaintext. Tor then deletes the temporary session keys
used for encryption at each node, so that any subsequently com-
promised nodes cannot decrypt old traffic (Dingledine et al., 2004).

2.2. Virtual private networks

Virtual Private Networks (VPN) are used to ensure communica-


tion privacy for individuals or enterprises, and can serve to sepa-
rate private address spaces from the public Internet. VPN software
Fig. 2. Overview of experiments. disguises client IP addresses by tunneling encrypted communica-
tions through a trusted server, which acts as a gateway or proxy
to route client traffic to the broader network space. Client data is
correlate the resulting RF confusion with our obfuscation tech- anonymized behind VPN server credentials before being forwarded
nique for three attack scenarios, assuming few limitations for traf- to an intended destination, which may be either public or private.
fic modification. We then assess the strength of our obfuscation Any response traffic is sent back through the VPN server over the
technique with one defense scenario, by which we demonstrate encrypted connection for the client to decrypt, ensuring anonymity
that we can restore the performance of the RF classifier despite between the client and recipient. Third parties, such as Internet
duress. We find that sufficient statistical knowledge of network Service Providers (ISP), will only see the VPN server as the des-
traffic features can empower either the classification or obfuscation tination of client communications. There are many forms of VPN.
tasks. Some operate at the network layer, others reside at the transport
A high-level overview of our experiments is provided in Fig. 2. or application layer (Venkateswaran, 2001).
After some limited initial data cleaning, for each experiment, we
partition the dataset under consideration into training and valida- 2.3. Related work
tion sets. In the base case, a specific machine learning model is
trained, based on the training set, with the validation set used to Several researchers have considered the problem of detecting
compute accuracy and F1-score statistics. As mentioned above, we darknet traffic. However, there are limited public darknet datasets
consider data augmentation using SMOTE. We also conduct experi- available. The CIC-Darknet2020 dataset used in the experiments re-
ments using the generator module of AC-GAN to produce synthetic ported in this paper was generated by Lashkari et al. (2020). This
data, which can be viewed as another form of data augmentation. dataset was also used in prior research, including (Demertzis et al.;
The three adversarial attack scenarios mentioned above assume Iliadis and Kaifas 2021; Sarwar et al. 2021), and it has become a
that the attacker can manipulate the training data, the validation well-known darknet traffic dataset due to its accessibility. In their
data, or both. In Section 4.2, we discuss these attack scenarios in research, Lashkari et al. (2020) grouped Tor and VPN together as
detail, and explain why they are realistic threats. darknet traffic, while non-Tor and non-VPN were grouped as be-
The remainder of this paper is structured as follows. nign traffic (clearnet). They created 8 × 8 grayscale images from 61
Section 2 gives a brief background discussion of Tor and VPN, and select features and used Convolutional Neural Networks (CNN) to
considers related work on darknet traffic detection. Section 3 de- classify samples in the dataset. Their CNN model achieved an over-
scribes the dataset used in our experiments and outlines our ex- all accuracy of 94% classifying traffic as darknet or benign and 86%
perimental methodology. Section 4 provides background knowl- accuracy classifying the application type used to generate the traf-
edge on the machine learning techniques used in our experiments fic. The application traffic was broadly labeled as browsing, chat,
and gives implementation details. Section 5 discusses the results email, file transfer, P2P, audio streaming, video streaming, or VOIP.
of our experiments. Lastly, Section 6 summarizes our research and The research reported in Sarwar et al. (2021) consisted of clas-
considers possible directions for future work. sifying traffic and application type by combining a CNN and two

2
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

other deep-learning techniques: Long Short-Term Memory (LSTM) (MLP), Convolutional Neural Networks (CNN), and Auxiliary Clas-
and Gated Recurrent Units (GRU). They addressed the issue of sifier Generative Adversarial Networks (AC-GAN) as classifiers. We
having an imbalanced dataset by performing Synthetic Minority experiment with different levels of SMOTE during a preprocessing
Oversampling Technique (SMOTE) on Tor, the minority traffic class. phase, oversampling the minority classes of the CIC-Darknet2020
They used Principle Component Analysis (PCA), Decision Trees dataset to assess the effects of data augmentation and class bal-
(DT), and Extreme Gradient Boosting (XGBoost) to extract 20 fea- ance on classifier performance. We also consider using the AC-GAN
tures before feeding the data into CNN-LSTM and CNN-GRU ar- generator for data augmentation, but we find that it is ineffective
chitectures. Their CNN layer was used to extract features from for this purpose. We experiment with representations of the dark-
the input data, while LSTM and GRU did sequence prediction on net traffic features as 2-dimensional grayscale images for CNN and
these features. CNN-LSTM in combination with XGBoost as the fea- AC-GAN. Then we test the robustness of our best-performing clas-
ture selector produced the best F1-scores, achieving 96% classifying sifier in obfuscation scenarios, which serve to simulate adversar-
traffic type and 89% classifying application type. ial attacks, assuming both the perspectives of an attacker and de-
The study (Iliadis and Kaifas, 2021) focused on just traffic type fender.
from the CIC-Darknet2020 dataset. They used k-Nearest Neigh- In our adversarial attacks, we apply statistical knowledge of the
bors (k-NN), Multi-layer Perceptron (MLP), RF, DT, and Gradient- dataset to obfuscate specific data features, disguising one or more
Boosting Decision Trees (GBDT) to do binary and multi-class clas- classes as others. We explore three scenarios whereby we either
sification. For binary classification, they grouped the data into obfuscate the training data, the validation data or both. Obfuscat-
two classes, namely, benign and darknet, similar to Lashkari et al. ing just the validation data simulates an attack scenario in which
(2020). For the multi-class problem, they used the original four traffic data is disguised while our classifier is yet unaware of the
classes of traffic type (Tor, non-Tor, VPN or non-VPN). They found attack, and thus we can only apply previously trained models with-
that RF was the most effective classifier for traffic type, yielding out a chance to learn from the obfuscation. Obfuscating just the
F1-scores of 98.7% for binary classification and 98.61% for multi- training data simulates a scenario in which an attacker has ac-
class classification. cessed our training data to poison it, such that we train our clas-
Using the same dataset, the authors of (Demertzis et al.) further sifier with malformed assumptions or outright malicious supervi-
broke down the application categories into 11 classes and used sion. A third scenario supposes we collect some of the obfuscated
Weighted Agnostic Neural Networks (WANN) to classify the data. traffic data before training our classifier, and thus have a chance
Unlike regular ANNs, WANNs do not update neuron weights, but to update our classification models to detect obfuscated validation
rather update their own network architecture piece-wise. WANNs data.
rank different architectures by performance and complexity, form-
ing new network layers from the highest ranked architecture. Their
3.1. Dataset
best WANN model achieved 92.68% accuracy on application layer
classification.
The CIC-Darknet2020 dataset (Lashkari et al., 2020) is an
The UNB-CIC Tor and non-Tor dataset, also known as ISCX-
amalgamation of two public datasets from the University of
Tor2016 (Lashkari et al., 2017), was used by Sarkar et al. (2020) to
New Brunswick. It combines the ISCXTor2016 and ISCXVPN2016
classify Tor and non-Tor traffic using Deep Neural Networks (DNN).
datasets, which capture real-time traffic using Wireshark and
They built two models, DNN-A with 3-layers and DNN-B with 5-
TCPdump (Gil et al., 2016; Lashkari et al., 2017). CICFlowMe-
layers. DNN-A classified Tor from non-Tor samples with 98.81% ac-
ter (Lashkari, 2018) is used to generate CIC-Darknet2020 dataset
curacy, while DNN-B achieved 99.89% accuracy. For Tor samples,
features from these traffic samples. Each CIC-Darknet2020 sample
they built a 4-layer Deep Neural Network to classify eight applica-
consists of traffic features extracted in this manner from raw traffic
tion types. This model attained 95.6% accuracy.
packet capture sessions. CIC-Darknet2020 consists of 158,659 hier-
In another study, Hu et al. (2020) generated their own dataset,
archically labeled samples. The top level traffic category labels con-
capturing darknet traffic across eight application categories (brows-
sist of Tor, non-Tor, VPN, and non-VPN. Within these top level cate-
ing, chat, email, file transfer, P2P, audio, video and VOIP) sourced
gories, samples are further categorized by the types of application
from four different darknets (Tor, I2P, ZeroNet, and Freenet). They
used to generate the traffic. These type subcategories are audio-
used a 3-layer hierarchical approach for classification. The first
streaming, browsing, chat, email, file transfer, P2P, video-streaming,
layer classified traffic as either darknet or normal. In the second
and VOIP. Table 2 details the applications that are used to generate
layer, samples classified correctly as darknet were then classified
each type of traffic at the application level.
by their darknet source. The third layer then classified application
type for each of the darknet sources. The techniques (Hu et al.,
2020) used for classification include Logistic Regression (LR), RF, 3.2. Preprocessing
MLP, GBDT, Light Gradient Boosting (LightGB), XGBoost, LSTM, and
DT. Their hierarchical method attained 99.42% accuracy in the first The CIC-Darknet2020 dataset has samples with missing data,
layer, 96.85% accuracy in the second layer and 92.46% accuracy in more specifically, feature values of “NaN”. We remove samples
the third layer. with these values in our data cleaning phase. As shown in Table 3,
Table 1 provides a summary of the prior work presented in there are significantly less Tor samples compared to the other traf-
this section. We note that the research in Iliadis and Kaifas (2021), fic categories. Prior work using this dataset eliminated CICFlowMe-
Lashkari et al. (2020), Sarwar et al. (2021) use the same dataset ter the flow labels, namely, Flow Id, Timestamp, Source IP
that we consider in this paper. and Destination IP. The Flow Id, and Timestamp, which
are also eliminated in our research as well. However, to obtain as
3. Methodology much information as possible from the CIC-Darknet2020 dataset,
we separate each octet of the source and destination IP addresses
The primary goal of this research is to improve upon the state- into their own feature columns. Preliminary tests run on the
of-the-art classification of darknet traffic by exploring the perfor- dataset with and without these IP octet features indicate an im-
mance of Support Vector Machines (SVM), Random Forest (RF), provement in the performance of the classifiers when this IP in-
Gradient-Boosting Decision Trees (GBDT), Extreme Gradient Boost- formation is retained. Thus our dataset contains 72 features total
ing (XGBoost), k-Nearest Neighbors (k-NN), Multilayer Perceptron after this preprocessing step.

3
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

Table 1
Summary of previous work.

Work Dataset Problem considered Techniques Results

Demertzis et al. (2021) CIC-Darknet2020 Only examines WANN 92.68% accuracy


11 application types
Hu et al. (2020) Self-generated Hierarchial approach: LR, RF, Layer 1:
Layer 1: darknet vs clearnet MLP, GBDT, 99.42% accuracy
Layer 2: Tor, I2P, ZeroNET LightGB, XGB, Layer 2:
and FreeNET LSTM, DT 96.85% accuracy
Layer 3: 8 application types Layer 3:
92.46% accuracy
Iliadis and Kaifas CIC-Darknet2020 Only examines traffic type kNN, MLP Binary:
(2021) Binary: darknet vs clearnet RF, DT, GB 98.7% F1-score
Multiclass: 4 traffic types Multiclass:
98.61% F1-score
Lashkari et al. (2020) CIC-Darknet2020 Binary: darknet vs clearnet CNN Binary:
Multiclass: 8 application types 94% accuracy
Multiclass:
86% accuracy
Sarkar et al. (2020) ISCXTor2016 Binary: Tor vs non-Tor DNN Binary:
Multiclass: 99.89% accuracy
8 application types within Tor Multiclass:
95.6% accuracy
Sarwar et al. (2021) CIC-Darknet2020 4 traffic types CNN-LSTM, Traffic:
8 application types CNN-GRU 96% F1-score
Application:
89% F1-score

Table 2 Table 4
CIC-Darknet2020 application classes (Lashkari et al., 2020). Samples per application category.

Application class Applications considered Class Application Type Samples

Audio-Streaming Vimeo and YouTube 0 Audio-Streaming 18,065


Browsing Firefox and Chrome 1 Browsing 32,809
Chat ICQ, AIM, Skype, Facebook and Hangouts 2 Chat 11,479
Email SMTPS, POP3S and IMAPS 3 Email 6146
File Transfer Skype and FileZilla 4 File Transfer 11,183
P2P uTorrent and Transmission (BitTorrent) 5 P2P 48,521
Video-Streaming Vimeo and YouTube 6 Video-Streaming 9768
VOIP Facebook, Skype and Hangouts 7 VOIP 3567

Table 3
Samples per traffic category. that 100% SMOTE results in an equal number of samples for each
Traffic Type Samples class, while lower thresholds of SMOTE result in an equal number
of samples among only those classes which are oversampled.
Non-Tor 93,357
Non-VPN 23,864
Tor 1393 3.2.2. Data representation
VPN 22,920 SVM and RF both use each the dataset samples in their original
format, which is a 1-dimensional array. However, we reshape each
sample to be 2-dimensional for CNN and AC-GAN. Intuitively, the
The CIC-Darknet2020 dataset was scaled by min-max normal- data is reshaped as 9 × 9 grayscale images, where each of our 72
ization, which applies the equation features is represented as a single pixel with the remaining pixels
produced by zero padding. The pixels are ordered as their respec-
(value − min )
normalizedValue = tive features appeared in the CIC-Darknet2020 dataset, starting at
(max − min ) the top left corner of the image as shown in Fig. 3, where each
to every value in each feature column. Note that this serves to row represents samples from an application class, color-coded for
scale the feature values between 0 and 1. We also apply min-max readability.
normalization to our IP octet feature columns. Both CNN and AC-GAN convolve local structures within the 2-
D images, so adjacent pixels play an important role in classifi-
3.2.1. Data balancing cation. Therefore, we experiment with strategies to reorder the
The CIC-Darknet2020 dataset does not have balanced sam- data to achieve better performance. We order the pixels by fea-
ple counts among traffic and application classes, as shown in ture importance—as determined by our Random Forest classifier—
Tables 3 and 4. To explore the effect of reducing this imbalance starting at the top left corner of the image, and also reorganize the
on the classification task, we oversample each minority class us- pixels spiraling outward from the center of the image. This latter
ing SMOTE. SMOTE interpolates linearly between feature values strategy tends to group pixels with larger values toward the center
to produce new samples (Bhagat and Patil, 2015). We experiment of each image, as shown in Fig. 4.
with the following levels of oversampling: 0% (no SMOTE), 20%,
40%, 60%, 80% (partial SMOTE), and 100% (full SMOTE). SMOTE is 3.2.3. Data augmentation experiment
performed on all classes with less than the oversampling thresh- We experimented with AC-GAN as an alternative to SMOTE,
old as compared to the class with the largest sample count. Note with the goal of generating realistic artificial samples that can be

4
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

Fig. 3. Data as 2-D images in original order.

Fig. 4. Data in 2-D sorted by RF feature importance and centered.

used to augment our dataset. Again, we use data augmentation to 3.3. Evaluation metrics
address the issue of class imbalance. However, we abandoned this
approach as we found that the fake images generated by AC-GAN In our experiments, we use accuracy and F1-score to mea-
are consistently detectable by a CNN model with accuracy ranging sure the performance of each classifier. Accuracy is computed as
from 99% to 100%. We believe that the failure of our AC-GAN to the total number of correct predictions over the number of sam-
produce realistic fake images is due to the depth of the AC-GAN ples tested. The F1-score is the weighted average of precision and
neural network architecture, which is constrained by the input im- recall metrics, which is better for unbalanced datasets like CIC-
age size. In any case, we were unsuccessful in our attempt to use Darknet2020. Similar to accuracy, F1-scores fall between 0 and 1,
AC-GAN to augment our data. with 1 being the best possible. The F1-score is computed as
An example of four fake samples compared to real samples can
(Precision × Recall )
be found in Fig. 5. The fake samples in this figure may appear to be F1 = 2 ×
useful but, again, a CNN can distinguish the fake from the real with (Precision + Recall )
essentially 100% accuracy. This clearly shows that from a machine Precision calculates the ratio of samples classified correctly for the
learning perspective, the fakes samples are not sufficient for data positive class, while recall measures the total number of positive
augmentation. samples that were classified correctly. Precision and recall are com-

5
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

Fig. 5. Chat class (4 fake and 4 real examples).

Table 5 4.1.1. Boosting techniques


Computing resources used in experiments.
Boosting is a general technique where a collection of weak
Computing hardware Experiments classifiers are combined to produce a stronger classifier. Gradient-
CPU: 8-core Intel(R) CORE(TM) SVM, RF, GBDT, XGBoost, k-NN, MLP Boosting Decision Trees (GBDT) assign weights to decision trees
i7-8550U @ 1.80GHz Obfuscation based on residuals (i.e., gradient calculations). Extreme Gradient
CPU: 12-core Intel(R) Xeon(R) Boosting (XGBoost) is a slightly modified—and highly efficient—
W-10855M @ 2.80GHz CNN, AC-GAN implementation of the GBDT technique. XGBoost has performed
GPU: NVIDIA Quadro SMOTE
well in numerous machine learning competitions (Synced, 2017).
RTX 5000
In our GBDT experiments, we employ the log loss function,
while the learning rate is α = 0.1 and the number of estimators
is 100. For our XGBoost experiments, the learning rate is selected
puted as to be α = 0.3, the maximum depth of the trees is 6, and we em-
True Positives ploy uniform sampling.
Precision =
(True Positives + False Positives )
4.1.2. k-Nearest Neighbors
and
As the name suggests, in k-Nearest Neighbors (k-NN), samples
True Positives are classified based the k nearest samples in the training set. There
Recall =
(True Positives + False Negatives ) is no explicit training required in k-NN, and hence no algorithm
can be simpler, at least in terms of training. In spite of—or, per-
respectively.
haps, because of—its simplicity, there exist strong error bounds for
k-NN. However, the technique is sensitive to local structure and,
4. Implementation in particular, for small values of k, overfitting is common. Based
on small-scale experiments, we use k = 5 for all k-NN experiments
This section details the implementation of the experiments that reported in this paper.
we mentioned in Section 3. All experiments are coded in Python.
The Imblearn library (imblearn) is used to implement SMOTE 4.1.3. Multilayer perceptron
to balance the dataset, while the package Scikit-learn (Scikit- Multilayer Perceptrons (MLP) are feedforward networks that
learn: Machine Learning in Python) is employed to run most of generalize basic perceptrons to allow for nonlinear decision bound-
the experiments, with the exceptions being that the Tensorflow aries. This is somewhat analogous to the way that nonlinear SVM
and Keras libraries are utilized to implement CNN and AC- generalize linear SVMs. In a sense, MLPs are the simplest useful
GAN. From the Scikit-learn library, the metrics module neural networking architecture, and hence they are sometimes re-
is used to evaluate the F1-scores and accuracy of the classifiers ferred to simply as Artificial Neural Networks (ANN). In our MLP
and the StratifiedKFold function is applied to perform 5-fold experiments, we use an architecture with 100 hidden layers, rec-
cross validation. Graphs are generated with the Matplotlib and tified linear unit (ReLu) activation functions, the Adam optimizer,
Seaborn libraries, with the exception of the confusion matrices and a learning rate of α = 0.0 0 01.
and bar graph, which are typeset directly in LATEX using PGFPlots.
All experiments in this research are executed on one of two 4.1.4. Support vector machines
personal computers, as detailed in Table 5. We exploit a graph- Support Vector Machines (SVM) are supervised machine learn-
ics processing unit (GPU) in the second computer to decrease ing models frequently used for classification. An SVM attempts to
the training time of our more computationally demanding exper- find one or more hyperplanes to separate labeled training data
iments, that is those using neural networks to process 2-D image while maximizing the margin of the decision boundaries between
representations. classes. The data must be vectorized into linear feature sets, but
non-linear data can also be encoded with some success. Scaling
4.1. Overview of classification techniques the feature values across training samples allows coefficients of the
hyperplanes (weights) to be ranked by relative importance. SVMs
This section briefly describes the machine learning and deep rely on the so-called kernel trick to map data into a higher di-
learning concepts that we apply to classification in our experi- mensional space, which can yield nonlinear decision boundaries in
ments. These include the boosting techniques of GBDT and XG- the input space. The idea behind the kernel trick is that in higher
Boost, as well as k-NN, MLP, SVM, RF, CNN, and AC-GAN. dimensions, it is generally easier to find hyperplanes to separate

6
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

classes (Stamp, 2022). For our research, we perform preliminary our input image. We experiment with different cutout sizes in-
tests to determine the best kernel for our dataset, with the result cluding 2 × 2, 3 × 3, and 4 × 4 and randomize the position of the
being the Gaussian radial basis function (RBF). cutout within the mask. Refer to Fig. 7 for some examples of masks
with 3 × 3 cutouts. Our cutout experiments are discussed in detail
4.1.5. Random forest in Section 5.2, below.
Random Forest (RF) is an ensemble method that generalizes
Decision Trees (DT). While a DT is a simple and efficient classi- 4.1.7. Auxiliary-classifier generative adversarial network
fication algorithm, it is highly sensitive to variance in the train- Generative Adversarial Networks (GAN) are comprised of two
ing data and hence prone to overfitting. RF compensates for these neural network architectures—a generator and a discriminator—
deficiencies by generating many subsets of the dataset, then ran- that compete in a zero-sum game during training. The generator
domly selecting features (with replacement) and trains a DT for takes noise from a latent space as input and produces images that
each subset. This process is called bootstrapping. To classify, RF feed into the discriminator. The discriminator is given both real
takes the majority vote from all resulting DT in a process called and generated images and is tasked to classify them as either real
aggregation. Together bootstrapping and aggregation is referred or fake. The discriminator error is then fed back into the generator
to as bagging (Misra and Li, 2020; Stamp, 2022). RF also en- to improve its image generation. AC-GAN is an extension of this
ables us to rank the importance of features based on the mean base GAN architecture, taking a class label as additional input to
entropy within the component DTs. Feature importance tells us the generator while predicting this label as part of the discrimina-
how influential each feature is when classifying samples with tor output. The objective of the AC-GAN generator is to minimize
the RF. Based on small-scale experiments, we found that the de- the ability of the discriminator to distinguish between real and
fault hyperparameters in Scikit-learn yielded the best results; fake images and also maximize the accuracy of the discriminator
see (sklearn.ensemble.RandomForestClassifier) for the details. when predicting the class label (Mudavathu et al., 2018; Nagaraju
and Stamp, 2021). Besides using the AC-GAN generator in data aug-
mentation experiments, we also explore the secondary class pre-
4.1.6. Convolutional neural networks diction output of the discriminator as a classifier.
Convolutional Neural Networks (CNN) are a unique type of neu- Our AC-GAN architecture is inspired by the ImageNet model
ral network that focus on local structures, making them ideal for described in (Odena et al., 2017). However, since that architecture
image analysis. CNNs are composed of an image input layer, con- was built for image sizes 32 × 32 or larger, we modify that ar-
volution and pooling layers and a fully-connected output layer that chitecture to accommodate our 9 × 9 image size by reducing the
produces a vector of class scores. Convolutional and pooling lay- number of convolutional and transposed convolutional layers in
ers are the fundamental components of any CNN architecture. In the discriminator and generator, respectively.
convolutional layers, the output of the previous layer (or the raw We fine-tune our AC-GAN hyperparameters by experimenting
image in the initial convolutional layer) is convolved with random- with the following.
ized filters to produce local structure maps that are joined to cre-
ate the output of the layer. In the convolutional process, the fil- • Latent space size (81, 100)
ter windows slide across the input image, thus emphasizing lo- • Initial number of convolution filters (15, 40, 64, 192, 202, 384,
cal structure, and providing a degree of translation invariance. The 50 0, 150 0)
components of each filter are learned when training a CNN. Pool- • Number of nodes in the first dense layer (31, 81, 128, 384, 405,
ing layers decrease total training time by reducing the dimension- 768, 10 0 0, 30 0 0)
ality of the resulting feature maps, concentrating effort on the • Filter size (3 × 3, 5 × 5)
most signifcant features (Convolutional Neural Networks for Visual • Stride size (2 × 2, 3 × 3)
Recognition; Lashkari et al. 2020). For this research, we use max
We observe accuracies within the range of 70% to 73% when
pooling.
classifying application type with these hyperparameters. The best-
Our CNN architecture is based on that described in (Lashkari
performing architecture with the shortest runtime duration is used
et al., 2020). We experiment with various hyperparameters, testing
in this research; Tables 6 and 7 detail our generator and discrimi-
all combinations of the following in a grid search.
nator architecture, respectively.
• Initial number of convolution filters (9, 32, 64, 81) We feed training data to our AC-GAN model in batches of 64
• Filter size (2 × 2, 3 × 3) samples. Batch normalization (BatchNorm) layers are applied be-
• Percentage dropout (0.2, 0.5 ) tween convolutional layers to regularize the training gradient step
• Number of nodes in the first dense layer (72, 256) size. BatchNorm is thought to smooth local optimization steps and
stabilize training, thereby accelerating convergence of GAN mod-
All these architectures yield accuracies within the range of 86% els (Santurkar et al., 2018).
to 88% when classifying application type. Therefore, we select the
architecture that produces the highest accuracy. Our select CNN ar- 4.2. Adversarial attacks
chitecture is illustrated in Fig. 6. Note that we use Adam for our
optimizer and sparse categorical cross entropy for our loss func- Our adversarial attacks rely on obfuscation, which serves to dis-
tion. guise application classes based on applied probability analysis. We
Dropout is a common technique used to combat overfitting in select application classes to disguise as other classes based on min-
neural networks with fully-connected layers. However, it is found imum and maximum sum statistical distance between all class fea-
to be not as effective with convolution layers. A better regular- tures, as specified in Algorithm 1.
ization technique for CNN is to “cut out” sections of the input We also select a third class transformation to perform based
images. Such cutouts force CNN to learn from the other parts on maximal classifier confusion, whose sum statistical distance be-
of an image during training, which tends to activate filters that tween class features is notably low, but not the minimum between
would otherwise atrophy. It is comparative in effect to dropouts classes. We ensure our class transformation can be decoded by
except that it operates on the input stage rather than the in- encoding features with a deterministic algorithm, given here as
termediate layers (DeVries and Taylor; Li et al. 2021). We im- Algorithm 2. We impose no additional restrictions on feature trans-
plement cutouts by creating feature masks of equivalent size to formation.

7
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

Fig. 6. CNN architecture.

Table 6
AC-GAN generator architecture.

Layer Operation Kernel Strides Depth BN Activation

1 × 1 × 100 Input A (Latent Space)


Dense A 405 ReLU
1 × 1 × 1 Input B (Feature Noise)
8 × 32 Class Embedding for B 256
Dense B 1
Merge A + B 406
Conv2DTranspose 5×5 3×3 202  ReLU
Conv2DTranspose 5×5 3×3 1 Tanh

Table 7
AC-GAN discriminator architecture.

Layer Operation Kernel Strides Depth BN Dropout Activation

9x9x1 Input (Image)


Conv2D 3×3 2×2 32 0.5 Leaky ReLU
Conv2D 3×3 1×1 64  0.5 Leaky ReLU
Conv2D 3×3 2×2 128  0.5 Leaky ReLU
Conv2D 3×3 1×1 256  0.5 Leaky ReLU
Flatten
Dense 1 Sigmoid
Dense 8 Softmax
Leaky ReLU Slope 0.2
Weight Initialization Gaussian (σ = 0.02 )
Optimizer Adam (α = 0.0 0 02, β1 = 0.5 )

We start by generating normalized histograms of feature values From Table 8, we observe that class 0 is most different from
per class to assess the probability at which values occur within class 5 and class 3 is most similar to class 7. We pick the classes
each class. To decide which classes to obfuscate, we examine the with the minimum and maximum sum of statistical distances be-
sums of the distances between feature probability distributions tween features, changing class 0 (audiostreaming) to class 5 (P2P)
from each class to each other class. We use the cdist function of and class 3 (email) to class 7 (VOIP). We also examine the confu-
the scipy Python library to calculate the Euclidean distance be- sion matrix for our best-performing classifier, RF, which is shown
tween probability distributions. This provides an estimate of the in Fig. 8. RF is observed to be most confused between class 2
overall difference between classes while considering all feature (chat) and class 3 (email), so we decide to additionally obfuscate
probability distributions. In the case of application type, this yields class 2 with class 3. We arbitrarily choose to transform lower num-
the 8 × 8 array in Table 8, where the Class numbers correspond to bered classes to higher numbered classes, e.g., disguising class 2 as
those in Table 4, above. class 3 instead of class 3 as class 2.

8
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

Table 8
Statistical distances between pairs of application classes.

Class 0 1 2 3 4 5 6 7

0 0 25.129 18.709 21.041 23.656 28.195.9 18.371 21.903


1 25.129 0 23.518 21.958 12.884 12.098 16.728 23.623
2 18.709 23.518 0 9.841.75 22.613 25.294 18.408 9.901
3 21.041 21.958 9.841.75 0 21.51 23.021 18.031 6.859.6
4 23.656 12.884 22.613 21.51 0 15.651 14.605 23.211
5 28.195.9 12.098 25.294 23.021 15.651 0 21.085 24.451
6 18.371 16.728 18.408 18.031 14.605 21.085 0 20.089
7 21.903 23.623 9.901 6.859.6 23.211 24.451 20.089 0

Algorithm 2 Disguise one class sample as another class sample


1: procedure obfuscate(sample,classA,classB)  To decode, reverse A and B
2: bins = some discrete bins partitioning values 0 to 1 
We use 100 bins
3: A = classA feature probability distributions
4: B = classB feature probability distributions
5: for each featureValue at featureIndex in the sample do
6: featureBin = the bin which contains featureValue
7: DCPD = A[featureIndex] - B[featureIndex]
8: AtoB = sorted DCPD from maximum to minimum
9: BtoA = sorted DCPD from minimum to maximum
10: oldBin = where AtoB[oldBin] == featureValue 
Red arrows in Figure 9
11: newBin = BtoA[oldBin]  Black arrows in Figure 9
12: newValue = featureValue - (bins[oldBin] - bins[newBin])
13: sample[featureIndex] = newValue
14: end for
15: end procedure

Fig. 7. Examples 9 × 9 images with 3 × 3 cutouts.

Algorithm 1 Class feature probability distributions


1: procedure compare(classA,classB)
2: bins = some discrete bins partitioning values 0 to 1  We use 100 bins
3: A = classA feature probability distributions
4: B = classB feature probability distributions
5: classDistance = 0
6: for each distributionA, distributionB in A, B do
7: featureDistance = cdist(distributionA, distributionB) Euclidean
8: classDistance += featureDistance  Manhattan sum
9: end for
10: end procedure

Our obfuscation algorithm first calculates the difference in class


probability distributions (DCPD) for each feature between the two
classes under consideration, where the classes are denoted as A
and B, and sorts each distribution from maximum to minimum.
Intuitively the index of each DCPD maximum corresponds to each Fig. 8. Best RF results for application classification.
feature value most probably belonging to the positive class A while
the minimum corresponds to each feature value most probably be-
longing to the negative class B. To obfuscate a sample, we then tracted earlier, thus applying each feature DCPD between known
transform individual feature values by subtracting the difference classes as a decoder key to undo an expected class obfuscation
in bin thresholds between the original feature bin and a target bin for a particular feature. To test this method of class obfuscation,
for obfuscation. To choose target bins for this transformation, we we performed the three adversarial attacks summarized in Table 9,
create a 1-to-1 map of the sorted indices of each DCPD with a re- with RF as the classifier.
verse sort of the same DCPD. This ensures a transformed sample
feature could be decoded later given the feature DCPD for a class 4.2.1. An obfuscation example
obfuscation vector. An example visualization of the DCPD bin map- To illustrate Algorithm 2, we will walk through a simple exam-
ping for the transformation of the most common feature 0 values ple where we are given a sample from class 2 and we want to
from class 2 to class 3 is provided in Section 4.2.1, below. transform this sample to look more like class 3. Let us start with
Reversing the 1-to-1 bin map facilitates decoding of obfuscated the first feature, feature 0. We note the value of this feature for
class sample feature values back to their original values. To do this class 2; call this value v. Suppose, for example, that v = 0.178142.
we add back the same difference in bin thresholds which we sub- We allocate 100 equal-width bins ranging from 0 to 1, so that

9
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

Fig. 9. Visualization of obfuscation example.

Table 9 5.1. Data representation experiments


Attack scenarios.

What is obfuscated? Scenario We evaluate CNN and the AC-GAN discriminator given differ-
Scenario description ent 2-D pixel representations of the data features. All of our 2-D
Training data Validation data
representations of the data are of size 9 × 9, where each pixel is
1  Simulates a novel attack
a feature. The pixels in the original representation follow the or-
where we apply an outdated
model for classification der that the features appear in the CIC-Darknet2020 dataset. We
2  Simulates an attack hypothesize that grouping the pixels together would have a posi-
on our training data, tive effect on the performance of our classifiers since convolutions
poisoning the classifier operate on local structures. Our results show that CNN performs
3   Simulates a novel defense
where we train our model
best when the pixels are sorted by RF feature importance and then
on some obfuscated data grouped together at center of the image. However, this is not true
for the AC-GAN discriminator. AC-GAN does better using the origi-
nal data representation, contrary to our hypothesis. Table 10 shows
bin b0 corresponds to values 0.00 to 0.01 and so on. Given the the results for these experiments.
value of v, we find the bin that v falls into. The value v = 0.178142
is in bin b17 , which contains values between 0.17 to 0.18. Bin b17 5.2. Cutout experiments
is indicated by the red arrows in Fig. 9. We then flip the sorted
DCPD index at b17 to locate our target bin, indicated by the black Initially, our CNN model is able to achieve 88% accuracy classi-
arrows in Fig. 9. This target bin b58 , which contains values be- fying application type within 15 epochs. However, we notice that
tween 0.58 to 0.59. To obfuscate, we subtract the difference be- overfitting starts to occur the longer we run our model. To reduce
tween b17 and b58 from v. In this example, our new transformed overfitting, we apply cutouts to the training data. We experiment
value is with different cutout sizes: 2 × 2, 3 × 3, and 4 × 4. We observe
that cutouts allow our CNN to train for a longer period of time
v = 0.178142 − (0.17 − 0.58 ) = 0.588142

without overfitting. The loss graphs in Fig. 10 show how the CNN
which falls into the target bin b58 . We repeat this for all the fea- model overfits after 20 epochs in the original execution but does
tures to transform the sample from class 2 to class 3. not overfit with cutouts. There is little difference in the effects of
Note that this obfuscation technique is designed to maximize applying 2 × 2 compared to 3 × 3 cutouts. Both delay overfitting
the effectiveness of a simulated adversarial attack. Our approach at the same rate and the accuracies for both linger at 88%. No-
ignores practical limitations on the ability of attackers to modify tably, we witness a 1% decrease in accuracy with 4 × 4 cutouts. As
the statistics of the data. Hence these simulated attacks can be our images are only 9 × 9 pixels, a 4 × 4 cutout likely deletes too
considered worst-case scenarios, from the perspective of detecting much information from the image, negatively affecting the accu-
darknet traffic under adversarial attack. racy. While cutouts address the issue of overfitting, we find that
more training does not significantly improve the performance of
5. Results and discussion CNN on the dataset under consideration. Thus, we do not employ
cutouts in the CNN results reported below.
In this section, we consider a wide range of experiments. First,
we determine which of the three 2-D image representation tech- 5.3. SMOTE Experiments
niques discussed in Section 5.1 is most effective. Then we consider
the use of cutouts, which can serve to reduce overfitting and im- We compare the performance of our classifiers with various lev-
prove accuracy in CNNs. We then turn our attention to the imbal- els of SMOTE, performing SMOTE to oversample the training data
ance problem, with a series of SMOTE experiments. We conclude before training each classifier for both cases, that is, traffic type
this section with an extensive set of experiments involving various and application type. The results from these experiments appear
adversarial attack scenarios. in Tables 11 and 12, respectively, where the best result for each

10
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

Table 10
2-D data representation results.

CNN AC-GAN

Accuracy F1-scores Accuracy F1-score

Original 0.889 0.887 0.753 0.738


Shaped with RF feature importance 0.890 0.887 0.753 0.731
Shaped with RF feature importance and centered 0.891 0.889 0.742 0.729

Fig. 10. The effects of cutouts on overfitting for CNN.

Table 11 in every case. We conclude that for the problem under considera-
Traffic classification F1-scores at various SMOTE levels.
tion, SMOTE is of some value for fine tuning models.
Learning SMOTE percentage Our RF model without SMOTE outperforms the state-of-the-
technique 0% 20% 40% 60% 80% 100%
art F1-scores for both traffic and application classification tasks.
We observe a 1.1% improvement for traffic classification as com-
GBDT 0.961 0.961 0.960 0.960 0.958 0.958
pared to Iliadis and Kaifas (2021), where they also found RF to be
XGBoost 0.983 0.983 0.982 0.980 0.977 0.975
k-NN 0.884 0.884 0.881 0.875 0.871 0.868 their best classifier. The study (Iliadis and Kaifas, 2021) only clas-
MLP 0.821 0.821 0.850 0.788 0.676 0.744 sified traffic type, thus no application type performance is avail-
SVM 0.986 0.993 0.993 0.993 0.993 0.993 able for comparison. For application classification, our RF model
RF 0.998 0.998 0.998 0.998 0.998 0.998 achieved a 3.2% increase over (Sarwar et al., 2021). In addition,
CNN 0.998 0.995 0.995 0.995 0.996 0.995 our CNN model outperformed the CNN results in Lashkari et al.
AC-GAN 0.974 0.980 0.984 0.986 0.987 0.987 (2020) by 2.8% and is within 0.2% of the more complex and costly
CNN-LSTM results in Sarwar et al. (2021). We are only able to
compare classification results for application type with (Lashkari
SMOTE level is boxed. We observe that reducing class imbalance et al., 2020) because they approach traffic type classification as
using SMOTE does not have a large effect on the performance of a binary problem while we address it as a multiclass problem.
most of the classifiers. With the exception of the MLP traffic classi- Table 13 summarizes the best performance of our classifiers in
fication experiments, SMOTE only affects the F1-score by about 1% comparison to relevant prior work, where the best results in
to 2% in each case. Note also that the MLP results are the poorest the Traffic and Application columns are boxed. Overall, RF is our

11
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

Fig. 11. Confusion matrices for attack scenario 1.

Table 12
Application classification F1-scores at various SMOTE levels.

Learning technique SMOTE percentage

0% 20% 40% 60% 80% 100%

GBDT 0.840 0.840 0.840 0.838 0.837 0.835


XGBoost 0.893 0.890 0.888 0.887 0.885 0.885
k-NN 0.750 0.746 0.742 0.736 0.734 0.734
MLP 0.591 0.587 0.596 0.558 0.547 0.536
SVM 0.834 0.839 0.842 0.846 0.847 0.848
RF 0.922 0.920 0.921 0.921 0.920 0.920
CNN 0.887 0.883 0.883 0.887 0.888 0.885
AC-GAN 0.738 0.750 0.762 0.768 0.767 0.759

best-performing classifier and MLP and k-NN perform the worst. 5.4. Adversarial attack experiments
Also of note is the fact that the AC-GAN classifier is one of
the best performing models in the traffic classification problem, With improvement in the accuracy of darknet traffic detection
but it performs relatively poorly in the application classification by machine learning and deep learning techniques, it is realistic to
task. anticipate that attackers will attempt to find ways to circumvent

12
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

Fig. 12. Confusion matrices for attack scenario 2.

Table 13 be accomplished by modifying traffic feature values, understand-


Best F1-scores compared to prior work.
ing that this process is most feasible and desirable at the appli-
Source Learning technique Traffic Application cation layer. Also, if an attacker were to discover the methods we
Previous CNN-LSTM (Sarwar et al., 2021) 0.960 0.890 use for classification and pollute our training data, then our classi-
work RF (Iliadis and Kaifas, 2021) 0.987 — fiers could be compromised, allowing the attacker to avoid detec-
CNN (Lashkari et al., 2020) — 0.860 tion without modifying any of their traffic features.
Our GBDT 0.961 0.840 For this experiment we assume the role of an attacker on
results XGBoost 0.983 0.893
the network, with the goal of modifying traffic features such that
k-NN 0.875 0.750
MLP 0.850 0.596 classes are incorrectly classified or entirely undetected. This could
SVM 0.993 0.848 represent covert illegal activity that an attacker wishes to hinder
RF 0.998 0.922 the detection of, with common examples being P2P or file-transfer
CNN 0.998 0.888 applications. Realistically, traffic features common to one applica-
AC-GAN 0.987 0.768 tion class could be modified at the application layer to appear
more similar to other application classes. An attacker could do this
by writing a custom overlay application to change various features,
such as the number of packets sent, their communication intervals,
detection by modifying the profile of their application traffic. For port assignment, etc. In our experiments, we disguise class 0 as
example, someone pirating copyrighted media with P2P applica- class 5 (originally the most different), class 2 as class 3 (the classes
tions might disguise their illegal activity as VOIP traffic to avoid which most confused our RF classifier) and class 3 as class 7 (orig-
prosecution. We show obfuscation of traffic in this fashion can inally the most similar).

13
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

Table 14
Class accuracies for attack scenarios.

Overall Accuracy Class Accuracy


Scenario Obfuscation
(0,5) (2,3) (3,7) 0 2 3

No attack — — — — 0.907 0.846 0.821


1 — 0.808 0.854 0.887 0.100 0.006 0.000
2 — 0.820 0.865 0.894 0.000 0.000 0.000
3 20.0% 0.947 0.946 0.939 0.998 0.993 0.997
3 2.0% 0.935 0.939 0.891 0.958 0.921 0.120
3 0.2% 0.820 0.859 0.887 0.503 0.247 0.000

obfuscate. This represents a hypothetical scenario where the obfus-


cation algorithm has been obfuscating network traffic long enough
to pollute a small portion of a network traffic population. A de-
fender then updates the classifier to include this small portion of
Fig. 13. Classification results compared to previous work. obfuscated class data at training time, with increasing exposure
to the obfuscated data over time. As our dataset is split into 80%
training data and 20% validation data, we decide to limit the train-
In attack scenario 1, we train our RF classifier on the original ing dataset exposure of obfuscated class data to 20% of the total
application class data, then test the same model with an obfus- training dataset. We choose to decrement this value logarithmi-
cated class in the validation dataset. This represents a hypotheti- cally with three total sub-scenarios representing 0.2%, 2%, and 20%
cal scenario where an attacker modifies the traffic features of one obfuscation exposure, expecting that with more exposure to the
class at the application layer, perhaps with an overlay application. obfuscated class data, our classifier will adapt and outperform the
We demonstrate that our method of obfuscation is able to de- obfuscation algorithm to correctly classify the obfuscated class in
feat our best classifier in this scenario, significantly reducing de- our validation dataset.
tection of the obfuscated class, as well as overall classifier accu- We find that 20% exposure of our obfuscation algorithm to the
racy. Before obfuscation, RF classifies application classes with an RF training data is sufficient for RF to predict the disguised classes
accuracy of 92.3% without SMOTE. After obfuscation of the three with high accuracy, defeating our obfuscation technique as shown
class choices mentioned in the previous paragraph, the overall RF in Table 14. Note that the overall accuracies reported for attack sce-
accuracy for application classification without SMOTE decreases nario 3 are higher than our RF benchmark score of 92.2%. However,
to 80.8%, 85.4%, and 88.7%, respectively. we modify the validation dataset in both attack scenarios 1 and 3,
The confusion matrices in Fig. 11 show that RF consistently mis- so the resulting accuracies of those scenarios cannot be directly
classifies each class we obfuscate, actually detecting no samples compared to the results of prior work. Our results with lower ex-
in the case of an obfuscated class 3, where the dashed circle in- posure levels of 2% and 0.2% reveal a trend—of the classes tested,
dicates the class we are obfuscating and the solid circle indicates class 0 appears to be the most difficult for our algorithm to obfus-
the class we intended it to appear as. However, RF did not misclas- cate, while class 3 appears to be the easiest to obfuscate. Class 2 is
sify classes 0 and 3 as the expected classes 5 and 7, respectively. somewhere in between, providing a loose correlation to our met-
Instead, the confusion matrices (a) and (c) in Fig. 11 reveal that ric of statistical distance between classes and the performance of
RF mostly categorizes class 0 and 3 as class 6 and 2, respectively. our obfuscation algorithm. We observe this trend in Table 14 under
It may be relevant that our obfuscation method does not account Class Accuracy for attack scenario 3.
for any interdependence between traffic feature values, obfuscating
each feature independently.
In attack scenario 2, we train our RF classifier with an obfus- 6. Conclusion and future work
cated class in the training dataset, then test the model with the
original application class data. We consider a hypothetical scenar- In this research, we classified the CIC-Darknet2020 network
ios where an attacker entirely poisons our training data, perhaps traffic samples using a wide variety of classifiers. We classified
by injecting malware into our database or by intercepting our traf- based on four traffic classes and eight application classes, while
fic capture data stream. We find the attacker could prevent an fine tuning the classifier hyperparameters. We experimented with
entire class from being predicted by our best classifier when the different levels of SMOTE to assess class imbalance in the dataset
training data for a class is entirely obfuscated. We see this trend in and explored 2-D representations of the traffic features for CNN
the all three confusion matrices in Fig. 12, where in each case, the and AC-GAN. We also approached the issue of darknet detection
dashed circle indicates the class we are obfuscating and the solid adversarially, from the perspective of an attacker hoping to con-
circle indicates the class we intended it to appear as. Notice that fuse our best classifier. We demonstrated that we could effectively
the entire row in the confusion matrix is zeroed out, indicating obfuscate application class traffic features. We then correlated the
that the class was never predicted for classification by RF. Simi- underlying statistics of the CIC-Darknet2020 dataset to the perfor-
lar to attack scenario 1, the overall RF accuracy decreases to 82.0%, mance of this algorithm assuming specific hypothetical attack sce-
86.5%, and 89.4% respectively, for application classification without narios for added realism.
SMOTE. As the obfuscated class is never considered for prediction Among the tested machine learning classifiers, Random For-
by RF, in this scenario, we observe a lesser overall accuracy de- est was found to be the most proficient at classifying darknet
crease as compared to attack scenario 1. traffic for both traffic and application types. It yielded 99.8%
In attack scenario 3, we train our RF classifier with the same F1-score for traffic classification and 92.2% F1-score for applica-
obfuscated class in both the training dataset and the validation tion classification, outperforming the state-of-the-art studies on
dataset. We obfuscate only a small portion of the training data CIC-Darknet2020 (Iliadis and Kaifas, 2021; Sarwar et al., 2021).
while still obfuscating all of the validation data for each of class 0, Figure 13 provides a visual comparison of our best results with
2, and 3. We experiment with the percentage of training data we those of prior work.

14
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

Our research was limited by the availability of darknet traffic Data availability
datasets. We selected the CIC-Darknet2020 dataset because it is
frequently cited and publicly accessible; however the dataset suf- Data will be made available on request.
fers from a substantial imbalance. We attempted to compensate for
this class imbalance by generating artificial samples with AC-GAN
and SMOTE. The artificial SMOTE samples marginally improved our
classification results. Seeking to improve the quality of artificial References
samples, we assessed AC-GAN as a sample generator. However, our
AC-GAN-generated samples were not useful for data augmentation Bhagat, R.C., Patil, S.S., 2015. Enhanced SMOTE algorithm for classification of imbal-
anced big-data using random forest. In: 2015 IEEE International Advance Com-
purposes. An approach that future researchers might consider is to
puting Conference, pp. 403–408.
use clustering to group samples within a class, then train one GAN Branwen, G., Christin, N., Décary-Hétu, D., Andersen, R. M., StExo, Presidente,
per cluster to generate samples. Other variations of GAN might also E., Anonymous, Lau, D., Sohhlz, Kratunov, D., Cakic, V., Buskirk, V., Whom,
be better suited for multiclass sample generation and could con- McKenna, M., Goode, S., 2015. Dark net market archives, 2011–2015. https:
//www.gwern.net/DNM-archives.
ceivably generate more realistic samples. Convolutional Neural Networks for Visual Recognition, 2022. Convolu-
We kept our obfuscations fairly basic, with the goal being to tional neural networks for visual recognition. https://cs231n.github.io/
demonstrate that we could confuse our best classifier, with few convolutional-networks.
Demertzis, K., Tsiknas, K., Takezis, D., Skianis, C., Iliadis, L., 2021. Darknet traffic big-
restrictions imposed on the hypothetical attacker. Under more re- data analysis and network management for real-time automating of the mali-
alistic attack scenarios, it may not be possible to so easily modify cious intent detection process by a weight agnostic neural networks framework.
features which define darknets such as Tor and VPN, but it would https://arxiv.org/abs/2102.08411.
DeVries, T., Taylor, G. W., 2017. Improved regularization of convolutional neural net-
be possible to obfuscate traffic features at the application layer works with cutout. https://arxiv.org/abs/1708.04552.
such as those produced by CICFlowMeter analysis. We introduced Dingledine, R., Mathewson, N., Syverson, P., 2004. Tor: the second-generation onion
a loose correlation to one statistical metric, an independent sum router. In: 13th USENIX Security Symposium (USENIX Security 04). https://
www.usenix.org/conference/13th- usenix- security- symposium/tor- second-
of distances between DCPD across all sample features. We noted
generation- onion- router.
that 2 out of the 3 classes we chose to obfuscate were misclassi- Gil, G.D., Lashkari, A.H., Mamun, M., Ghorbani, A.A., 2016. Characterization of en-
fied not as the intended classes, but with a majority of predictions crypted and VPN traffic using time-related features. In: 2nd International Con-
ference on Information Systems Security and Privacy, pp. 407–414.
distributed among other classes. This results from the fact that our
Hu, Y., Zou, F., Li, L., Yi, P., 2020. Traffic classification of user behaviors in Tor, I2P,
obfuscation metric does not account for the statistical relationship ZeroNet, Freenet. In: 2020 IEEE 19th International Conference on Trust, Security
between more than two classes, nor does it account for any de- and Privacy in Computing and Communications, pp. 418–424.
pendency between the CIC-Darknet2020 feature values. Iliadis, L.A., Kaifas, T., 2021. Darknet traffic classification using machine learning
techniques. In: 2021 10th International Conference on Modern Circuits and Sys-
There is much more remaining work that could be done to tems Technologies (MOCAST), pp. 1–4.
extend the adversarial obfuscation analysis presented this paper. imblearn, 2022. imblearn 0.0. https://pypi.org/project/imblearn/.
Real traffic features could be modified on live network traffic (e.g., Lashkari, A. H., 2018. CICFlowmeter-v4.0 (formerly known as iscxflowmeter) is a
network traffic bi-flow generator and analyser for anomaly detection. https:
changing IP addresses, ports, packet lengths or intervals), or se- //github.com/ISCX/CICFlowMeter.
lect features could be prohibited from modification during ob- Lashkari, A.H., Draper-Gil, G., Mamun, M.S.I., Ghorbani, A.A., 2017. Characterization
fuscation, which is likely to be a realistic constraint. An even of Tor traffic using time based features. In: 3rd International Conference on In-
formation System Security and Privacy, pp. 253–262.
larger task is to explore the dependency between features in or- Lashkari, A.H., Kaur, G., Rahali, A., 2020. Didarknet: a contemporary approach to de-
der to anticipate counterattacks. One possible avenue that future tect and characterize the darknet traffic using deep image learning. In: Proceed-
research could take with respect to the CIC-Darknet2020 dataset ings of 10th International Conference on Communication and Network Security,
pp. 1–13.
is to develop an obfuscation method to exploit Random Forest
Li, J., Chang, H.-C., Stamp, M., 2021. Free-text keystroke dynamics for user authenti-
feature importance, or the weights of a linear SVM. This might cation. https://arxiv.org/abs/2107.07009.
better correlate the relationship between classifier response and Misra, S., Li, H., 2020. Noninvasive fracture characterization based on the classifica-
tion of sonic wave travel times. In: Misra, S., Li, H., He, J. (Eds.), Machine Learn-
dataset statistics. We only tested our obfuscation method using
ing for Subsurface Characterization. Elsevier, pp. 243–287.
our best-performing classifier. It would also be interesting to ex- Mudavathu, K.D.B., Rao, M.V.P.C.S., Ramana, K.V., 2018. Auxiliary conditional gener-
plore how other classifiers respond to similar obfuscation tech- ative adversarial networks for image data set augmentation. In: 2018 3rd Inter-
niques, so as to determine which classifiers are most robust to such national Conference on Inventive Computation Technologies, pp. 263–269.
Nagaraju, R., Stamp, M., 2021. Auxiliary-classifier GAN for malware analysis.
attacks. Odena, A., Olah, C., Shlens, J., 2017. Conditional image synthesis with auxiliary clas-
sifier GANs. In: Proceedings of the 34th International Conference on Machine
Learning. In: ICML, Vol. 70, pp. 2642–2651.
Santurkar, S., Tsipras, D., Ilyas, A., Madry, A., 2018. How does batch normalization
Author contribution help optimization? In: Proceedings of the 32nd International Conference on
Neural Information Processing Systems, pp. 2488–2498.
Mark Stamp proposed and guided the research, and edited the Sarkar, D., Vinod, P., Yerima, S.Y., 2020. Detection of Tor traffic using deep learning.
In: Proceedings of IEEE/ACS 17th International Conference on Computer Systems
paper. and Applications, pp. 1–8.
Nhien Rust-Nguyen performed the majority of the experiments, Sarwar, M.B., Hanif, M.K., Talib, R., Younas, M., Sarwar, M.U., 2021. Darkdetect: dark-
developed some of the key ideas used in this research, and wrote net traffic detection and categorization using modified convolution-long short-
-term memory. IEEE Access 9, 113705–113713.
the first draft of the paper. Scikit-learn: Machine Learning in Python, 2022. Scikit-learn: machine learning
Shruti Sharma completed several of the experiments included in Python. https://scikit-learn.org/stable/index.html.
in the paper. sklearn.ensemble.Random ForestClassifier, 2022. sklearn.ensemble.Random
ForestClassifier. https://scikit-learn.org/stable/modules/generated/
sklearn.ensemble.RandomFore
stClassifier.html.
Declaration of Competing Interest Stamp, M., 2022. Introduction to Machine Learning with Applications in Information
Security, 2nd ed. Chapman and Hall/CRC, Boca Raton, FL.
Synced, 2017. Tree boosting with XGBoost — why does XGBoost win “ev-
The authors declare the following financial interests/personal ery” machine learning competition?https://syncedreview.com/2017/10/22/
relationships which may be considered as potential competing in- tree- boosting- with- xgboost- why- does- xgboost- win- every- machine- learning- co
terests: Mark Stamp reports financial support was provided by San mpetition/.
Tor Project History, 2006. Tor project history. https://www.torproject.org/about/
Jose State University. Nhien Rust-Nguyen reports was provided by history/.
San Jose State University. Venkateswaran, R., 2001. Virtual private networks. IEEE Potentials 20 (1), 11–15.

15
N. Rust-Nguyen, S. Sharma and M. Stamp Computers & Security 127 (2023) 103098

Nhien Rust-Nguyen received her master’s in computer science in May 2022. Mark Stamp is a professor of computer science at San Jose State University. His
Her research interests are in applications of machine learning and deep primary research focus is on problems at the interface between information se-
learning. curity and machine learning. He has published more than 150 research articles
and textbooks in information security (Information Security: Principles and Prac-
Shruti Sharma will received her master’s in data science in December 2022. tice, 3rd edition, Wiley, September 2021) and machine learning (Introduction to
Her research interests are in applications of machine learning and deep Machine Learning with Applications in Information Security, 2nd edition, Chapman
learning. and Hall/CRC, May 2022).

16

You might also like