PREUNN
PREUNN
Keywords: protocol reverse engineering, artificial intelligence, machine learning, neural networks, fuzzing
Abstract: The ability of neural networks to universally approximate any function enables them to learn relationships
between arbitrary kinds of data. This offers great potential in information security topics such as protocol
reverse engineering (PRE), which has seen little usage of neural networks (NNs) so far. In this paper, we
provide a novel approach for implementing PRE with solely NNs, demonstrating a simple yet effective re-
verse engineering of text-based protocols. This approach is modular by design and allows for the exchange
of neural network models at any step with better performing models. The architectures used include a convo-
lutional neural network (CNN), an autoencoder (AE), a generative adversarial net (GAN), a long short-term
memory (LSTM), and a self-organizing map (SOM). All of these models combine for a new protocol reverse
engineering approach. The results show that the widespread application layer protocols HTTP and FTP can
successfully be mimicked by artificial intelligence, thereby paving the way for use cases such as fuzzing. A
direct comparison to other PRE approaches is not possible due to the black-box nature of neural networks and
represents the main limitation of our work. Our experiments showed that this multi-model approach yield up
to 19% better message clustering, improved context distribution, and proving LSTM to be the best candidate
for generating new messages with up to 67.6% valid HTTP packages and 100% valid FTP packages.
DOI: 10.5220/0010813500003120 1
ISBN: 978-989-758-553-1; ISSN: 2184-4356
In Proceedings of the 8th International Conference on Information Systems Security and Privacy (ICISSP 2022), pages 345-356
Event website: https://icissp.scitevents.org/
Copyright ©2022 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
2 BACKGROUND similarity between messages is not only computed by
their format, but also by the reaction and response
As we attempt to combine two areas of modern re- they create. Next, an acceptor automaton was
search in this paper, a brief overview of both areas’ designed to identify valid sequences of messages.
theoretical backgrounds is given in this section. This automaton was later reduced using the exbar
algorithm (Lang, 1999). The final tool was extended
2.1 Protocol Reverse Engineering (PRE) to be compatible with the fuzzing tool Peach Fuzzer2 .
As a direct result, fuzzing can be improved on stateful
Communication over the internet and other digital protocols. We learn from Prospex how a separation
networks requires standardized protocols to ensure of tasks such as the extraction of features, clustering
uniform behavior. These protocols are generally used using an off the shelf method and finally reverse
in a stack with multiple layers, where each layer has engineering are key in PRE.
its specific tasks and serves as the operational basis
for the layers above. The widely known ISO/OSI-7- Discoverer Discoverer identifies small clusters of
layer model works as the standard guideline for such tokens within a message, which are then merged and
a protocol stack (TC97, 1984). refined recursively. For this, it relies on the assump-
Protocols can be divided into four types: text- tion that protocols use common delimiter symbols to
based or binary and stateful or stateless. PRE aims separate parts of their message format, such as com-
to learn as much as possible about the specifications mas, whitespaces, or line breaks. The interdepen-
of an unknown protocol by analyzing the artifacts that dence of various fields (e.g., for addresses/variables)
the communication creates. This includes, but is not in the message format and their data type (text or bi-
limited to, message formats, syntax, grammar, con- nary) is then learned by heuristics. We can see the
stants and keywords, message types, implicit or ex- importance of identifying key features and their influ-
plicit state machines, and more. The taxonomy shown ence on clustering in the Discoverer approach.
in Table 1 lists the two possible kinds of artifacts: net-
work traffic messages or system calls inside the binary 2.2 Neural Network Architectures
of an application that communicates using the target
protocol. The two most commonly inferred properties All the models we use in this paper are types of neu-
are the general message format/syntax and the under- ral networks. They consist of layers of interconnected
lying state automaton of a stateful protocol. A suc- neurons, where each connection has a value to ad-
cessfully reverse-engineered protocol allows for fur- just the importance of the connection for the next
ther analysis of the communication like deep package layer. These weights are adjusted to minimize an er-
inspection (Brook, 2018) and fuzzing (Besic, 2019). ror function during training, but remain fixed for test-
Both can work more effectively if they have the de- ing and thereafter. This concept allows for compli-
tailed specification of the protocol. cated mapping of input data distributions to output
Two prominent PRE examples are Prospex (Com- results according to the universal approximation the-
paretti et al., 2009) and Discoverer (Cui et al., 2007). orem (Hornik et al., 1989). As such, neural network
They were published in the middle of the major period architectures are suitable tools for different data op-
for PRE research between 2004 and 2013 (Duchene erations such as classification, regression, clustering,
et al., 2018; Narayan et al., 2015). From our initial and more. In the following paragraphs, we describe
literature research, we judge both to be good represen- the different architectures used in the experimentation
tations of their respective PRE classes, as mentioned of Section 4, what problems they are good at solving
in Table 1. and why we considered them for PRE.
Protocol Specification Extraction (Prospex) Convolutional Neural Network (CNN) CNNs be-
Prospex is a two-part PRE approach basing its came popular for their performance in image recog-
analysis on both network messages and execution nition tasks (Krizhevsky et al., 2012). The layers in
traces of a binary at runtime to infer the message for- this architecture have a particular property: they are
mat (Wondracek et al., 2008; Caballero et al., 2007) used in a sliding window method over the input data.
and state automaton in a second part (Comparetti Their weights for each of the sliding window steps
et al., 2009). The highly distinctive features selected remain the same and can therefore be used to detect
for the clustering step are among the core reasons for small, fixed patterns. Many such “kernels” exist in
the success. They include file system reactions and
memory analysis by tainting bytes. Therefore, the 2 http://peachfuzzer.com
2
Table 1: Taxonomy for classifying PRE approaches by requirements (columns) and by results (rows)
PRE taxonomy requirements and results Only network messages Messages and executable binary at runtime
Inferring message format/grammar Discoverer (Cui et al., 2007) Wondracek (Wondracek et al., 2008)
Inferring state automaton PREUNN Prospex (Comparetti et al., 2009)
parallel to each other in separate “channels” of one time dependencies inside the data. LSTMs are a kind
layer. This structure allows individual kernels to ap- of RNNs with a strong ability to find sequential de-
proximate highly descriptive feature filters, thereby pendencies over time while avoiding some common
removing the traditional step of manual feature ex- problems with recurrent error correction (Hochreiter
traction from the classification task. This architec- and Schmidhuber, 1997). For working with text data
ture is beneficial in any task involving images/pixel (sequential letters), it is common to use a dictionary
maps with individual features occupying many adja- or alphabet along with a suitable embedding to con-
cent pixels. However, in theory, this concept can be dense the information in a medium-sized vector for
applied to any form of input data with patterns. each word or letter. The text is transformed into a
matrix of lengthtext × lengthembedding .
Autoencoder (AE) Autoencoders are neural net-
works designed to achieve dimensionality reduction
for a given input while retaining as much information Self-Organizing Map (SOM) SOMs are architec-
as possible. In training, the loss function demands tures with one-dimensional or two-dimensional out-
that the output equals the input as closely as possible. put maps of neurons that contain the topology of the
The architecture itself has a bottleneck in a middle data (Kohonen, 1982). This means that data with
layer to force the network to compress data but keep similar properties and distribution of features will be
reconstructable information (Hinton and Salakhutdi- found in the same general area of the output map.
nov, 2006). The middle layer’s size has to balance Thereby, clustering is implied. SOMs give each out-
the compressing size and keep relevant information put neuron a score related to the data. The winning
in some unknown encoding. Naturally, this divides node can be returned for indexing.
the neural network into two components, namely the
encoder and the decoder part. After the AE has been
trained, the decoder element is removed so that any 3 MAIN APPROACH
input data will be returned in its encoded form only.
This section presents a novel way of looking at the un-
Generative Adversarial Network (GAN) GANs derlying task structure of reverse engineering a pro-
were developed to create a generative model for tocol. It is divided into multiple steps that are han-
learned data distribution. The peculiarity of this ar- dled mostly sequentially. The approach is designed to
chitecture comes from the use of two competing net- fit the capabilities of various neural network architec-
works (Goodfellow et al., 2014). The first, known tures and provides modularity. This allows for the re-
as generator, uses random noise as input and tries to placement of any model in the system by a better per-
create an image similar to those found in the train- forming one and thereby enabling novel AI designs to
ing dataset. The second network, known as the dis- be directly inserted into the process.
criminator, is given either a true or a fake image ran-
domly and has to learn the distinction between them
by merely classifying true from fake. The error cor- 3.1 Data Gathering
rection for the generator is based on the classifica-
tion result by the discriminator. This combination of To train a neural network, a representative dataset is
two networks causes a setup of competitive learning required. We chose a set of text-based application
where each NN tries to outdo the opposing one. layer protocol artifacts as a basis to allow for an easier
result evaluation as we do not have a direct compari-
Long Short-Term Memory (LSTM) Recurrent son with other PRE approaches. The chosen protocols
neural networks (RNNs) describe a type of architec- are HTTP v1.1 and FTP because they are commonly
ture, where some part of the internal hidden or input used, abundantly available and lack encryption. We
state is recursively put into the network again for the use several sources of datasets to cover a broader mix
next time step, thereby allowing the network to find of implementations and message type distribution.
3
3.2 Feature Extraction 3.6 Sequence Generation
In this first part, we want to extract highly distinguish- As a last step, we combine all trained models into one
able features. Both in Prospex and Discoverer, the generative PREUNN AI. Context information such as
feature selection was an essential part of the work, cluster index and sequential dependencies can be in-
but the features were chosen by the researchers. We cluded in the feature reverse engineering to achieve
intend to automate this process with neural networks. more accurate results. Ideally, this AI is capable of
Of particular interest are keywords, punctuation, syn- producing valid messages, which are not part of the
tactic characters, and other patterns that distinguish training set but have comparable statistical properties.
between different messages. Pseudo-random strings
such as tags and cookies are avoided, as they are gen-
erally unrelated to the protocol specifications. 4 EXPERIMENTS
3.3 Feature Reverse Engineering This section lists all implementations for the main ap-
proach and the experiments, that were used to test var-
ious neural network architectures. The hyperparame-
In traditional protocol reverse engineering, the anal- ters of all neural networks are set to well-performing
ysis infers rules, lists of variables or constants, and numbers after some semi-extensive manual testing.
grammar from the communication artifacts. With a The scope of the project did not include major opti-
neural network, the learned knowledge is intrinsically mizations as the hardware was unable to handle large
non-representative, meaning it is challenging to in- search spaces for automated hyperparameter tuning.
terpret by humans. We use a generative evaluation Our code is available on GitHub3 . All experiments
approach to judge the quality of the features learned were written in Python 3 using an object-oriented pro-
(and recreated) by the respective architectures. Such gramming style to ensure easy modification and ex-
a method will create new samples from the training tension of the experiments to new protocols. We se-
distribution and provide insights into what the neural lected PyTorch4 as the deep learning framework, and
networks learned. all experiments were conducted on an NVIDIA GTX
970.
3.4 Clustering
4.1 Data Preprocessing
Protocol messages can usually be grouped into types We used two datasets in our tests. The first one con-
whether or not the protocol explicitly specified these sists of the combination of multiple HTTP sources
groups. We can cluster these messages using infor- (Garfinkel, 2008; Shiravi et al., 2012; Goo et al.,
mation like sequential order, functionality and general 2019; Sharafaldin et al., 2019) that were selected to
format. The clustering would imply various types of cover different implementations and scenarios. The
messages, and we consider this task to be well-suited second dataset consists of FTP messages (Pang and
for neural networks (Bação et al., 2005). Paxson, 2003). Before any experiments can be con-
ducted, we examined the data for outliers and irregu-
3.5 State Recognition larities. Data engineering usually takes up a signifi-
cant amount of time in any machine learning project;
however, with network traces in pcap files, we were
A typical communication session usually involves able to shorten this process. The network analysis tool
multiple messages being sent or received. In stateful Wireshark5 provides a widely used parser for proto-
protocols, particular sequences of messages achieve a col package analysis. We filtered for valid packages
more complex state between the communication par- of HTTP and FTP respectively and discarded the rest.
ties. These sequential patterns imply a representative The length of application layer datagrams is lim-
state automaton for the inner state of the protocol. We ited by the underlying TCP protocol. An analysis of
think that a neural network with the capability of un- the new pcap file showed that lengthy messages only
derstanding sequential dependencies should be able occur at image transfers (HTTP) or custom messages
to learn the order of different message types and their
likelihood. This interpretation allows the usage of 3 https://github.com/PREUNN/preunn
time series prediction to imply the next state of the 4 https://pytorch.org/
protocol. 5 https://www.wireshark.org/
4
(FTP) and can subsequently be cut without significant
loss of information. We chose the length limit of 1024
bytes for all packages to uniform the neural network
inputs. Only 0.33% of HTTP and 0.18% of FTP data
that we use exceed this length limit.
As a further step for the HTTP experiments, all
content after the message header was removed, to
avoid XML and other non-HTTP data. We achieved
this by splitting each statement at every “\r\n\r\n”
occurrence and only using the first element of that
split. This double line break is the HTTP sign for
only payload following, and thus represents a conve-
nient way to clean the data. No filter was applied for
the FTP data.
Dataset bias is a common pitfall when developing Figure 1: HTTP dataset distributions in original (blue) and
machine learning solutions. It describes an uneven balanced (green). The overwhelming bias in favor of GET
distribution of classes within the data, which causes messages has been mitigated while preserving the notion of
this type being the most common.
suboptimal feature learning in the neural networks.
The protocols themselves do not specify classes of
messages directly; however, the manually created
classification for both protocols are our baselines to
orient the class balance on. It is desirable that the neu-
ral networks still learn which types are more common
than others, but the imbalance in our dataset is over-
whelmingly in favor of two or three common message
types. To balance the number of messages per class,
we came up with this improvised formula:
p
Nosamples per class = Nooccurrences ∗ 100 (1)
and visualized the distribution to see the effect in fig-
ure 1 and 2.
The packages for each class were selected ran-
domly until the limit was reached. This includes mul- Figure 2: FTP dataset distributions in original (blue) and
tiplications of rare messages to get a significant sam- balanced (green). With three dominant types and several
ple size in every class. Tests without this dataset bal- underrepresented types, the rebalancing smoothed the dis-
ancing have shown strong signs of overfitting in most tribution significantly while preserving tendencies.
experiments. We did not simply set all classes to the
sults from clustering to evaluate both architecture. We
same limit, as the notion of common and uncommon
only defined auxiliary metrics for each experiment.
should not be lost.
5
rate = 0.0005, the loss function is Mean Square Er- Class 1: unchanged
\r
ror (MSE), and the batch size = 128 for 10 epochs. H T T P / 1 . 1 2 0 0 O K \n
Convolutional Neural Network CNNs are com- Figure 4: Example of HTTP statement classification score
monly used for supervised image classification. Our visualized by Grad-CAM. The padding has been cut off.
training data does not contain any labels that can be The semantically significant and consistent parts are high-
used for supervised learning. We came up with our lighted, while pseudo randomness in the ETag is ignored.
own unsupervised learning technique using data aug- The text highlighting in this figure is an approximation.
mentation. Messages are replicated and modified into
riety of methods for model explanation have been pro-
several known classes of augmentation types to ex-
posed for image classification tasks. We use the vi-
tract information about the syntactic context. Our idea
sualization of pixel importance known as Grad-CAM
is, that the ability of a CNN to differentiate between
(Selvaraju et al., 2017) as our quality measurement
these augmentation classes will teach it to become
for syntactical features found by the model. Our ex-
fine-tuned to common patterns in the syntax. HTTP
periments showed mixed results. Regarding HTTP,
or FTP statements are again padded/capped to a fixed
we see that various parts of a protocol message are
length of 1024, then divided into segments of vari-
highlighted differently, see figure 4. The highlighted
ous lengths (1024, 32, 16, 8, 4), which are then put
parts often match syntactically relevant pieces, while
into random order within the same statement. This ap-
pseudo-random strings are ignored. This is precisely
proach results in 5 classes (unchanged and scrambled
the kind of syntactical feature extraction that is de-
into blocks of length 32, 16, 8, and 4, respectively). A
sired for this experiment.
simplified example is illustrated in figure 3.
For FTP, the results are less visible in Grad-CAM,
We chose an architecture that uses blocks of 1D
as the average message length is much smaller. The
convolution, 1D batch normalization, softplus activa-
overall convergence of the experiment was also much
tion, and 20%-dropout in a total of 5 layers following
flatter than that of the HTTP version.
the channel sizes as follows: 1 → 128 → 64 → 32 →
16 → 8. The classification task was performed by two
fully connected layers, which were removed after the 4.3 Feature Reverse Engineering
training to get a feature map of 8 ∗ 30 = 240 neurons,
which is larger than that of the autoencoder. We used A common result of protocol reverse engineering is
a visualization of the learned features as an additional the representation of the target protocol in the form
analytical method to evaluate the performance. A va- of rules or clusters. Neural networks do not allow
6
us to directly visualize their internal representation of Further, experiments using this architecture are omit-
the features they extracted, however. To have a tan- ted.
gible result outside clustering and sequence recogni-
tion, we want to be able to generate new messages as
proof of the correctness of the feature learning abil- Long Short-Term Memory LSTMs are advanced
ity. We achieve this by using generative neural net- recurrent networks, trained on sequences of fixed
work models and examine their outputs. There are lengths. To create such conditions, padding is out of
two possible choices of how a text message can be in- the question, as it would insert undesirable sequential
terpreted: an image-like byte interpretation (as in the dependencies. We instead use a script, which con-
feature extraction before) or a sequential interpreta- catenates statements to achieve more than four times
tion as a sequence of ASCII alphabet symbols. We in- the required length, then choosing a substring of cor-
vestigated both alternatives, using a default GAN ar- rect length starting from a random position. This
chitecture for image-like byte interpretation, roughly script is only used for the training to teach the se-
following the suggestions given by the original au- quential dependencies from letter to letter. Before
thors (Goodfellow et al., 2014; Radford et al., 2015), and after each statement, a unique symbol for start-
and an LSTM model for interpretation as a sequence of-package (SOP) and end-of-package (EOP) is in-
of ASCII symbols, with a modified embedding for serted so that the network can learn to distinguish
accounting the randomness in some parts of protocol between different messages. The data is represented
messages (cookies, addresses, etc.). as a one-hot-encoding of the ASCII alphabet. The
architecture uses an embedding layer followed by a
convolution layer (kernel size=4, stride=4) for em-
Generative Adversarial Net The two networks a bedding adjustment. This trick is introduced here
GAN consists of, the generator and the discrimina- as convolutional embedding, designed to balance be-
tor, are trained in parallel. The generator uses four tween character-based and word-based embedding in
1D transposed convolution layers with (kernel size, data formats with a lot of random noise on character
stride)-parameter tuples of (2, 2), (4, 4), (8, 8), and level. The hidden size of the embedding is flipped
(16, 16) in ascending order. The first three layers with the feature-length dimension of the tensor; the
are followed by a 1D batch normalization and ReLU convolution layer interprets the hidden width as chan-
activation each, while the last one ends with a sig- nels and the feature-length dimensions as image di-
moid function. The number of channels is as fol- mensions. Only the batch size remains the same. This
lows: 1024 → 1024 → 128 → 32 → 1. The dis- dimension transposition is reversed after the convolu-
criminator uses four 1D convolution layers with 1D tion, resulting in an embedded tensor with a quarter of
batch normalization (except for the first layer) and the length. This is given as input to a single-layered
LeakyReLU with a 0.2 slope and 20%-dropout. The LSTM. The output goes through the reversed pro-
network is capped with a 360-neuron fully connected cedure of the convolution embedding in a 1D trans-
layer. Channels are as follows: 1 → 10 → 20 → 60 → posed convolution and a fully connected layer with
90 → 1. We use a formula to keep either network the same hyperparameters and dimension transposi-
from overtaking the other in training: If one network tions. Figure 5 shows an overview of how this em-
error went over a threshold OR the other network un- bedding works. The way to interpret this kind of em-
der a particular border, then the overperforming net- bedding is a learnable, weighted 4-gram of charac-
work is removed from training until the other catches ters, where the whole architecture only has to learn
up. Both use an Adam optimizer with a learning rate the next character (from 0–1023 to 1–1024 by index).
= 0.0005 and betas = (0.5, 0.99). As a loss function, This local context inside the 4-gram and the long-term
we used Binary Cross-Entropy (BCE). dependencies, which have been shortened by a factor
The GAN model’s training did not indicate a sig- of 4, are easier to learn for the model and more sta-
nificant convergence towards a stable state. As the ble to sample. For training, an Adam optimizer with a
pixel interpretation’s inaccuracy is limiting the GAN learning rate = 0.005 and standard beta is used. Nega-
from the start, combined with this architecture’s gen- tive Log-Likelihood (NLL) is used as the loss func-
erally unstable nature, we are not surprised by this tion since the error is measured on character level.
disappointing result. Creating text from a continu- This, of course, requires the input data mentioned at
ous data interpretation only sounded promising to us the top to consist of message strings with a length that
when we considered the static structure used by a pro- are multiples of 4.
tocol. However, not even the padding symbols have This sequence-based attempt at recreating HTTP
successfully been replicated with any significant ac- and FTP statements shows good results. The LSTM
curacy (about ±3 ASCII values) by the GAN model. architecture converges towards a minimal loss after
7
H T T P / 1 . 1 3 0 2 M O V E D\r
Batch of sequences H T T P / 1 . 1
H T T P / 1 . 1
2 0 0
2 0 0
P O S\r T \n
O K \n
HTTP/1.1 200 OK
...
Date: Mon, 14 Jun 2010 13:20:25 GMT
Server: Apache
Conv Conv Conv Conv Conv Last-Modified: Mon, 21 Jun 2010 14:18:09 GMT
4:1 4:1 4:1 4:1 4:1
ETag: "2de9573-2b-486717fb77ac0"
Accepted-Ranges: bytes
Batch of 1D images
(flipped tensor dimensions)
XX1
X1 1
XX3
X2 2
XX3
X3 3
XX4
X4 4
XX5
X5 5
Content-Length: 43
Connection: close
Content-Type: text/html; charset=iso-8859-1
sis tool Wireshark6 showed 67.6% of the HTTP mes- 1 ACCT, ADAT, AUTH, CONF, ENC, MIC, PASS, PBSZ, PROT, QUIT, USER
2 230, 331, 332, 530, 532
sages as valid, with the remaining ones classified as 3 PASV, EPSV, LPSV
TCP with a random payload. For FTP, a 100% quota 4 227, 228, 229
was reached, however FTP messages can be rather 5 ABOR, EPRT, LPRT, MODE, PORT, REST, RETR, TYPE, XSEM, XSEN
6 125, 150, 221, 225, 226, 421, 425, 426
simple to be valid. A few messages are repeated often 7 ALLO, APPE, CDUP, CWD, DELE, LIST, MKD, MDTM, PWD, RMD, RNFR, RNTO,
in the training data, which also appear in the output STOR, STRU, SYST, XCUP, XMKD, XPWD, XRMD
8 212, 213, 215, 250, 257, 350, 532
of this experiment. Some generated examples can be 9 120, 200, 202, 211, 214, 220, 450, 451, 452, 500, 501, 502, 503, 504, 550, 551, 552, 553, 554, 555
seen in figure 6. We see these results as a sound basis
for further experimentation. For this evaluation to work, we have to know the
true classes (synonym for types) for both protocols
4.4 Clustering in advance. This is not straightforward, as neither
protocol specifies explicit types/groups of messages
For clustering, the feature extraction results are rele- apart from requests and responses. For HTTP, the
vant to encode the data samples to a smaller format. responses especially have a wide range of meanings,
We initialize three SOMs and compare their results: which we grouped by their respective code’s first digit
a baseline model with capped and padded messages for this analysis, as they represent a basic meaning
to a fixed length of 1024, a second SOM model using of the code without going into too much detail. This
the AE encoding, and a third model using the CNN means that all messages from 200–299, all 300–399,
feature map. These three models differ in their input and all 400–499 messages are considered to be of the
size but have an identical output map dimension of same type, respectively. Along with all the valid key-
16 × 1 for HTTP and 64 × 1 for FTP. The different words that an HTTP statement can start with (GET,
numbers for the output dimensions for both protocols POST, HEAD, DELETE, OPTIONS, PUT, TRACE,
are based on experimentation and roughly match the CONNECT), this gives us a total of 11 clusters with
variety of different message types for each protocol. an additional miscellaneous one (MISC). For FTP,
This can be adjusted by parameter for any new pro- we manually defined groupings of keywords and key-
tocol or optimization purposes. Training is performed codes with similar meaning or purpose, as shown in
with a learning rate = 0.005 and sigma = 1.5 for HTTP Table 2. This was a manual process, and we do not
and sigma = 3 for FTP. For details on the parameters, claim to have achieved perfection with this grouping.
please see the “MiniSom” library documentation7 .
For the experimentation on clustering, we use the
6 https://www.wireshark.org/ first multi-model approach. We compare three dif-
7 https://github.com/JustGlowing/minisom ferent configurations of self-organizing maps in terms
8
Table 3: Results of the clustering experiments for compari- 1024 Chars/Bytes
son. One can see an improvement over the baseline model HTTP/1.1 200 POST...
HTTP/1.1
when using Autoencoder. 200 POST...
Autoencoder
128 Chars/Bytes
Padding/Capping
(AE)
HTTP/1.1 HTTP/1.1
(a) HTTP clustering 200 OK... 200 OK...
HTTP/1.1...
9
applying the same efforts to a stateless protocol shows
the weakness and dangerous pitfalls when interpret-
ing the metric. For both protocols, the metric is mis-
leading at face value. Only an analytic look at actual
prediction results did correct the error.
10
score among the output to ensure syntactically cor- sifying the message based on these features. Based
rect objects. The second strategy takes the network’s on the approach, they report a classification accuracy
softmax output as a probability distribution for pick- of about 99% on unknown datasets. Their method
ing the next letter. This increases the fuzzing cover- differs from PREUNN in several ways. They semi-
age but may result in many ill-formed objects. Lastly, automatically extract features using statistical meth-
the SampleSpace strategy combines the two previous. ods, while we fully automated this process using an
No-Sample is used until a whitespace is generated. autoencoder. Furthermore, their work’s goal is to
The next following character is sampled by proba- classify unknown protocol messages, whereas we aim
bility, only to switch back to the No-Sample strat- to generate new valid messages based on a novel pro-
egy again. This creates more syntactically correct ob- tocol. Classification in PREUNN happens implicitly.
jects while also creating a more comprehensive mix
of different objects. These strategies were used in Network Traffic Classification (NTC) The topic
the LSTM sampling. Our sampling, however, has to of NTC is related to PRE regarding feature extrac-
cover message context as well. tion. Several papers are combining NTC with ML and
NNs (Lopez-Martin et al., 2017; Michael et al., 2017;
Machine Learning for Black-Box Fuzzing of Net- Li et al., 2018). However, the more complex steps of
work Protocols The authors (Fan and Chang, 2017) PRE, like generating new packages for an unknown
see their work as the first attempt to combine an protocol, are not addressed in those papers.
in-depth learning approach with black-box fuzzing.
They deploy a sequence-to-sequence model, a two-
part LSTM model invented for machine translation
(Sutskever et al., 2014), to learn the semantics of
6 CONCLUSION
network protocols and create new output for fuzzing
PREUNN represents a novel approach for the separa-
purposes. Their work differs from this paper in the
tion of traditional protocol reverse engineering tasks
number of steps taken and models employed to learn
that can be implemented using only neural networks.
details about the protocol. Here, a whole model is
In this paper, the widespread application layer proto-
presented, which was derived from previous work on
col HTTP v1.1 and FTP were successfully reverse-
PRE. Additional information such as clustering or
engineered using our approach. This highlights the
context-based generation is missing as well.
potential of several deep learning architectures such
as AEs for feature extraction, LSTMs for feature re-
GANFuzz: A GAN-based Industrial Network Pro- verse engineering, SOMs for clustering, LSTMs for
tocol Fuzzing Framework The GANFuzz Frame- state recognition, and finally a combination of the
work (Hu et al., 2018) represents an alternative ap- above for sequence generation. The results achieved
proach to protocol fuzzing using deep learning but include a decent, intuitively agreeable clustering and a
falls into the same category as the previous paper. context-capable message generation model. We omit-
Instead of a sequence-to-sequence model, GANFuzz ted optimizations and testing the use of PREUNN for
uses a model known as a SequenceGAN (SeqGAN) fuzzing entirely due to time restraints and leave them
(Yu et al., 2017). This is a version of the common for future work. Our modular approach allows the use
GAN model but adjusted for text-based data using re- of newer architectures from the fields of deep learn-
inforcement learning. The authors of the paper train ing, such as natural language processing (e.g. BERT)
one SeqGAN per message type, which they deduce for improvements. Further future work could also in-
from a variety of clustering heuristics. This makes clude the use of reinforcement learning utilizing these
their approach more similar to PREUNN but is still pre-trained models to create an automated fuzzer that
missing several work steps, a concise work step mode, is capable of reverse engineering any similarly struc-
and the generation of data done by context symbols tured message-based language.
instead of entirely separate models.
11
mean. https://www.neuralegion.com/fuzzing-what- Kohonen, T. (1982). Self-organized formation of topolog-
is-fuzzer/. ically correct feature maps. Biological cybernetics,
Brook, C. (2018). What is deep packet inspection? 43(1):59–69.
how it works, use cases for dpi, and more. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).
https://digitalguardian.com/blog/what-deep-packet- Imagenet classification with deep convolutional neu-
inspection-how-it-works-use-cases-dpi-and-more. ral networks. In Pereira, F., Burges, C. J. C., Bottou,
Caballero, J., Yin, H., Liang, Z., and Song, D. (2007). Poly- L., and Weinberger, K. Q., editors, Advances in Neu-
glot: Automatic extraction of protocol message format ral Information Processing Systems 25, pages 1097–
using dynamic binary analysis. In Proceedings of the 1105. Curran Associates, Inc.
14th ACM conference on Computer and communica- Lang, K. J. (1999). Faster algorithms for finding minimal
tions security, pages 317–329. consistent dfas. NEC Research Institute, Tech. Rep.
Comparetti, P. M., Wondracek, G., Kruegel, C., and Kirda, Li, R., Xiao, X., Ni, S., Zheng, H., and Xia, S. (2018). Byte
E. (2009). Prospex: Protocol specification extraction. segment neural network for network traffic classifica-
In 2009 30th IEEE Symposium on Security and Pri- tion. In 2018 IEEE/ACM 26th International Sympo-
vacy, pages 110–125. IEEE. sium on Quality of Service (IWQoS), pages 1–10.
Cui, W., Kannan, J., and Wang, H. J. (2007). Discoverer: Lopez-Martin, M., Carro, B., Sanchez-Esguevillas, A., and
Automatic protocol reverse engineering from network Lloret, J. (2017). Network traffic classifier with con-
traces. In USENIX Security Symposium, pages 1–14. volutional and recurrent neural networks for internet
of things. IEEE Access, 5:18042–18050.
Duchene, J., Le Guernic, C., Alata, E., Nicomette, V., and
Kaâniche, M. (2018). State of the art of network pro- Michael, A., Valla, E., Neggatu, N. S., and Moore, A.
tocol reverse engineering tools. Journal of Computer (2017). Network traffic classification via neural net-
Virology and Hacking Techniques, 14(1):53–68. works. Technical Report UCAM-CL-TR-912, Univer-
sity of Cambridge, Computer Laboratory.
Fan, R. and Chang, Y. (2017). Machine learning for black-
Narayan, J., Shukla, S. K., and Clancy, T. C. (2015). A sur-
box fuzzing of network protocols. In International
vey of automatic protocol reverse engineering tools.
Conference on Information and Communications Se-
ACM Computing Surveys (CSUR), 48(3):1–26.
curity, pages 621–632. Springer.
Pang, R. and Paxson, V. (2003). Lawrence Berkeley Na-
Garfinkel, S. (2008). Nitroba university harassment sce- tional Laboratory - FTP - Packet Trace. Dataset:
nario. Dataset: https://digitalcorpora.org/corpora/ https://ee.lbl.gov/anonymized-traces.html.
scenarios/nitroba-university-harassment-scenario.
Radford, A., Metz, L., and Chintala, S. (2015). Unsu-
Godefroid, P., Peleg, H., and Singh, R. (2017). Learn&fuzz: pervised representation learning with deep convolu-
Machine learning for input fuzzing. In 2017 32nd tional generative adversarial networks. arXiv preprint
IEEE/ACM International Conference on Automated arXiv:1511.06434.
Software Engineering (ASE), pages 50–59. IEEE.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,
Goo, Y.-H., Shim, K.-S., Lee, M.-S., and Kim, M.-S. Parikh, D., and Batra, D. (2017). Grad-cam: Visual
(2019). Http and dns traffic traces for experiment- explanations from deep networks via gradient-based
ing of protocol reverse engineering methods. http: localization. In Proceedings of the IEEE international
//dx.doi.org/10.21227/tpqf-fe98. conference on computer vision, pages 618–626.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Sharafaldin, I., Lashkari, A. H., Hakak, S., and Ghorbani,
Warde-Farley, D., Ozair, S., Courville, A., and Ben- A. A. (2019). Developing realistic distributed denial
gio, Y. (2014). Generative adversarial nets. In of service (ddos) attack dataset and taxonomy. In 2019
Advances in neural information processing systems, International Carnahan Conference on Security Tech-
pages 2672–2680. nology (ICCST), pages 1–8. IEEE.
Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the Shiravi, A., Shiravi, H., Tavallaee, M., and Ghorbani, A. A.
dimensionality of data with neural networks. Science, (2012). Toward developing a systematic approach to
313:504 – 507. generate benchmark datasets for intrusion detection.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term computers & security, 31(3):357–374.
memory. Neural computation, 9(8):1735–1780. Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Se-
Hornik, K., Stinchcombe, M., White, H., et al. (1989). Mul- quence to sequence learning with neural networks. In
tilayer feedforward networks are universal approxi- Advances in neural information processing systems,
mators. Neural networks, 2(5):359–366. pages 3104–3112.
Hu, Z., Shi, J., Huang, Y., Xiong, J., and Bu, X. (2018). TC97, I. (1984). Basic reference model. International Stan-
Ganfuzz: a gan-based industrial network protocol dard, ISO/IS, 7498.
fuzzing framework. In Proceedings of the 15th ACM Wondracek, G., Comparetti, P. M., Kruegel, C., and Kirda,
International Conference on Computing Frontiers, E. (2008). Automatic network protocol analysis. In
pages 138–145. NDSS, volume 8, pages 1–14.
Jung, Y. and Jeong, C.-M. (2020). Deep neural network- Yu, L., Zhang, W., Wang, J., and Yu, Y. (2017). Seqgan:
based automatic unknown protocol classification sys- Sequence generative adversarial nets with policy gra-
tem using histogram feature. The Journal of Super- dient. In Thirty-first AAAI conference on artificial in-
computing, 76(7):5425–5441. telligence.
12