0% found this document useful (0 votes)
73 views11 pages

Conference Paper NIDS PCAP Transformation

This paper introduces Payload-Byte, an open-source tool for extracting and labeling network packets from packet capture files. This addresses issues with comparability and reproducibility in previous packet-based network intrusion detection research. Payload-Byte utilizes metadata and labels to standardize the processing and labeling of raw traffic data. It outputs a byte-wise feature vector for the payload data that is protocol-independent. The tool and a processed dataset are made publicly available to serve as a standardized baseline for future research.

Uploaded by

hungdang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views11 pages

Conference Paper NIDS PCAP Transformation

This paper introduces Payload-Byte, an open-source tool for extracting and labeling network packets from packet capture files. This addresses issues with comparability and reproducibility in previous packet-based network intrusion detection research. Payload-Byte utilizes metadata and labels to standardize the processing and labeling of raw traffic data. It outputs a byte-wise feature vector for the payload data that is protocol-independent. The tool and a processed dataset are made publicly available to serve as a standardized baseline for future research.

Uploaded by

hungdang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Payload-Byte: A Tool for Extracting and Labeling Packet

Capture Files of Modern Network Intrusion Detection Datasets


This paper was downloaded from TechRxiv (https://www.techrxiv.org).

LICENSE

CC BY 4.0

SUBMISSION DATE / POSTED DATE

29-08-2022 / 01-09-2022

CITATION

Farrukh, Yasir Ali; Khan, Irfan; Wali, Syed; Bierbrauer, David; Bastian, Nathaniel (2022): Payload-Byte: A Tool
for Extracting and Labeling Packet Capture Files of Modern Network Intrusion Detection Datasets. TechRxiv.
Preprint. https://doi.org/10.36227/techrxiv.20714221.v1

DOI

10.36227/techrxiv.20714221.v1
Payload-Byte: A Tool for Extracting and Labeling
Packet Capture Files of Modern Network Intrusion
Detection Datasets
Yasir Ali Farrukh Irfan Khan Syed Wali
Clean and Resilient Energy System Lab Clean and Resilient Energy System Lab Clean and Resilient Energy System Lab
Texas A&M University Texas A&M University Texas A&M University
College Station, TX, USA Galveston, TX, USA College Station, TX, USA
[email protected] [email protected] [email protected]

David Bierbrauer Nathaniel D. Bastian


Army Cyber Institute Army Cyber Institute
United States Military Academy United States Military Academy
West Point, NY, USA West Point, NY, USA
[email protected] [email protected]

Abstract—Adapting modern approaches for network intrusion and people’s lives in one way or another. The world we used
detection is becoming critical, given the rapid technological to know is no longer similar, as it has transformed into a
advancement and adversarial attack rates. Therefore, packet- collaborative network in which everything is interconnected
based methods utilizing payload data are gaining much pop-
ularity due to their effectiveness in detecting certain attacks. [1]. This inter-connectivity has led to an increase in cyber-
However, packet-based approaches suffer from a lack of stan- attacks on ICT systems [2]. Cyber-attacks on the network
dardization, resulting in incomparability and reproducibility of ICT systems often lead to data breaches and operational
issues. Unlike flow-based datasets, no standard labeled dataset halts for the company [3], [4]. Therefore, ICT systems require
exists, forcing researchers to follow bespoke labeling pipelines effective and robust security solutions.
for individual approaches. Without a standardized baseline,
proposed approaches cannot be compared and evaluated with Network Intrusion Detection System (NIDS) is often con-
each other. One cannot gauge whether the proposed approach sidered a feasible option to protect against network-based
is a methodological advancement or is just being benefited attacks [5], as it identifies attack behavior by analyzing the
from the proprietary interpretation of the dataset. Addressing network traffic of vital nodes in a network. NIDS utilizes
comparability and reproducibility issues, we introduce Payload- various approaches for the detection of malicious attack in-
Byte, an open-source tool for extracting and labeling network
packets in this work. Payload-Byte utilizes metadata information stances. The most prominent of these approaches are rule-
and labels raw traffic captures of modern intrusion detection based, flow-based, and packet-based methods [6]. Rule-based
datasets in a generalized manner. Moreover, we transformed the methods are typically based on feature selection to construct
labeled data into a byte-wise feature vector that can be utilized for domain-specific rules. Anomalies are detected by comparing
training machine learning models. The whole cycle of processing the extracted signature of network flow with predefined rules.
and labeling is explicitly stated in this work. Furthermore, source
code and processed data are made publicly available so that it Rule-based methods are effective for detecting known attacks
may act as a standardized baseline for future research work. but they heavily rely on in-depth domain knowledge.
Lastly, we present a brief comparative analysis of machine Recently, much attention has been diverted toward applying
learning models trained on packet-based and flow-based data. machine learning (ML) approaches to NIDS as ML algorithms
are achieving striking results in domains where it is hard to
Index Terms—Network intrusion detection, Traffic classifica-
tion, Packet capture, Cyber attack datasets, Payload extraction
specify a set of rules for their procedures [7]. One of the
reasons for this shift from human-dependent approaches to
ML is that humans cannot incorporate every possible scenario,
and there will always be an unanticipated condition that might
I. I NTRODUCTION
be devastating for the system. Since ML approaches operate
Advancement in Information and Communication Technolo- differently, it utilizes the data it gets and attempts to learn
gies (ICT) brings exceptional convenience to our daily life. the patterns between occurrences [8]. Thus, ML approaches
However, the world is becoming more dependent on these become dynamic and adaptive to various scenarios without
technological adaptations, as it impacts every aspect of society human intervention [9]. Lately, many ML approaches have
been proposed in the domain of NIDS; however, most of of extracting and labeling raw packet data from PCAP files
these approaches utilize flow-based information. In flow-based of already available datasets like UNSW and CICIDS. This
methods, network traffic is analyzed over a period to extract tool, which is based upon initial work by Bierbrauer et al.
flow-based behavior features [10]. This approach can be imple- [15], is publicly available, and researchers can utilize the tool
mented in a centralized server to monitor massive traffic. The to generate the data, which can be treated as a baseline for
anomalies are detected utilizing the correlation between the future exploration in packet-based approaches. Our goal is
traffic behavior and the corresponding characteristics. These to provide standardization solution to the future researchers
approaches require subject matter experts to select useful fea- so that reproducibility and comparability can be enhanced.
tures from data to detect malicious network traffic. Moreover, This will also enable researchers to directly compare new
the extracted features require a pre-processing application, approaches with the previous ones. The main contribution for
often requiring mathematical techniques to prepare the data this paper is as follows:
for the ML model [11], [12]. Another significant issue with • We developed a tool (Payload-Byte) for extracting and
flow-based approaches is that they might end up learning labeling raw packet data of NIDS datasets. The complete
which IP addresses send malicious traffic or which ports are cycle of processing and labeling is explicitly stated in
frequently attacked. Flow-based approaches basically monitor this paper for ease of usage. This tool addresses the major
threats at the lower level of the TCP/IP protocol stack, thereby issue of comparability and reproducibility in packet-based
diminishing the chance of detecting higher-level threats [13]. NIDS.
For such approaches, the methods used to extract and represent • Transformation of payload data into byte-wise data utiliz-
the information in a packet are crucial and can affect the output ing a generalized feature vector has been presented. This
of ML models. feature vector is independent of any protocol structure.
On the other hand, packet-based approaches can unveil • A processed packet-based dataset along with the tool is
malicious network flow by inspecting the packet payload, made publically available to facilitate future researcher1 .
which refers to the network packet’s user data. The packet- • A brief comparative analysis has been performed utiliz-
based approach tries to learn characteristics of normal as ing the extracted payload data and flow-based data for
well as possible attack instances that have potential abnormal network intrusion detection.
characteristics in the packet payload. The anomalies in attack The rest of this paper is organized as follows: Section
instances might appear as a number of specific strings. For II covers the background knowledge related to the packet
example, an SQL injection attack injects anomalous codes structure. In Section III, an overview of related work and
such as “ or 1 = 1 - - ” into SQL queries to make gaps has been highlighted. Working and methodology has been
them always true [6]. Moreover, many attacks such as Worms, presented in Section IV. Furthermore, results are shown in
Ransomware, and Trojans are based on payload delivery, Section V. Finally, Section VI concludes the paper.
and these types of attacks might be hidden from flow-based
approaches. In short, models that utilize flow-based approaches II. BACKGROUND
rely upon the tool’s correctness that extracts information from Libcap (PCAP) format is considered as the de facto standard
packets. Thus, an inadequacy in the dependent tool propa- for network packet capture, which is widely utilized in packet
gates to a complete failure whereas rule-based approaches sniffers and analyzers [16]. This format is based on binary for-
are completely dependent on in-depth domain knowledge and mat, supporting nanosecond precision timestamps. Although it
can not cater every possible scenario. Therefore, packet-based may vary from implementation to implementation. A general
approaches seems like a viable option for NIDS. structure for PCAP format is shown in Fig.1.
Network packet payload analysis might be an effective
solution for detecting network attacks since application attacks
are embedded in the payload rather than the header portion of
the Internet Protocol (IP) packet [14]. However, the ability to
detect payload embedded attacks remains a challenge due to
the absence of properly labeled packet data, a standardized
dataset baseline, and dynamic structuring of protocols. Unlike
flow-based NIDS datasets, which are easily available and can
be utilized as a baseline for developing and comparing any
new proposition, packet-based datasets do not have standard
labeling or dedicated dataset publicly available for everyone.
This lack of standardization leads towards incomparable and
Fig. 1. The general structure of PCAP format
irreproducible research. In addition, there is a lot of ambiguity
in the process of labeling the raw packet capture (PCAP) For understanding the information extraction from a raw
file, as every paper adopts its own processing method based packet file, it is essential to develop an understanding of
on their own set of rules. Therefore, addressing the issue of
standardization, we developed a tool (Payload-Byte) capable 1 GitHub: https://github.com/Yasir-ali-farrukh/Payload-Byte.git
the format in which packets are stored. The PCAP format is proposed as HAST-IDS in [22]. This approach captures low-
contains a global header followed by multiple packets con- level spatial and high-level temporal features through CNN
taining packet header and packet data. The global header and LSTM. Utilizing the same concept, the authors in [23]
identifies the generic PCAP format and byte order using the proposed AEIDS, based on an Autoencoder (AE), in which
“Magic Number”, validates the time information stored for a reconstruction error and modified z-score are employed for
each capture, and permits length checks to accommodate the classifying the incoming traffic instead of CNN and LSTM.
maximum length of captured packets (in octets). However, Other than these approaches, some modified approaches also
the individual packet header comprises packet information, curtail the total length of payload on some basis. PCNAD
like its origin, destination IP addresses, protocol, total length, [24], modified version PAYL, utilizes Content-based portion-
and other similar features. The packet data is the actual data ing (CPP) to determine the length of payloads for different
which is often referred as payload, containing the raw data. profiles. Similarly, authors in [25] proposed a payload-based
In practice, packets have more than one header, and each attribution scheme named Compressed Bitmap Index and Traf-
header is utilized by a different part of the networking pro- fic Down-sampling (CBID). CBID extracts feature utilizing
cess. Simply, packet headers are attached by certain types of the combination of bitmap index table and bloom filters
networking protocols. Dividing packet into header and data is from down-sampled traffic. Although all these approaches
high level representation of packets, whereas if we dive deeper use payload data to detect anomalies, there is still a huge
then there are several layers of additional information present reproducibility and comparison analysis gap. Most methods
within the raw network data. Following the TCP/IP model, in the literature utilize proprietary data or datasets that have
each packet can be divided into four individual layers: Data been outdated and are already labeled. However, the obsolete
Link Layer, Network Layer, Transport Layer and Application dataset is not the issue here. The main problem is compara-
Layer. By referring to packet header, we are actually extracting bility and reproducibility. Since every method has utilized a
information from Network Layer and Transport Layer header different approach for extracting and labeling raw data without
in our approach, which is covered in detail in Section IV. The explicitly mentioning the whole process and assumptions, this
PCAP format packets are layered following the TCP/IP model. has led to branching of the same tasks in several different
ways [26].
III. R ELATED W ORK AND G APS
Much work has been done in the domain of NIDS utiliz- B. Approaches Based on Modern Data
ing ML approaches. However, most proposed methods use Since future research utilizes modern datasets containing
packet header information to extract features for training the updated attacks and data instances, our goal is to provide
model, also known as flow-based approaches. The authors a standard baseline for researchers, just like flow-based ap-
in [17] present a comprehensive overview of the flow-based proaches. Few works based on packet-based approaches have
techniques. However, in our work, we have only focused on utilized updated datasets such as UNS-NB15 and CICIDS
packet-based approaches. datasets. A method based on a Recurrent Neural Network
(RNN) with the attention mechanism ATPAD is proposed in
A. Approaches Based on Outdated Data [27]. This method employs the word embedding and RNN
Prior research has been performed utilizing several methods to extract features used to capture the correlation between
for detecting anomalies in a network payload. As per [14], detection results and the potential byte of the payload. This
the packet-based processing was first performed by [18], in approach makes use of the CIC-IDS2017 dataset and utilizes
which the authors utilized the Self Organizing Map (SOM) binary classification. Moreover, no information is given regard-
to distinguish between normal and abnormal characteristics of ing the extraction and labeling of raw packet files. Similarly,
the network employing payload data. Building upon it, many [6] also utilizes the CIC-IDS2017 dataset. In this approach,
payload-based approaches are presented based on Natural the author employs the payload data to construct a block
Language Processing (NLP) concept called n-gram. PAYL, sequence that contains two kinds of information that retain
an anomaly detector based on payload data, is presented in short-term and long-term dependency relationships among the
[19]. This approach utilizes the byte frequency distribution of malicious byte in payload data. In this approach, the author has
normal packets to form a centroid model. However, the author stated the number of instances utilized for model training and
use a knowledge-based structure to store the probability range testing. However, the amount of information given regarding
occurrences of the n-gram technique to extract sequences from the payload extraction is minimal. In terms of the UNS-NB15
payload data. McPAD proposed in [20] uses a modified n-gram dataset, an approach utilizing the header and payload data has
method to extract the features from the payload. Incorporating been proposed in [8]. The authors have used a raw byte in
neural networks with ann-gram approach, the authors in [21] conjunction with a specified feature vector comprising only
proposed Packet2Vec approach. In this approach, the authors TCP/UDP protocols. Since individual protocols have different
utilize Word2Vec to develop a vector representation for indi- header byte numbers, authors have fixed a feature vector to
vidual most frequent n-grams. Another approach based on raw avoid ambiguity. Labeling information has been stated in this
packet data using Convolutional Neural Network (CNN) and paper. However, the authors utilized only eight files of their
Long Short-Term Memory (LSTM) deep learning architectures own choice for their model training and validation. Moreover,
Fig. 2. Workflow representation of developed tool (Payload-Byte). Raw PCAP files are passed with available metadata for labeling and transformation of raw
PCAP data into ML model readable form.

there are more than 130 protocols in the UNSW-NB15 dataset Lastly, a brief overview of the selected approach for evaluating
that the author neglects. Another approach covering additional packet-based approaches is also presented in this section.
ICMP protocols has been presented in [28]. The authors
A. Network Intrusion Detection Datasets
presented a unified packet representation using raw packet
information in this approach. The authors conducted their There are many publicly available datasets for researchers in
evaluation over ten different datasets. However, the proposed the domain of cybersecurity. However, most network intrusion
method utilizes every byte of the raw packet file, including detection datasets only contain header information of network
headers containing information about IP addresses and ports. packets. Several datasets comprising real network traffic either
Using every byte of the packet may simply train the model to do not have payload data or it has been removed due to privacy
learn which IP addresses send malicious traffic or which port concerns [5]. The unavailability of labeled packet data is a
are attacked frequently, which is not a robust approach. significant issue in the packet-based NIDS. As a result, packet-
In short, every approach proposed in the literature utilizes based approaches are evaluated on proprietary or self-labeled
its own methodology for extracting and labeling raw packet data, resulting in reproducibility and comparability problems.
files for available data or its own proprietary dataset. In Table I provides a comprehensive summary of the available
addition, most of the approaches used for labeling do not datasets based on nine features. The features comprise the year
seem adequate and have a complication, as they use the 5- of publication, accessibility of the dataset (whether the dataset
tuple approach. There is no doubt that packet-based NIDS has is publicly available or not), format of the dataset (either flow
the potential to detect attacks in a network. Still, to gauge or packet), size of the dataset, availability of labeled packet
the true applicability, researchers need a standardized baseline data, kind of traffic (real or emulated), whether the data is
that can utilized for the proposed scheme. In this way, the balanced or not, availability of modern network attacks, and
problem of reproducibility and comparability will be resolved. whether the data contain metadata or not.
In this work, we developed a tool after undergoing a complex However, many datasets contain raw packet data informa-
process of in-depth analysis of the datasets and methodology tion, and not every dataset is being utilized in the ongoing
to provide a baseline solution for future researchers. The research. The reason is that every dataset has limitations and
objective of the developed tool is to provide a generalized challenges due to the methods and environment used for
packet-based dataset that can be utilized by anyone according creating them. Moreover, many of these datasets are outdated
to their model. The detail of our adopted methodology is due to the technological advancement accompanied by new
presented in Section IV. Moreover, labeled raw packet data and more complex software and network structures. But still,
has also been made available for researchers’ ease. the most widely used and up-to-date datasets available are
CICIDS 2017 and UNSW-NB15 [29]. Therefore, we selected
these two datasets for the explanation of our tool.
IV. M ETHODOLOGY
B. Workflow Overview
A detailed overview of the adopted procedure for extracting The developed tool, Payload-Byte, consists of three main
and labeling raw packet data is provided in this Section. Since components: python-based parser, labeling module, and pay-
our goal is to provide ease and a frame of reference to future load transformation module. Python-based parser and payload
researchers, we only considered the modern and preferred transformation module are generalized components, and their
network intrusion detection dataset to explain our tool. In methodology and approach are similar for every dataset. How-
addition, a concise summary of available network intrusion ever, the labeling module is dataset specific, and its approach
detection datasets and their characteristics is also presented. differs from dataset to dataset, which is explained later. An
Fig. 3. Feature Vector representation of extracted data and data utilized for training the ML model. T-delta in employed feature vector is the time difference
between packets.

TABLE I
OVERVIEW OF THE WIDELY UTILIZED INTRUSION DETECTION DATASET IN THE LITERATURE
Dataset Year of Publication Accessibility Format Size Count Traffic Modern Attacks Balanced Metadata Labeled Packets
NSL-KDD [30] 1998 Public Other 150k points Emulated No No No -
DARPA [31], [32] 1998 Public Packets, logs 4.9M points Emulated No No Yes Yes
KYOTO 2006+ [33] 2006 Public Other 93M points Real No No No -
UNIBS [34] 2009 On Request Flow 79k flows Real No No No -
Botnet [35] 2010 Public Packet 14GB packets Emulated No No Yes Yes
ISCX 2012 [36] 2012 Public Packet, Flow 2M flows Emulated No No Yes No
CIC DoS [37] 2012 Public Packet 4.6GB packets Emulated No No No Yes
CTU-13 [38] 2013 Public Packet, Flow 81M flows Real No No Yes No
UNSW-NB15 [39] 2015 Public Packet,Flow 2M points Emulated Yes No Yes No
CIC-IDS2017 [40] 2017 Public Packet, Flow 4.6GB packets Emulated Yes No Yes No
CIC-IDS2018 [41] 2018 Public Packet, Flow, logs 450GB logs Emulated Yes No Yes No

overview of the workflow is illustrated in Fig. 2. Raw PCAP complications for the model. Moreover, IP addresses in the
files and metadata are fed into Payload-Byte as input, where network layer and port fields in the transport layer can cause
processed and labeled payload data is received at the output. the ML model to form an unreliable bias towards these bytes.
Payload-Byte can also obtain parsed PCAP files in the form of Since these features are commonly associated with preferences
CSV and labeled PCAP file without transformation. Provision and specific services, they can create a communication pattern.
of these files enables researchers to employ their own inferring But they can change anytime; therefore, these features are
for the extracted payload data while still having that standard unreliable for training a model. Unlike [11], in which the
baseline. authors limited their information extraction to the transport
Several tools are available for analyzing and extracting in- layer and only involved two protocols, TCP and UDP, the
formation from packet capture files, such as Wireshark, Chaos developed parser can extract information from the application
Reader, Tcpflow, Network miner, and many others. However, layer too. A pictorial representation of extracted feature and
most of these tools are Graphical User Interface (GUI) based feature vector utilized in training the model is presented in
and require a lot of computation and processing power, which Fig. 3. As shown in Fig. 3, IP addresses and ports are only
is unsuitable for any tactical environment. A programming- extracted for labeling the raw PCAP file data with reference
based parser is preferred for such an environment, which to available metadata. Since the length of the payload changes
can be operated on any resource-constraint device. Keeping with each packet, the maximum length that a packet can attain
this need in view, we developed our parser based on the is considered to avoid any overflowing or truncation of the
Scapy python module [42]. Since the first step in labeling any payload byte. As per the de facto packet size limit of 1500
packet capture files is extracting information, we developed a bytes [43], [44], we set a 1500 bytes range for the payload
generalized PCAP file parser that can be utilized for parsing to incorporate every byte. Our goal is to extract the data in
PCAP format files. such a way that it is complete and researchers can reduce this
There are many approaches for extracting information, as range as per their need. Furthermore, the payload was divided
mentioned in Section III. Since our focus is on a payload- with respect to bytes, transforming into 1500 features. This
based intrusion detection system, we laid out our feature vector transformation is necessary for the training of the ML model.
for packet-based approaches in such a way that raw bytes are The utilization of NLP techniques for payload data is not
captured from packet data, and features are extracted from the adopted; instead, the payload is transformed from hex value to
packet header. We have not utilized the raw bytes of header integer having a range of 0-255. Zero padding was employed
due to its dynamic structure. As there are many protocols, where the number of payload bytes are less than 1500 to
each protocol’s header size and order are distinct. Therefore, maintain the standard structure of the feature vector. Further
utilizing raw bytes for the header would lead to learning detail on parsing concerning individual datasets is provided in
their subsequent heading. not included as there are more than 130 different protocols.
The next step after extraction is labeling, which is performed However, the obtained results were not satisfactory. Therefore,
by comparing the extracted features from the PCAP file and different protocols having exact naming to the ones present in
features from the ground truth table. However, the generalized the CSV file were incorporated into the python based parser.
labeling approach is not possible for both datsets due to Since these protocols are from different layers, python based
dataset-specific complications explained in their respective parser is programmed to extract the application layer protocols.
subheading. Inner merge utilizing the divide and conquer The top 45 protocols concerning attacks label counts are hard
algorithm is adopted for comparing and labeling the PCAP coded in the program, and the rest of the protocols are mapped
file with CSV while preserving the order of PCAP files. Since under others. As the number of data instances after the top
packet data has packets in the range of microseconds, the 20 protocols is insignificant, they are mapped under other
number of labeled PCAP data is higher than the data instance category. Similar mapping is performed for the CSV file too. In
in the CSV file which is shown in Section V. Further, to addition to protocol, dur feature from CSV and t-delta feature
facilitate the availability of data publicly and ease the data from pcap files is also utilized for mapping. Subsequently, the
usage process, data reduction is carried out. All the data results obtained are not to par. Manual data exploration was
instances whose packet data has no payload are removed. performed and it was deduced that the dur feature also has
Moreover, normal data instances are under-sampled to mitigate ambiguity. Therefore, t-delta feature is added to starting time
the unbalanced issue of the dataset. The processed payload of the PCAP file to attain the ending time. Both of the Unix
data for both datasets are made available along with the source time stamps are rounded-off to transform them into integers.
code of the developed tool so that researchers can cross-check Here type casting is not performed directly, as type casting
the procedure or utilize it to generate the complete data without would truncate the floating points, which is not the case in
data reduction. The further detail of labeling and parsing for the CSV file. While exploring data, it is also inferred that the
the individual dataset is explained below. Time to Live (TTL) feature of the pcap file maps to the source
1) UNSW-NB15: The UNSW-NB15 intrusion detection TTL feature of CSV. Therefore, eight different features are
dataset encompasses nine modern attacks and network traffic utilized for comparing and labeling pcap files. These features
emulated in a small environment. The network traffic is are: Source IP, Destination IP, Source Port, Destination Port,
captured for more than 31 hours and is spread out in 79 starting time, ending time, protocol and time to live.
different PCAP files, having more than 99GB of data. The Since ICMP protocol has corrupted destination port and
dataset comprises raw network traffic in the PCAP file format source port, it was labeled without including these two fea-
and labeled flow-based data with additional attributes. We tures. Similarly, ARP protocols also do not have a destination
have utilized the PCAP and CSV files having labeled data or source port. Moreover, there are some protocols whose IP
instances for our tool. Moreover, an in-depth dataset analysis addresses are not available in PCAP file format; therefore,
was performed by deploying various approaches for accurate they were not labeled automatically. After labeling, duplicate
and effective comparison and labeling. However, several am- data instances are removed, and benign data is under-sampled
biguities were found in the dataset, which is discussed next. to 1.5 times of second highest attack instances. After that,
First of all, CSV data requires a lot of prepossessing. There data is transformed into 1504 features, converting payload hex
are many missing and null values in the dataset. Furthermore, string into 1500 byte-wise data represented in integers. The
there is inconsistency in the labeling of data instances. Similar remaining 4 features are from packet header as shown in Fig.
attack classes are labeled differently. Around 480,630 data 3.
instances are duplicated, and more than 130 protocols are 2) CIC-IDS2017: The CIC-IDS2017 intrusion detection
present in the dataset. The protocols are from different layers: dataset is specifically developed to represent more modern net-
the transport layer and the application layer. Secondly, some work flows, and attacks than the preceding datasets mentioned
corrupted data are present in the dataset, such as the source in Table I [40]. The dataset consists of 48.8GB of network
and destination ports of ICMP protocols are in hex values. traffic captured in five separate files over five days. The dataset
Thirdly, the time stamping in the UNSW-NB15 dataset is in is released in two different formats: raw network traffic in the
Unix epoch format and has a starting and ending time for PCAP file format and extracted flow-based data having a set of
the packets. But the epoch time is rounded up to integers, different features in CSV format. The authors have additionally
losing its microseconds which could be useful for accurately provided metadata about IP addresses and attack duration. We
labeling the packets. Also, the dataset has a feature duration utilized the PCAP and CSV files having labeled data instances
which was used to generate the ending time for packets, but for our tool. However, some ambiguity in the CSV data files
for some data instances, this feature does not add up correctly. and PCAP files led to labeling based on flow-ID (Source IP,
Fourthly, several protocols, such as udt and any do not exist. Destination IP, Source Port, Destination Port, and Protocol).
Several approaches were tried to achieve the most optimal Some of the prominent ambiguities are highlighted below.
approach for comparing and labeling the PCAP data. The First of all, CSV data requires a lot of prepossessing, there
first attempt was made by labeling the PCAP packets by are several missing values in the dataset, and four columns
utilizing five features: Source IP, Destination IP, Source Port, are duplicated. Secondly, only three protocols are available
Destination Port, and Starting time. Protocol feature was in the CSV file: TCP, UDP, and others. Whereas, in PCAP
files, we found that there are Address Resolution Protocol TABLE II
(ARP), Link Layer Discovery Protocol (LLDP), and Cisco M ODEL ARCHITECTURE FOR THE CNN-LSTM AND DNN
Discovery Protocol (CDP) which are not part of IPv4 or Parameter Description
Activation Function Softmax
IPv6. Moreover, the PCAP file also contains Internet Con- Loss Function Categorical cross entropy
Optimizer Adam & lr=default
trol Message Protocol (ICMP), Internet Group Management Epochs 30
Protocol (IGMP), and Stream Control Transmission Protocol Number of Layers
DNN
3
CNN-LSTM
5
(SCTP) packets. These protocols are neglected as they are not Layer 1 Fully Connected (None,1024) Conv1D (None, 1504, 64)
Layer 2 Fully Connected (None,512) Maxpooling1D (None, 752, 64)
included in the CSV file. Thirdly, time-stamping in the CSV Layer 3 Fully Connected (None,10) BatchNormalization (None, 752, 64)
Layer 4 - LSTM (None, 64)
file is in the general format of “dd/MM/YYYY HH:mm:ss” Layer 5 - Fully Connected (None,10)
rather than Unix epoch time stamping as in the UNSW-NB15
dataset. 529,450 data instances (around 19%) in the CSV file
are missing seconds in time format. This leads to inaccurate parameter tuning has been performed for the approaches,
time calculation for comparison and labeling of the PCAP and default parameters are used. The goal here is not to
file. Moreover, the time format in the CSV file follows the achieve the best results but to provide a brief comparison.
12-hour clock format, but it is found that every data instance Moreover, the results are not compared with recent available
is in AM, which is not the case with the PCAP file. Just payload-based approaches as they have not explicitly stated
for experimentation, we labeled the data with the inclusion of their data extraction and assumptions approach. Therefore, it is
time-stamping, and found that only 80 data instances were not a feasible option. Furthermore, available approaches have
matched and labeled for one PCAP file. That is why we only utilized binary class classification and presented it as an
have not utilized time stamping in our labeling approach for anomaly detector. However, we have performed a multi-class
CIC-IDS2017. Another ambiguity in time stamping is that classification and results are provided in the results Section.
Wednesday (July 5, 2017) and Friday (July 7, 2017) PCAP The approaches that are adopted are: Random Forest, Lo-
files have time-stamping till 15:10 and 15:02, respectively. As gistic Regression, K-Nearest Neighbour, AdaboostClassifier,
per the metadata and CSV file, the Heartbleed Port attack starts Multilayer Perceptron, Deep Neural Network (DNN) and a
at 15:12 on Wednesday, and the DDoS attack starts at 15:56 simple combination of Convolutional Neural Network (CNN)
on Friday, which are completely missing in the PCAP file as and Long Short-Term Memory (LSTM). The architecture
per time stamping. utilized for DNN and CNN-LSTM is shown in Table II.
Keeping these issues in mind, we dropped the time stamp Similar architecture is utilized for the CSV and labeled PCAP
feature as our tool’s comparison base for CIC-IDS2017. We files. However, input of the model is different for CSV and
utilized the Source IP, Destination IP, Source Port, Desti- PCAP files.
nation Port, and Protocol feature for matching the packets
with labeled CSV data. After data instances are matched,
we utilized the time duration of each attack as specified in
metadata to cross-validate the data instances. Here we used
the time-stamp of the CSV file rather than the PCAP file
since metadata is based on CSV file time-stamps. For the
given time of attacks as per metadata, we removed benign
instances from them to eliminate complications. However, the
CSV file also contains benign data instances in the time frame
of attacks as specified by metadata. After cross-validation of
labeled PCAP data, duplicated values are removed and benign
data is under-sampled to 1.5 times of second highest attack
instances. After that, data was transformed into 1504 features,
Fig. 4. An overview of the data processing and achieved outcome of the
converting payload hex string into byte-wise data represented Payload-Byte. Both of the datasets are plotted side by side for better inferring.
in integers.
The adopted approach for both datasets is deduced after
extensive exploration of the dataset and methods. Therefore, V. R ESULTS
the extracted data can be utilized as a standard baseline for A comprehensive overview in the form of quantitative
future and current work. The finding of the processed data is data is presented in this Section, along with the comparative
presented in the next Section. analysis of results obtained by packet and flow-based data.
For the implementation of our developed tool, CSV files of
C. Incorporated Models both datasets are pre-processed before passing them into the
We performed a comparative analysis using several ML Payload-Byte, whereas PCAP files are directly fed into it. The
approaches between packet-based and extracted flow-based output of the Payload-Byte is a transformed and labeled pcap
data. The objective of this comparative analysis is to provide a file, having 1504 features as shown in Fig. 3. However, initial
comparison between both types of data. Therefore, no hyper- stage files can also be extracted from the tool, such as parsed
Fig. 5. A comparison of data instances for individual attack classes in Labeled PCAP and Processed CSV file. Numbers on the bar graph represents the data
instances in labeled PCAP file. Normal instances are under-sampled for ease of data representation and usage.

PCAP files and labeled PCAP files having hex valued payload. spanning over 5 days; therefore, they are shown individually.
An overview of the processed data is presented in the next Data Instances in CSV in Table III and Table IV represents
heading. number of data points that CSV file contain in the time span of
that individual PCAP file. Unprocessed Labeled PCAP depicts
A. Data Processing the number of packet instances labeled and Processed Labeled
Available CSV files for both datasets are distributed among PCAP shows the number of remaining packets after processing
several files. Therefore, they are combined into a single file and removal of non-payload data from Unprocessed Labeled
for an individual dataset before any processing. Data prepos- PCAP.
sessing is also performed, cleaning and removing erroneous
TABLE IV
data instances. The crux of the whole processing is illustrated S UMMARY OF CIC-IDS2017 PCAP F ILES
in Fig. 4 where data in CSV file represents the data instances
Monday Tuesday Wednesday Thursday Friday
in the available CSV file before prepossessing it. The total Data Instances in CSV 306794 335,415 590,692 285,825 675,965
number of available packets in all PCAP files is shown as Data Instances in PCAP
Unprocessed Labeled PCAP
11,626,492
17,096,760
11,469,736
20,692,983
13,705,555
48,931,426
9,240,723
15,727,986
9,915,680
19,740,075
data in the PCAP file. Moreover, labeled data represents data Processed Labeled PCAP 5,633,567 3,516,569 16,436,931 4,071,767 5,848,193

instances obtained after labeling and removing non-payload


data instances. Data instances obtained after labeling involve several du-
It can be deduced by looking at the statistical data provided plicated values and instances where packets have no payload.
in Fig. 4 that labeled data is dependent on the number of Since the number of labeled packets is extensively high, they
data instances present in the CSV file rather than the PCAP are processed by removing duplicates and instances having
file. Moreover, the data instance in the PCAP file of UNSW no payload. Afterward, the data is under-sampled to reduce
is exceptionally high due to the number of available PCAP its size by decreasing normal instances. The data obtained
files. There are 79 PCAP files for the USNW-NB15 dataset, after labeling has a high number of normal instances, since
whereas only 5 PCAP files in CIC-IDS2017. Another reason normal instances in CSV files are also in significant quantity.
for more labeled data instances in the CIC-IDS2017 dataset is Consequently, attack instances are not under-sampled to avoid
that we only utilize five features for comparing and labeling, any information loss. The total number of data instances
whereas eight features are utilized in UNSW-NB15. for each attack class in CSV file and labeled PCAP file is
illustrated in Fig. 5. In the figure, data labels on bar graph
TABLE III show the number of data instances in labeled PCAP files.
S UMMARY OF UNSW-NB15 PCAP F ILES The number of data instances shown in Fig. 5 is of the
22-01-2015 17-02-2015
finalized file, processed and under-sampled. The final version
Data Instances in CSV 1,082,221 1,036,218 of both the datasets is publicly available and can be utilized
Data Instances in PCAP 93,384,964 92,559,915
Unprocessed Labeled PCAP 11,084,503 9,745,079 by future research work as a standard baseline. Furthermore,
Processed Labeled PCAP 6,691,105 the complete data can also be generated by using the Payload-
Byte.
A detail summary of data processing based on individual
PCAP files for respective dataset is presented in Table III and B. Comparative Analysis
Table IV. Since there are 79 PCAP files in the UNSW-NB15 Labeled PCAP files and processed CSV files of the UNSW-
dataset, collective results are shown based on a particular day. NB15 dataset are utilized to compare both approaches briefly.
On the other hand, CIC-IDS2017 only contain 5 PCAP files Features Source IP, Destination IP, Start time, and Last time
explicitly stating the whole process and notion, which leads to
branching the same tasks in many different ways. Addressing
the issue, we developed an open-source tool (Payload-Byte)
for parsing and labeling modern raw network traffic datasets.
Payload-Byte standardizes dataset curation and provides a
standardized baseline for future researchers to reproduce and
compare other packet-based proposed approaches. Payload-
Byte eliminates the engineering and language errors that stem
from current datasets. Our tool can also parse high-level layers,
extracting information from application layers. Since UNSW-
NB15 datasets involve application layer protocols, Payload-
Byte can extract and label every available protocol, which is
Fig. 6. A comparison of macro accuracy for several ML model, utilizing being done for the first time. Moreover, any future datasets
packet and flow based data. All models utilized have default parameters and
no hyper-parameter tuning is performed.
can also be parsed utilizing Payload-Byte as we developed
a generalized parsing approach that is applicable for every
packet capture file.
are removed from the CSV files, as they can force the ML Furthermore, we stated the complete cycle of parsing and
model to learn the relation between these features and attacks. labeling of CIC-IDS2017 and UNSW-NB15 datasets so that
As there are many missing values for the feature ct ftp cmd. it can be adopted for future research work. Payload-Byte
Therefore, it is also omitted from the CSV file. Moreover, can be utilized to reproduce the entire data and transform it
normal data instances of CSV files are under-sampled just accordingly. However, for ease of usage, we transform the
as PCAP files to maintain coherence. Both approaches’ data payload data into a feature vector comprising features from
is scaled utilizing Standard-Scaler before being implemented the packet header and payload data, transformed into byte-
into the ML models. The resulting outcome of the comparison wise (1500) features. Lastly, a brief comparison of flow-based
is shown in Fig 6. However, the overall performance of and transformed packet-based data has been presented to show
individual models can be improved by tuning hyper-parameter, the effectiveness of packet-based approaches.
which will be carried out in our future research work. The For our future work, we will carry out an in-depth compari-
accuracy illustrated in Fig 6 is macro average accuracy, son analysis of available proposed approaches on the extracted
as it represents performance better in terms of multi-class data. Moreover, we will look into additional transformation
classification. Through Fig 6 it can be deduced that PCAP methods that can result in effective solution for packet-based
data is performing well for almost every model compared to approaches.
the available CSV data. A detailed result comprising Recall,
ACKNOWLEDGMENT
F1-score, and Accuracy for each model is presented in Table
V. This work was supported in part by the U.S. Mili-
tary Academy (USMA) under Cooperative Agreement No.
TABLE V W911NF-22-2-0081, the U.S. Army Combat Capabilities De-
D ETAIL R ESULT FOR P ERFORMANCE C OMPARISON velopment Command (DEVCOM) C5ISR Center under Sup-
Models CSV Data PCAP Data port Agreement No. USMA21056, and the National Security
Accuracy Recall F1-Score Accuracy Recall F1-Score
Random Forest 62% 56% 58% 66% 67% 67% Agency Laboratory for Advanced Cybersecurity Research
Logistic Regression 48% 42% 42% 57% 55% 55% under Interagency Agreement No. USMA21035. The views
K-Nearest Neighbour 54% 44% 47% 63% 62% 62%
AdaboostClassifier 37% 54% 40% 34% 34% 31% and conclusions expressed in this paper are those of the
Multilayer Perceptro 54% 51% 52% 72% 68% 67%
DNN 56% 51% 52% 74% 66% 66%
authors and do not reflect the official policy or position of
CNN LSTM 61% 56% 56% 75% 69% 69% the U.S. Military Academy, U.S. Army, U.S. Department
of Defense, or U.S. Government. The U.S. Government is
authorized to reproduce and distribute reprints for Government
VI. C ONCLUSION purposes notwithstanding any copyright notation herein. The
In this work, we highlighted the issue of comparability U.S. Government reserves a royalty-free, nonexclusive and
and reproducibility for packet-based NIDS. Since payload- irrevocable right to reproduce, publish, or otherwise use this
based approaches can be an effective solution for detecting data for Federal purposes, and to authorize others to do so in
different network attacks, several works have been proposed accordance with 2 CFR 200.315(b).
in the literature. However, the data utilized for such work
is either proprietary or processed data based on undefined R EFERENCES
procedures and assumptions. Moreover, there is no properly [1] M. A. Khan, M. R. Karim, and Y. Kim, “A Scalable and
labeled packet-based dataset available that can be utilized as a Hybrid Intrusion Detection System Based on the Convolutional-LSTM
Network,” Symmetry 2019, Vol. 11, Page 583, vol. 11, no. 4,
standard baseline. Therefore, every new approach follows its p. 583, 4 2019. [Online]. Available: https://www.mdpi.com/2073-
own method for extracting and labeling packet data without 8994/11/4/583/htm https://www.mdpi.com/2073-8994/11/4/583
[2] S. Wali and I. Khan, “Explainable ai and random forest based reliable [24] S. A. Thorat, A. K. Khandelwal, B. Bruhadeshwar, and K. Kishore,
intrusion detection system,” 2021. “Payload content based network anomaly detection,” 1st International
[3] L. Yu, J. Dong, L. Chen, M. Li, B. Xu, Z. Li, L. Qiao, L. Liu, Conference on the Applications of Digital Information and Web Tech-
B. Zhao, and C. Zhang, “PBCNN: Packet Bytes-based Convolutional nologies, ICADIWT 2008, pp. 127–132, 2008.
Neural Network for Network Intrusion Detection,” Computer Networks, [25] S. M. Hosseini and A. H. Jahangir, “An Effective Payload Attribution
vol. 194, p. 108117, 7 2021. Scheme for Cybercriminal Detection Using Compressed Bitmap Index
[4] Y. A. Farrukh, Z. Ahmad, I. Khan, and R. M. Elavarasan, “A sequential Tables and Traffic Downsampling,” IEEE Transactions on Information
supervised machine learning approach for cyber attack detection in a Forensics and Security, vol. 13, no. 4, pp. 850–860, 4 2018.
smart grid system,” in 2021 North American Power Symposium (NAPS), [26] J. Holland, P. Schmitt, P. Mittal, and N. Feamster, “Towards Repro-
2021, pp. 1–6. ducible Network Traffic Analysis,” 3 2022.
[5] M. Hassan, M. E. Haque, M. E. Tozal, V. Raghavan, and R. Agrawal, [27] Z.-Q. Qin, X.-K. Ma, and Y.-J. Wang, “Attentional payload anomaly
“Intrusion Detection Using Payload Embeddings,” IEEE Access, vol. 10, detector for web applications,” in Neural Information Processing,
pp. 4015–4030, 2022. L. Cheng, A. C. S. Leung, and S. Ozawa, Eds. Cham: Springer
[6] J. Liu, X. Song, Y. Zhou, X. Peng, Y. Zhang, P. Liu, D. Wu, and C. Zhu, International Publishing, 2018, pp. 588–599.
“Deep anomaly detection in packet payload,” Neurocomputing, vol. 485, [28] J. Holland, P. Schmitt, N. Feamster, and P. Mittal, “New Directions in
pp. 205–218, 5 2022. Automated Traffic Analysis,” Proceedings of the ACM Conference on
[7] E. Alhajjar, P. Maxwell, and N. Bastian, “Adversarial machine learning Computer and Communications Security, pp. 3366–3383, 11 2021.
in Network Intrusion Detection Systems,” Expert Systems with Applica- [29] M. Ring, S. Wunderlich, D. Scheuring, D. Landes, and A. Hotho, “A
tions, vol. 186, p. 115782, 12 2021. survey of network-based intrusion detection data sets,” Computers &
[8] F. Dimitrios Tsokos Supervisor and P. Georgios, “Network Dataset Gen- Security, vol. 86, pp. 147–167, 9 2019.
eration and Implementation of a Network Intrusion Detection System [30] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A Detailed
using Neural Networks,” 2021. Analysis of the KDD CUP 99 Data Set.” [Online]. Available:
http://nsl.cs.unb.ca/NSL-KDD/
[9] R. R. Mehdi, E. A. Mendiola, A. Sears, J. Ohayon, G. Choudhary,
R. Pettigrew, and R. Avazmohammadi, “Comparison of three machine [31] R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines, K. R. Kendall,
learning methods to estimate myocardial stiffness.” D. McClung, D. Weber, S. E. Webster, D. Wyschogrod, R. K. Cun-
ningham, and M. A. Zissman, “Evaluating intrusion detection systems:
[10] J. T. Zhou, H. Zhang, D. Jin, and X. Peng, “Dual Adversarial Transfer
The 1998 DARPA off-line intrusion detection evaluation,” Proceedings
for Sequence Labeling,” IEEE Transactions on Pattern Analysis and
- DARPA Information Survivability Conference and Exposition, DISCEX
Machine Intelligence, vol. 43, no. 2, pp. 434–446, 2 2021.
2000, vol. 2, pp. 12–26, 2000.
[11] M. De Lucia, P. E. Maxwell, N. D. Bastian, A. Swami, B. Jalaian, and [32] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, and K. Das, “The
N. Leslie, “Machine learning raw network traffic detection,” p. 24, 4 1999 DARPA off-line intrusion detection evaluation,” Computer Net-
2021. works: The International Journal of Computer and Telecommunications
[12] P. Maxwell, E. Alhajjar, and N. D. Bastian, “Intelligent feature engi- Networking, vol. 34, no. 4, pp. 579–595, 10 2000.
neering for cybersecurity,” in 2019 IEEE International Conference on [33] J. Song, H. Takakura, Y. Okabe, M. Eto, D. Inoue, and K. Nakao,
Big Data (Big Data), 2019, pp. 5005–5011. “Statistical Analysis of Honeypot Data and Building of Kyoto 2006+
[13] W. Lee, S. J. Stolfo, P. K. Chan, E. Eskin, W. Fan, M. Miller, Dataset for NIDS Evaluation.”
S. Hershkop, and J. Zhang, “Real time data mining-based intrusion [34] F. Gringoli, L. Salgarelli, M. Dusi, N. Cascarano, F. Risso, and K. C.
detection,” Proceedings - DARPA Information Survivability Conference Claffy, “GT: Picking up the Truth from the Ground for Internet Traffic
and Exposition II, DISCEX 2001, vol. 1, pp. 89–100, 2001. *.”
[14] A. F. Mercurio, “A Critical Analysis of Payload Anomaly-Based [35] E. B. Beigi, H. H. Jazi, N. Stakhanova, and A. A. Ghorbani, “Towards
Intrusion Detection Systems,” p. 363, 2010. [Online]. Available: effective feature selection in machine learning-based botnet detection
https://epublications.regis.edu/theses/363 approaches,” 2014 IEEE Conference on Communications and Network
[15] R. K. P. M. . N. D. B. David A Bierbrauer, Michael J De Lucia, Security, CNS 2014, pp. 247–255, 12 2014.
“Transfer learning for raw network traffic detection. expert systems with [36] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, “Toward
applications,” unpublished. developing a systematic approach to generate benchmark datasets for
[16] L. F. Sikos, “Packet analysis for network forensics: A comprehensive intrusion detection,” Computers and Security, vol. 31, no. 3, pp. 357–
survey,” Forensic Science International: Digital Investigation, vol. 32, 374, 5 2012.
3 2020. [37] H. H. Jazi, H. Gonzalez, N. Stakhanova, and A. A. Ghorbani, “Detecting
[17] F. Pacheco, E. Expósito, M. Gineste, C. Baudoin, J. Aguilar, E. Exposito, HTTP-based application layer DoS attacks on web servers in the
and C. Baudoin, “Towards the Deployment of Machine Learning presence of sampling,” Computer Networks, vol. 121, pp. 25–36, 7 2017.
Solutions in Network Traffic Classification: A Systematic Survey,” [38] S. Garcı́a, M. Grill, J. Stiborek, and A. Zunino, “An empirical compar-
Communications Surveys and Tutorials, vol. 21, no. 2, pp. 1988–2014, ison of botnet detection methods,” Computers and Security, vol. 45, pp.
2018. [Online]. Available: https://hal.archives-ouvertes.fr/hal-02423375 100–123, 2014.
[18] “A neural network approach wowards intrusion de- [39] N. Moustafa and J. Slay, “UNSW-NB15: A comprehensive data set for
tection — CiNii Research.” [Online]. Available: network intrusion detection systems (UNSW-NB15 network data set),”
https://cir.nii.ac.jp/crid/1572543024790112512 2015 Military Communications and Information Systems Conference,
[19] K. Wang and S. J. Stolfo, “Anomalous payload-based network intrusion MilCIS 2015 - Proceedings, 12 2015.
detection,” in Recent Advances in Intrusion Detection, E. Jonsson, [40] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward Generating
A. Valdes, and M. Almgren, Eds. Berlin, Heidelberg: Springer Berlin a New Intrusion Detection Dataset and Intrusion Traffic Characteriza-
Heidelberg, 2004, pp. 203–222. tion,” 2018.
[20] R. Perdisci, D. Ariu, P. Fogla, G. Giacinto, and W. Lee, “McPAD: A [41] “IDS 2018 — Datasets — Research — Canadian Institute for Cybersecu-
multiple classifier system for accurate payload-based anomaly detec- rity — UNB.” [Online]. Available: https://www.unb.ca/cic/datasets/ids-
tion,” Computer Networks, vol. 53, no. 6, pp. 864–881, 4 2009. 2018.html
[21] E. L. Goodman, C. Zimmerman, and C. Hudson, “Packet2Vec: Utilizing [42] “Scapy.” [Online]. Available: https://scapy.net/
Word2Vec for Feature Extraction in Packet Data,” 4 2020. [Online]. [43] N. M. Garcia, M. M. Freire, and P. P. Monteiro, “The ethernet frame
Available: http://arxiv.org/abs/2004.14477 payload size and its effect on IPv4 and IPv6 traffic,” 2008 International
[22] W. Wang, Y. Sheng, J. Wang, X. Zeng, X. Ye, Y. Huang, and M. Zhu, Conference on Information Networking, ICOIN, 2008.
“HAST-IDS: Learning Hierarchical Spatial-Temporal Features Using [44] R. J. Anthony, “The Resource View,” Systems Programming, pp. 203–
Deep Neural Networks to Improve Intrusion Detection,” IEEE Access, 276, 1 2016.
vol. 6, pp. 1792–1806, 12 2017.
[23] B. A. Pratomo, P. Burnap, and G. Theodorakopoulos, “Unsupervised
Approach for Detecting Low Rate Attacks on Network Traffic with
Autoencoder,” 2018 International Conference on Cyber Security and
Protection of Digital Services, Cyber Security 2018, 12 2018.

You might also like