Conference Paper NIDS PCAP Transformation
Conference Paper NIDS PCAP Transformation
LICENSE
CC BY 4.0
29-08-2022 / 01-09-2022
CITATION
Farrukh, Yasir Ali; Khan, Irfan; Wali, Syed; Bierbrauer, David; Bastian, Nathaniel (2022): Payload-Byte: A Tool
for Extracting and Labeling Packet Capture Files of Modern Network Intrusion Detection Datasets. TechRxiv.
Preprint. https://doi.org/10.36227/techrxiv.20714221.v1
DOI
10.36227/techrxiv.20714221.v1
Payload-Byte: A Tool for Extracting and Labeling
Packet Capture Files of Modern Network Intrusion
Detection Datasets
Yasir Ali Farrukh Irfan Khan Syed Wali
Clean and Resilient Energy System Lab Clean and Resilient Energy System Lab Clean and Resilient Energy System Lab
Texas A&M University Texas A&M University Texas A&M University
College Station, TX, USA Galveston, TX, USA College Station, TX, USA
[email protected] [email protected] [email protected]
Abstract—Adapting modern approaches for network intrusion and people’s lives in one way or another. The world we used
detection is becoming critical, given the rapid technological to know is no longer similar, as it has transformed into a
advancement and adversarial attack rates. Therefore, packet- collaborative network in which everything is interconnected
based methods utilizing payload data are gaining much pop-
ularity due to their effectiveness in detecting certain attacks. [1]. This inter-connectivity has led to an increase in cyber-
However, packet-based approaches suffer from a lack of stan- attacks on ICT systems [2]. Cyber-attacks on the network
dardization, resulting in incomparability and reproducibility of ICT systems often lead to data breaches and operational
issues. Unlike flow-based datasets, no standard labeled dataset halts for the company [3], [4]. Therefore, ICT systems require
exists, forcing researchers to follow bespoke labeling pipelines effective and robust security solutions.
for individual approaches. Without a standardized baseline,
proposed approaches cannot be compared and evaluated with Network Intrusion Detection System (NIDS) is often con-
each other. One cannot gauge whether the proposed approach sidered a feasible option to protect against network-based
is a methodological advancement or is just being benefited attacks [5], as it identifies attack behavior by analyzing the
from the proprietary interpretation of the dataset. Addressing network traffic of vital nodes in a network. NIDS utilizes
comparability and reproducibility issues, we introduce Payload- various approaches for the detection of malicious attack in-
Byte, an open-source tool for extracting and labeling network
packets in this work. Payload-Byte utilizes metadata information stances. The most prominent of these approaches are rule-
and labels raw traffic captures of modern intrusion detection based, flow-based, and packet-based methods [6]. Rule-based
datasets in a generalized manner. Moreover, we transformed the methods are typically based on feature selection to construct
labeled data into a byte-wise feature vector that can be utilized for domain-specific rules. Anomalies are detected by comparing
training machine learning models. The whole cycle of processing the extracted signature of network flow with predefined rules.
and labeling is explicitly stated in this work. Furthermore, source
code and processed data are made publicly available so that it Rule-based methods are effective for detecting known attacks
may act as a standardized baseline for future research work. but they heavily rely on in-depth domain knowledge.
Lastly, we present a brief comparative analysis of machine Recently, much attention has been diverted toward applying
learning models trained on packet-based and flow-based data. machine learning (ML) approaches to NIDS as ML algorithms
are achieving striking results in domains where it is hard to
Index Terms—Network intrusion detection, Traffic classifica-
tion, Packet capture, Cyber attack datasets, Payload extraction
specify a set of rules for their procedures [7]. One of the
reasons for this shift from human-dependent approaches to
ML is that humans cannot incorporate every possible scenario,
and there will always be an unanticipated condition that might
I. I NTRODUCTION
be devastating for the system. Since ML approaches operate
Advancement in Information and Communication Technolo- differently, it utilizes the data it gets and attempts to learn
gies (ICT) brings exceptional convenience to our daily life. the patterns between occurrences [8]. Thus, ML approaches
However, the world is becoming more dependent on these become dynamic and adaptive to various scenarios without
technological adaptations, as it impacts every aspect of society human intervention [9]. Lately, many ML approaches have
been proposed in the domain of NIDS; however, most of of extracting and labeling raw packet data from PCAP files
these approaches utilize flow-based information. In flow-based of already available datasets like UNSW and CICIDS. This
methods, network traffic is analyzed over a period to extract tool, which is based upon initial work by Bierbrauer et al.
flow-based behavior features [10]. This approach can be imple- [15], is publicly available, and researchers can utilize the tool
mented in a centralized server to monitor massive traffic. The to generate the data, which can be treated as a baseline for
anomalies are detected utilizing the correlation between the future exploration in packet-based approaches. Our goal is
traffic behavior and the corresponding characteristics. These to provide standardization solution to the future researchers
approaches require subject matter experts to select useful fea- so that reproducibility and comparability can be enhanced.
tures from data to detect malicious network traffic. Moreover, This will also enable researchers to directly compare new
the extracted features require a pre-processing application, approaches with the previous ones. The main contribution for
often requiring mathematical techniques to prepare the data this paper is as follows:
for the ML model [11], [12]. Another significant issue with • We developed a tool (Payload-Byte) for extracting and
flow-based approaches is that they might end up learning labeling raw packet data of NIDS datasets. The complete
which IP addresses send malicious traffic or which ports are cycle of processing and labeling is explicitly stated in
frequently attacked. Flow-based approaches basically monitor this paper for ease of usage. This tool addresses the major
threats at the lower level of the TCP/IP protocol stack, thereby issue of comparability and reproducibility in packet-based
diminishing the chance of detecting higher-level threats [13]. NIDS.
For such approaches, the methods used to extract and represent • Transformation of payload data into byte-wise data utiliz-
the information in a packet are crucial and can affect the output ing a generalized feature vector has been presented. This
of ML models. feature vector is independent of any protocol structure.
On the other hand, packet-based approaches can unveil • A processed packet-based dataset along with the tool is
malicious network flow by inspecting the packet payload, made publically available to facilitate future researcher1 .
which refers to the network packet’s user data. The packet- • A brief comparative analysis has been performed utiliz-
based approach tries to learn characteristics of normal as ing the extracted payload data and flow-based data for
well as possible attack instances that have potential abnormal network intrusion detection.
characteristics in the packet payload. The anomalies in attack The rest of this paper is organized as follows: Section
instances might appear as a number of specific strings. For II covers the background knowledge related to the packet
example, an SQL injection attack injects anomalous codes structure. In Section III, an overview of related work and
such as “ or 1 = 1 - - ” into SQL queries to make gaps has been highlighted. Working and methodology has been
them always true [6]. Moreover, many attacks such as Worms, presented in Section IV. Furthermore, results are shown in
Ransomware, and Trojans are based on payload delivery, Section V. Finally, Section VI concludes the paper.
and these types of attacks might be hidden from flow-based
approaches. In short, models that utilize flow-based approaches II. BACKGROUND
rely upon the tool’s correctness that extracts information from Libcap (PCAP) format is considered as the de facto standard
packets. Thus, an inadequacy in the dependent tool propa- for network packet capture, which is widely utilized in packet
gates to a complete failure whereas rule-based approaches sniffers and analyzers [16]. This format is based on binary for-
are completely dependent on in-depth domain knowledge and mat, supporting nanosecond precision timestamps. Although it
can not cater every possible scenario. Therefore, packet-based may vary from implementation to implementation. A general
approaches seems like a viable option for NIDS. structure for PCAP format is shown in Fig.1.
Network packet payload analysis might be an effective
solution for detecting network attacks since application attacks
are embedded in the payload rather than the header portion of
the Internet Protocol (IP) packet [14]. However, the ability to
detect payload embedded attacks remains a challenge due to
the absence of properly labeled packet data, a standardized
dataset baseline, and dynamic structuring of protocols. Unlike
flow-based NIDS datasets, which are easily available and can
be utilized as a baseline for developing and comparing any
new proposition, packet-based datasets do not have standard
labeling or dedicated dataset publicly available for everyone.
This lack of standardization leads towards incomparable and
Fig. 1. The general structure of PCAP format
irreproducible research. In addition, there is a lot of ambiguity
in the process of labeling the raw packet capture (PCAP) For understanding the information extraction from a raw
file, as every paper adopts its own processing method based packet file, it is essential to develop an understanding of
on their own set of rules. Therefore, addressing the issue of
standardization, we developed a tool (Payload-Byte) capable 1 GitHub: https://github.com/Yasir-ali-farrukh/Payload-Byte.git
the format in which packets are stored. The PCAP format is proposed as HAST-IDS in [22]. This approach captures low-
contains a global header followed by multiple packets con- level spatial and high-level temporal features through CNN
taining packet header and packet data. The global header and LSTM. Utilizing the same concept, the authors in [23]
identifies the generic PCAP format and byte order using the proposed AEIDS, based on an Autoencoder (AE), in which
“Magic Number”, validates the time information stored for a reconstruction error and modified z-score are employed for
each capture, and permits length checks to accommodate the classifying the incoming traffic instead of CNN and LSTM.
maximum length of captured packets (in octets). However, Other than these approaches, some modified approaches also
the individual packet header comprises packet information, curtail the total length of payload on some basis. PCNAD
like its origin, destination IP addresses, protocol, total length, [24], modified version PAYL, utilizes Content-based portion-
and other similar features. The packet data is the actual data ing (CPP) to determine the length of payloads for different
which is often referred as payload, containing the raw data. profiles. Similarly, authors in [25] proposed a payload-based
In practice, packets have more than one header, and each attribution scheme named Compressed Bitmap Index and Traf-
header is utilized by a different part of the networking pro- fic Down-sampling (CBID). CBID extracts feature utilizing
cess. Simply, packet headers are attached by certain types of the combination of bitmap index table and bloom filters
networking protocols. Dividing packet into header and data is from down-sampled traffic. Although all these approaches
high level representation of packets, whereas if we dive deeper use payload data to detect anomalies, there is still a huge
then there are several layers of additional information present reproducibility and comparison analysis gap. Most methods
within the raw network data. Following the TCP/IP model, in the literature utilize proprietary data or datasets that have
each packet can be divided into four individual layers: Data been outdated and are already labeled. However, the obsolete
Link Layer, Network Layer, Transport Layer and Application dataset is not the issue here. The main problem is compara-
Layer. By referring to packet header, we are actually extracting bility and reproducibility. Since every method has utilized a
information from Network Layer and Transport Layer header different approach for extracting and labeling raw data without
in our approach, which is covered in detail in Section IV. The explicitly mentioning the whole process and assumptions, this
PCAP format packets are layered following the TCP/IP model. has led to branching of the same tasks in several different
ways [26].
III. R ELATED W ORK AND G APS
Much work has been done in the domain of NIDS utiliz- B. Approaches Based on Modern Data
ing ML approaches. However, most proposed methods use Since future research utilizes modern datasets containing
packet header information to extract features for training the updated attacks and data instances, our goal is to provide
model, also known as flow-based approaches. The authors a standard baseline for researchers, just like flow-based ap-
in [17] present a comprehensive overview of the flow-based proaches. Few works based on packet-based approaches have
techniques. However, in our work, we have only focused on utilized updated datasets such as UNS-NB15 and CICIDS
packet-based approaches. datasets. A method based on a Recurrent Neural Network
(RNN) with the attention mechanism ATPAD is proposed in
A. Approaches Based on Outdated Data [27]. This method employs the word embedding and RNN
Prior research has been performed utilizing several methods to extract features used to capture the correlation between
for detecting anomalies in a network payload. As per [14], detection results and the potential byte of the payload. This
the packet-based processing was first performed by [18], in approach makes use of the CIC-IDS2017 dataset and utilizes
which the authors utilized the Self Organizing Map (SOM) binary classification. Moreover, no information is given regard-
to distinguish between normal and abnormal characteristics of ing the extraction and labeling of raw packet files. Similarly,
the network employing payload data. Building upon it, many [6] also utilizes the CIC-IDS2017 dataset. In this approach,
payload-based approaches are presented based on Natural the author employs the payload data to construct a block
Language Processing (NLP) concept called n-gram. PAYL, sequence that contains two kinds of information that retain
an anomaly detector based on payload data, is presented in short-term and long-term dependency relationships among the
[19]. This approach utilizes the byte frequency distribution of malicious byte in payload data. In this approach, the author has
normal packets to form a centroid model. However, the author stated the number of instances utilized for model training and
use a knowledge-based structure to store the probability range testing. However, the amount of information given regarding
occurrences of the n-gram technique to extract sequences from the payload extraction is minimal. In terms of the UNS-NB15
payload data. McPAD proposed in [20] uses a modified n-gram dataset, an approach utilizing the header and payload data has
method to extract the features from the payload. Incorporating been proposed in [8]. The authors have used a raw byte in
neural networks with ann-gram approach, the authors in [21] conjunction with a specified feature vector comprising only
proposed Packet2Vec approach. In this approach, the authors TCP/UDP protocols. Since individual protocols have different
utilize Word2Vec to develop a vector representation for indi- header byte numbers, authors have fixed a feature vector to
vidual most frequent n-grams. Another approach based on raw avoid ambiguity. Labeling information has been stated in this
packet data using Convolutional Neural Network (CNN) and paper. However, the authors utilized only eight files of their
Long Short-Term Memory (LSTM) deep learning architectures own choice for their model training and validation. Moreover,
Fig. 2. Workflow representation of developed tool (Payload-Byte). Raw PCAP files are passed with available metadata for labeling and transformation of raw
PCAP data into ML model readable form.
there are more than 130 protocols in the UNSW-NB15 dataset Lastly, a brief overview of the selected approach for evaluating
that the author neglects. Another approach covering additional packet-based approaches is also presented in this section.
ICMP protocols has been presented in [28]. The authors
A. Network Intrusion Detection Datasets
presented a unified packet representation using raw packet
information in this approach. The authors conducted their There are many publicly available datasets for researchers in
evaluation over ten different datasets. However, the proposed the domain of cybersecurity. However, most network intrusion
method utilizes every byte of the raw packet file, including detection datasets only contain header information of network
headers containing information about IP addresses and ports. packets. Several datasets comprising real network traffic either
Using every byte of the packet may simply train the model to do not have payload data or it has been removed due to privacy
learn which IP addresses send malicious traffic or which port concerns [5]. The unavailability of labeled packet data is a
are attacked frequently, which is not a robust approach. significant issue in the packet-based NIDS. As a result, packet-
In short, every approach proposed in the literature utilizes based approaches are evaluated on proprietary or self-labeled
its own methodology for extracting and labeling raw packet data, resulting in reproducibility and comparability problems.
files for available data or its own proprietary dataset. In Table I provides a comprehensive summary of the available
addition, most of the approaches used for labeling do not datasets based on nine features. The features comprise the year
seem adequate and have a complication, as they use the 5- of publication, accessibility of the dataset (whether the dataset
tuple approach. There is no doubt that packet-based NIDS has is publicly available or not), format of the dataset (either flow
the potential to detect attacks in a network. Still, to gauge or packet), size of the dataset, availability of labeled packet
the true applicability, researchers need a standardized baseline data, kind of traffic (real or emulated), whether the data is
that can utilized for the proposed scheme. In this way, the balanced or not, availability of modern network attacks, and
problem of reproducibility and comparability will be resolved. whether the data contain metadata or not.
In this work, we developed a tool after undergoing a complex However, many datasets contain raw packet data informa-
process of in-depth analysis of the datasets and methodology tion, and not every dataset is being utilized in the ongoing
to provide a baseline solution for future researchers. The research. The reason is that every dataset has limitations and
objective of the developed tool is to provide a generalized challenges due to the methods and environment used for
packet-based dataset that can be utilized by anyone according creating them. Moreover, many of these datasets are outdated
to their model. The detail of our adopted methodology is due to the technological advancement accompanied by new
presented in Section IV. Moreover, labeled raw packet data and more complex software and network structures. But still,
has also been made available for researchers’ ease. the most widely used and up-to-date datasets available are
CICIDS 2017 and UNSW-NB15 [29]. Therefore, we selected
these two datasets for the explanation of our tool.
IV. M ETHODOLOGY
B. Workflow Overview
A detailed overview of the adopted procedure for extracting The developed tool, Payload-Byte, consists of three main
and labeling raw packet data is provided in this Section. Since components: python-based parser, labeling module, and pay-
our goal is to provide ease and a frame of reference to future load transformation module. Python-based parser and payload
researchers, we only considered the modern and preferred transformation module are generalized components, and their
network intrusion detection dataset to explain our tool. In methodology and approach are similar for every dataset. How-
addition, a concise summary of available network intrusion ever, the labeling module is dataset specific, and its approach
detection datasets and their characteristics is also presented. differs from dataset to dataset, which is explained later. An
Fig. 3. Feature Vector representation of extracted data and data utilized for training the ML model. T-delta in employed feature vector is the time difference
between packets.
TABLE I
OVERVIEW OF THE WIDELY UTILIZED INTRUSION DETECTION DATASET IN THE LITERATURE
Dataset Year of Publication Accessibility Format Size Count Traffic Modern Attacks Balanced Metadata Labeled Packets
NSL-KDD [30] 1998 Public Other 150k points Emulated No No No -
DARPA [31], [32] 1998 Public Packets, logs 4.9M points Emulated No No Yes Yes
KYOTO 2006+ [33] 2006 Public Other 93M points Real No No No -
UNIBS [34] 2009 On Request Flow 79k flows Real No No No -
Botnet [35] 2010 Public Packet 14GB packets Emulated No No Yes Yes
ISCX 2012 [36] 2012 Public Packet, Flow 2M flows Emulated No No Yes No
CIC DoS [37] 2012 Public Packet 4.6GB packets Emulated No No No Yes
CTU-13 [38] 2013 Public Packet, Flow 81M flows Real No No Yes No
UNSW-NB15 [39] 2015 Public Packet,Flow 2M points Emulated Yes No Yes No
CIC-IDS2017 [40] 2017 Public Packet, Flow 4.6GB packets Emulated Yes No Yes No
CIC-IDS2018 [41] 2018 Public Packet, Flow, logs 450GB logs Emulated Yes No Yes No
overview of the workflow is illustrated in Fig. 2. Raw PCAP complications for the model. Moreover, IP addresses in the
files and metadata are fed into Payload-Byte as input, where network layer and port fields in the transport layer can cause
processed and labeled payload data is received at the output. the ML model to form an unreliable bias towards these bytes.
Payload-Byte can also obtain parsed PCAP files in the form of Since these features are commonly associated with preferences
CSV and labeled PCAP file without transformation. Provision and specific services, they can create a communication pattern.
of these files enables researchers to employ their own inferring But they can change anytime; therefore, these features are
for the extracted payload data while still having that standard unreliable for training a model. Unlike [11], in which the
baseline. authors limited their information extraction to the transport
Several tools are available for analyzing and extracting in- layer and only involved two protocols, TCP and UDP, the
formation from packet capture files, such as Wireshark, Chaos developed parser can extract information from the application
Reader, Tcpflow, Network miner, and many others. However, layer too. A pictorial representation of extracted feature and
most of these tools are Graphical User Interface (GUI) based feature vector utilized in training the model is presented in
and require a lot of computation and processing power, which Fig. 3. As shown in Fig. 3, IP addresses and ports are only
is unsuitable for any tactical environment. A programming- extracted for labeling the raw PCAP file data with reference
based parser is preferred for such an environment, which to available metadata. Since the length of the payload changes
can be operated on any resource-constraint device. Keeping with each packet, the maximum length that a packet can attain
this need in view, we developed our parser based on the is considered to avoid any overflowing or truncation of the
Scapy python module [42]. Since the first step in labeling any payload byte. As per the de facto packet size limit of 1500
packet capture files is extracting information, we developed a bytes [43], [44], we set a 1500 bytes range for the payload
generalized PCAP file parser that can be utilized for parsing to incorporate every byte. Our goal is to extract the data in
PCAP format files. such a way that it is complete and researchers can reduce this
There are many approaches for extracting information, as range as per their need. Furthermore, the payload was divided
mentioned in Section III. Since our focus is on a payload- with respect to bytes, transforming into 1500 features. This
based intrusion detection system, we laid out our feature vector transformation is necessary for the training of the ML model.
for packet-based approaches in such a way that raw bytes are The utilization of NLP techniques for payload data is not
captured from packet data, and features are extracted from the adopted; instead, the payload is transformed from hex value to
packet header. We have not utilized the raw bytes of header integer having a range of 0-255. Zero padding was employed
due to its dynamic structure. As there are many protocols, where the number of payload bytes are less than 1500 to
each protocol’s header size and order are distinct. Therefore, maintain the standard structure of the feature vector. Further
utilizing raw bytes for the header would lead to learning detail on parsing concerning individual datasets is provided in
their subsequent heading. not included as there are more than 130 different protocols.
The next step after extraction is labeling, which is performed However, the obtained results were not satisfactory. Therefore,
by comparing the extracted features from the PCAP file and different protocols having exact naming to the ones present in
features from the ground truth table. However, the generalized the CSV file were incorporated into the python based parser.
labeling approach is not possible for both datsets due to Since these protocols are from different layers, python based
dataset-specific complications explained in their respective parser is programmed to extract the application layer protocols.
subheading. Inner merge utilizing the divide and conquer The top 45 protocols concerning attacks label counts are hard
algorithm is adopted for comparing and labeling the PCAP coded in the program, and the rest of the protocols are mapped
file with CSV while preserving the order of PCAP files. Since under others. As the number of data instances after the top
packet data has packets in the range of microseconds, the 20 protocols is insignificant, they are mapped under other
number of labeled PCAP data is higher than the data instance category. Similar mapping is performed for the CSV file too. In
in the CSV file which is shown in Section V. Further, to addition to protocol, dur feature from CSV and t-delta feature
facilitate the availability of data publicly and ease the data from pcap files is also utilized for mapping. Subsequently, the
usage process, data reduction is carried out. All the data results obtained are not to par. Manual data exploration was
instances whose packet data has no payload are removed. performed and it was deduced that the dur feature also has
Moreover, normal data instances are under-sampled to mitigate ambiguity. Therefore, t-delta feature is added to starting time
the unbalanced issue of the dataset. The processed payload of the PCAP file to attain the ending time. Both of the Unix
data for both datasets are made available along with the source time stamps are rounded-off to transform them into integers.
code of the developed tool so that researchers can cross-check Here type casting is not performed directly, as type casting
the procedure or utilize it to generate the complete data without would truncate the floating points, which is not the case in
data reduction. The further detail of labeling and parsing for the CSV file. While exploring data, it is also inferred that the
the individual dataset is explained below. Time to Live (TTL) feature of the pcap file maps to the source
1) UNSW-NB15: The UNSW-NB15 intrusion detection TTL feature of CSV. Therefore, eight different features are
dataset encompasses nine modern attacks and network traffic utilized for comparing and labeling pcap files. These features
emulated in a small environment. The network traffic is are: Source IP, Destination IP, Source Port, Destination Port,
captured for more than 31 hours and is spread out in 79 starting time, ending time, protocol and time to live.
different PCAP files, having more than 99GB of data. The Since ICMP protocol has corrupted destination port and
dataset comprises raw network traffic in the PCAP file format source port, it was labeled without including these two fea-
and labeled flow-based data with additional attributes. We tures. Similarly, ARP protocols also do not have a destination
have utilized the PCAP and CSV files having labeled data or source port. Moreover, there are some protocols whose IP
instances for our tool. Moreover, an in-depth dataset analysis addresses are not available in PCAP file format; therefore,
was performed by deploying various approaches for accurate they were not labeled automatically. After labeling, duplicate
and effective comparison and labeling. However, several am- data instances are removed, and benign data is under-sampled
biguities were found in the dataset, which is discussed next. to 1.5 times of second highest attack instances. After that,
First of all, CSV data requires a lot of prepossessing. There data is transformed into 1504 features, converting payload hex
are many missing and null values in the dataset. Furthermore, string into 1500 byte-wise data represented in integers. The
there is inconsistency in the labeling of data instances. Similar remaining 4 features are from packet header as shown in Fig.
attack classes are labeled differently. Around 480,630 data 3.
instances are duplicated, and more than 130 protocols are 2) CIC-IDS2017: The CIC-IDS2017 intrusion detection
present in the dataset. The protocols are from different layers: dataset is specifically developed to represent more modern net-
the transport layer and the application layer. Secondly, some work flows, and attacks than the preceding datasets mentioned
corrupted data are present in the dataset, such as the source in Table I [40]. The dataset consists of 48.8GB of network
and destination ports of ICMP protocols are in hex values. traffic captured in five separate files over five days. The dataset
Thirdly, the time stamping in the UNSW-NB15 dataset is in is released in two different formats: raw network traffic in the
Unix epoch format and has a starting and ending time for PCAP file format and extracted flow-based data having a set of
the packets. But the epoch time is rounded up to integers, different features in CSV format. The authors have additionally
losing its microseconds which could be useful for accurately provided metadata about IP addresses and attack duration. We
labeling the packets. Also, the dataset has a feature duration utilized the PCAP and CSV files having labeled data instances
which was used to generate the ending time for packets, but for our tool. However, some ambiguity in the CSV data files
for some data instances, this feature does not add up correctly. and PCAP files led to labeling based on flow-ID (Source IP,
Fourthly, several protocols, such as udt and any do not exist. Destination IP, Source Port, Destination Port, and Protocol).
Several approaches were tried to achieve the most optimal Some of the prominent ambiguities are highlighted below.
approach for comparing and labeling the PCAP data. The First of all, CSV data requires a lot of prepossessing, there
first attempt was made by labeling the PCAP packets by are several missing values in the dataset, and four columns
utilizing five features: Source IP, Destination IP, Source Port, are duplicated. Secondly, only three protocols are available
Destination Port, and Starting time. Protocol feature was in the CSV file: TCP, UDP, and others. Whereas, in PCAP
files, we found that there are Address Resolution Protocol TABLE II
(ARP), Link Layer Discovery Protocol (LLDP), and Cisco M ODEL ARCHITECTURE FOR THE CNN-LSTM AND DNN
Discovery Protocol (CDP) which are not part of IPv4 or Parameter Description
Activation Function Softmax
IPv6. Moreover, the PCAP file also contains Internet Con- Loss Function Categorical cross entropy
Optimizer Adam & lr=default
trol Message Protocol (ICMP), Internet Group Management Epochs 30
Protocol (IGMP), and Stream Control Transmission Protocol Number of Layers
DNN
3
CNN-LSTM
5
(SCTP) packets. These protocols are neglected as they are not Layer 1 Fully Connected (None,1024) Conv1D (None, 1504, 64)
Layer 2 Fully Connected (None,512) Maxpooling1D (None, 752, 64)
included in the CSV file. Thirdly, time-stamping in the CSV Layer 3 Fully Connected (None,10) BatchNormalization (None, 752, 64)
Layer 4 - LSTM (None, 64)
file is in the general format of “dd/MM/YYYY HH:mm:ss” Layer 5 - Fully Connected (None,10)
rather than Unix epoch time stamping as in the UNSW-NB15
dataset. 529,450 data instances (around 19%) in the CSV file
are missing seconds in time format. This leads to inaccurate parameter tuning has been performed for the approaches,
time calculation for comparison and labeling of the PCAP and default parameters are used. The goal here is not to
file. Moreover, the time format in the CSV file follows the achieve the best results but to provide a brief comparison.
12-hour clock format, but it is found that every data instance Moreover, the results are not compared with recent available
is in AM, which is not the case with the PCAP file. Just payload-based approaches as they have not explicitly stated
for experimentation, we labeled the data with the inclusion of their data extraction and assumptions approach. Therefore, it is
time-stamping, and found that only 80 data instances were not a feasible option. Furthermore, available approaches have
matched and labeled for one PCAP file. That is why we only utilized binary class classification and presented it as an
have not utilized time stamping in our labeling approach for anomaly detector. However, we have performed a multi-class
CIC-IDS2017. Another ambiguity in time stamping is that classification and results are provided in the results Section.
Wednesday (July 5, 2017) and Friday (July 7, 2017) PCAP The approaches that are adopted are: Random Forest, Lo-
files have time-stamping till 15:10 and 15:02, respectively. As gistic Regression, K-Nearest Neighbour, AdaboostClassifier,
per the metadata and CSV file, the Heartbleed Port attack starts Multilayer Perceptron, Deep Neural Network (DNN) and a
at 15:12 on Wednesday, and the DDoS attack starts at 15:56 simple combination of Convolutional Neural Network (CNN)
on Friday, which are completely missing in the PCAP file as and Long Short-Term Memory (LSTM). The architecture
per time stamping. utilized for DNN and CNN-LSTM is shown in Table II.
Keeping these issues in mind, we dropped the time stamp Similar architecture is utilized for the CSV and labeled PCAP
feature as our tool’s comparison base for CIC-IDS2017. We files. However, input of the model is different for CSV and
utilized the Source IP, Destination IP, Source Port, Desti- PCAP files.
nation Port, and Protocol feature for matching the packets
with labeled CSV data. After data instances are matched,
we utilized the time duration of each attack as specified in
metadata to cross-validate the data instances. Here we used
the time-stamp of the CSV file rather than the PCAP file
since metadata is based on CSV file time-stamps. For the
given time of attacks as per metadata, we removed benign
instances from them to eliminate complications. However, the
CSV file also contains benign data instances in the time frame
of attacks as specified by metadata. After cross-validation of
labeled PCAP data, duplicated values are removed and benign
data is under-sampled to 1.5 times of second highest attack
instances. After that, data was transformed into 1504 features,
Fig. 4. An overview of the data processing and achieved outcome of the
converting payload hex string into byte-wise data represented Payload-Byte. Both of the datasets are plotted side by side for better inferring.
in integers.
The adopted approach for both datasets is deduced after
extensive exploration of the dataset and methods. Therefore, V. R ESULTS
the extracted data can be utilized as a standard baseline for A comprehensive overview in the form of quantitative
future and current work. The finding of the processed data is data is presented in this Section, along with the comparative
presented in the next Section. analysis of results obtained by packet and flow-based data.
For the implementation of our developed tool, CSV files of
C. Incorporated Models both datasets are pre-processed before passing them into the
We performed a comparative analysis using several ML Payload-Byte, whereas PCAP files are directly fed into it. The
approaches between packet-based and extracted flow-based output of the Payload-Byte is a transformed and labeled pcap
data. The objective of this comparative analysis is to provide a file, having 1504 features as shown in Fig. 3. However, initial
comparison between both types of data. Therefore, no hyper- stage files can also be extracted from the tool, such as parsed
Fig. 5. A comparison of data instances for individual attack classes in Labeled PCAP and Processed CSV file. Numbers on the bar graph represents the data
instances in labeled PCAP file. Normal instances are under-sampled for ease of data representation and usage.
PCAP files and labeled PCAP files having hex valued payload. spanning over 5 days; therefore, they are shown individually.
An overview of the processed data is presented in the next Data Instances in CSV in Table III and Table IV represents
heading. number of data points that CSV file contain in the time span of
that individual PCAP file. Unprocessed Labeled PCAP depicts
A. Data Processing the number of packet instances labeled and Processed Labeled
Available CSV files for both datasets are distributed among PCAP shows the number of remaining packets after processing
several files. Therefore, they are combined into a single file and removal of non-payload data from Unprocessed Labeled
for an individual dataset before any processing. Data prepos- PCAP.
sessing is also performed, cleaning and removing erroneous
TABLE IV
data instances. The crux of the whole processing is illustrated S UMMARY OF CIC-IDS2017 PCAP F ILES
in Fig. 4 where data in CSV file represents the data instances
Monday Tuesday Wednesday Thursday Friday
in the available CSV file before prepossessing it. The total Data Instances in CSV 306794 335,415 590,692 285,825 675,965
number of available packets in all PCAP files is shown as Data Instances in PCAP
Unprocessed Labeled PCAP
11,626,492
17,096,760
11,469,736
20,692,983
13,705,555
48,931,426
9,240,723
15,727,986
9,915,680
19,740,075
data in the PCAP file. Moreover, labeled data represents data Processed Labeled PCAP 5,633,567 3,516,569 16,436,931 4,071,767 5,848,193