0% found this document useful (0 votes)
82 views

A Survey of Network-Based Intrusion Detection Data Sets

This document provides a literature survey of network-based intrusion detection data sets. It identifies 15 properties to assess the suitability of data sets for specific evaluation scenarios and groups them into 5 categories. The document then provides a comprehensive overview of existing data sets, highlighting the peculiarities of each. It focuses on attack scenarios within data sets and relations between them. Other sources of network data like traffic generators and repositories are also briefly discussed.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views

A Survey of Network-Based Intrusion Detection Data Sets

This document provides a literature survey of network-based intrusion detection data sets. It identifies 15 properties to assess the suitability of data sets for specific evaluation scenarios and groups them into 5 categories. The document then provides a comprehensive overview of existing data sets, highlighting the peculiarities of each. It focuses on attack scenarios within data sets and relations between them. Other sources of network data like traffic generators and repositories are also briefly discussed.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

1

A Survey of Network-based Intrusion Detection


Data Sets
Markus Ring, Sarah Wunderlich, Deniz Scheuring, Dieter Landes and Andreas Hotho

Abstract—Labeled data sets are necessary to train and evaluate New Brunswick published the CICIDS 2017 [22] data set.
anomaly-based network intrusion detection systems. This work More data sets can be expected in the future. However, there
provides a focused literature survey of data sets for network- is no overall index of existing data sets and it is hard to keep
based intrusion detection and describes the underlying packet-
and flow-based network data in detail. The paper identifies 15 track of the latest developments.
arXiv:1903.02460v2 [cs.CR] 6 Jul 2019

different properties to assess the suitability of individual data sets This work provides a literature survey of existing network-
for specific evaluation scenarios. These properties cover a wide based intrusion detection data sets. At first, the underlying data
range of criteria and are grouped into five categories such as are investigated in more detail. Network-based data appear in
data volume or recording environment for offering a structured packet-based or flow-based format. While flow-based data con-
search. Based on these properties, a comprehensive overview of
existing data sets is given. This overview also highlights the tain only meta information about network connections, packet-
peculiarities of each data set. Furthermore, this work briefly based data also contain payload. Then, this paper analyzes
touches upon other sources for network-based data such as and groups different data set properties which are often used
traffic generators and data repositories. Finally, we discuss our in literature to evaluate the quality of network-based data sets.
observations and provide some recommendations for the use and The main contribution of this survey is an exhaustive literature
the creation of network-based data sets.
overview of network-based data sets and an analysis as to
Index Terms—Intrusion Detection, IDS, NIDS, Data Sets, which data set fulfills which data set properties. The paper
Evaluation, Data Mining focuses on attack scenarios within data sets and highlights
relations between the data sets. Furthermore, we briefly touch
I. I NTRODUCTION upon traffic generators and data repositories as further sources
IT security is an important issue and much effort has been for network traffic besides typical data sets and provide some
spent in the research of intrusion and insider threat detec- observations and recommendations. As a primary benefit, this
tion. Many contributions have been published for processing survey establishes a collection of data set properties as a
security-related data [1]–[4], detecting botnets [5]–[8], port basis for comparing available data sets and for identifying
scans [9]–[12], brute force attacks [13]–[16] and so on. All suitable data sets, given specific evaluation scenarios. Further,
these works have in common that they require representative we created a website 1 which references to all mentioned data
network-based data sets. Furthermore, benchmark data sets sets and data repositories and we intend to update this website.
are a good basis to evaluate and compare the quality of The rest of the paper is organized as follows. The next
different network intrusion detection systems (NIDS). Given section discusses related work. Section III analyzes packet-
a labeled data set in which each data point is assigned to the and flow-based network data in more detail. Section IV
class normal or attack, the number of detected attacks or the discusses typical data set properties which are often used in
number of false alarms may be used as evaluation criteria. the literature to evaluate the quality of intrusion detection
Unfortunately, there are not too many representative data sets data sets. Section V gives an overview of existing data sets
around. According to Sommer and Paxson [17] (2010), the and checks each data set against the identified properties of
lack of representative publicly available data sets constitutes Section IV. Section VI briefly touches upon further sources
one of the biggest challenges for anomaly-based intrusion for network-based data. Observations and recommendations
detection. Similar statements are made by Malowidzki et are discussed in Section VII before the paper concludes with
al. [18] (2015) and Haider et al. [19] (2017). However, the a summary.
community is working on this problem as several intrusion
detection data sets have been published over the last years. In II. R ELATED W ORK
particular, the Australian Centre for Cyber Security published This section reviews related work on network-based data
the UNSW-NB15 [20] data set, the University of Coburg sets for intrusion detection. It should be noted that host-based
published the CIDDS-001 [21] data set, or the University of intrusion detection data sets like ADFA [23] are not considered
Markus Ring, Sarah Wunderlich, Deniz Scheuring and Dieter Lan-
in this paper. Interested readers may find details on host-based
des were with the Department of Electrical Engineering and Computer intrusion detection data in Glass-Vanderlan et al. [24].
Science, Coburg University of Applied Sciences, 96450 Coburg, Ger- Malowidzki et al. [18] discuss missing data sets as a
many (e-mail: [email protected], [email protected],
[email protected], [email protected])
significant problem for intrusion detection, set up requirements
Andreas Hotho was with Data Mining and Information Retrieval for good data sets, and list available data sets. Koch et al. [25]
Group, University of Würzburg, 97074 Würzburg, Germany (e-mail:
[email protected]) 1 http://www.dmir.uni-wuerzburg.de/datasets/nids-ds
2

provide another overview of intrusion detection data sets,


analyze 13 data sources, and evaluate them with respect to 8
data set properties. Nehinbe [26] provides a critical evaluation
of data sets for IDS and intrusion prevention systems (IPS).
The author examines seven data sets from different sources
(e.g. DARPA data sets and DEFCON data sets), highlights
their limitations, and suggests methods for creating more
realistic data sets. Since many data sets are published in the
last four years, we continue previous work [18], [25], [26]
from 2011 to 2015, but offer a more up-to-date and more
detailed overview than our predecessors.
While many data set papers (e.g., CIDDS-002 [27],
ISCX [28] or UGR’16 [29]) give just a brief overview of
some intrusion detection data sets, Sharafaldin et al. [30]
provide a more exhaustive review. Their main contribution
is a new framework for generating intrusion detection data
sets. Sharafaldin et al. also analyze 11 available intrusion
detection data sets and evaluate them with respect to 11 data
set properties. In contrast to earlier data set papers, our work
focuses on providing a neutral overview of existing network-
based data sets rather than contributing an additional data set.
Other recent papers also touch upon network-based data
sets, yet with a different primary focus. Bhuyan et al. [31]
present a comprehensive review of network anomaly detection.
The authors describe nine existing data sets and analyze data
sets which are used by existing anomaly detection methods.
Similarly, Nisioti et al. [32] focus on unsupervised methods
for intrusion detection and briefly refer to 12 existing network-
based data sets. Yavanoglu and Aydos [33] analyze and com-
pare the most commonly used data sets for intrusion detection.
However, their review contains only seven data sets including
other data sets like HTTP CSIC 2010 [34]. All in all, these
works tend to have different research objectives and only touch
upon network-based data sets marginally.

III. DATA
Fig. 1. IP, TCP, UDP and ICMP header after [36].
Normally, network traffic is captured either in packet-based
or flow-based format. Capturing network traffic on packet-
level is usually done by mirroring ports on network devices. sequence number, acknowledgment number, TCP flags, or
Packet-based data encompass complete payload information. checksum values. UDP is a connection-less transport protocol
Flow-based data are more aggregated and usually contain only and has a smaller header than TCP which contains only
metadata from network connections. Wheelus et al. highlight four fields, namely source port, destination port, length and
the distinction through an illustrative comparison: "A good checksum. In contrast to TCP and UDP, ICMP is a supporting
example of the difference between captured packet inspection protocol containing status messages and is thus even smaller.
and NetFlow would be viewing a forest by hiking through Normally, there is also an IP header available beside the header
the forest as opposed to flying over the forest in a hot air of the transport protocol. The IP header provides information
balloon" [35]. In this work, a third category (other data) is such as source and destination IP addresses and is also shown
introduced. The category other has no standard format and in Figure 1.
varies for each data set.
B. Flow-based data
A. Packet-based data Flow-based network data is a more condensed format which
Packet-based data is commonly captured in pcap format and contains mainly meta information about network connections.
contains payload. Available metadata depends on the used Flow-based data aggregate all packets which share some
network and transport protocols. There are many different properties within a time window into one flow and usually
protocols and the most important ones being TCP, UDP, ICMP do not include any payload. The default five-tuple definition,
and IP. Figure 1 illustrates the different headers. TCP is i.e., source IP address, source port, destination IP address,
a reliable transport protocol and encompasses metadata like destination port and transport protocol [37], is a widely used
3

TABLE I to assess intrusion detection data sets. The general concept


ATTRIBUTES IN F LOW- BASED N ETWORK T RAFFIC . FAIR [43] defines four principles that scholarly data should
fulfill, namely Findability, Accessibility, Interoperability and
# Attribute
1 Date first seen Reusability. While concurring with this general concept, this
2 Duration work uses more detailed data set properties to provide a
3 Transport protocol focused comparison of network-based intrusion detection data
4 Source IP address
5 Source port sets. Generally, different data sets emphasize different data set
6 Destination IP address properties. For instance, the UGR’16 data set [29] emphasizes
7 Destination port a long recording time to capture periodic effects while the
8 Number of transmitted bytes
9 Number of transmitted packets ISCX data set [28] focuses on accurate labeling. Since we
10 TCP flags aim at investigating more general properties for network-based
intrusion detection data sets, we try to unify and generalize
properties used in literature rather than adopting all of them.
standard for matching properties in flow-based data. Flows For example, some approaches evaluate the presence of spe-
can appear in unidirectional or bidirectional format. The cific kind of attacks like DoS (Denial of Service) or Browser
unidirectional format aggregates all packets from host A to injections. The presence of certain attack types may be a
host B which share the above mentioned properties into one relevant property for evaluating detection approaches for those
flow. All packets from host B to host A are aggregated into specific attack types, but are meaningless for other approaches.
another unidirectional flow. In contrast, a bidirectional flow Hence, we use the general property attacks which describes the
summarizes all packets between hosts A and B, regardless of presence of malicious network traffic (see Table III). Section V
direction. provides more details on the different attack types in the data
Typical flow-based formats are NetFlow [38], IPFIX [37], sets together with a discussion of other particular properties.
sFlow [39] and OpenFlow [40]. Table I gives an overview of We do not develop an evaluation score like Haider et al. [19]
typical attributes within flow-based network traffic. Depending or Sharafaldin et al. [30] since we do not want to judge the
on the specific flow format and flow exporter, additional importance of different data set properties. In our opinion,
attributes like bytes per second, bytes per packet, TCP flags of the importance of certain properties depends on the specific
the first packet, or even the calculated entropy of the payload evaluation scenario and should not be generally judged in a
can be extracted. survey. Rather, readers should be put in a position to find
Furthermore, it is possible to convert packet-based data to suitable data sets for their needs. Therefore, we group the data
flow-based data (but not vice versa) with tools like nfdump2 set properties discussed below in five categories to support
or YAF3 . Readers interested in the differences between flow systematic search. Figure 2 summarizes all data set properties
exporters may find additional details in [41], together with and their value ranges.
an analysis of how different flow exporters affect botnet
classification.
A. General Information
C. Other data The following four properties reflect general information
This category includes all data sets that are neither purely about the data set, namely the year of creation, availability,
packet-based nor flow-based. An example of this category presence of normal and malicious network traffic.
might be flow-based data sets which have been enriched with 1) Year of Creation: Since network traffic is subject to
additional information from packet-based data or host-based concept drift and new attack scenarios appear daily, the age
log files. The KDD CUP 1999 [42] data set is a well-known of an intrusion detection data set plays an important role. This
representative of this category. Each data point has network- property describes the year of creation. The year in which the
based attributes like the number of transmitted source bytes or underlying network traffic of a data set was captured is more
TCP flags, but has also host-based attributes like number of relevant for up-to-dateness than the year of its publication.
failed logins. As a consequence, each data set of this category 2) Public Availability: Intrusion detection data sets should
has its own set of attributes. Since each data set must be be publicly available to serve as a basis for comparing different
analyzed individually, we do not make any general statements intrusion detection methods. Furthermore, the quality of data
about available attributes. sets can only be checked by third parties if they are publicly
available. Table III encompasses three different characteristics
IV. DATA S ET P ROPERTIES for this property: yes, o.r. (on request), and no. On request
means that access will be granted after sending a message to
To be able to compare different intrusion detection data sets
the authors or the responsible person.
side by side and to help researchers finding appropriate data
3) Normal User Behavior: This property indicates the
sets for their specific evaluation scenario, it is necessary to
availability of normal user behavior within a data set and
define common properties as evaluation basis. Therefore, we
takes the values yes or no. The value yes indicates that there
explore typical data set properties that are used in the literature
is normal user behavior within the data set, but it does not
2 https://github.com/phaag/nfdump make any statements about the presence of attacks. In general,
3 https://tools.netsa.cert.org/yaf/ the quality of an IDS is primarily determined by its attack
4

Fig. 2. Data set properties and their value ranges.

detection rate and false alarm rate. Therefore, the presence of addresses, attack scenarios and so on. This property indicates
normal user behavior is indispensable for evaluating an IDS. the presence of additional metadata.
The absence of normal user behavior, however, does not make
a data set unusable, but rather indicates that it has to be merged 2) Format: Network intrusion detection data sets appear in
with other data sets or with real world network traffic. Such a different formats. We roughly divide them into three formats
merging step is often called overlaying or salting [44], [45]. (see Section III). (1) Packet-based network traffic (e.g. pcap)
4) Attack Traffic: IDS data sets should include various contains network traffic with payload. (2) Flow-based network
attack scenarios. This property indicates the presence of ma- traffic (e.g. NetFlow) contains only meta information about
licious network traffic within a data set and has the value yes network connections. (3) Other types of data sets may contain,
if the data set contains at least one attack. Table IV provides e.g., flow-based traces with additional attributes from packet-
additional information about the specific attack types. based data or even from host-based log files.

3) Anonymity: Frequently, intrusion detection data sets


B. Nature of Data may not be published due to privacy reasons or are only
available in anonymized form. This property indicates if data
Properties of this category describe the format of the data is anonymized and which attributes are affected. The value
sets and the presence of meta information. none in Table III indicates that no anonymization has been
1) Metadata: Content-related interpretation of packet- performed. The value yes (IPs) means that IP addresses are
based and flow-based network traffic is difficult for third par- either anonymized or removed from the data set. Similarly, yes
ties. Therefore, data sets should come along with metadata to (payload) indicates that payload information is anonymized or
provide additional information about the network structure, IP removed from packet-based network traffic.
5

C. Data Volume labels. Imbalanced data sets should be balanced by appropriate


Properties in this category characterize data sets in terms of preprocessing before data mining algorithms are used. He
volume and duration. and Garcia [46] provide a good overview of learning from
1) Count: The property count describes a data set’s size imbalanced data.
as either the number of contained packets/flows/points or the 3) Labeled: Labeled data sets are necessary for training
physical size in Gigabyte (GB). supervised methods and for evaluating supervised as well
2) Duration: Data sets should cover network traffic over as unsupervised intrusion detection methods. This property
a long time for capturing periodical effects (e.g., daytime vs. denotes if data sets are labeled or not. If there are at least
night or weekday vs. weekend) [29]. The property duration the two classes normal and attack, this property is set to
provides the recording time of each data set. yes. Possible values in this property are: yes, yes with BG.
(yes with background), yes (IDS), indirect, and no. Yes with
D. Recording Environment background means that there is a third class background.
Packets, flows, or data points which belong to the class
Properties in this category delineate the network environ- background could be normal or attack. Yes (IDS) means that
ment and conditions in which the data sets are captured. some kind of intrusion detection system was used to create the
1) Kind of Traffic: The property Kind of Traffic describes data set’s labels. Some labels of the data set might be wrong
three possible origins of network traffic: real, emulated, or syn- since an IDS might be imperfect. Indirect means that the data
thetic. Real means that real network traffic was captured within set has no explicit labels, but labels can be created on one’s
a productive network environment. Emulated means that real own from additional log files.
network traffic was captured within a test bed or emulated
network environment. Synthetic means that the network traffic
was created synthetically (e.g., through a traffic generator) and V. DATA S ETS
not captured by a real (or virtual) network device. In our opinion, the data set properties Labeled and Format
2) Type of Network: Network environments in small and are the most decisive properties when searching for adequate
medium-sized companies are fundamentally different from network-based data sets. The intrusion detection method (su-
internet service providers (ISP). As a consequence, different pervised or unsupervised) determines if labels are necessary or
environments require different security systems and evaluation not and which kind of data is required (packet, flow or other).
data sets should be adapted to the specific environment. This Therefore, Table II provides a classification of all investigated
property describes the underlying network environment in network-based data sets with respect to these two properties. A
which the respective data set was created. more detailed overview of network-based intrusion detection
3) Complete Network: This property is adopted from data sets with respect to the data set properties of Section IV
Sharafaldin et al. [30] and indicates if the data set contains is given in Table III. The presence of specific attack scenarios
the complete network traffic from a network environment with is an important aspect when searching for a network-based
several hosts, router and so on. If the data set contains only data set. Therefore, Table III indicates the presence of attack
network traffic from a single host (e.g., honeypot) or only traffic while Table IV provides details on specific attacks
some protocols from the network traffic (e.g., exclusively SSH within a data set. Papers on data sets describe attacks on
traffic), the value is set to no. different abstraction levels. Vasudevan et al. [47], for instance,
characterized attack traffic within their data set (SSENET-
E. Evaluation 2011) as follows: "Nmap, Nessus, Angry IP scanner, Port
The following properties are related to the evaluation of in- Scanner, Metaploit, Backtrack OS, LOIC, etc., were some of
trusion detection methods using network-based data sets. More the attack tools used by the participants to launch the attacks.".
precisely, the properties denote the availability of predefined In contrast, Ring et al. specify the number and different types
subsets, the data set’s balance, and the presence of labels. of executed port scans in their CIDDS-002 data set [27].
1) Predefined Splits: Sometimes it is difficult to compare Consequently, the abstraction level of attack descriptions may
the quality of different IDS, even if they are evaluated on vary in Table IV. A detailed description of all attack types
the same data set. In that case, it must be clarified whether is beyond the scope of this work. Rather, we refer interested
the same subsets are used for training and evaluation. This readers to the open access paper "From Intrusion Detection to
property provides the information if a data set comes along an Intrusion Response System: Fundamentals, Requirements,
with predefined subsets for training and evaluation. and Future Directions" by Anwar et al. [48].
2) Balanced: Often, machine learning and data mining Further, some data sets are modifications or combinations
methods are used for anomaly-based intrusion detection. In the of others. Figure 3 shows the interrelationships among several
training phase of such methods (e.g., decision tree classifiers), well-known data sets.
data sets should be balanced with respect to their class labels.
Consequently, data sets should contain the same number Network-based data sets in alphabetical order
of data points from each class (normal and attack). Real- AWID [49]. AWID is a publicly available data set4 which is
world network traffic, however, is not balanced and contains focused on 802.11 networks. Its creators used a small network
more normal user behavior than attack traffic. This property
indicates if data sets are balanced with respect to their class 4 http://icsdweb.aegean.gr/awid/index.html
6

TABLE II
D ECISION SUPPORT TABLE FOR FINDING APPROPRIATE NETWORK - BASED
DATA SETS . S OME DATA SETS LIKE CTU-13 PROVIDE SEVERAL DATA
FORMATS AND APPEAR IN SEVERAL COLUMNS . (+) INDICATES THAT THE
DATA SET IS PUBLICLY AVAILABLE . (?) INDICATES THAT WE WERE NOT
ABLE TO FIND THE DATA SET. (-) INDICATES THAT THE DATA SET IS NOT
PUBLICLY AVAILABLE .

Format
Labeled packet flow other
yes (+) Botnet (+) CICIDS 2017 (+) AWID
(+) CIC DoS (+) CIDDS-001 (+) KDD CUP 99
(+) CICIDS 2017 (+) CIDDS-002 (+) Kyoto 2006+
(+) DARPA (+) ISCX 2012 (+) NSL-KDD
(+) DDoS 2016 (+) TUIDS (+) UNSW-NB 15
(+) ISCX 2012 (+) Twente
(+) ISOT
(+) NDSec-1
(+) NGIDS-DS
(+) TRAbID
(+) TUIDS
(+) UNSW-NB15

(?) PU-IDS
(?) SSENET-2011
(?) SSENET-2014

(-) IRSC (-) IRSC (-) SANTA


yes with BG (+) CTU-13 (+) CTU-13
(+) UGR’16
yes (IDS) (?) PUF
indirect (+) SSHCure
no (+) Booters (+) Kent 2016
(+) CDX (+) UNIBS
(+) LBNL (+) Unified Host
and Network

environment (11 clients) and captured WLAN traffic in packet-


based format. In one hour, 37 million packets were captured.
156 attributes are extracted from each packet. Malicious net-
work traffic was generated by executing 16 specific attacks
against the 802.11 network. AWID is labeled and split into a
training and a test subset.
Booters [50]. Booters are Distributed Denial of Service
(DDoS) attacks offered as a service by criminals. Santanna
Fig. 3. Relationships between the data sets in Table III.
et. al [50] published a data set which includes traces of
nine different booter attacks which were executed against
a null-routed IP address within their network environment. set with application layer DoS attacks. Therefore, the authors
The resulting data set is recorded in packet-based format and executed eight different DoS attacks on the application layer.
consists of more than 250GB of network traffic. Individual Normal user behavior was generated by combining the result-
packets are not labeled, but the different booter attacks are ing traces with attack-free traffic from the ISCX 2012 [28]
split into different files. The data set is publicly available5 , data set. The resulting data set is available in packet-based
but names of booters are anonymized for privacy reasons. format and contains 24 hours of network traffic.
Botnet [5]. The Botnet data set is a combination of existing
CICIDS 2017 [22]. CICIDS 2017 was created within an
data sets and is publicly available6 . The creators of Botnet
emulated environment over a period of 5 days and contains
used the overlay methodology of [44] to combine (parts of)
network traffic in packet-based and bidirectional flow-based
the ISOT [57], ISCX 2012 [28] and CTU-13 [3] data sets.
format. For each flow, the authors extracted more than 80
The resulting data set contains various botnets and normal
attributes and provide additional metadata about IP addresses
user behavior. The Botnet data set is divided into a 5.3 GB
and attacks. Normal user behavior is executed through scripts.
training subset and a 8.5 GB test subset, both in packet-based
The data set contains a wide range of attack types like
format.
SSH brute force, heartbleed, botnet, DoS, DDoS, web and
CIC DoS [51]. CIC DoS is a data set from the Canadian infiltration attacks. CICIDS 2017 is publicly available8 .
Institute for Cybersecurity and is publicly available7 . The
CIDDS-001 [21]. The CIDDS-001 data set was captured
authors’ intention was to create an intrusion detection data
within an emulated small business environment in 2017,
5 https://www.simpleweb.org/wiki/index.php contains four weeks of unidirectional flow-based network
6 http://www.unb.ca/cic/datasets/botnet.html
7 http://www.unb.ca/cic/datasets/dos-dataset.html 8 http://www.unb.ca/cic/datasets/ids-2017.html
TABLE III
OVERVIEW OF N ETWORK - BASED DATA S ETS .

General Information Nature of the Data Data Volume Recording Environment Evaluation
Data Set Year of Traf- Public Normal Attack Meta- Format Anonymity Count Duration Kind of Type of Network Compl. Predef. Balanced Labeled
fic Creation Avail. Traffic Traffic data Traffic Network Splits
AWID [49] 2015 o.r. yes yes yes other none 37M packets 1 hour emulated small network yes yes no yes
Booters [50] 2013 yes no yes no packet yes 250GB packets 2 days real small network no no no no
Botnet [5] 2010/2014 yes yes yes yes packet none 14GB packets n.s. emulated diverse networks yes yes no yes
CIC DoS [51] 2012/2017 yes yes yes no packet none 4.6GB packets 24 hours emulated small network yes no no yes
CICIDS 2017 [22] 2017 yes yes yes yes packet, bi. flow none 3.1M flows 5 days emulated small network yes no no yes
CIDDS-001 [21] 2017 yes yes yes yes uni. flow yes (IPs) 32M flows 28 days emulated small network yes no no yes
and real
CIDDS-002 [27] 2017 yes yes yes yes uni. flow yes (IPs) 15M flows 14 days emulated small network yes no no yes
CDX [52] 2009 yes yes yes yes packet none 14GB packets 4 days real small network yes no no no
CTU-13 [3] 2013 yes yes yes yes uni. and bi. flow, yes (payload) 81M flows 125 hours real university network yes no no yes with
paket BG.
DARPA [53], [54] 1998/99 yes yes yes yes packet, logs none n.s. 7/5 weeks emulated small network yes yes no yes
DDoS 2016 [55] 2016 yes yes yes no packet yes (IPs) 2.1M packets n.s. synthetic n.s. n.s. no no yes
IRSC [56] 2015 no yes yes no packet, flow n.s. n.s. n.s. real production network yes n.s. n.s. yes
ISCX 2012 [28] 2012 yes yes yes yes packet, bi. flow none 2M flows 7 days emulated small network yes no no yes
ISOT [57] 2010 yes yes yes yes packet none 11GB packets n.s. emulated small network yes no no yes
KDD CUP 99 [42] 1998 yes yes yes no other none 5M points n.s. emulated small network yes yes no yes
Kent 2016 [58], [59] 2016 yes yes n.s. no uni. flow, logs yes (IPs, 130M flows 58 days real enterprise network yes no no no
Ports, date)
Kyoto 2006+ [60] 2006 to 2009 yes yes yes no other yes (IPs) 93M points 3 years real honeypots no no no yes
LBNL [61] 2004 / 2005 yes yes yes no packet yes 160M packets 5 hours real enterprise network yes no no no
NDSec-1 [62] 2016 o.r. no yes no packet, logs none 3.5M packets n.s. emulated small network yes no no yes
NGIDS-DS [19] 2016 yes yes yes no packet, logs none 1M packets 5 days emulated small network yes no no yes
NSL-KDD [63] 1998 yes yes yes no other none 150k points n.s. emulated small network yes yes no yes
PU-IDS [64] 1998 n.i.f. yes yes no other none 200k points n.s. synthetic small network yes no no yes
PUF [65] 2018 n.i.f. yes yes no uni. flow yes (IPs) 300k flows 3 days real university network no no no yes (IDS)
SANTA [35] 2014 no yes yes no other yes (payload) n.s. n.s. real ISP yes n.s. no yes
SSENET-2011 [47] 2011 n.i.f. yes yes no other none n.s. 4 hours emulated small network yes no no yes
SSENET-2014 [66] 2011 n.i.f. yes yes no other none 200k points 4 hours emulated small network yes yes yes yes
SSHCure [67] 2013 / 2014 yes yes yes no uni. and bi. flow, yes (IPs) 2.4GB flows 2 months real university network yes no no indirect
logs (compressed)
TRAbID [68] 2017 yes yes yes no packet yes (IPs) 460M packets 8 hours emulated small network yes yes no yes
TUIDS [69], [70] 2011 / 2012 o.r. yes yes no packet, bi. flow none 250k flows 21 days emulated medium network yes yes no yes
Twente [71] 2008 yes no yes yes uni. flow yes (IPs) 14M flows 6 days real honeypot no no no yes
UGR’16 [29] 2016 yes yes yes some uni. flows yes (IPs) 16900M flows 4 months real ISP yes yes no yes with
BG.
UNIBS [72] 2009 o.r. yes no no flow yes (IPs) 79k flows 3 days real university network yes no no no
Unified Host and 2017 yes yes n.s. no bi. flows, logs yes (IPs and 150GB flows 90 days real enterprise network yes no no no
Network [73] date) (compressed)
UNSW-NB15 [20] 2015 yes yes yes yes packet, other none 2M points 31 hours emulated small network yes yes no yes

n.s. = not specified, n.i.f. = no information found, uni. flow = unidirectional flow, bi. flow = bidirectional flow, yes with BG. = yes with background labels
7
8

TABLE IV traffic, and comes along with a detailed technical report


ATTACKS WITHIN THE NETWORK - BASED DATA SETS OF TABLE III. with additional information. As special feature, the data set
S PECIFIC ATTACK INFORMATION ( E . G . NAME OF THE EXECUTED BOTNET )
AND USED TOOLS ARE PROVIDED IN ROUND BRACKETS IF AVAILABLE . encompasses an external server which was attacked in the
internet. In contrast to honeypots, this server was also regularly
Data Set Attacks
used by the clients from the emulated environment. Normal
AWID [49] Popular attacks on 802.11 (e.g. authentication
request, ARP flooding, injection, probe request) and malicious user behavior was executed through python
Booters [50] 9 different DDoS attacks scripts which are publicly available on GitHub9 . These scripts
Botnet [5] botnets (Menti, Murlo, Neris, NSIS, Rbot, Sogou, allow an ongoing generation of new data sets and can be
Strom, Virut, Zeus)
CIC DoS [51] Application layer DoS attacks (executed through used by other researches. The CIDDS-001 data set is publicly
ddossim, Goldeneye, hulk, RUDY, Slowhttptest, available10 and contains SSH brute force, DoS and port scan
Slowloris) attacks as well as several attacks captured from the wild.
CICIDS 2017 [22] botnet (Ares), cross-site-scripting, DoS (exe-
cuted through Hulk, GoldenEye, Slowloris, and CIDDS-002 [27]. CIDDS-002 is a port scan data set which
Slowhttptest), DDoS (executed through LOIC), is created based on the scripts of CIDDS-001. The data set
heartbleed, infiltration, SSH brute force, SQL
injection
contains two weeks of unidirectional flow-based network traf-
CIDDS-001 [21] DoS, port scans (ping-scan, SYN-Scan), SSH fic within an emulated small business environment. CIDDS-
brute force 002 contains normal user behavior as well as a wide range of
CIDDS-002 [27] port scans (ACK-Scan, FIN-Scan, ping-Scan,
UDP-Scan, SYN-Scan)
different port scan attacks. A technical report provides addi-
CDX [52] not specified tional meta information about the data set where external IP
CTU-13 [3] botnets (Menti, Murlo, Neris, NSIS, Rbot, Sogou, addresses are anonymized. The data set is publicly available11 .
Virut)
DARPA [53], [54] DoS, privilege escalation (remote-to-local and
CDX [52]. Sangster el al. [52] propose a concept to create
user-to-root), probing network-based data sets from network warfare competitions
DDoS 2016 [55] DDoS (HTTP flood, SIDDOS, smurf ICMP flood, and discuss the advantages and disadvantages of such an
UDP flood)
IRSC [56] n.s.
approach comprehensively. The CDX data set contains net-
ISCX 2012 [28] Four attack scenarios (1: Infiltrating the network work traffic of a four day network warfare competition in
from the inside; 2: HTTP DoS; 3: DDoS using 2009. The traffic is recorded in packet-based format and is
an IRC botnet; 4: SSH brute force)
ISOT [57] botnet (Storm, Waledac)
publicly available12 . CDX contains normal user behaviour as
KDD CUP 99 [42] DoS, privilege escalation (remote-to-local and well as several types of attacks. An additional plan describes
user-to-root), probing metadata about the network structure and IP addresses, but the
Kent 2016 [58], [59] not specified
Kyoto 2006+ [60] Various attacks against honeypots (e.g. backscat-
individual packets are not labeled. Further, host-based log files
ter, DoS, exploits, malware, port scans, shellcode) and warnings from an IDS are available.
LBNL [61] port scans CTU-13 [3]. The CTU-13 data set was captured in the year
NDSec-1 [62] botnet (Citadel), brute force (against FTP, HTTP
and SSH), DDoS (HTTP floods, SYN flooding 2013 and is available in three formats: packet, unidirectional
and UDP floods), exploits, probe, spoofing, SSL flow, and bidirectional flow13 . It was captured in a university
proxy, XSS/SQL injection network and distinguishes 13 scenarios containing different
NGIDS-DS [19] backdoors, DoS, exploits, generic, reconnais-
sance, shellcode, worms botnet attacks. Additional information about infected hosts is
NSL-KDD [63] DoS, privilege escalation (remote-to-local and provided at the website. Traffic was labeled using a three stage
user-to-root), probing approach. In the first stage, all traffic to and from infected
PU-IDS [64] DoS, privilege escalation (remote-to-local and
user-to-root), probing hosts is labeled as botnet. In the second stage, traffic which
PUF [65] DNS attacks matches specific filters is labeled as normal. Remaining traffic
SANTA [35] (D)DoS (ICMP flood, RUDY, SYN flood), DNS is labeled as background. Consequently, background traffic
amplification, heartbleed, port scans
SSENET-2011 [47] DoS (executed through LOIC), port scans (exe- could be normal or malicious. The authors recommend a split
cuted through Angry IP Scanner, Nessus, Nmap), of their data set into training and test subsets [3].
various attack tools (e.g. metasploit) DARPA [53], [54], [74]. The DARPA 1998/99 data sets are
SSENET-2014 [66] botnet, flooding, privilege escalation, port scans
SSHCure [67] SSH attacks the most popular data sets for intrusion detection and were
TRAbID [68] DoS (HTTP flood, ICMP flood, SMTP flood, created at the MIT Lincoln Lab within an emulated network
SYN flood, TCP keepalive), port scans (ACK- environment. The DARPA 1998 and DARPA 1999 data sets
Scan, FIN-Scan, NULL-Scan, OS Fingerprinting,
Service Fingerprinting, UDP-Scan, XMAS-Scan) contain seven and, respectively, five weeks of network traffic
TUIDS [69], [70] botnet (IRC), DDoS (Fraggle flood, Ping flood, in packet-based format, including various kinds of attacks
RST flood, smurf ICMP flood, SYN flood, UDP like DoS, buffer overflow, port scans, or rootkits. Additional
flood), port scans (e.g. FIN-Scan, NULL-Scan,
UDP-Scan, XMAS-Scan), coordinated port scan, information as well as download links can be found at the
SSH brute force website14 . In spite (or because) of their wide distribution, the
Twente [71] Attacks against a honeypot with three open ser-
vices (FTP, HTTP, SSH) 9 https://github.com/markusring/CIDDS
UGR’16 [29] botnet (Neris), DoS, port scans, SSH brute force, 10 http://www.hs-coburg.de/cidds
spam
11 http://www.hs-coburg.de/cidds
UNIBS [72] none
12 https://www.usma.edu/crc/sitepages/datasets.aspx
Unified Host and not specified
Network [73] 13 https://mcfp.weebly.com/the-ctu-13-dataset-a-labeled-dataset-with-botnet-normal-and-b
UNSW-NB15 [20] backdoors, DoS, exploits, fuzzers, generic, port html
scans, reconnaissance, shellcode, spam, worms 14 https://www.ll.mit.edu/ideval/docs/index.html
9

data sets are often criticized for artificial attack injections or contains around 130 million flows of unidirectional flow-based
the large amount of redundancy [63], [75]. network traffic as well as several host-based log files. Network
DDoS 2016 [55]. Alkasassbeh et al. [55] published a packet- traffic is heavily anonymized for privacy reasons. The data set
based data set which was created using the network simula- is not labeled and can be downloaded at the website21 .
tor NS2 in 2016. Detailed information about the simulated Kyoto 2006+ [60]. Kyoto 2006+ is a publicly available
network environment is not available. The DDoS 2016 data honeypot data set22 which contains real network traffic, but
set focuses on different types of DDoS attacks. In addition includes only a small amount and a small range of realistic
to normal network traffic, the data set contains four different normal user behavior. Kyoto 2006+ is categorized as other
types of DDoS attacks: UDP flood, smurf, HTTP flood, and since the IDS Bro23 was used to convert packet-based traffic
SIDDOS. The data set contains 2.1 million packets and can into a new format called sessions. Each session comprises 24
be downloaded at researchgate15 . attributes, 14 out of which characterize statistical information
IRSC [56]. The IRSC data set was recorded in 2015, inspired by the KDD CUP 99 data set. The remaining 10
using an inventive approach. Real network traffic with normal attributes are typical flow-based attributes like IP addresses
user behavior and attacks from the internet were captured. In (in anonymized form), ports, or duration. A label attribute
addition to that, additional attacks were run manually. The IDS indicates the presence of attacks. Data were captured over
SNORT16 and manual inspection were used for labeling. Since three years. As a consequence of that unusually long recording
the data set is not publicly available due to privacy concerns, period, the data set contains about 93 million sessions.
we are not able to fill all properties in Table III. LBNL [61]. Research on intrusion detection data sets often
ISCX 2012 [28]. The ISCX data set was created in 2012 refers to the LBNL data set. Thus, for the sake of complete-
by capturing traffic in an emulated network environment over ness, this data set is also added to the list. The creation of the
one week. The authors used a dynamic approach to generate an LNBL data set was mainly motivated by analyzing character-
intrusion detection data set with normal as well as malicious istics of network traffic within enterprise networks, rather than
network behavior. So-called α profiles define attack scenarios publishing intrusion detection data. According to its creators,
while β profiles characterize normal user behavior like writing the data set might still be used as background traffic for
e-mails or browsing the web. These profiles are used to create security researchers as it contains almost exclusively normal
a new data set in packet-based and bidirectional flow-based user behavior. The data set is not labeled, but anonymized for
format. The dynamic approach allows an ongoing generation privacy reasons, and contains more than 100 hours of network
of new data sets. ISCX can be downloaded at the website17 traffic in packet-based format. The data set can be downloaded
and contains various types of attacks like SSH brute force, at the website24 .
DoS or DDoS. NDSec-1 [62]. The NDSec-1 data set is remarkable since
ISOT [57]. The ISOT data set was created in 2010 by it is designed as an attack composition for network security.
combining normal network traffic from Traffic Lab at Ericsson According to the authors, this data set can be reused to salt
Research in Hungary [76] and the Lawrence Berkeley National existing network traffic with attacks using overlay method-
Lab (LBNL) [61] with malicious network traffic from the ologies like [44]. NDSec-1 is publicly available on request25
French chapter of the honeynet project18 . ISOT was used for and was captured in packet-based format in 2016. It contains
detecting P2P botnets [57]. The resulting data set is publicly additional syslog and windows event log information. The
available19 and contains 11 GB of packet-based data in pcap attack composition of NDSec-1 encompasses botnet, brute
format. force attacks (against FTP, HTTP and SSH), DoS (HTTP
KDD CUP 99 [42]. KDD CUP 99 is based on the DARPA flooding, SYN flooding, UDP flooding), exploits, port scans,
data set and among the most widespread data sets for intrusion spoofing, and XSS/SQL injection.
detection. Since it is neither in standard packet-, nor in flow- NGIDS-DS [19]. The NGIDS-DS data set contains network
based format, it belongs to category other. The data set traffic in packet-based format as well as host-based log files.
contains basic attributes about TCP connections and high-level It was generated in an emulated environment, using the IXIA
attributes like number of failed logins, but no IP addresses. Perfect Storm tool to generate normal user behavior as well
KDD CUP 99 encompasses more than 20 different types of as attacks from seven different attack families (e.g. DoS
attacks (e.g. DoS or buffer overflow) and comes along with an or worms). Consequently, the quality of the generated data
explicit test subset. The data set includes 5 million data points depends primarily on the IXIA Perfect Storm hardware26 . The
and can be downloaded freely20 . labeled data set contains approximately 1 million packets and
Kent 2016 [58], [59]. This data set was captured over is publicly available27 .
58 days at the Los Alamos National Laboratory network. It NSL-KDD [63]. NSL-KDD enhances the KDD CUP 99.
A major criticism against the KDD CUP 99 data set is the
15 https://www.researchgate.net/publication/292967044_Dataset-_
21 https://csr.lanl.gov/data/cyber1/
Detecting_Distributed_Denial_of_Service_Attacks_Using_Data_Mining_
Techniques 22 http://www.takakura.com/Kyoto_data/
16 https://www.snort.org/ 23 https://www.bro.org/
17 http://www.unb.ca/cic/datasets/ids.html 24 http://icir.org/enterprise-tracing/
18 http://honeynet.org/chapters/france 25 http://www2.hs-fulda.de/NDSec/NDSec-1/
19 https://www.uvic.ca/engineering/ece/isot/datasets/ 26 https://www.ixiacom.com/products/perfectstorm
20 http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html 27 https://research.unsw.edu.au/people/professor-jiankun-hu
10

large amount of redundancy [63]. Therefore, the authors of SSHCure [67]. Hofstede et al. [67] propose SSHCure, a
NSL-KDD removed duplicates from the KDD CUP 99 data tool for SSH attack detection. To evaluate their work, the
set and created more sophisticated subsets. The resulting data authors captured two data sets (each with a period of one
set contains about 150,000 data points and is divided into month) within a university network. The resulting data sets
predefined training and test subsets for intrusion detection are publicly available29 and contain exclusively SSH network
methods. NSL-KDD uses the same attributes as KDD CUP traffic. The flow-based network traffic is not directly labeled.
99 and belongs to the category other. Yet, it should be noted Instead, the authors provide additional host-based logs files
that the underlying network traffic of NSL-KDD dates back which may be used to check if SSH login attempts were
to the year 1998. The data set is publicly available28 . successful or not.
PU-IDS [64]. The PU-IDS data set is a derivative of the TRAbID [68]. Viegas et al. proposed the TRAbID database
NSL-KDD data set. The authors developed a generator which [68] in 2017. This database contains 16 different scenarios
extracts statistics of an input data set and uses these statistics for evaluating IDS. Each scenario was captured within an
to generate new synthetic instances. As a consequence, the emulated environment (1 honeypot server and 100 clients).
work of Singh et al. [64] could be seen as a traffic generator In each scenario, the traffic was captured for a period of 30
to create PU-IDS which contains about 200,000 data points minutes and some attacks were executed. To label the network
and has the same attributes and format as the NSL-KDD data traffic, the authors used the IP addresses of the clients. All
set. As NSL-KDD is based on KDD CUP 1999 which in turn clients were Linux machines. Some clients exclusively per-
is extracted from DARPA 1998, the year of creation is set formed attacks while most of the clients exclusively handled
to 1998 since the input for the traffic generator was captured normal user requests to the honeypot server. Normal user
back then. behavior includes HTTP, SMTP, SSH and SNMP traffic while
PUF [65]. Recently, Sharma et al. [65] published the flow- malicious network traffic encompasses port scans and DoS
based PUF data set which was captured over three days within attacks. TRAbID is publicly available30 .
a campus network and contains exclusively DNS connections. TUIDS [69], [70]. The labeled TUIDS data set can be
38,120 out of a total of 298,463 unidirectional flows are mali- divided into three parts: TUIDS Intrusion data set, TUIDS
cious while the remaining ones reflect normal user activity. All coordinated scan data set and TUIDS DDoS data set. As
flows are labeled using logs of an intrusion prevention system. the names already indicate, the data sets contain normal
For privacy reasons, IP addresses are removed from the data user behavior and primarily attacks like port scans or DDoS.
set. The authors intend to make PUF publicly available. Data were generated within an emulated environment which
SANTA [35]. The SANTA data set was captured within contains around 250 clients. Traffic was captured in packet-
an ISP environment and contains real network traffic. The based and bidirectional flow-based format. Each subset spans
network traffic is labeled through an exhaustive manual pro- a period of seven days and all three subsets contain around
cedure and stored in a so-called session-based format. This 250,000 flows. Unfortunately, the link31 to the data set in
data format is similar to NetFlow but enriched with addi- the original publication seems to be outdated. However, the
tional attributes which are calculated by using information authors respond to e-mail requests.
Twente [71]. Sperotto et al. [71] published one of the first
from packet-based data. The authors spent much effort on
flow-based intrusion detection data sets in 2008. This data set
the generation of additional attributes which should enhance
spans six days of traffic involving a honeypot server which
intrusion detection methods. SANTA is not publicly available.
offers web, FTP, and SSH services. Due to this approach, the
SSENet-2011 [47]. SSENet-2011 was captured within an
data set contains only network traffic from the honeypot and
emulated environment over four hours. It contains several
nearly all flows are malicious without normal user behavior.
attacks like DoS or port scans. Browsing activities of par-
The authors analyzed log files and traffic in packet-based
ticipants generated normal user behavior. Each data point is
format for labeling the flows of this data set. The data set
characterized by 24 attributes. The data set belongs to the
is publicly available32 and IP addresses were removed due to
category other since the tool Tstat was used to extract adjusted
privacy concerns.
data points from packet-based traffic. We found no information
UGR’16 [29]. UGR’16 is a unidirectional flow-based data
about public availability.
set. Its focus lies on capturing periodic effects in an ISP
SSENet-2014 [66]. SSENet-2014 is created by extracting environment. Thus, it spans a period of four months and
attributes from the packet-based files of SSENet-2011 [47]. contains 16,900 million unidirectional flows. IP addresses are
Thus, like SSENet-2011, the data set is categorized as other. anonymized and the flows are labeled as normal, background,
The authors extracted 28 attributes for each data point which or attack. The authors explicitly executed several attacks
describe host-based and network-based attributes. The created (botnet, DoS, and port scans) within that data set. The corre-
attributes are in line with KDD CUP 1999. SSENet-2014 sponding flows are labeled as attacks and some other attacks
contains 200,000 labeled data points and is divided into a were identified and manually labeled as attack. Injected normal
training and test subnet. SSENet-2014 is the only known data user behavior and traffic which matches certain patterns are
set with a balanced training subset. Again, no information on
29 https://www.simpleweb.org/wiki/index.php
public availability could be found.
30 https://secplab.ppgia.pucpr.br/trabid
31 http://agnigarh.tezu.ernet.in/~dkb/resources.html
28 http://www.unb.ca/cic/datasets/nsl.html 32 https://www.simpleweb.org/wiki/index.php
11

labeled as normal. However, most of the traffic is labeled as CAIDA38 . CAIDA collects different types of data sets, with
background which could be normal or an attack. The data set varying degree of availability (public access or on request), and
is publicly available33 . provides a search engine. Generally, a form needs to be filled
UNIBS 2009 [72]. Like LBNL [61], the UNIBS 2009 data out to gain access to some of the public data sets. Additionally,
set was not created for intrusion detection. Since UNIBS 2009 most network-based data sets can exclusively be requested
is referenced in other work, it is still added to the list. Gringoli through an IMPACT (see below) login since CAIDA supports
et al. [72] used the data set to identify applications (e.g. IMPACT as Data Provider. The repository is managed and
web browsers, Skype or mail clients) based on their flow- updated with new data.
based network traffic. UNIBS 2009 contains around 79,000 Contagiodump39 . Contagiodump is a blog about malware
flows without malicious behavior. Since the labels just describe dumps. There are several posts each year and the last post
the application protocols of the flows, network traffic is not was on 20th March 2018. The website contains, among other
categorized as normal or attack. Consequently, the property things, a collection of pcap files from malware analysis.
label in the categorization scheme is set to no. The data set is covert.io40 . Covert.io is a blog about security and machine
publicly available34 . learning by Jason Trost. The blog maintains different lists of
Unified Host and Network Data Set [73]. This data set tutorials, GitHub repositories, research papers and other blogs
contains host and network-based data which were captured concerning security, big data, and machine learning, but also
within a real environment, the LANL (Los Alamos National a collection of various security-based data resources41 . The
Laboratory) enterprise network. For privacy reasons, attributes latest entry was posted on August 14, 2017 by Jason Trost.
like IP addresses and timestamps were anoynmized in bidirec- DEF CON CTF Archive42 . DEF CON is a popular annual
tional flow-based network traffic files. The network traffic was hacker convention. The event includes a capture the flag (CTF)
collected for a period of 90 days and has no labels. The data competition where every team has to defend their own network
set is publicly available35 . against the other teams whilst simultaneously hacking the
UNSW-NB15 [20]. The UNSW-NB15 data set encompasses opponents’ networks. The competition is typically recorded
normal and malicious network traffic in packet-based format and available in packet-based format on the website. Given the
which was created using the IXIA Perfect Storm tool in a nature of the competition, the recorded data almost exclusively
small emulated environment over 31 hours. It contains nine contain attack traffic and little normal user behavior. The
different families of attacks like backdoors, DoS, exploits, website is current and updated annually with new data from
fuzzers, or worms. The data set is also available in flow- the CTF competitions.
based format with additional attributes. UNSW-NB15 comes IMPACT43 . IMPACT Cyber Trust, formerly known as
along with predefined splits for training and test. The data set PREDICT, is a community of data providers, cyber security
includes 45 distinct IP addresses and is publicly available36 . researchers as well as coordinators. IMPACT is administrated
and up-to-date. A data catalog is given on the site to browse
VI. OTHER DATA S OURCES the data sets provided by the community. The data providers
Besides network-based data sets, there are some other data are (among others) DARPA, the MIT Lincoln Laboratory,
sources for packet-based and flow-based network traffic. In or the UCSD - Center for Applied Internet Data Analysis
the following, we shortly discuss data repositories and traffic (CAIDA). However, the data sets can only be downloaded with
generators. an account that may be requested exclusively by researchers
from eight selected countries approved by the US Department
A. Data Repositories of Homeland Security. As Germany is not among the approved
Besides traditional data sets, several data repositories can locations, no further statements about the data sets can be
be found on the internet. Since type and structure of those made.
repositories differ greatly, we abstain from a tabular compar- Internet Traffic Archive44 . The Internet Traffic Archive is
ison. Instead, we give a brief textual overview in alphabetical a repository of internet traffic traces sponsored by ACM SIG-
order. Repositories have been checked on 26 February 2019 COMM. The list includes four extensively anonymized packet-
with respect to actuality. based traces. In particular, the payload has been removed, all
AZSecure37 . AZSecure is a repository of network data at timestamps are relative to the first packet, and IP addresses
the University of Arizona for use by the research community. have been changed to numerical representations. The packet-
It includes various data sets in pcap, arff and other formats based data sets were captured more than 20 years ago and can
some of which are labeled, while other are not. AZSecure be downloaded without restriction.
encompasses, among others, the CTU-13 data set [3] or the Kaggle45 . Kaggle is an online platform for sharing and
Unified Host and Network Data Set [73]. The repository is
38 http://www.caida.org/data/overview/
managed and contains some recent data sets.
39 http://contagiodump.blogspot.com/
33 https://nesg.ugr.es/nesg-ugr16/index.php 40 http://www.covert.io
34 http://netweb.ing.unibs.it/~ntw/tools/traces/) 41 http://www.covert.io/data-links/
35 https://csr.lanl.gov/data/2017.html 42 https://www.defcon.org/html/links/dc-ctf.html
36 https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ 43 https://www.impactcybertrust.org/

ADFA-NB15-Datasets/ 44 http://ita.ee.lbl.gov/html/traces.html
37 https://www.azsecure-data.org/other-data.html 45 https://www.kaggle.com/
12

publishing data sets. The platform contains security-based The repository also mirrors some data available from the
data sets like KDD CUP 99 and has a search function. It Waikato Internet Traffic Storage (see below).
allows registered users also to upload and explore data analysis SecRepo53 . SecRepo lists different samples of security
models. related data and is maintained by Mike Sconzo. The list is
Malware Traffic Analysis46 . Malware Traffic Analysis is a divided in the following categories: Network, Malware, Sys-
repository which contains blog posts and exercises related to tem, File, Password, Threat Feeds and Other. The very detailed
network traffic analysis, e.g. identifying malicious activities. list contains references to typical data sets like DARPA, but
Exercises come along with packet-based network traffic which also to many repositories (e.g. NETRECSEC). The website
is indirectly labeled through the provided answers to the was last updated on November 20, 2018.
exercises. Downloadable files are secured with a password Simple Web54 . Simple Web provides a database collec-
which can be obtained from the website. The repository is tion and information on network management tutorials and
recent and new blog posts are issued almost daily. software. The repository includes traces in different formats
Mid-Atlantic CCDC47 . Similar to DEFCON CTF, MAC- like packet or flow-based network traffic. It is hosted by the
CDC is an annual competition hosted by the US National University of Twente, maintained by members of the DACS
CyberWatch Center where the captured packet-based traffic (Design and Analysis of Communication Systems) group, and
of the competitions is made available. Teams have to assure updated with new results from this group.
that services provided by their network are not interrupted in UMassTraceRepository55 . UMassTraceRepository
any way. Similar to the DEFCON CTF archives, MACCDC provides the research community with several traces of
data contain almost exclusively attack traffic and little normal network traffic. Some of these traces have been collected by
user behavior. The latest competition took place in 2018. the suppliers of the archive themselves while others have
MAWILab48 . The MAWILab repository contains a huge been donated. The archive includes 19 packet-based data
amount of network traffic over a long time which is captured sets from different sources. The most recent data sets were
at a link between USA and Japan. For each day since 2007, the captured in 2018.
repository contains a 15 minute trace in packet-based format. VAST Challenge56 . The IEEE Visual Analytics Science
For privacy reasons, IP addresses are anonymized and packet and Technology (VAST) challenge is an annual contest with
payloads are omitted. The captured network traffic is labeled the goal of advancing the field of visual analytics through
using different anomaly detection methods [77]. competition. In some challenges, network traffic data were
MWS49 . The anti malware engineering workshop (MWS) provided for contest tasks. For instance, the second mini
is an annual workshop about malware in Japan. The workshop challenge of the VAST 2011 competition involved an IDS log
comes along with several MWS data sets which contain consisting of packet-based network traffic in pcap format. A
packet-based network data as well as host-based log files. similar setup was used in a follow-up VAST challenge in 2012.
However, the data sets are only shared within the MWS com- Furthermore, a VAST challenge in 2013 deals with flow-based
munity which consists of researches in industry and academia network traffic.
in Japan [78]. The latest workshop took place in 2018. WITS: Waikato Internet Traffic Storage57 . This website
aims to list all internet traces possessed by the WAND
NETRECSEC50 . NETRECSEC maintains a comprehensive
research group. The data sets are typically available in packet-
list of publicly available pcap files on the internet. Similar to
based format and free to download from the Waikato servers.
SecRepo, NETRECSEC refers to many repositories mentioned
However, the repository has not been updated for a long time.
in this work, but also incorporates additional sources like
honeypot dumps or CTF events. Its up-to-dateness can only
be judged indirectly as NETRECSEC also refers to data traces B. Traffic Generators
from the year 2018. Another source of network traffic for intrusion detection
OpenML51 . OpenML is an update-to-date platform for research are traffic generators. Traffic generators are models
sharing machine learning data sets. It contains also security- which create synthetic network traffic. In most cases, traffic
based data sets like KDD CUP 99. The platform has a search generators use user-defined parameters or extract basic prop-
function and comes along with other possibilities like creating erties of real network traffic to create new synthetic network
scientific tasks. traffic. While data sets and data repositories provide fixed data,
RIPE Data Repository52 . The RIPE data repository hosts a traffic generators allow the generation of network traffic which
number of data sets. Yet, no new data sets have been included can be adapted to certain network structures.
for several years. To obtain access, users need to create an For instance, the traffic generators FLAME [79] and ID2T
account and accept the terms and conditions of the data sets. [80] use real network traffic as input. This input traffic
should serve as a baseline for normal user behavior. Then,
46 http://malware-traffic-analysis.net/ FLAME and ID2T add malicious network traffic by editing
47 http://maccdc.org/
48 http://www.fukuda-lab.org/mawilab/ 53 http://www.secrepo.com/
49 https://www.iwsec.org/mws/2018/en.html 54 https://www.simpleweb.org/wiki/index.php/
50 http://www.netresec.com/?page=PcapFiles 55 http://traces.cs.umass.edu/
51 https://www.openml.org/home 56 http://vacommunity.org/tiki-index.php
52 https://labs.ripe.net/datarepository 57 https://wand.net.nz/wits/catalogue.php
13

values of input traffic or by injecting synthetic flows under this survey shows that there are many sources for such data
consideration of typical attack patterns. Siska et al. [81] (data sets, data repositories, and traffic generators). Further-
present a graph-based flow generator which extracts traffic more, this work establishes a collection of data set properties
templates from real network traffic. Then, their generator uses as a basis for comparing available data sets and for identifying
these traffic templates in order to create new synthetic flow- suitable data sets, given specific evaluation scenarios.
based network traffic. Ring et al. [82] adapted GANs for In the following, we discuss some aspects concerning the
generating synthetic network traffic. The authors use Improved use of available data sets and the creation of new data sets.
Wasserstein Generative Adversarial Networks (WGAN-GP) to Perfect data set: The ever-increasing number of attack
create flow-based network traffic. The WGAN-GP is trained scenarios, accompanied by new and more complex software
with real network traffic and learns traffic characteristics. After and network structures, leads to the requirement that data sets
training, the WGAN-GP is able to create new synthetic flow- should contain up-to-date and real network traffic. Since there
based network traffic with similar characteristics. Erlacher and is no perfect IDS, labeling of data points should be checked
Dressler’s traffic generator GENESIDS [83] generates HTTP manually rather than being done exclusively by an IDS.
attack traffic based on user defined attack descriptions. There Consequently, the perfect network-based data set is up-to-date,
are many additional traffic generators which are not discussed correctly labeled, publicly available, contains real network
here for the sake of brevity. Besides those traffic generators, traffic with all kinds of attacks and normal user behavior
there are many other traffic generators which are not discussed as well as payload, and spans a long time. Such a data set,
here. Instead, we refer to Molnár et al. [84] for an overview however, does not exist and will (probably) never be created.
of traffic generators. If privacy concerns could be satisfied and real-world network
Brogi et al. [85] come up with another idea that in some traffic (in packet-based format) with all kind of attacks could
sense resembles traffic generators. Starting out from the prob- be recorded over a sufficiently long time, accurate labeling of
lem of sharing data sets due to privacy concerns, they present such traffic would be very time-consuming. As a consequence,
Moirai, a framework which allows users to share complete the labeling process would take so much time that the data
scenarios instead of data sets. The idea behind Moirai is to set is slightly outdated since new attack scenarios appear
replay attack scenarios in virtual machines such that users can continuously. However, several available data sets satisfy some
generate data on the fly. properties of a perfect data set. Besides, most applications do
A third approach - which is also categorized into the larger not require a perfect data set - a data set which satisfies certain
context of traffic generators - are frameworks which support properties is often sufficient. For instance, there is no need
users to label real network traffic. Rajasinghe et al. present that a data set contains all types of attacks when evaluating
such a framework called INSecS-DCS [86] which captures a new port scan detection algorithm, or there is no need for
network traffic at network devices or uses already captured complete network configuration when evaluating the security
network traffic in pcap files as input. Then, INSecS-DCS of a specific server. Therefore, we hope that this work supports
divides the data stream into time windows, extracts data points researchers to find the appropriate data set for their specific
with appropriate attributes, and labels the network traffic based evaluation scenario.
on a user-defined attacker IP address list. Consequently, the Use of several data sets: As mentioned above, no perfect
focus of INSecS-DCS is on labeling network traffic and on network-based data set exists. However, this survey shows that
extracting meaningful attributes. Aparicio-Navarro et al. [87] there are several data sets (and other data sources) available
present an automatic data set labeling approach using an for packet- and flow-based network traffic. Therefore, we
unsupervised anomaly-based IDS. Since no IDS is able to recommend users to evaluate their intrusion detection methods
classify each data point to the correct class, the authors take with more than one data set in order to avoid over-fitting to a
some middle ground to reduce the number of false positives certain data set, reduce the influence of artificial artifacts of a
and true negatives. The IDS assigns belief values to each certain data set, and evaluate their methods in a more general
data point for the classes normal and attack. If the difference context. In addition to that, Hofstede et al. [88] show that
between the belief values for these two classes is smaller than flow-based network traffic differs between lab environments
a predefined threshold, the data point is removed from the data and production networks. Therefore, another approach could
set. This approach increases the quality of the labels, but may be to use both, emulated respectively synthetic data sets and
discard the most interesting data points of the data set. real world network traffic to emphasize these points.
In order to ensure reproducibility for third parties, we
recommend evaluating intrusion detection methods with at
VII. O BSERVATIONS AND R ECOMMENDATIONS
least one publicly available data set.
Labeled data sets are inevitable for training supervised data Further, we would like to give a general recommendation
mining methods like classification algorithms and helpful for for the use of the CICIDS 2017, CIDDS-001, UGR’16 and
the evaluation of supervised as well as unsupervised data UNSW-NB15 data sets. These data sets may be suitable for
mining methods. Consequently, labeled network-based data general evaluation settings. CICIDS 2017 and UNSW-NB15
sets can be used to compare the quality of different NIDS contain a wide range of attack scenarios. CIDDS-001 contains
with each other. In any case, however, the data sets must be detailed metadata for deeper investigations. UGR’16 stands
representative to be suitable for those tasks. The community is out by the huge amount of flows. However, it should be
aware of the importance of realistic network-based data, and considered that this recommendation reflects our personal
14

views. The recommendation does not imply that other data time which may affect their usefulness in NIDS. Therefore,
sets are inappropriate. For instance, we only refrain to include we suggest providing network-based data sets in standard
the more widespread CTU-13 and ISCX 2012 data sets in our packet-based or flow-based formats as they are captured in real
recommendation due to their increasing age. Further, other network environments. Simultaneously, many anomaly-based
data sets like AWID or Botnet are better suited for certain approaches (e.g., [91] or [92]) achieve high detection rates in
evaluation scenarios. data sets from the category other which is an indicator that
Predefined Subsets: Furthermore, we want to make a note the calculated attributes are promising for intrusion detection.
on the evaluation of anomaly-based NIDS. Machine learning Therefore, we recommend publishing both, the network-based
and data mining methods often use so-called 10-fold cross- data sets in a standard format and the scripts for transforming
validation [89]. This method divides the data set into ten equal- the data sets to other formats. Such an approach would have
sized subsets. One subset is used for testing and the other nine two advantages. First, users may decide if they want to transfer
subsets are used for training. This procedure is repeated ten data sets to other formats and a larger number of researchers
times, such that every subset has been used once for testing. could use the corresponding data sets. Second, the scripts
However, this straight-forward splitting of data sets makes only could also be applied to future data sets.
limited sense for intrusion detection. For instance, the port Anonymization: Anonymization is another important is-
scan data set CIDDS-002 [27] contains two weeks of network sue since this may complicate the analysis of network-based
traffic in flow-based format. Each port scan within this data set data sets. Therefore, it should be carefully evaluated which
may cause thousands of flows. Using 10-fold cross-validation attributes have to be discarded and which attributes may be
would lead to the situation that probably some flows of each published in anonymous form. Various authors demonstrate
attack appear in the training data set. Thus, attack detection the effectiveness of using only small parts of payload. For ex-
in test data is facilitated and generalization is not properly ample, Mahoney [93] proposes an intrusion detection method
evaluated. which uses the first 48 bytes of each packet starting with the
In that scenario, it would be better to train on week1 and test IP header. The flow exporter YAF [94] allows the creation
on week2 (and vice versa) for the CIDDS-002 data set. Defin- of such attributes by extracting the first n bytes of payload
ing subsets on that approach may also consider the impact of or by calculating the entropy of the payload. Generally, there
concept drift in network traffic over time. Another approach are several methods for anonymization. For example, Xu et
for creating suitable subsets might be to split the whole data al. [95] propose a prefix-preserving IP address anonymization
set based on traffic characteristics like source IP addresses. technique. Tcpmkpub [96] is an anonymization tool for packet-
However, such subsets must be well designed to preserve the based network traffic which allows the anonymization of some
basic network structures of the data set. For instance, a training attributes like IP addresses and also computes new values
data set with exclusively source IP addresses which represent for header checksums. We refer to Kelly et al. [97] for a
clients and no severs would be inappropriate. more comprehensive review of anonymization techniques for
Based on these observations, we recommend creating mean- network-based data.
ingful training and test splits with respect to the application Publication: We recommend the publication of network-
domain IT security. Therefore, benchmark data sets should based data sets. Only publicly available data sets can be used
be published with predefined splits for training and test to by third parties and thus serve as a basis for evaluating NIDS.
facilitate comparisons of different approaches evaluated on the Likewise, the quality of data sets can only be checked by third
same data. parties if they are publicly available. Last but not least, we
Closer collaboration: This study shows (see Section V) recommend the publication of additional metadata such that
that many data sets have been published in the last few years third parties are able to analyze the data and their results in
and the community works on creating new intrusion detection more detail.
data sets continuously. Further, the community could benefit
from closer collaboration and a single generally accepted VIII. S UMMARY
platform for sharing intrusion detection data sets without any
access restrictions. For instance, Cermak et al. [90] work on Labeled network-based data sets are necessary for training
establishing such a platform for sharing intrusion detection and evaluating NIDS. This paper provides a literature survey of
data sets. Likewise, Ring et al. [21] published their scripts available network-based intrusion detection data sets. To this
for emulating normal user behavior and attacks such that they end, standard network-based data formats are analyzed in more
can be used and improved by third parties. A short summary detail. Further, 15 properties are identified that may be used to
of all mentioned data sets and data repositories can be found assess the suitability of data sets. These properties are grouped
at our website 58 and we intend to update this website with into five categories: General Information, Nature of the Data,
upcoming network-based data sets. Data Volume, Recording Environment and Evaluation.
Standard formats: Most network-based intrusion detec- The paper’s main contribution is a comprehensive overview
tion approaches require standard input data formats and cannot of 34 data sets which points out the peculiarities of each
handle preprocessed data. Further, it is questionable if data sets data set. Thereby a particular focus was placed on attack
from category other (Section III-C) can be calculated in real scenarios within the data sets and their interrelationships. In
addition, each data set assessed with respect to the properties
58 http://www.dmir.uni-wuerzburg.de/datasets/nids-ds of the categorization scheme developed in the first step. This
15

detailed investigation aims to support readers to identify data [8] C. Yin, Y. Zhu, S. Liu, J. Fei, H. Zhang, An Enhancing Framework
sets for their purposes. The review of data sets shows that the for Botnet Detection Using Generative Adversarial Networks, in: Inter-
national Conference on Artificial Intelligence and Big Data (ICAIBD),
research community has noticed a lack of publicly available 2018, pp. 228–234. doi:10.1109/ICAIBD.2018.8396200.
network-based data sets and tries to overcome this shortage [9] S. Staniford, J. A. Hoagland, J. M. McAlerney, Practical automated
by publishing a considerable number of data sets over the last detection of stealthy portscans, Journal of Computer Security 10 (1-2)
(2002) 105–136.
few years. Since several research groups are active in this area, [10] J. Jung, V. Paxson, A. W. Berger, H. Balakrishnan, Fast Portscan
additional intrusion detection data sets and improvements can Detection Using Sequential Hypothesis Testing, in: IEEE Symposium
be expected soon. on Security & Privacy, IEEE, 2004, pp. 211–225. doi:10.1109/
SECPRI.2004.1301325.
As further sources for network traffic, traffic generators and [11] A. Sridharan, T. Ye, S. Bhattacharyya, Connectionless Port Scan Detec-
data repositories are discussed in Section VI. Traffic generators tion on the Backbone, in: IEEE International Performance Computing
create synthetic network traffic and can be used to create and Communications Conference, IEEE, 2006, pp. 10–19. doi:
10.1109/.2006.1629454.
adapted network traffic for specific scenarios. Data repositories [12] M. Ring, D. Landes, A. Hotho, Detection of slow port scans in flow-
are collections of different network traces on the internet. based network traffic, PLOS ONE 13 (9) (2018) 1–18. doi:10.1371/
Compared to the data sets in Section V, data repositories journal.pone.0204507.
[13] A. Sperotto, R. Sadre, P.-T. de Boer, A. Pras, Hidden Markov Model
often provide limited documentation, non-labeled data sets or modeling of SSH brute-force attacks, in: International Workshop on
network traffic of specific scenarios (e.g., exclusively FTP Distributed Systems: Operations and Management, Springer, 2009, pp.
connections). However, these data sources should be taken 164–176. doi:10.1007/978-3-642-04989-7_13.
[14] L. Hellemons, L. Hendriks, R. Hofstede, A. Sperotto, R. Sadre,
into account when searching for suitable data, especially for A. Pras, SSHCure: A Flow-Based SSH Intrusion Detection System, in:
specialized scenarios. Finally, we discussed some observations International Conference on Autonomous Infrastructure, Management
and recommendations for the use and generation of network- and Security (IFIP), Springer, 2012, pp. 86–97. doi:10.1007/
978-3-642-30633-4_11.
based intrusion detection data sets. We encourage users to [15] M. Javed, V. Paxson, Detecting Stealthy, Distributed SSH Brute-Forcing,
evaluate their methods on several data sets to avoid over-fitting in: ACM SIGSAC Conference on Computer & Communications Secu-
to a certain data set and to reduce the influence of artificial rity, ACM, 2013, pp. 85–96. doi:10.1145/2508859.2516719.
[16] M. M. Najafabadi, T. M. Khoshgoftaar, C. Kemp, N. Seliya, R. Zuech,
artifacts of a certain data set. Further, we advocate data sets in Machine Learning for Detecting Brute Force Attacks at the Network
standard formats including predefined training and test subsets. Level, in: International Conference on Bioinformatics and Bioengineer-
Overall, there probably won’t be a perfect data set, but there ing (BIBE), IEEE, 2014, pp. 379–385. doi:10.1109/BIBE.2014.
73.
are many very good data sets available and the community [17] R. Sommer, V. Paxson, Outside the Closed World: On Using Machine
could benefit from closer collaboration. Learning For Network Intrusion Detection, in: IEEE Symposium on
Security and Privacy, IEEE, 2010, pp. 305–316. doi:10.1109/SP.
2010.25.
ACKNOWLEDGMENT [18] M. Małowidzki, P. Berezinski, M. Mazur, Network Intrusion Detection:
Half a Kingdom for a Good Dataset, in: NATO STO SAS-139 Workshop,
M.R. and S.W. were supported by the BayWISS Consortium Portugal, 2015.
Digitisation. S.W. is additionally funded by the Bavarian State [19] W. Haider, J. Hu, J. Slay, B. Turnbull, Y. Xie, Generating realistic
Ministry of Science and Arts in the framework of the Centre intrusion detection system dataset based on fuzzy qualitative modeling,
Journal of Network and Computer Applications 87 (2017) 185–192.
Digitisation.Bavaria (ZD.B). doi:10.1016/j.jnca.2017.03.018.
[20] N. Moustafa, J. Slay, UNSW-NB15: A Comprehensive Data Set for
R EFERENCES Network Intrusion Detection Systems, in: Military Communications and
Information Systems Conference (MilCIS), IEEE, 2015, pp. 1–6. doi:
[1] V. Chandola, E. Eilertson, L. Ertoz, G. Simon, V. Kumar, Data Mining 10.1109/MilCIS.2015.7348942.
for Cyber Security, in: A. Singhal (Ed.), Data Warehousing and Data [21] M. Ring, S. Wunderlich, D. Grüdl, D. Landes, A. Hotho, Flow-based
Mining Techniques for Computer Security, 1st Edition, Springer, 2006, benchmark data sets for intrusion detection, in: European Conference
pp. 83–107. on Cyber Warfare and Security (ECCWS), ACPI, 2017, pp. 361–369.
[2] M. Rehák, M. Pechoucek, K. Bartos, M. Grill, P. Celeda, V. Krmicek, [22] I. Sharafaldin, A. H. Lashkari, A. A. Ghorbani, Toward Generating
CAMNEP: An intrusion detection system for high-speed networks, a New Intrusion Detection Dataset and Intrusion Traffic Character-
Progress in Informatics 5 (5) (2008) 65–74. doi:10.2201/NiiPi. ization, in: International Conference on Information Systems Secu-
2008.5.7. rity and Privacy (ICISSP), 2018, pp. 108–116. doi:10.5220/
[3] S. Garcia, M. Grill, J. Stiborek, A. Zunino, An empirical comparison 0006639801080116.
of botnet detection methods, Computers & Security 45 (2014) 100–123. [23] G. Creech, J. Hu, Generation of a New IDS Test Dataset: Time
doi:10.1016/j.cose.2014.05.011. to Retire the KDD Collection, in: IEEE Wireless Communications
[4] M. Ring, S. Wunderlich, D. Grüdl, D. Landes, A. Hotho, A Toolset for and Networking Conference (WCNC), IEEE, 2013, pp. 4487–4492.
Intrusion and Insider Threat Detection, in: I. Palomares, H. Kalutarage, doi:10.1109/WCNC.2013.6555301.
Y. Huang (Eds.), Data Analytics and Decision Support for Cybersecurity: [24] T. R. Glass-Vanderlan, M. D. Iannacone, M. S. Vincent, R. A. Bridges,
Trends, Methodologies and Applications, Springer, 2017, pp. 3–31. et al., A Survey of Intrusion Detection Systems Leveraging Host Data,
doi:10.1007/978-3-319-59439-2_1. arXiv preprint arXiv:1805.06070.
[5] E. B. Beigi, H. H. Jazi, N. Stakhanova, A. A. Ghorbani, Towards [25] R. Koch, M. Golling, G. D. Rodosek, Towards Comparability of In-
Effective Feature Selection in Machine Learning-Based Botnet Detection trusion Detection Systems: New Data Sets, in: TERENA Networking
Approaches, in: IEEE Conference on Communications and Network Conference, Vol. 7, 2014.
Security, IEEE, 2014, pp. 247–255. doi:10.1109/CNS.2014. [26] J. O. Nehinbe, A critical evaluation of datasets for investigating IDSs
6997492. and IPSs Researches, in: IEEE International Conference on Cybernetic
[6] M. Stevanovic, J. M. Pedersen, An analysis of network traffic classifi- Intelligent Systems (CIS), IEEE, 2011, pp. 92–97. doi:10.1109/
cation for botnet detection, in: IEEE International Conference on Cyber CIS.2011.6169141.
Situational Awareness, Data Analytics and Assessment (CyberSA), [27] M. Ring, S. Wunderlich, D. Grüdl, D. Landes, A. Hotho, Creation of
IEEE, 2015, pp. 1–8. doi:10.1109/CyberSA.2015.7361120. Flow-Based Data Sets for Intrusion Detection, Journal of Information
[7] J. Wang, I. C. Paschalidis, Botnet Detection Based on Anomaly and Warfare 16 (2017) 40–53.
Community Detection, IEEE Transactions on Control of Network Sys- [28] A. Shiravi, H. Shiravi, M. Tavallaee, A. A. Ghorbani, Toward de-
tems 4 (2) (2017) 392–404. doi:10.1109/TCNS.2016.2532804. veloping a systematic approach to generate benchmark datasets for
16

intrusion detection, Computers & Security 31 (3) (2012) 357–374. [50] J. J. Santanna, R. van Rijswijk-Deij, R. Hofstede, A. Sperotto, M. Wier-
doi:10.1016/j.cose.2011.12.012. bosch, L. Z. Granville, A. Pras, Booters - An analysis of DDoS-as-a-
[29] G. Maciá-Fernández, J. Camacho, R. Magán-Carrión, P. García- service attacks, in: IFIP/IEEE International Symposium on Integrated
Teodoro, R. Therón, UGR’16: A New Dataset for the Evaluation of Network Management (IM), 2015, pp. 243–251. doi:10.1109/INM.
Cyclostationarity-Based Network IDSs, Computers & Security 73 (2018) 2015.7140298.
411–424. doi:10.1016/j.cose.2017.11.004. [51] H. H. Jazi, H. Gonzalez, N. Stakhanova, A. A. Ghorbani, Detecting
[30] I. Sharafaldin, A. Gharib, A. H. Lashkari, A. A. Ghorbani, Towards a HTTP-based application layer DoS attacks on web servers in the
Reliable Intrusion Detection Benchmark Dataset, Software Networking presence of sampling, Computer Networks 121 (2017) 25–36. doi:
2018 (1) (2018) 177–200. doi:10.13052/jsn2445-9739.2017. 10.1016/j.comnet.2017.03.018.
009. [52] B. Sangster, T. O’Connor, T. Cook, R. Fanelli, E. Dean, C. Morrell, G. J.
[31] M. H. Bhuyan, D. K. Bhattacharyya, J. K. Kalita, Network Anomaly Conti, Toward Instrumenting Network Warfare Competitions to Generate
Detection: Methods, Systems and Tools, IEEE Communications Surveys Labeled Datasets, in: Workshop on Cyber Security Experimentation and
& Tutorials 16 (1) (2014) 303–336. doi:10.1109/SURV.2013. Test (CSET), 2009.
052213.00046. [53] R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines, K. R. Kendall, D. Mc-
[32] A. Nisioti, A. Mylonas, P. D. Yoo, V. Katos, From Intrusion Detection Clung, D. Weber, S. E. Webster, D. Wyschogrod, R. K. Cunningham,
to Attacker Attribution: A Comprehensive Survey of Unsupervised et al., Evaluating Intrusion Detection Systems : The 1998 DARPA Off-
Methods, IEEE Communications Surveys Tutorials 20 (4) (2018) 3369– line Intrusion Detection Evaluation, in: DARPA Information Survivabil-
3388. doi:10.1109/COMST.2018.2854724. ity Conference and Exposition (DISCEX), Vol. 2, IEEE, 2000, pp. 12–
26. doi:10.1109/DISCEX.2000.821506.
[33] O. Yavanoglu, M. Aydos, A Review on Cyber Security Datasets for
[54] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, K. Das, The
Machine Learning Algorithms, in: IEEE International Conference on
1999 DARPA Off-Line Intrusion Detection Evaluation, Computer Net-
Big Data, IEEE, 2017, pp. 2186–2193. doi:10.1109/BigData.
works 34 (4) (2000) 579–595. doi:10.1016/S1389-1286(00)
2017.8258167.
00139-0.
[34] C. T. Giménez, A. P. Villegas, G. Á. Marañón, HTTP data set CSIC [55] M. Alkasassbeh, G. Al-Naymat, A. Hassanat, M. Almseidin, Detecting
2010, (Date last accessed 22-June-2018) (2010). Distributed Denial of Service Attacks Using Data Mining Techniques,
[35] C. Wheelus, T. M. Khoshgoftaar, R. Zuech, M. M. Najafabadi, A International Journal of Advanced Computer Science and Applications
Session Based Approach for Aggregating Network Traffic Data - The (IJACSA) 7 (1) (2016) 436–445.
SANTA Dataset, in: IEEE International Conference on Bioinformatics [56] R. Zuech, T. M. Khoshgoftaar, N. Seliya, M. M. Najafabadi, C. Kemp, A
and Bioengineering (BIBE), IEEE, 2014, pp. 369–378. doi:10. New Intrusion Detection Benchmarking System, in: International Florida
1109/BIBE.2014.72. Artificial Intelligence Research Society Conference (FLAIRS), AAAI
[36] A. S. Tanenbaum, D. Wetherall, Computer Networks, 5th Edition, Press, 2015, pp. 252–256.
Pearson, 2011. [57] S. Saad, I. Traore, A. Ghorbani, B. Sayed, D. Zhao, W. Lu, J. Felix,
[37] B. Claise, Specification of the IP Flow Information Export (IPFIX) P. Hakimian, Detecting P2P Botnets through Network Behavior Analysis
Protocol for the Exchange of IP Traffic Flow Information, RFC 5101 and Machine Learning, in: International Conference on Privacy, Security
(2008). and Trust (PST), IEEE, 2011, pp. 174–180. doi:10.1109/PST.
[38] B. Claise, Cisco Systems NetFlow Services Export Version 9, RFC 3954 2011.5971980.
(2004). [58] A. D. Kent, Comprehensive, Multi-Source Cyber-Security Events, Los
[39] P. Phaal, sFlow Specification Version 5 (2004). Alamos National Laboratory (2015). doi:10.17021/1179829.
URL https://sflow.org/sflow_version_5.txt [59] A. D. Kent, Cybersecurity Data Sources for Dynamic Network Research,
[40] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, in: Dynamic Networks in Cybersecurity, Imperial College Press, 2015,
J. Rexford, S. Shenker, J. Turner, OpenFlow: Enabling Innovation pp. 37–65. doi:10.1142/9781786340757_0002.
in Campus Networks, ACM SIGCOMM Computer Communication [60] J. Song, H. Takakura, Y. Okabe, M. Eto, D. Inoue, K. Nakao, Statistical
Review 38 (2) (2008) 69–74. doi:10.1145/1355734.1355746. Analysis of Honeypot Data and Building of Kyoto 2006+ Dataset for
[41] F. Haddadi, A. N. Zincir-Heywood, Benchmarking the Effect of Flow NIDS Evaluation, in: Workshop on Building Analysis Datasets and
Exporters and Protocol Filters on Botnet Traffic Classification, IEEE Gathering Experience Returns for Security, ACM, 2011, pp. 29–36.
Systems Journal 10 (4) (2016) 1390–1401. doi:10.1109/JSYST. doi:10.1145/1978672.1978676.
2014.2364743. [61] R. Pang, M. Allman, M. Bennett, J. Lee, V. Paxson, B. Tierney, A First
[42] S. Stolfo, (Date last accessed 22-June-2018). [link]. Look at Modern Enterprise Traffic, in: ACM SIGCOMM Conference
URL http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html on Internet Measurement (IMC), USENIX Association, Berkeley, CA,
[43] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, USA, 2005, pp. 15–28.
M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, [62] F. Beer, T. Hofer, D. Karimi, U. Bühler, A new Attack Composition
P. E. Bourne, et al., The FAIR Guiding Principles for scientific data for Network Security, in: 10. DFN-Forum Kommunikationstechnologien,
management and stewardship, Scientific Data 3. doi:10.1038/ Gesellschaft für Informatik eV, 2017, pp. 11–20.
sdata.2016.18. [63] M. Tavallaee, E. Bagheri, W. Lu, A. A. Ghorbani, A detailed analysis
of the KDD CUP 99 data set, in: IEEE Symposium on Computational
[44] A. J. Aviv, A. Haeberlen, Challenges in Experimenting with Botnet
Intelligence for Security and Defense Applications, 2009, pp. 1–6. doi:
Detection Systems, in: Conference on Cyber Security Experimentation
10.1109/CISDA.2009.5356528.
and Test (CEST), USENIX Association, Berkeley, CA, USA, 2011.
[64] R. Singh, H. Kumar, R. Singla, A Reference Dataset for Network
[45] Z. B. Celik, J. Raghuram, G. Kesidis, D. J. Miller, Salting Public Traces Traffic Activity Based Intrusion Detection System, International Journal
with Attack Traffic to Test Flow Classifiers, in: Workshop on Cyber of Computers Communications & Control 10 (3) (2015) 390–402.
Security Experimentation and Test (CSET), 2011. doi:10.15837/ijccc.2015.3.1924.
[46] H. He, E. A. Garcia, Learning from Imbalanced Data, IEEE Transactions [65] R. Sharma, R. Singla, A. Guleria, A New Labeled Flow-based DNS
on Knowledge and Data Engineering 21 (9) (2009) 1263–1284. doi: Dataset for Anomaly Detection: PUF Dataset, Procedia Computer Sci-
10.1109/TKDE.2008.239. ence 132 (2018) 1458–1466, international Conference on Computational
[47] A. R. Vasudevan, E. Harshini, S. Selvakumar, SSENet-2011: A Network Intelligence and Data Science. doi:10.1016/j.procs.2018.05.
Intrusion Detection System dataset and its comparison with KDD CUP 079.
99 dataset, in: Second Asian Himalayas International Conference on [66] S. Bhattacharya, S. Selvakumar, SSENet-2014 Dataset: A Dataset for
Internet (AH-ICI), 2011, pp. 1–5. doi:10.1109/AHICI.2011. Detection of Multiconnection Attacks, in: International Conference on
6113948. Eco-friendly Computing and Communication Systems (ICECCS), IEEE,
[48] S. Anwar, J. Mohamad Zain, M. F. Zolkipli, Z. Inayat, S. Khan, 2014, pp. 121–126. doi:10.1109/Eco-friendly.2014.100.
B. Anthony, V. Chang, From Intrusion Detection to an Intrusion Re- [67] R. Hofstede, L. Hendriks, A. Sperotto, A. Pras, SSH compromise detec-
sponse System: Fundamentals, Requirements, and Future Directions, tion using NetFlow/IPFIX, ACM SIGCOMM Computer Communication
Algorithms 10 (2) (2017) 39. Review 44 (5) (2014) 20–26. doi:10.1145/2677046.2677050.
[49] C. Kolias, G. Kambourakis, A. Stavrou, S. Gritzalis, Intrusion Detection [68] E. K. Viegas, A. O. Santin, L. S. Oliveira, Toward a reliable anomaly-
in 802.11 Networks: Empirical Evaluation of Threats and a Public based intrusion detection in real-world environments, Computer Net-
Dataset, IEEE Communications Surveys Tutorials 18 (1) (2016) 184– works 127 (2017) 200–216. doi:10.1016/j.comnet.2017.08.
208. doi:10.1109/COMST.2015.2402161. 013.
17

[69] P. Gogoi, M. H. Bhuyan, D. Bhattacharyya, J. K. Kalita, Packet and tomated System for Generating Attack Traffic, in: Workshop on Traffic
Flow Based Network Intrusion Dataset, in: International Conference Measurements for Cybersecurity (WTMC), ACM, New York, NY, USA,
on Contemporary Computing, Springer, 2012, pp. 322–334. doi: 2018, pp. 46–51. doi:10.1145/3229598.3229601.
10.1007/978-3-642-32129-0_34. [84] S. Molnár, P. Megyesi, G. Szabo, How to Validate Traffic Genera-
[70] M. H. Bhuyan, D. K. Bhattacharyya, J. K. Kalita, Towards Generating tors?, in: IEEE International Conference on Communications Workshops
Real-life Datasets for Network Intrusion Detection, International Journal (ICC), IEEE, 2013, pp. 1340–1344. doi:10.1109/ICCW.2013.
of Network Security (IJNS) 17 (6) (2015) 683–701. 6649445.
[71] A. Sperotto, R. Sadre, F. Van Vliet, A. Pras, A Labeled Data Set [85] G. Brogi, V. V. T. Tong, Sharing and replaying attack scenarios
for Flow-Based Intrusion Detection, in: International Workshop on with Moirai, in: RESSI 2017: Rendez-vous de la Recherche et de
IP Operations and Management, Springer, 2009, pp. 39–50. doi: l’Enseignement de la Sécurité des Systèmes d’Information, 2017.
10.1007/978-3-642-04968-2_4. [86] N. Rajasinghe, J. Samarabandu, X. Wang, INSecS-DCS: A Highly Cus-
[72] F. Gringoli, L. Salgarelli, M. Dusi, N. Cascarano, F. Risso, et al., tomizable Network Intrusion Dataset Creation Framework, in: Canadian
GT: Picking up the Truth from the Ground for Internet Traffic, ACM Conference on Electrical & Computer Engineering (CCECE), IEEE,
SIGCOMM Computer Communication Review 39 (5) (2009) 12–18. 2018, pp. 1–4. doi:10.1109/CCECE.2018.8447661.
doi:10.1145/1629607.1629610. [87] F. J. Aparicio-Navarro, K. G. Kyriakopoulos, D. J. Parish, Automatic
[73] M. J. Turcotte, A. D. Kent, C. Hash, Unified Host and Network Data Dataset Labelling and Feature Selection for Intrusion Detection Systems,
Set, arXiv preprint arXiv:1708.07518. in: IEEE Military Communications Conference (MILCOM), IEEE,
[74] CYBER Systems and Technology, (Date last accessed 22-June-2018). 2014, pp. 46–51. doi:10.1109/MILCOM.2014.17.
URL https://www.ll.mit.edu/ideval/docs/index.html [88] R. Hofstede, A. Pras, A. Sperotto, G. D. Rodosek, Flow-Based Compro-
[75] J. McHugh, Testing Intrusion Detection Systems: A Critique of the 1998 mise Detection: Lessons Learned, IEEE Security Privacy 16 (1) (2018)
and 1999 DARPA Intrusion Detection System Evaluations As Performed 82–89. doi:10.1109/MSP.2018.1331021.
by Lincoln Laboratory, ACM Transactions on Information and System
[89] J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques, 3rd
Security (TISSEC) 3 (4) (2000) 262–294. doi:10.1145/382912.
Edition, Elsevier, 2011.
382923.
[90] M. Cermak, T. Jirsik, P. Velan, J. Komarkova, S. Spacek, M. Drasar,
[76] G. Szabó, D. Orincsay, S. Malomsoky, I. Szabó, On the Validation
T. Plesnik, Towards Provable Network Traffic Measurement and Analy-
of Traffic Classification Algorithms, in: International Conference on
sis via Semi-Labeled Trace Datasets, in: Network Traffic Measurement
Passive and Active Network Measurement, Springer, 2008, pp. 72–81.
and Analysis Conference (TMA), IEEE, 2018, pp. 1–8. doi:10.
doi:10.1007/978-3-540-79232-1_8.
23919/TMA.2018.8506498.
[77] R. Fontugne, P. Borgnat, P. Abry, K. Fukuda, MAWILab: Combining
[91] G. Wang, J. Hao, J. Ma, L. Huang, A new approach to intrusion
Diverse Anomaly Detectors for Automated Anomaly Labeling and Per-
detection using Artificial Neural Networks and fuzzy clustering, Expert
formance Benchmarking, in: International Conference on emerging Net-
Systems with Applications 37 (9) (2010) 6225–6232. doi:10.1016/
working EXperiments and Technologies (CoNEXT), ACM, New York,
j.eswa.2010.02.102.
NY, USA, 2010, pp. 8:1–8:12. doi:10.1145/1921168.1921179.
[78] M. Hatada, M. Akiyama, T. Matsuki, T. Kasama, Empowering Anti- [92] J. Zhang, M. Zulkernine, A. Haque, Random-Forests-Based Network
malware Research in Japan by Sharing the MWS Datasets, Journal Intrusion Detection Systems, IEEE Transactions on Systems, Man, and
of Information Processing 23 (5) (2015) 579–588. doi:10.2197/ Cybernetics, Part C (Applications and Reviews) 38 (5) (2008) 649–659.
ipsjjip.23.579. doi:10.1109/TSMCC.2008.923876.
[79] D. Brauckhoff, A. Wagner, M. May, FLAME: A Flow-Level Anomaly [93] M. V. Mahoney, Network Traffic Anomaly Detection based on Packet
Modeling Engine, in: Workshop on Cyber Security Experimentation and Bytes, in: ACM Symposium on Applied Computing, ACM, 2003, pp.
Test (CSET), USENIX Association, 2008, pp. 1:1–1:6. 346–350. doi:10.1145/952532.952601.
[80] E. Vasilomanolakis, C. G. Cordero, N. Milanov, M. Mühlhäuser, [94] C. M. Inacio, B. Trammell, YAF: Yet Another Flowmeter, in: Large
Towards the creation of synthetic, yet realistic, intrusion detection Installation System Administration Conference, 2010, pp. 107–118.
datasets, in: IEEE Network Operations and Management Symposium [95] J. Xu, J. Fan, M. H. Ammar, S. B. Moon, Prefix-Preserving IP
(NOMS), IEEE, 2016, pp. 1209–1214. doi:10.1109/NOMS.2016. Address Anonymization: Measurement-based security evaluation and a
7502989. new cryptography-based Scheme, in: IEEE International Conference on
[81] P. Siska, M. P. Stoecklin, A. Kind, T. Braun, A Flow Trace Gener- Network Protocols, IEEE, 2002, pp. 280–289. doi:10.1109/ICNP.
ator using Graph-based Traffic Classification Techniques, in: Interna- 2002.1181415.
tional Wireless Communications and Mobile Computing Conference [96] R. Pang, M. Allman, V. Paxson, J. Lee, The Devil and Packet Trace
(IWCMC), ACM, 2010, pp. 457–462. doi:10.1145/1815396. Anonymization, ACM SIGCOMM Computer Communication Review
1815503. 36 (1) (2006) 29–38. doi:10.1145/1111322.1111330.
[82] M. Ring, D. Schlör, D. Landes, A. Hotho, Flow-based Network Traffic [97] D. J. Kelly, R. A. Raines, M. R. Grimaila, R. O. Baldwin, B. E. Mullins,
Generation using Generative Adversarial Networks, Computer & Secu- A Survey of State-of-the-Art in Anonymity Metrics, in: ACM Workshop
rity 82 (2019) 156–172. doi:10.1016/j.cose.2018.12.012. on Network Data Anonymization, ACM, 2008, pp. 31–40. doi:10.
[83] F. Erlacher, F. Dressler, How to Test an IDS?: GENESIDS: An Au- 1145/1456441.1456453.

You might also like