A Survey of Network-Based Intrusion Detection Data Sets
A Survey of Network-Based Intrusion Detection Data Sets
Abstract—Labeled data sets are necessary to train and evaluate New Brunswick published the CICIDS 2017 [22] data set.
anomaly-based network intrusion detection systems. This work More data sets can be expected in the future. However, there
provides a focused literature survey of data sets for network- is no overall index of existing data sets and it is hard to keep
based intrusion detection and describes the underlying packet-
and flow-based network data in detail. The paper identifies 15 track of the latest developments.
arXiv:1903.02460v2 [cs.CR] 6 Jul 2019
different properties to assess the suitability of individual data sets This work provides a literature survey of existing network-
for specific evaluation scenarios. These properties cover a wide based intrusion detection data sets. At first, the underlying data
range of criteria and are grouped into five categories such as are investigated in more detail. Network-based data appear in
data volume or recording environment for offering a structured packet-based or flow-based format. While flow-based data con-
search. Based on these properties, a comprehensive overview of
existing data sets is given. This overview also highlights the tain only meta information about network connections, packet-
peculiarities of each data set. Furthermore, this work briefly based data also contain payload. Then, this paper analyzes
touches upon other sources for network-based data such as and groups different data set properties which are often used
traffic generators and data repositories. Finally, we discuss our in literature to evaluate the quality of network-based data sets.
observations and provide some recommendations for the use and The main contribution of this survey is an exhaustive literature
the creation of network-based data sets.
overview of network-based data sets and an analysis as to
Index Terms—Intrusion Detection, IDS, NIDS, Data Sets, which data set fulfills which data set properties. The paper
Evaluation, Data Mining focuses on attack scenarios within data sets and highlights
relations between the data sets. Furthermore, we briefly touch
I. I NTRODUCTION upon traffic generators and data repositories as further sources
IT security is an important issue and much effort has been for network traffic besides typical data sets and provide some
spent in the research of intrusion and insider threat detec- observations and recommendations. As a primary benefit, this
tion. Many contributions have been published for processing survey establishes a collection of data set properties as a
security-related data [1]–[4], detecting botnets [5]–[8], port basis for comparing available data sets and for identifying
scans [9]–[12], brute force attacks [13]–[16] and so on. All suitable data sets, given specific evaluation scenarios. Further,
these works have in common that they require representative we created a website 1 which references to all mentioned data
network-based data sets. Furthermore, benchmark data sets sets and data repositories and we intend to update this website.
are a good basis to evaluate and compare the quality of The rest of the paper is organized as follows. The next
different network intrusion detection systems (NIDS). Given section discusses related work. Section III analyzes packet-
a labeled data set in which each data point is assigned to the and flow-based network data in more detail. Section IV
class normal or attack, the number of detected attacks or the discusses typical data set properties which are often used in
number of false alarms may be used as evaluation criteria. the literature to evaluate the quality of intrusion detection
Unfortunately, there are not too many representative data sets data sets. Section V gives an overview of existing data sets
around. According to Sommer and Paxson [17] (2010), the and checks each data set against the identified properties of
lack of representative publicly available data sets constitutes Section IV. Section VI briefly touches upon further sources
one of the biggest challenges for anomaly-based intrusion for network-based data. Observations and recommendations
detection. Similar statements are made by Malowidzki et are discussed in Section VII before the paper concludes with
al. [18] (2015) and Haider et al. [19] (2017). However, the a summary.
community is working on this problem as several intrusion
detection data sets have been published over the last years. In II. R ELATED W ORK
particular, the Australian Centre for Cyber Security published This section reviews related work on network-based data
the UNSW-NB15 [20] data set, the University of Coburg sets for intrusion detection. It should be noted that host-based
published the CIDDS-001 [21] data set, or the University of intrusion detection data sets like ADFA [23] are not considered
Markus Ring, Sarah Wunderlich, Deniz Scheuring and Dieter Lan-
in this paper. Interested readers may find details on host-based
des were with the Department of Electrical Engineering and Computer intrusion detection data in Glass-Vanderlan et al. [24].
Science, Coburg University of Applied Sciences, 96450 Coburg, Ger- Malowidzki et al. [18] discuss missing data sets as a
many (e-mail: [email protected], [email protected],
[email protected], [email protected])
significant problem for intrusion detection, set up requirements
Andreas Hotho was with Data Mining and Information Retrieval for good data sets, and list available data sets. Koch et al. [25]
Group, University of Würzburg, 97074 Würzburg, Germany (e-mail:
[email protected]) 1 http://www.dmir.uni-wuerzburg.de/datasets/nids-ds
2
III. DATA
Fig. 1. IP, TCP, UDP and ICMP header after [36].
Normally, network traffic is captured either in packet-based
or flow-based format. Capturing network traffic on packet-
level is usually done by mirroring ports on network devices. sequence number, acknowledgment number, TCP flags, or
Packet-based data encompass complete payload information. checksum values. UDP is a connection-less transport protocol
Flow-based data are more aggregated and usually contain only and has a smaller header than TCP which contains only
metadata from network connections. Wheelus et al. highlight four fields, namely source port, destination port, length and
the distinction through an illustrative comparison: "A good checksum. In contrast to TCP and UDP, ICMP is a supporting
example of the difference between captured packet inspection protocol containing status messages and is thus even smaller.
and NetFlow would be viewing a forest by hiking through Normally, there is also an IP header available beside the header
the forest as opposed to flying over the forest in a hot air of the transport protocol. The IP header provides information
balloon" [35]. In this work, a third category (other data) is such as source and destination IP addresses and is also shown
introduced. The category other has no standard format and in Figure 1.
varies for each data set.
B. Flow-based data
A. Packet-based data Flow-based network data is a more condensed format which
Packet-based data is commonly captured in pcap format and contains mainly meta information about network connections.
contains payload. Available metadata depends on the used Flow-based data aggregate all packets which share some
network and transport protocols. There are many different properties within a time window into one flow and usually
protocols and the most important ones being TCP, UDP, ICMP do not include any payload. The default five-tuple definition,
and IP. Figure 1 illustrates the different headers. TCP is i.e., source IP address, source port, destination IP address,
a reliable transport protocol and encompasses metadata like destination port and transport protocol [37], is a widely used
3
detection rate and false alarm rate. Therefore, the presence of addresses, attack scenarios and so on. This property indicates
normal user behavior is indispensable for evaluating an IDS. the presence of additional metadata.
The absence of normal user behavior, however, does not make
a data set unusable, but rather indicates that it has to be merged 2) Format: Network intrusion detection data sets appear in
with other data sets or with real world network traffic. Such a different formats. We roughly divide them into three formats
merging step is often called overlaying or salting [44], [45]. (see Section III). (1) Packet-based network traffic (e.g. pcap)
4) Attack Traffic: IDS data sets should include various contains network traffic with payload. (2) Flow-based network
attack scenarios. This property indicates the presence of ma- traffic (e.g. NetFlow) contains only meta information about
licious network traffic within a data set and has the value yes network connections. (3) Other types of data sets may contain,
if the data set contains at least one attack. Table IV provides e.g., flow-based traces with additional attributes from packet-
additional information about the specific attack types. based data or even from host-based log files.
TABLE II
D ECISION SUPPORT TABLE FOR FINDING APPROPRIATE NETWORK - BASED
DATA SETS . S OME DATA SETS LIKE CTU-13 PROVIDE SEVERAL DATA
FORMATS AND APPEAR IN SEVERAL COLUMNS . (+) INDICATES THAT THE
DATA SET IS PUBLICLY AVAILABLE . (?) INDICATES THAT WE WERE NOT
ABLE TO FIND THE DATA SET. (-) INDICATES THAT THE DATA SET IS NOT
PUBLICLY AVAILABLE .
Format
Labeled packet flow other
yes (+) Botnet (+) CICIDS 2017 (+) AWID
(+) CIC DoS (+) CIDDS-001 (+) KDD CUP 99
(+) CICIDS 2017 (+) CIDDS-002 (+) Kyoto 2006+
(+) DARPA (+) ISCX 2012 (+) NSL-KDD
(+) DDoS 2016 (+) TUIDS (+) UNSW-NB 15
(+) ISCX 2012 (+) Twente
(+) ISOT
(+) NDSec-1
(+) NGIDS-DS
(+) TRAbID
(+) TUIDS
(+) UNSW-NB15
(?) PU-IDS
(?) SSENET-2011
(?) SSENET-2014
General Information Nature of the Data Data Volume Recording Environment Evaluation
Data Set Year of Traf- Public Normal Attack Meta- Format Anonymity Count Duration Kind of Type of Network Compl. Predef. Balanced Labeled
fic Creation Avail. Traffic Traffic data Traffic Network Splits
AWID [49] 2015 o.r. yes yes yes other none 37M packets 1 hour emulated small network yes yes no yes
Booters [50] 2013 yes no yes no packet yes 250GB packets 2 days real small network no no no no
Botnet [5] 2010/2014 yes yes yes yes packet none 14GB packets n.s. emulated diverse networks yes yes no yes
CIC DoS [51] 2012/2017 yes yes yes no packet none 4.6GB packets 24 hours emulated small network yes no no yes
CICIDS 2017 [22] 2017 yes yes yes yes packet, bi. flow none 3.1M flows 5 days emulated small network yes no no yes
CIDDS-001 [21] 2017 yes yes yes yes uni. flow yes (IPs) 32M flows 28 days emulated small network yes no no yes
and real
CIDDS-002 [27] 2017 yes yes yes yes uni. flow yes (IPs) 15M flows 14 days emulated small network yes no no yes
CDX [52] 2009 yes yes yes yes packet none 14GB packets 4 days real small network yes no no no
CTU-13 [3] 2013 yes yes yes yes uni. and bi. flow, yes (payload) 81M flows 125 hours real university network yes no no yes with
paket BG.
DARPA [53], [54] 1998/99 yes yes yes yes packet, logs none n.s. 7/5 weeks emulated small network yes yes no yes
DDoS 2016 [55] 2016 yes yes yes no packet yes (IPs) 2.1M packets n.s. synthetic n.s. n.s. no no yes
IRSC [56] 2015 no yes yes no packet, flow n.s. n.s. n.s. real production network yes n.s. n.s. yes
ISCX 2012 [28] 2012 yes yes yes yes packet, bi. flow none 2M flows 7 days emulated small network yes no no yes
ISOT [57] 2010 yes yes yes yes packet none 11GB packets n.s. emulated small network yes no no yes
KDD CUP 99 [42] 1998 yes yes yes no other none 5M points n.s. emulated small network yes yes no yes
Kent 2016 [58], [59] 2016 yes yes n.s. no uni. flow, logs yes (IPs, 130M flows 58 days real enterprise network yes no no no
Ports, date)
Kyoto 2006+ [60] 2006 to 2009 yes yes yes no other yes (IPs) 93M points 3 years real honeypots no no no yes
LBNL [61] 2004 / 2005 yes yes yes no packet yes 160M packets 5 hours real enterprise network yes no no no
NDSec-1 [62] 2016 o.r. no yes no packet, logs none 3.5M packets n.s. emulated small network yes no no yes
NGIDS-DS [19] 2016 yes yes yes no packet, logs none 1M packets 5 days emulated small network yes no no yes
NSL-KDD [63] 1998 yes yes yes no other none 150k points n.s. emulated small network yes yes no yes
PU-IDS [64] 1998 n.i.f. yes yes no other none 200k points n.s. synthetic small network yes no no yes
PUF [65] 2018 n.i.f. yes yes no uni. flow yes (IPs) 300k flows 3 days real university network no no no yes (IDS)
SANTA [35] 2014 no yes yes no other yes (payload) n.s. n.s. real ISP yes n.s. no yes
SSENET-2011 [47] 2011 n.i.f. yes yes no other none n.s. 4 hours emulated small network yes no no yes
SSENET-2014 [66] 2011 n.i.f. yes yes no other none 200k points 4 hours emulated small network yes yes yes yes
SSHCure [67] 2013 / 2014 yes yes yes no uni. and bi. flow, yes (IPs) 2.4GB flows 2 months real university network yes no no indirect
logs (compressed)
TRAbID [68] 2017 yes yes yes no packet yes (IPs) 460M packets 8 hours emulated small network yes yes no yes
TUIDS [69], [70] 2011 / 2012 o.r. yes yes no packet, bi. flow none 250k flows 21 days emulated medium network yes yes no yes
Twente [71] 2008 yes no yes yes uni. flow yes (IPs) 14M flows 6 days real honeypot no no no yes
UGR’16 [29] 2016 yes yes yes some uni. flows yes (IPs) 16900M flows 4 months real ISP yes yes no yes with
BG.
UNIBS [72] 2009 o.r. yes no no flow yes (IPs) 79k flows 3 days real university network yes no no no
Unified Host and 2017 yes yes n.s. no bi. flows, logs yes (IPs and 150GB flows 90 days real enterprise network yes no no no
Network [73] date) (compressed)
UNSW-NB15 [20] 2015 yes yes yes yes packet, other none 2M points 31 hours emulated small network yes yes no yes
n.s. = not specified, n.i.f. = no information found, uni. flow = unidirectional flow, bi. flow = bidirectional flow, yes with BG. = yes with background labels
7
8
data sets are often criticized for artificial attack injections or contains around 130 million flows of unidirectional flow-based
the large amount of redundancy [63], [75]. network traffic as well as several host-based log files. Network
DDoS 2016 [55]. Alkasassbeh et al. [55] published a packet- traffic is heavily anonymized for privacy reasons. The data set
based data set which was created using the network simula- is not labeled and can be downloaded at the website21 .
tor NS2 in 2016. Detailed information about the simulated Kyoto 2006+ [60]. Kyoto 2006+ is a publicly available
network environment is not available. The DDoS 2016 data honeypot data set22 which contains real network traffic, but
set focuses on different types of DDoS attacks. In addition includes only a small amount and a small range of realistic
to normal network traffic, the data set contains four different normal user behavior. Kyoto 2006+ is categorized as other
types of DDoS attacks: UDP flood, smurf, HTTP flood, and since the IDS Bro23 was used to convert packet-based traffic
SIDDOS. The data set contains 2.1 million packets and can into a new format called sessions. Each session comprises 24
be downloaded at researchgate15 . attributes, 14 out of which characterize statistical information
IRSC [56]. The IRSC data set was recorded in 2015, inspired by the KDD CUP 99 data set. The remaining 10
using an inventive approach. Real network traffic with normal attributes are typical flow-based attributes like IP addresses
user behavior and attacks from the internet were captured. In (in anonymized form), ports, or duration. A label attribute
addition to that, additional attacks were run manually. The IDS indicates the presence of attacks. Data were captured over
SNORT16 and manual inspection were used for labeling. Since three years. As a consequence of that unusually long recording
the data set is not publicly available due to privacy concerns, period, the data set contains about 93 million sessions.
we are not able to fill all properties in Table III. LBNL [61]. Research on intrusion detection data sets often
ISCX 2012 [28]. The ISCX data set was created in 2012 refers to the LBNL data set. Thus, for the sake of complete-
by capturing traffic in an emulated network environment over ness, this data set is also added to the list. The creation of the
one week. The authors used a dynamic approach to generate an LNBL data set was mainly motivated by analyzing character-
intrusion detection data set with normal as well as malicious istics of network traffic within enterprise networks, rather than
network behavior. So-called α profiles define attack scenarios publishing intrusion detection data. According to its creators,
while β profiles characterize normal user behavior like writing the data set might still be used as background traffic for
e-mails or browsing the web. These profiles are used to create security researchers as it contains almost exclusively normal
a new data set in packet-based and bidirectional flow-based user behavior. The data set is not labeled, but anonymized for
format. The dynamic approach allows an ongoing generation privacy reasons, and contains more than 100 hours of network
of new data sets. ISCX can be downloaded at the website17 traffic in packet-based format. The data set can be downloaded
and contains various types of attacks like SSH brute force, at the website24 .
DoS or DDoS. NDSec-1 [62]. The NDSec-1 data set is remarkable since
ISOT [57]. The ISOT data set was created in 2010 by it is designed as an attack composition for network security.
combining normal network traffic from Traffic Lab at Ericsson According to the authors, this data set can be reused to salt
Research in Hungary [76] and the Lawrence Berkeley National existing network traffic with attacks using overlay method-
Lab (LBNL) [61] with malicious network traffic from the ologies like [44]. NDSec-1 is publicly available on request25
French chapter of the honeynet project18 . ISOT was used for and was captured in packet-based format in 2016. It contains
detecting P2P botnets [57]. The resulting data set is publicly additional syslog and windows event log information. The
available19 and contains 11 GB of packet-based data in pcap attack composition of NDSec-1 encompasses botnet, brute
format. force attacks (against FTP, HTTP and SSH), DoS (HTTP
KDD CUP 99 [42]. KDD CUP 99 is based on the DARPA flooding, SYN flooding, UDP flooding), exploits, port scans,
data set and among the most widespread data sets for intrusion spoofing, and XSS/SQL injection.
detection. Since it is neither in standard packet-, nor in flow- NGIDS-DS [19]. The NGIDS-DS data set contains network
based format, it belongs to category other. The data set traffic in packet-based format as well as host-based log files.
contains basic attributes about TCP connections and high-level It was generated in an emulated environment, using the IXIA
attributes like number of failed logins, but no IP addresses. Perfect Storm tool to generate normal user behavior as well
KDD CUP 99 encompasses more than 20 different types of as attacks from seven different attack families (e.g. DoS
attacks (e.g. DoS or buffer overflow) and comes along with an or worms). Consequently, the quality of the generated data
explicit test subset. The data set includes 5 million data points depends primarily on the IXIA Perfect Storm hardware26 . The
and can be downloaded freely20 . labeled data set contains approximately 1 million packets and
Kent 2016 [58], [59]. This data set was captured over is publicly available27 .
58 days at the Los Alamos National Laboratory network. It NSL-KDD [63]. NSL-KDD enhances the KDD CUP 99.
A major criticism against the KDD CUP 99 data set is the
15 https://www.researchgate.net/publication/292967044_Dataset-_
21 https://csr.lanl.gov/data/cyber1/
Detecting_Distributed_Denial_of_Service_Attacks_Using_Data_Mining_
Techniques 22 http://www.takakura.com/Kyoto_data/
16 https://www.snort.org/ 23 https://www.bro.org/
17 http://www.unb.ca/cic/datasets/ids.html 24 http://icir.org/enterprise-tracing/
18 http://honeynet.org/chapters/france 25 http://www2.hs-fulda.de/NDSec/NDSec-1/
19 https://www.uvic.ca/engineering/ece/isot/datasets/ 26 https://www.ixiacom.com/products/perfectstorm
20 http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html 27 https://research.unsw.edu.au/people/professor-jiankun-hu
10
large amount of redundancy [63]. Therefore, the authors of SSHCure [67]. Hofstede et al. [67] propose SSHCure, a
NSL-KDD removed duplicates from the KDD CUP 99 data tool for SSH attack detection. To evaluate their work, the
set and created more sophisticated subsets. The resulting data authors captured two data sets (each with a period of one
set contains about 150,000 data points and is divided into month) within a university network. The resulting data sets
predefined training and test subsets for intrusion detection are publicly available29 and contain exclusively SSH network
methods. NSL-KDD uses the same attributes as KDD CUP traffic. The flow-based network traffic is not directly labeled.
99 and belongs to the category other. Yet, it should be noted Instead, the authors provide additional host-based logs files
that the underlying network traffic of NSL-KDD dates back which may be used to check if SSH login attempts were
to the year 1998. The data set is publicly available28 . successful or not.
PU-IDS [64]. The PU-IDS data set is a derivative of the TRAbID [68]. Viegas et al. proposed the TRAbID database
NSL-KDD data set. The authors developed a generator which [68] in 2017. This database contains 16 different scenarios
extracts statistics of an input data set and uses these statistics for evaluating IDS. Each scenario was captured within an
to generate new synthetic instances. As a consequence, the emulated environment (1 honeypot server and 100 clients).
work of Singh et al. [64] could be seen as a traffic generator In each scenario, the traffic was captured for a period of 30
to create PU-IDS which contains about 200,000 data points minutes and some attacks were executed. To label the network
and has the same attributes and format as the NSL-KDD data traffic, the authors used the IP addresses of the clients. All
set. As NSL-KDD is based on KDD CUP 1999 which in turn clients were Linux machines. Some clients exclusively per-
is extracted from DARPA 1998, the year of creation is set formed attacks while most of the clients exclusively handled
to 1998 since the input for the traffic generator was captured normal user requests to the honeypot server. Normal user
back then. behavior includes HTTP, SMTP, SSH and SNMP traffic while
PUF [65]. Recently, Sharma et al. [65] published the flow- malicious network traffic encompasses port scans and DoS
based PUF data set which was captured over three days within attacks. TRAbID is publicly available30 .
a campus network and contains exclusively DNS connections. TUIDS [69], [70]. The labeled TUIDS data set can be
38,120 out of a total of 298,463 unidirectional flows are mali- divided into three parts: TUIDS Intrusion data set, TUIDS
cious while the remaining ones reflect normal user activity. All coordinated scan data set and TUIDS DDoS data set. As
flows are labeled using logs of an intrusion prevention system. the names already indicate, the data sets contain normal
For privacy reasons, IP addresses are removed from the data user behavior and primarily attacks like port scans or DDoS.
set. The authors intend to make PUF publicly available. Data were generated within an emulated environment which
SANTA [35]. The SANTA data set was captured within contains around 250 clients. Traffic was captured in packet-
an ISP environment and contains real network traffic. The based and bidirectional flow-based format. Each subset spans
network traffic is labeled through an exhaustive manual pro- a period of seven days and all three subsets contain around
cedure and stored in a so-called session-based format. This 250,000 flows. Unfortunately, the link31 to the data set in
data format is similar to NetFlow but enriched with addi- the original publication seems to be outdated. However, the
tional attributes which are calculated by using information authors respond to e-mail requests.
Twente [71]. Sperotto et al. [71] published one of the first
from packet-based data. The authors spent much effort on
flow-based intrusion detection data sets in 2008. This data set
the generation of additional attributes which should enhance
spans six days of traffic involving a honeypot server which
intrusion detection methods. SANTA is not publicly available.
offers web, FTP, and SSH services. Due to this approach, the
SSENet-2011 [47]. SSENet-2011 was captured within an
data set contains only network traffic from the honeypot and
emulated environment over four hours. It contains several
nearly all flows are malicious without normal user behavior.
attacks like DoS or port scans. Browsing activities of par-
The authors analyzed log files and traffic in packet-based
ticipants generated normal user behavior. Each data point is
format for labeling the flows of this data set. The data set
characterized by 24 attributes. The data set belongs to the
is publicly available32 and IP addresses were removed due to
category other since the tool Tstat was used to extract adjusted
privacy concerns.
data points from packet-based traffic. We found no information
UGR’16 [29]. UGR’16 is a unidirectional flow-based data
about public availability.
set. Its focus lies on capturing periodic effects in an ISP
SSENet-2014 [66]. SSENet-2014 is created by extracting environment. Thus, it spans a period of four months and
attributes from the packet-based files of SSENet-2011 [47]. contains 16,900 million unidirectional flows. IP addresses are
Thus, like SSENet-2011, the data set is categorized as other. anonymized and the flows are labeled as normal, background,
The authors extracted 28 attributes for each data point which or attack. The authors explicitly executed several attacks
describe host-based and network-based attributes. The created (botnet, DoS, and port scans) within that data set. The corre-
attributes are in line with KDD CUP 1999. SSENet-2014 sponding flows are labeled as attacks and some other attacks
contains 200,000 labeled data points and is divided into a were identified and manually labeled as attack. Injected normal
training and test subnet. SSENet-2014 is the only known data user behavior and traffic which matches certain patterns are
set with a balanced training subset. Again, no information on
29 https://www.simpleweb.org/wiki/index.php
public availability could be found.
30 https://secplab.ppgia.pucpr.br/trabid
31 http://agnigarh.tezu.ernet.in/~dkb/resources.html
28 http://www.unb.ca/cic/datasets/nsl.html 32 https://www.simpleweb.org/wiki/index.php
11
labeled as normal. However, most of the traffic is labeled as CAIDA38 . CAIDA collects different types of data sets, with
background which could be normal or an attack. The data set varying degree of availability (public access or on request), and
is publicly available33 . provides a search engine. Generally, a form needs to be filled
UNIBS 2009 [72]. Like LBNL [61], the UNIBS 2009 data out to gain access to some of the public data sets. Additionally,
set was not created for intrusion detection. Since UNIBS 2009 most network-based data sets can exclusively be requested
is referenced in other work, it is still added to the list. Gringoli through an IMPACT (see below) login since CAIDA supports
et al. [72] used the data set to identify applications (e.g. IMPACT as Data Provider. The repository is managed and
web browsers, Skype or mail clients) based on their flow- updated with new data.
based network traffic. UNIBS 2009 contains around 79,000 Contagiodump39 . Contagiodump is a blog about malware
flows without malicious behavior. Since the labels just describe dumps. There are several posts each year and the last post
the application protocols of the flows, network traffic is not was on 20th March 2018. The website contains, among other
categorized as normal or attack. Consequently, the property things, a collection of pcap files from malware analysis.
label in the categorization scheme is set to no. The data set is covert.io40 . Covert.io is a blog about security and machine
publicly available34 . learning by Jason Trost. The blog maintains different lists of
Unified Host and Network Data Set [73]. This data set tutorials, GitHub repositories, research papers and other blogs
contains host and network-based data which were captured concerning security, big data, and machine learning, but also
within a real environment, the LANL (Los Alamos National a collection of various security-based data resources41 . The
Laboratory) enterprise network. For privacy reasons, attributes latest entry was posted on August 14, 2017 by Jason Trost.
like IP addresses and timestamps were anoynmized in bidirec- DEF CON CTF Archive42 . DEF CON is a popular annual
tional flow-based network traffic files. The network traffic was hacker convention. The event includes a capture the flag (CTF)
collected for a period of 90 days and has no labels. The data competition where every team has to defend their own network
set is publicly available35 . against the other teams whilst simultaneously hacking the
UNSW-NB15 [20]. The UNSW-NB15 data set encompasses opponents’ networks. The competition is typically recorded
normal and malicious network traffic in packet-based format and available in packet-based format on the website. Given the
which was created using the IXIA Perfect Storm tool in a nature of the competition, the recorded data almost exclusively
small emulated environment over 31 hours. It contains nine contain attack traffic and little normal user behavior. The
different families of attacks like backdoors, DoS, exploits, website is current and updated annually with new data from
fuzzers, or worms. The data set is also available in flow- the CTF competitions.
based format with additional attributes. UNSW-NB15 comes IMPACT43 . IMPACT Cyber Trust, formerly known as
along with predefined splits for training and test. The data set PREDICT, is a community of data providers, cyber security
includes 45 distinct IP addresses and is publicly available36 . researchers as well as coordinators. IMPACT is administrated
and up-to-date. A data catalog is given on the site to browse
VI. OTHER DATA S OURCES the data sets provided by the community. The data providers
Besides network-based data sets, there are some other data are (among others) DARPA, the MIT Lincoln Laboratory,
sources for packet-based and flow-based network traffic. In or the UCSD - Center for Applied Internet Data Analysis
the following, we shortly discuss data repositories and traffic (CAIDA). However, the data sets can only be downloaded with
generators. an account that may be requested exclusively by researchers
from eight selected countries approved by the US Department
A. Data Repositories of Homeland Security. As Germany is not among the approved
Besides traditional data sets, several data repositories can locations, no further statements about the data sets can be
be found on the internet. Since type and structure of those made.
repositories differ greatly, we abstain from a tabular compar- Internet Traffic Archive44 . The Internet Traffic Archive is
ison. Instead, we give a brief textual overview in alphabetical a repository of internet traffic traces sponsored by ACM SIG-
order. Repositories have been checked on 26 February 2019 COMM. The list includes four extensively anonymized packet-
with respect to actuality. based traces. In particular, the payload has been removed, all
AZSecure37 . AZSecure is a repository of network data at timestamps are relative to the first packet, and IP addresses
the University of Arizona for use by the research community. have been changed to numerical representations. The packet-
It includes various data sets in pcap, arff and other formats based data sets were captured more than 20 years ago and can
some of which are labeled, while other are not. AZSecure be downloaded without restriction.
encompasses, among others, the CTU-13 data set [3] or the Kaggle45 . Kaggle is an online platform for sharing and
Unified Host and Network Data Set [73]. The repository is
38 http://www.caida.org/data/overview/
managed and contains some recent data sets.
39 http://contagiodump.blogspot.com/
33 https://nesg.ugr.es/nesg-ugr16/index.php 40 http://www.covert.io
34 http://netweb.ing.unibs.it/~ntw/tools/traces/) 41 http://www.covert.io/data-links/
35 https://csr.lanl.gov/data/2017.html 42 https://www.defcon.org/html/links/dc-ctf.html
36 https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ 43 https://www.impactcybertrust.org/
ADFA-NB15-Datasets/ 44 http://ita.ee.lbl.gov/html/traces.html
37 https://www.azsecure-data.org/other-data.html 45 https://www.kaggle.com/
12
publishing data sets. The platform contains security-based The repository also mirrors some data available from the
data sets like KDD CUP 99 and has a search function. It Waikato Internet Traffic Storage (see below).
allows registered users also to upload and explore data analysis SecRepo53 . SecRepo lists different samples of security
models. related data and is maintained by Mike Sconzo. The list is
Malware Traffic Analysis46 . Malware Traffic Analysis is a divided in the following categories: Network, Malware, Sys-
repository which contains blog posts and exercises related to tem, File, Password, Threat Feeds and Other. The very detailed
network traffic analysis, e.g. identifying malicious activities. list contains references to typical data sets like DARPA, but
Exercises come along with packet-based network traffic which also to many repositories (e.g. NETRECSEC). The website
is indirectly labeled through the provided answers to the was last updated on November 20, 2018.
exercises. Downloadable files are secured with a password Simple Web54 . Simple Web provides a database collec-
which can be obtained from the website. The repository is tion and information on network management tutorials and
recent and new blog posts are issued almost daily. software. The repository includes traces in different formats
Mid-Atlantic CCDC47 . Similar to DEFCON CTF, MAC- like packet or flow-based network traffic. It is hosted by the
CDC is an annual competition hosted by the US National University of Twente, maintained by members of the DACS
CyberWatch Center where the captured packet-based traffic (Design and Analysis of Communication Systems) group, and
of the competitions is made available. Teams have to assure updated with new results from this group.
that services provided by their network are not interrupted in UMassTraceRepository55 . UMassTraceRepository
any way. Similar to the DEFCON CTF archives, MACCDC provides the research community with several traces of
data contain almost exclusively attack traffic and little normal network traffic. Some of these traces have been collected by
user behavior. The latest competition took place in 2018. the suppliers of the archive themselves while others have
MAWILab48 . The MAWILab repository contains a huge been donated. The archive includes 19 packet-based data
amount of network traffic over a long time which is captured sets from different sources. The most recent data sets were
at a link between USA and Japan. For each day since 2007, the captured in 2018.
repository contains a 15 minute trace in packet-based format. VAST Challenge56 . The IEEE Visual Analytics Science
For privacy reasons, IP addresses are anonymized and packet and Technology (VAST) challenge is an annual contest with
payloads are omitted. The captured network traffic is labeled the goal of advancing the field of visual analytics through
using different anomaly detection methods [77]. competition. In some challenges, network traffic data were
MWS49 . The anti malware engineering workshop (MWS) provided for contest tasks. For instance, the second mini
is an annual workshop about malware in Japan. The workshop challenge of the VAST 2011 competition involved an IDS log
comes along with several MWS data sets which contain consisting of packet-based network traffic in pcap format. A
packet-based network data as well as host-based log files. similar setup was used in a follow-up VAST challenge in 2012.
However, the data sets are only shared within the MWS com- Furthermore, a VAST challenge in 2013 deals with flow-based
munity which consists of researches in industry and academia network traffic.
in Japan [78]. The latest workshop took place in 2018. WITS: Waikato Internet Traffic Storage57 . This website
aims to list all internet traces possessed by the WAND
NETRECSEC50 . NETRECSEC maintains a comprehensive
research group. The data sets are typically available in packet-
list of publicly available pcap files on the internet. Similar to
based format and free to download from the Waikato servers.
SecRepo, NETRECSEC refers to many repositories mentioned
However, the repository has not been updated for a long time.
in this work, but also incorporates additional sources like
honeypot dumps or CTF events. Its up-to-dateness can only
be judged indirectly as NETRECSEC also refers to data traces B. Traffic Generators
from the year 2018. Another source of network traffic for intrusion detection
OpenML51 . OpenML is an update-to-date platform for research are traffic generators. Traffic generators are models
sharing machine learning data sets. It contains also security- which create synthetic network traffic. In most cases, traffic
based data sets like KDD CUP 99. The platform has a search generators use user-defined parameters or extract basic prop-
function and comes along with other possibilities like creating erties of real network traffic to create new synthetic network
scientific tasks. traffic. While data sets and data repositories provide fixed data,
RIPE Data Repository52 . The RIPE data repository hosts a traffic generators allow the generation of network traffic which
number of data sets. Yet, no new data sets have been included can be adapted to certain network structures.
for several years. To obtain access, users need to create an For instance, the traffic generators FLAME [79] and ID2T
account and accept the terms and conditions of the data sets. [80] use real network traffic as input. This input traffic
should serve as a baseline for normal user behavior. Then,
46 http://malware-traffic-analysis.net/ FLAME and ID2T add malicious network traffic by editing
47 http://maccdc.org/
48 http://www.fukuda-lab.org/mawilab/ 53 http://www.secrepo.com/
49 https://www.iwsec.org/mws/2018/en.html 54 https://www.simpleweb.org/wiki/index.php/
50 http://www.netresec.com/?page=PcapFiles 55 http://traces.cs.umass.edu/
51 https://www.openml.org/home 56 http://vacommunity.org/tiki-index.php
52 https://labs.ripe.net/datarepository 57 https://wand.net.nz/wits/catalogue.php
13
values of input traffic or by injecting synthetic flows under this survey shows that there are many sources for such data
consideration of typical attack patterns. Siska et al. [81] (data sets, data repositories, and traffic generators). Further-
present a graph-based flow generator which extracts traffic more, this work establishes a collection of data set properties
templates from real network traffic. Then, their generator uses as a basis for comparing available data sets and for identifying
these traffic templates in order to create new synthetic flow- suitable data sets, given specific evaluation scenarios.
based network traffic. Ring et al. [82] adapted GANs for In the following, we discuss some aspects concerning the
generating synthetic network traffic. The authors use Improved use of available data sets and the creation of new data sets.
Wasserstein Generative Adversarial Networks (WGAN-GP) to Perfect data set: The ever-increasing number of attack
create flow-based network traffic. The WGAN-GP is trained scenarios, accompanied by new and more complex software
with real network traffic and learns traffic characteristics. After and network structures, leads to the requirement that data sets
training, the WGAN-GP is able to create new synthetic flow- should contain up-to-date and real network traffic. Since there
based network traffic with similar characteristics. Erlacher and is no perfect IDS, labeling of data points should be checked
Dressler’s traffic generator GENESIDS [83] generates HTTP manually rather than being done exclusively by an IDS.
attack traffic based on user defined attack descriptions. There Consequently, the perfect network-based data set is up-to-date,
are many additional traffic generators which are not discussed correctly labeled, publicly available, contains real network
here for the sake of brevity. Besides those traffic generators, traffic with all kinds of attacks and normal user behavior
there are many other traffic generators which are not discussed as well as payload, and spans a long time. Such a data set,
here. Instead, we refer to Molnár et al. [84] for an overview however, does not exist and will (probably) never be created.
of traffic generators. If privacy concerns could be satisfied and real-world network
Brogi et al. [85] come up with another idea that in some traffic (in packet-based format) with all kind of attacks could
sense resembles traffic generators. Starting out from the prob- be recorded over a sufficiently long time, accurate labeling of
lem of sharing data sets due to privacy concerns, they present such traffic would be very time-consuming. As a consequence,
Moirai, a framework which allows users to share complete the labeling process would take so much time that the data
scenarios instead of data sets. The idea behind Moirai is to set is slightly outdated since new attack scenarios appear
replay attack scenarios in virtual machines such that users can continuously. However, several available data sets satisfy some
generate data on the fly. properties of a perfect data set. Besides, most applications do
A third approach - which is also categorized into the larger not require a perfect data set - a data set which satisfies certain
context of traffic generators - are frameworks which support properties is often sufficient. For instance, there is no need
users to label real network traffic. Rajasinghe et al. present that a data set contains all types of attacks when evaluating
such a framework called INSecS-DCS [86] which captures a new port scan detection algorithm, or there is no need for
network traffic at network devices or uses already captured complete network configuration when evaluating the security
network traffic in pcap files as input. Then, INSecS-DCS of a specific server. Therefore, we hope that this work supports
divides the data stream into time windows, extracts data points researchers to find the appropriate data set for their specific
with appropriate attributes, and labels the network traffic based evaluation scenario.
on a user-defined attacker IP address list. Consequently, the Use of several data sets: As mentioned above, no perfect
focus of INSecS-DCS is on labeling network traffic and on network-based data set exists. However, this survey shows that
extracting meaningful attributes. Aparicio-Navarro et al. [87] there are several data sets (and other data sources) available
present an automatic data set labeling approach using an for packet- and flow-based network traffic. Therefore, we
unsupervised anomaly-based IDS. Since no IDS is able to recommend users to evaluate their intrusion detection methods
classify each data point to the correct class, the authors take with more than one data set in order to avoid over-fitting to a
some middle ground to reduce the number of false positives certain data set, reduce the influence of artificial artifacts of a
and true negatives. The IDS assigns belief values to each certain data set, and evaluate their methods in a more general
data point for the classes normal and attack. If the difference context. In addition to that, Hofstede et al. [88] show that
between the belief values for these two classes is smaller than flow-based network traffic differs between lab environments
a predefined threshold, the data point is removed from the data and production networks. Therefore, another approach could
set. This approach increases the quality of the labels, but may be to use both, emulated respectively synthetic data sets and
discard the most interesting data points of the data set. real world network traffic to emphasize these points.
In order to ensure reproducibility for third parties, we
recommend evaluating intrusion detection methods with at
VII. O BSERVATIONS AND R ECOMMENDATIONS
least one publicly available data set.
Labeled data sets are inevitable for training supervised data Further, we would like to give a general recommendation
mining methods like classification algorithms and helpful for for the use of the CICIDS 2017, CIDDS-001, UGR’16 and
the evaluation of supervised as well as unsupervised data UNSW-NB15 data sets. These data sets may be suitable for
mining methods. Consequently, labeled network-based data general evaluation settings. CICIDS 2017 and UNSW-NB15
sets can be used to compare the quality of different NIDS contain a wide range of attack scenarios. CIDDS-001 contains
with each other. In any case, however, the data sets must be detailed metadata for deeper investigations. UGR’16 stands
representative to be suitable for those tasks. The community is out by the huge amount of flows. However, it should be
aware of the importance of realistic network-based data, and considered that this recommendation reflects our personal
14
views. The recommendation does not imply that other data time which may affect their usefulness in NIDS. Therefore,
sets are inappropriate. For instance, we only refrain to include we suggest providing network-based data sets in standard
the more widespread CTU-13 and ISCX 2012 data sets in our packet-based or flow-based formats as they are captured in real
recommendation due to their increasing age. Further, other network environments. Simultaneously, many anomaly-based
data sets like AWID or Botnet are better suited for certain approaches (e.g., [91] or [92]) achieve high detection rates in
evaluation scenarios. data sets from the category other which is an indicator that
Predefined Subsets: Furthermore, we want to make a note the calculated attributes are promising for intrusion detection.
on the evaluation of anomaly-based NIDS. Machine learning Therefore, we recommend publishing both, the network-based
and data mining methods often use so-called 10-fold cross- data sets in a standard format and the scripts for transforming
validation [89]. This method divides the data set into ten equal- the data sets to other formats. Such an approach would have
sized subsets. One subset is used for testing and the other nine two advantages. First, users may decide if they want to transfer
subsets are used for training. This procedure is repeated ten data sets to other formats and a larger number of researchers
times, such that every subset has been used once for testing. could use the corresponding data sets. Second, the scripts
However, this straight-forward splitting of data sets makes only could also be applied to future data sets.
limited sense for intrusion detection. For instance, the port Anonymization: Anonymization is another important is-
scan data set CIDDS-002 [27] contains two weeks of network sue since this may complicate the analysis of network-based
traffic in flow-based format. Each port scan within this data set data sets. Therefore, it should be carefully evaluated which
may cause thousands of flows. Using 10-fold cross-validation attributes have to be discarded and which attributes may be
would lead to the situation that probably some flows of each published in anonymous form. Various authors demonstrate
attack appear in the training data set. Thus, attack detection the effectiveness of using only small parts of payload. For ex-
in test data is facilitated and generalization is not properly ample, Mahoney [93] proposes an intrusion detection method
evaluated. which uses the first 48 bytes of each packet starting with the
In that scenario, it would be better to train on week1 and test IP header. The flow exporter YAF [94] allows the creation
on week2 (and vice versa) for the CIDDS-002 data set. Defin- of such attributes by extracting the first n bytes of payload
ing subsets on that approach may also consider the impact of or by calculating the entropy of the payload. Generally, there
concept drift in network traffic over time. Another approach are several methods for anonymization. For example, Xu et
for creating suitable subsets might be to split the whole data al. [95] propose a prefix-preserving IP address anonymization
set based on traffic characteristics like source IP addresses. technique. Tcpmkpub [96] is an anonymization tool for packet-
However, such subsets must be well designed to preserve the based network traffic which allows the anonymization of some
basic network structures of the data set. For instance, a training attributes like IP addresses and also computes new values
data set with exclusively source IP addresses which represent for header checksums. We refer to Kelly et al. [97] for a
clients and no severs would be inappropriate. more comprehensive review of anonymization techniques for
Based on these observations, we recommend creating mean- network-based data.
ingful training and test splits with respect to the application Publication: We recommend the publication of network-
domain IT security. Therefore, benchmark data sets should based data sets. Only publicly available data sets can be used
be published with predefined splits for training and test to by third parties and thus serve as a basis for evaluating NIDS.
facilitate comparisons of different approaches evaluated on the Likewise, the quality of data sets can only be checked by third
same data. parties if they are publicly available. Last but not least, we
Closer collaboration: This study shows (see Section V) recommend the publication of additional metadata such that
that many data sets have been published in the last few years third parties are able to analyze the data and their results in
and the community works on creating new intrusion detection more detail.
data sets continuously. Further, the community could benefit
from closer collaboration and a single generally accepted VIII. S UMMARY
platform for sharing intrusion detection data sets without any
access restrictions. For instance, Cermak et al. [90] work on Labeled network-based data sets are necessary for training
establishing such a platform for sharing intrusion detection and evaluating NIDS. This paper provides a literature survey of
data sets. Likewise, Ring et al. [21] published their scripts available network-based intrusion detection data sets. To this
for emulating normal user behavior and attacks such that they end, standard network-based data formats are analyzed in more
can be used and improved by third parties. A short summary detail. Further, 15 properties are identified that may be used to
of all mentioned data sets and data repositories can be found assess the suitability of data sets. These properties are grouped
at our website 58 and we intend to update this website with into five categories: General Information, Nature of the Data,
upcoming network-based data sets. Data Volume, Recording Environment and Evaluation.
Standard formats: Most network-based intrusion detec- The paper’s main contribution is a comprehensive overview
tion approaches require standard input data formats and cannot of 34 data sets which points out the peculiarities of each
handle preprocessed data. Further, it is questionable if data sets data set. Thereby a particular focus was placed on attack
from category other (Section III-C) can be calculated in real scenarios within the data sets and their interrelationships. In
addition, each data set assessed with respect to the properties
58 http://www.dmir.uni-wuerzburg.de/datasets/nids-ds of the categorization scheme developed in the first step. This
15
detailed investigation aims to support readers to identify data [8] C. Yin, Y. Zhu, S. Liu, J. Fei, H. Zhang, An Enhancing Framework
sets for their purposes. The review of data sets shows that the for Botnet Detection Using Generative Adversarial Networks, in: Inter-
national Conference on Artificial Intelligence and Big Data (ICAIBD),
research community has noticed a lack of publicly available 2018, pp. 228–234. doi:10.1109/ICAIBD.2018.8396200.
network-based data sets and tries to overcome this shortage [9] S. Staniford, J. A. Hoagland, J. M. McAlerney, Practical automated
by publishing a considerable number of data sets over the last detection of stealthy portscans, Journal of Computer Security 10 (1-2)
(2002) 105–136.
few years. Since several research groups are active in this area, [10] J. Jung, V. Paxson, A. W. Berger, H. Balakrishnan, Fast Portscan
additional intrusion detection data sets and improvements can Detection Using Sequential Hypothesis Testing, in: IEEE Symposium
be expected soon. on Security & Privacy, IEEE, 2004, pp. 211–225. doi:10.1109/
SECPRI.2004.1301325.
As further sources for network traffic, traffic generators and [11] A. Sridharan, T. Ye, S. Bhattacharyya, Connectionless Port Scan Detec-
data repositories are discussed in Section VI. Traffic generators tion on the Backbone, in: IEEE International Performance Computing
create synthetic network traffic and can be used to create and Communications Conference, IEEE, 2006, pp. 10–19. doi:
10.1109/.2006.1629454.
adapted network traffic for specific scenarios. Data repositories [12] M. Ring, D. Landes, A. Hotho, Detection of slow port scans in flow-
are collections of different network traces on the internet. based network traffic, PLOS ONE 13 (9) (2018) 1–18. doi:10.1371/
Compared to the data sets in Section V, data repositories journal.pone.0204507.
[13] A. Sperotto, R. Sadre, P.-T. de Boer, A. Pras, Hidden Markov Model
often provide limited documentation, non-labeled data sets or modeling of SSH brute-force attacks, in: International Workshop on
network traffic of specific scenarios (e.g., exclusively FTP Distributed Systems: Operations and Management, Springer, 2009, pp.
connections). However, these data sources should be taken 164–176. doi:10.1007/978-3-642-04989-7_13.
[14] L. Hellemons, L. Hendriks, R. Hofstede, A. Sperotto, R. Sadre,
into account when searching for suitable data, especially for A. Pras, SSHCure: A Flow-Based SSH Intrusion Detection System, in:
specialized scenarios. Finally, we discussed some observations International Conference on Autonomous Infrastructure, Management
and recommendations for the use and generation of network- and Security (IFIP), Springer, 2012, pp. 86–97. doi:10.1007/
978-3-642-30633-4_11.
based intrusion detection data sets. We encourage users to [15] M. Javed, V. Paxson, Detecting Stealthy, Distributed SSH Brute-Forcing,
evaluate their methods on several data sets to avoid over-fitting in: ACM SIGSAC Conference on Computer & Communications Secu-
to a certain data set and to reduce the influence of artificial rity, ACM, 2013, pp. 85–96. doi:10.1145/2508859.2516719.
[16] M. M. Najafabadi, T. M. Khoshgoftaar, C. Kemp, N. Seliya, R. Zuech,
artifacts of a certain data set. Further, we advocate data sets in Machine Learning for Detecting Brute Force Attacks at the Network
standard formats including predefined training and test subsets. Level, in: International Conference on Bioinformatics and Bioengineer-
Overall, there probably won’t be a perfect data set, but there ing (BIBE), IEEE, 2014, pp. 379–385. doi:10.1109/BIBE.2014.
73.
are many very good data sets available and the community [17] R. Sommer, V. Paxson, Outside the Closed World: On Using Machine
could benefit from closer collaboration. Learning For Network Intrusion Detection, in: IEEE Symposium on
Security and Privacy, IEEE, 2010, pp. 305–316. doi:10.1109/SP.
2010.25.
ACKNOWLEDGMENT [18] M. Małowidzki, P. Berezinski, M. Mazur, Network Intrusion Detection:
Half a Kingdom for a Good Dataset, in: NATO STO SAS-139 Workshop,
M.R. and S.W. were supported by the BayWISS Consortium Portugal, 2015.
Digitisation. S.W. is additionally funded by the Bavarian State [19] W. Haider, J. Hu, J. Slay, B. Turnbull, Y. Xie, Generating realistic
Ministry of Science and Arts in the framework of the Centre intrusion detection system dataset based on fuzzy qualitative modeling,
Journal of Network and Computer Applications 87 (2017) 185–192.
Digitisation.Bavaria (ZD.B). doi:10.1016/j.jnca.2017.03.018.
[20] N. Moustafa, J. Slay, UNSW-NB15: A Comprehensive Data Set for
R EFERENCES Network Intrusion Detection Systems, in: Military Communications and
Information Systems Conference (MilCIS), IEEE, 2015, pp. 1–6. doi:
[1] V. Chandola, E. Eilertson, L. Ertoz, G. Simon, V. Kumar, Data Mining 10.1109/MilCIS.2015.7348942.
for Cyber Security, in: A. Singhal (Ed.), Data Warehousing and Data [21] M. Ring, S. Wunderlich, D. Grüdl, D. Landes, A. Hotho, Flow-based
Mining Techniques for Computer Security, 1st Edition, Springer, 2006, benchmark data sets for intrusion detection, in: European Conference
pp. 83–107. on Cyber Warfare and Security (ECCWS), ACPI, 2017, pp. 361–369.
[2] M. Rehák, M. Pechoucek, K. Bartos, M. Grill, P. Celeda, V. Krmicek, [22] I. Sharafaldin, A. H. Lashkari, A. A. Ghorbani, Toward Generating
CAMNEP: An intrusion detection system for high-speed networks, a New Intrusion Detection Dataset and Intrusion Traffic Character-
Progress in Informatics 5 (5) (2008) 65–74. doi:10.2201/NiiPi. ization, in: International Conference on Information Systems Secu-
2008.5.7. rity and Privacy (ICISSP), 2018, pp. 108–116. doi:10.5220/
[3] S. Garcia, M. Grill, J. Stiborek, A. Zunino, An empirical comparison 0006639801080116.
of botnet detection methods, Computers & Security 45 (2014) 100–123. [23] G. Creech, J. Hu, Generation of a New IDS Test Dataset: Time
doi:10.1016/j.cose.2014.05.011. to Retire the KDD Collection, in: IEEE Wireless Communications
[4] M. Ring, S. Wunderlich, D. Grüdl, D. Landes, A. Hotho, A Toolset for and Networking Conference (WCNC), IEEE, 2013, pp. 4487–4492.
Intrusion and Insider Threat Detection, in: I. Palomares, H. Kalutarage, doi:10.1109/WCNC.2013.6555301.
Y. Huang (Eds.), Data Analytics and Decision Support for Cybersecurity: [24] T. R. Glass-Vanderlan, M. D. Iannacone, M. S. Vincent, R. A. Bridges,
Trends, Methodologies and Applications, Springer, 2017, pp. 3–31. et al., A Survey of Intrusion Detection Systems Leveraging Host Data,
doi:10.1007/978-3-319-59439-2_1. arXiv preprint arXiv:1805.06070.
[5] E. B. Beigi, H. H. Jazi, N. Stakhanova, A. A. Ghorbani, Towards [25] R. Koch, M. Golling, G. D. Rodosek, Towards Comparability of In-
Effective Feature Selection in Machine Learning-Based Botnet Detection trusion Detection Systems: New Data Sets, in: TERENA Networking
Approaches, in: IEEE Conference on Communications and Network Conference, Vol. 7, 2014.
Security, IEEE, 2014, pp. 247–255. doi:10.1109/CNS.2014. [26] J. O. Nehinbe, A critical evaluation of datasets for investigating IDSs
6997492. and IPSs Researches, in: IEEE International Conference on Cybernetic
[6] M. Stevanovic, J. M. Pedersen, An analysis of network traffic classifi- Intelligent Systems (CIS), IEEE, 2011, pp. 92–97. doi:10.1109/
cation for botnet detection, in: IEEE International Conference on Cyber CIS.2011.6169141.
Situational Awareness, Data Analytics and Assessment (CyberSA), [27] M. Ring, S. Wunderlich, D. Grüdl, D. Landes, A. Hotho, Creation of
IEEE, 2015, pp. 1–8. doi:10.1109/CyberSA.2015.7361120. Flow-Based Data Sets for Intrusion Detection, Journal of Information
[7] J. Wang, I. C. Paschalidis, Botnet Detection Based on Anomaly and Warfare 16 (2017) 40–53.
Community Detection, IEEE Transactions on Control of Network Sys- [28] A. Shiravi, H. Shiravi, M. Tavallaee, A. A. Ghorbani, Toward de-
tems 4 (2) (2017) 392–404. doi:10.1109/TCNS.2016.2532804. veloping a systematic approach to generate benchmark datasets for
16
intrusion detection, Computers & Security 31 (3) (2012) 357–374. [50] J. J. Santanna, R. van Rijswijk-Deij, R. Hofstede, A. Sperotto, M. Wier-
doi:10.1016/j.cose.2011.12.012. bosch, L. Z. Granville, A. Pras, Booters - An analysis of DDoS-as-a-
[29] G. Maciá-Fernández, J. Camacho, R. Magán-Carrión, P. García- service attacks, in: IFIP/IEEE International Symposium on Integrated
Teodoro, R. Therón, UGR’16: A New Dataset for the Evaluation of Network Management (IM), 2015, pp. 243–251. doi:10.1109/INM.
Cyclostationarity-Based Network IDSs, Computers & Security 73 (2018) 2015.7140298.
411–424. doi:10.1016/j.cose.2017.11.004. [51] H. H. Jazi, H. Gonzalez, N. Stakhanova, A. A. Ghorbani, Detecting
[30] I. Sharafaldin, A. Gharib, A. H. Lashkari, A. A. Ghorbani, Towards a HTTP-based application layer DoS attacks on web servers in the
Reliable Intrusion Detection Benchmark Dataset, Software Networking presence of sampling, Computer Networks 121 (2017) 25–36. doi:
2018 (1) (2018) 177–200. doi:10.13052/jsn2445-9739.2017. 10.1016/j.comnet.2017.03.018.
009. [52] B. Sangster, T. O’Connor, T. Cook, R. Fanelli, E. Dean, C. Morrell, G. J.
[31] M. H. Bhuyan, D. K. Bhattacharyya, J. K. Kalita, Network Anomaly Conti, Toward Instrumenting Network Warfare Competitions to Generate
Detection: Methods, Systems and Tools, IEEE Communications Surveys Labeled Datasets, in: Workshop on Cyber Security Experimentation and
& Tutorials 16 (1) (2014) 303–336. doi:10.1109/SURV.2013. Test (CSET), 2009.
052213.00046. [53] R. P. Lippmann, D. J. Fried, I. Graf, J. W. Haines, K. R. Kendall, D. Mc-
[32] A. Nisioti, A. Mylonas, P. D. Yoo, V. Katos, From Intrusion Detection Clung, D. Weber, S. E. Webster, D. Wyschogrod, R. K. Cunningham,
to Attacker Attribution: A Comprehensive Survey of Unsupervised et al., Evaluating Intrusion Detection Systems : The 1998 DARPA Off-
Methods, IEEE Communications Surveys Tutorials 20 (4) (2018) 3369– line Intrusion Detection Evaluation, in: DARPA Information Survivabil-
3388. doi:10.1109/COMST.2018.2854724. ity Conference and Exposition (DISCEX), Vol. 2, IEEE, 2000, pp. 12–
26. doi:10.1109/DISCEX.2000.821506.
[33] O. Yavanoglu, M. Aydos, A Review on Cyber Security Datasets for
[54] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, K. Das, The
Machine Learning Algorithms, in: IEEE International Conference on
1999 DARPA Off-Line Intrusion Detection Evaluation, Computer Net-
Big Data, IEEE, 2017, pp. 2186–2193. doi:10.1109/BigData.
works 34 (4) (2000) 579–595. doi:10.1016/S1389-1286(00)
2017.8258167.
00139-0.
[34] C. T. Giménez, A. P. Villegas, G. Á. Marañón, HTTP data set CSIC [55] M. Alkasassbeh, G. Al-Naymat, A. Hassanat, M. Almseidin, Detecting
2010, (Date last accessed 22-June-2018) (2010). Distributed Denial of Service Attacks Using Data Mining Techniques,
[35] C. Wheelus, T. M. Khoshgoftaar, R. Zuech, M. M. Najafabadi, A International Journal of Advanced Computer Science and Applications
Session Based Approach for Aggregating Network Traffic Data - The (IJACSA) 7 (1) (2016) 436–445.
SANTA Dataset, in: IEEE International Conference on Bioinformatics [56] R. Zuech, T. M. Khoshgoftaar, N. Seliya, M. M. Najafabadi, C. Kemp, A
and Bioengineering (BIBE), IEEE, 2014, pp. 369–378. doi:10. New Intrusion Detection Benchmarking System, in: International Florida
1109/BIBE.2014.72. Artificial Intelligence Research Society Conference (FLAIRS), AAAI
[36] A. S. Tanenbaum, D. Wetherall, Computer Networks, 5th Edition, Press, 2015, pp. 252–256.
Pearson, 2011. [57] S. Saad, I. Traore, A. Ghorbani, B. Sayed, D. Zhao, W. Lu, J. Felix,
[37] B. Claise, Specification of the IP Flow Information Export (IPFIX) P. Hakimian, Detecting P2P Botnets through Network Behavior Analysis
Protocol for the Exchange of IP Traffic Flow Information, RFC 5101 and Machine Learning, in: International Conference on Privacy, Security
(2008). and Trust (PST), IEEE, 2011, pp. 174–180. doi:10.1109/PST.
[38] B. Claise, Cisco Systems NetFlow Services Export Version 9, RFC 3954 2011.5971980.
(2004). [58] A. D. Kent, Comprehensive, Multi-Source Cyber-Security Events, Los
[39] P. Phaal, sFlow Specification Version 5 (2004). Alamos National Laboratory (2015). doi:10.17021/1179829.
URL https://sflow.org/sflow_version_5.txt [59] A. D. Kent, Cybersecurity Data Sources for Dynamic Network Research,
[40] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, in: Dynamic Networks in Cybersecurity, Imperial College Press, 2015,
J. Rexford, S. Shenker, J. Turner, OpenFlow: Enabling Innovation pp. 37–65. doi:10.1142/9781786340757_0002.
in Campus Networks, ACM SIGCOMM Computer Communication [60] J. Song, H. Takakura, Y. Okabe, M. Eto, D. Inoue, K. Nakao, Statistical
Review 38 (2) (2008) 69–74. doi:10.1145/1355734.1355746. Analysis of Honeypot Data and Building of Kyoto 2006+ Dataset for
[41] F. Haddadi, A. N. Zincir-Heywood, Benchmarking the Effect of Flow NIDS Evaluation, in: Workshop on Building Analysis Datasets and
Exporters and Protocol Filters on Botnet Traffic Classification, IEEE Gathering Experience Returns for Security, ACM, 2011, pp. 29–36.
Systems Journal 10 (4) (2016) 1390–1401. doi:10.1109/JSYST. doi:10.1145/1978672.1978676.
2014.2364743. [61] R. Pang, M. Allman, M. Bennett, J. Lee, V. Paxson, B. Tierney, A First
[42] S. Stolfo, (Date last accessed 22-June-2018). [link]. Look at Modern Enterprise Traffic, in: ACM SIGCOMM Conference
URL http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html on Internet Measurement (IMC), USENIX Association, Berkeley, CA,
[43] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, USA, 2005, pp. 15–28.
M. Axton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, [62] F. Beer, T. Hofer, D. Karimi, U. Bühler, A new Attack Composition
P. E. Bourne, et al., The FAIR Guiding Principles for scientific data for Network Security, in: 10. DFN-Forum Kommunikationstechnologien,
management and stewardship, Scientific Data 3. doi:10.1038/ Gesellschaft für Informatik eV, 2017, pp. 11–20.
sdata.2016.18. [63] M. Tavallaee, E. Bagheri, W. Lu, A. A. Ghorbani, A detailed analysis
of the KDD CUP 99 data set, in: IEEE Symposium on Computational
[44] A. J. Aviv, A. Haeberlen, Challenges in Experimenting with Botnet
Intelligence for Security and Defense Applications, 2009, pp. 1–6. doi:
Detection Systems, in: Conference on Cyber Security Experimentation
10.1109/CISDA.2009.5356528.
and Test (CEST), USENIX Association, Berkeley, CA, USA, 2011.
[64] R. Singh, H. Kumar, R. Singla, A Reference Dataset for Network
[45] Z. B. Celik, J. Raghuram, G. Kesidis, D. J. Miller, Salting Public Traces Traffic Activity Based Intrusion Detection System, International Journal
with Attack Traffic to Test Flow Classifiers, in: Workshop on Cyber of Computers Communications & Control 10 (3) (2015) 390–402.
Security Experimentation and Test (CSET), 2011. doi:10.15837/ijccc.2015.3.1924.
[46] H. He, E. A. Garcia, Learning from Imbalanced Data, IEEE Transactions [65] R. Sharma, R. Singla, A. Guleria, A New Labeled Flow-based DNS
on Knowledge and Data Engineering 21 (9) (2009) 1263–1284. doi: Dataset for Anomaly Detection: PUF Dataset, Procedia Computer Sci-
10.1109/TKDE.2008.239. ence 132 (2018) 1458–1466, international Conference on Computational
[47] A. R. Vasudevan, E. Harshini, S. Selvakumar, SSENet-2011: A Network Intelligence and Data Science. doi:10.1016/j.procs.2018.05.
Intrusion Detection System dataset and its comparison with KDD CUP 079.
99 dataset, in: Second Asian Himalayas International Conference on [66] S. Bhattacharya, S. Selvakumar, SSENet-2014 Dataset: A Dataset for
Internet (AH-ICI), 2011, pp. 1–5. doi:10.1109/AHICI.2011. Detection of Multiconnection Attacks, in: International Conference on
6113948. Eco-friendly Computing and Communication Systems (ICECCS), IEEE,
[48] S. Anwar, J. Mohamad Zain, M. F. Zolkipli, Z. Inayat, S. Khan, 2014, pp. 121–126. doi:10.1109/Eco-friendly.2014.100.
B. Anthony, V. Chang, From Intrusion Detection to an Intrusion Re- [67] R. Hofstede, L. Hendriks, A. Sperotto, A. Pras, SSH compromise detec-
sponse System: Fundamentals, Requirements, and Future Directions, tion using NetFlow/IPFIX, ACM SIGCOMM Computer Communication
Algorithms 10 (2) (2017) 39. Review 44 (5) (2014) 20–26. doi:10.1145/2677046.2677050.
[49] C. Kolias, G. Kambourakis, A. Stavrou, S. Gritzalis, Intrusion Detection [68] E. K. Viegas, A. O. Santin, L. S. Oliveira, Toward a reliable anomaly-
in 802.11 Networks: Empirical Evaluation of Threats and a Public based intrusion detection in real-world environments, Computer Net-
Dataset, IEEE Communications Surveys Tutorials 18 (1) (2016) 184– works 127 (2017) 200–216. doi:10.1016/j.comnet.2017.08.
208. doi:10.1109/COMST.2015.2402161. 013.
17
[69] P. Gogoi, M. H. Bhuyan, D. Bhattacharyya, J. K. Kalita, Packet and tomated System for Generating Attack Traffic, in: Workshop on Traffic
Flow Based Network Intrusion Dataset, in: International Conference Measurements for Cybersecurity (WTMC), ACM, New York, NY, USA,
on Contemporary Computing, Springer, 2012, pp. 322–334. doi: 2018, pp. 46–51. doi:10.1145/3229598.3229601.
10.1007/978-3-642-32129-0_34. [84] S. Molnár, P. Megyesi, G. Szabo, How to Validate Traffic Genera-
[70] M. H. Bhuyan, D. K. Bhattacharyya, J. K. Kalita, Towards Generating tors?, in: IEEE International Conference on Communications Workshops
Real-life Datasets for Network Intrusion Detection, International Journal (ICC), IEEE, 2013, pp. 1340–1344. doi:10.1109/ICCW.2013.
of Network Security (IJNS) 17 (6) (2015) 683–701. 6649445.
[71] A. Sperotto, R. Sadre, F. Van Vliet, A. Pras, A Labeled Data Set [85] G. Brogi, V. V. T. Tong, Sharing and replaying attack scenarios
for Flow-Based Intrusion Detection, in: International Workshop on with Moirai, in: RESSI 2017: Rendez-vous de la Recherche et de
IP Operations and Management, Springer, 2009, pp. 39–50. doi: l’Enseignement de la Sécurité des Systèmes d’Information, 2017.
10.1007/978-3-642-04968-2_4. [86] N. Rajasinghe, J. Samarabandu, X. Wang, INSecS-DCS: A Highly Cus-
[72] F. Gringoli, L. Salgarelli, M. Dusi, N. Cascarano, F. Risso, et al., tomizable Network Intrusion Dataset Creation Framework, in: Canadian
GT: Picking up the Truth from the Ground for Internet Traffic, ACM Conference on Electrical & Computer Engineering (CCECE), IEEE,
SIGCOMM Computer Communication Review 39 (5) (2009) 12–18. 2018, pp. 1–4. doi:10.1109/CCECE.2018.8447661.
doi:10.1145/1629607.1629610. [87] F. J. Aparicio-Navarro, K. G. Kyriakopoulos, D. J. Parish, Automatic
[73] M. J. Turcotte, A. D. Kent, C. Hash, Unified Host and Network Data Dataset Labelling and Feature Selection for Intrusion Detection Systems,
Set, arXiv preprint arXiv:1708.07518. in: IEEE Military Communications Conference (MILCOM), IEEE,
[74] CYBER Systems and Technology, (Date last accessed 22-June-2018). 2014, pp. 46–51. doi:10.1109/MILCOM.2014.17.
URL https://www.ll.mit.edu/ideval/docs/index.html [88] R. Hofstede, A. Pras, A. Sperotto, G. D. Rodosek, Flow-Based Compro-
[75] J. McHugh, Testing Intrusion Detection Systems: A Critique of the 1998 mise Detection: Lessons Learned, IEEE Security Privacy 16 (1) (2018)
and 1999 DARPA Intrusion Detection System Evaluations As Performed 82–89. doi:10.1109/MSP.2018.1331021.
by Lincoln Laboratory, ACM Transactions on Information and System
[89] J. Han, J. Pei, M. Kamber, Data Mining: Concepts and Techniques, 3rd
Security (TISSEC) 3 (4) (2000) 262–294. doi:10.1145/382912.
Edition, Elsevier, 2011.
382923.
[90] M. Cermak, T. Jirsik, P. Velan, J. Komarkova, S. Spacek, M. Drasar,
[76] G. Szabó, D. Orincsay, S. Malomsoky, I. Szabó, On the Validation
T. Plesnik, Towards Provable Network Traffic Measurement and Analy-
of Traffic Classification Algorithms, in: International Conference on
sis via Semi-Labeled Trace Datasets, in: Network Traffic Measurement
Passive and Active Network Measurement, Springer, 2008, pp. 72–81.
and Analysis Conference (TMA), IEEE, 2018, pp. 1–8. doi:10.
doi:10.1007/978-3-540-79232-1_8.
23919/TMA.2018.8506498.
[77] R. Fontugne, P. Borgnat, P. Abry, K. Fukuda, MAWILab: Combining
[91] G. Wang, J. Hao, J. Ma, L. Huang, A new approach to intrusion
Diverse Anomaly Detectors for Automated Anomaly Labeling and Per-
detection using Artificial Neural Networks and fuzzy clustering, Expert
formance Benchmarking, in: International Conference on emerging Net-
Systems with Applications 37 (9) (2010) 6225–6232. doi:10.1016/
working EXperiments and Technologies (CoNEXT), ACM, New York,
j.eswa.2010.02.102.
NY, USA, 2010, pp. 8:1–8:12. doi:10.1145/1921168.1921179.
[78] M. Hatada, M. Akiyama, T. Matsuki, T. Kasama, Empowering Anti- [92] J. Zhang, M. Zulkernine, A. Haque, Random-Forests-Based Network
malware Research in Japan by Sharing the MWS Datasets, Journal Intrusion Detection Systems, IEEE Transactions on Systems, Man, and
of Information Processing 23 (5) (2015) 579–588. doi:10.2197/ Cybernetics, Part C (Applications and Reviews) 38 (5) (2008) 649–659.
ipsjjip.23.579. doi:10.1109/TSMCC.2008.923876.
[79] D. Brauckhoff, A. Wagner, M. May, FLAME: A Flow-Level Anomaly [93] M. V. Mahoney, Network Traffic Anomaly Detection based on Packet
Modeling Engine, in: Workshop on Cyber Security Experimentation and Bytes, in: ACM Symposium on Applied Computing, ACM, 2003, pp.
Test (CSET), USENIX Association, 2008, pp. 1:1–1:6. 346–350. doi:10.1145/952532.952601.
[80] E. Vasilomanolakis, C. G. Cordero, N. Milanov, M. Mühlhäuser, [94] C. M. Inacio, B. Trammell, YAF: Yet Another Flowmeter, in: Large
Towards the creation of synthetic, yet realistic, intrusion detection Installation System Administration Conference, 2010, pp. 107–118.
datasets, in: IEEE Network Operations and Management Symposium [95] J. Xu, J. Fan, M. H. Ammar, S. B. Moon, Prefix-Preserving IP
(NOMS), IEEE, 2016, pp. 1209–1214. doi:10.1109/NOMS.2016. Address Anonymization: Measurement-based security evaluation and a
7502989. new cryptography-based Scheme, in: IEEE International Conference on
[81] P. Siska, M. P. Stoecklin, A. Kind, T. Braun, A Flow Trace Gener- Network Protocols, IEEE, 2002, pp. 280–289. doi:10.1109/ICNP.
ator using Graph-based Traffic Classification Techniques, in: Interna- 2002.1181415.
tional Wireless Communications and Mobile Computing Conference [96] R. Pang, M. Allman, V. Paxson, J. Lee, The Devil and Packet Trace
(IWCMC), ACM, 2010, pp. 457–462. doi:10.1145/1815396. Anonymization, ACM SIGCOMM Computer Communication Review
1815503. 36 (1) (2006) 29–38. doi:10.1145/1111322.1111330.
[82] M. Ring, D. Schlör, D. Landes, A. Hotho, Flow-based Network Traffic [97] D. J. Kelly, R. A. Raines, M. R. Grimaila, R. O. Baldwin, B. E. Mullins,
Generation using Generative Adversarial Networks, Computer & Secu- A Survey of State-of-the-Art in Anonymity Metrics, in: ACM Workshop
rity 82 (2019) 156–172. doi:10.1016/j.cose.2018.12.012. on Network Data Anonymization, ACM, 2008, pp. 31–40. doi:10.
[83] F. Erlacher, F. Dressler, How to Test an IDS?: GENESIDS: An Au- 1145/1456441.1456453.