0% found this document useful (0 votes)
44 views29 pages

Data-Driven Cybersecurity Incident Prediction A Survey

Uploaded by

ScOrPiOn ViPeR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views29 pages

Data-Driven Cybersecurity Incident Prediction A Survey

Uploaded by

ScOrPiOn ViPeR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

1744 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO.

2, SECOND QUARTER 2019

Data-Driven Cybersecurity Incident Prediction:


A Survey
Nan Sun , Jun Zhang , Senior Member, IEEE, Paul Rimba, Shang Gao, Leo Yu Zhang , Member, IEEE,
and Yang Xiang , Senior Member, IEEE

Abstract—Driven by the increasing scale and high profile even catastrophic damage [2]. A cybersecurity incident
cybersecurity incidents related public data, recent years we can be defined as an event whereby an intruder employs
have witnessed a paradigm shift in understanding and defend- a tool to implement an action that exploits a vulnerability
ing against the evolving cyber threats, from primarily reactive
detection toward proactive prediction. Meanwhile, governments, on an objective, making an unauthorized outcome that
businesses, and individual Internet users show the growing pub- satisfies the attacker’s intentions [3]. According to the latest
lic appetite to improve cyber resilience that refers to their Australia Cyber Security Center (ACSC) survey [4], 90%
ability to prepare for, combat and recover from cyber threats organizations suffered from some form of attempted or
and incidents. Undoubtedly, predicting cybersecurity incidents successful cybersecurity compromises during the 2015-2016
is deemed to have excellent potential for proactively advanc-
ing cyber resilience. Research communities and industries have financial year. Furthermore, over half of the organizations
begun proposing cybersecurity incident prediction schemes by (58%) experienced at least one cybersecurity incident that
utilizing different types of data sources, including organization’s successfully compromised their data or system [4].
reports and datasets, network data, synthetic data, data crawled Under urgent threats from cybersecurity incidents, the
from webpages, and data retrieved from social media. With a good news is that research communities and industries are
focus on the dataset, this survey paper investigates the emerg-
ing research by reviewing recent representative works appeared well aware of these cybersecurity threats and show great
in the dominant period. We also extract and summarize the foresight in building cyber resilience. The 2018 Annual
data-driven research methodology commonly adopted in this fast- Cyber Security report [5] published by Cisco indicates
growing area. In consonance with the phases of the methodology, that organizations and enterprises have implemented cyber-
each work that predicts cybersecurity incident is comprehensively awareness programmes, including looking for outsourcing
studied. Challenges and future directions in this field are also
discussed. service to strengthen defenses on cybersecurity incidents [6].
For instance, 49% global respondents outsourced monitoring
Index Terms—Cybersecurity incidents, data mining, data- service as part of their cyber preparation strategy in 2017,
driven, discovery, machine learning, prediction.
compared with 44% in 2015. Suppose the incidents can be pre-
dicted in advance, the governments, businesses, Information
I. I NTRODUCTION and Communication Technology (ICT) providers, and even
individual users can be protected from damages caused by
YBERSECURITY incident as an ever-present threat to
C organizations, governments, and enterprises is increas-
ing in frequency, scale, sophistication and severity [1]. Like
security issues. Hence, proactively predicting cybersecurity
incident is deemed as a potential and immediate problem that
imperatively demands to solve. That is to say, cybersecurity
natural disasters (e.g., hurricanes, floods, and earthquakes) incident prediction is entirely an area of research that is in the
and human-made disasters (e.g., military nuclear accidents exciting early development.
and financial crashes), cybersecurity incidents that involve At present, no system can be considered invulnerable. The
extreme events can lead to unintended consequences or security community has begun to realize that the current evo-
Manuscript received May 20, 2018; revised September 29, 2018; accepted lution is a never-ending competition between attackers and
December 2, 2018. Date of publication December 7, 2018; date of current defenders. Therefore, it is significant to foreshadow the threat
version May 31, 2019. This paper was supported in part by the Australian in advance, which prioritizes recommendations to organiza-
Research Council Discovery Project under Grant DP150103732 and in part
by the National Natural Science Foundation of China under Grant 61772405. tions, and as such reduces damage from various kinds of cyber
(Corresponding author: Jun Zhang.) attacks. To date, numerous works have extensively studied var-
N. Sun, S. Gao, and L. Y. Zhang are with the School of Information ious aspects of cybersecurity incidents and threats. They focus
Technology, Deakin University (Waurn Ponds Campus), Geelong, VIC 3216,
Australia. on analysis, detection, and prevention. However, few exhibited
J. Zhang is with the School of Software and Electrical Engineering, prediction schemes that can provide proactive measures to avoid
Swinburne University of Technology, Melbourne, VIC 3122, Australia the damage. These works utilized website features [7], [8],
(e-mail: [email protected]).
P. Rimba is with Data61, CSIRO, Sydney, NSW 2015, Australia. incident reports [9], log files [10] or other security postures to
Y. Xiang is with the School of Software and Electrical Engineering, predict cybersecurity incidents. The corresponding techniques
Swinburne University of Technology, Melbourne, VIC 3122, Australia, and are focused on machine learning (ML), data mining (DM), deep
also with the State Key Laboratory of Integrated Service Networks, Xidian
University, Xi’an, 710071 China. learning, and graph mining, which are specifically illustrated in
Digital Object Identifier 10.1109/COMST.2018.2885561 Sections II and III. A representative result is that the RiskTeller
1553-877X  c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: DATA-DRIVEN CYBERSECURITY INCIDENT PREDICTION: SURVEY 1745

system proposed in [10] which can achieve 96% true positive unknown is also defined as a kind of prediction in this survey.
rates (TPRs, the proportion of actual positives that are correctly Compared to time-based prediction, discovery has a more gen-
identified as such) for predictions at a machine-level granular- eral goal, which utilizes numerous amount of data to extract
ity. The results of these studies are promising and suggest the previously unknown or potentially useful new knowledge [16].
possibility of accurately anticipating cybersecurity incidents, In a nutshell, this survey is based on 19 core papers which
which is crucial to strengthen cyber resilience. utilize data from different domains to forecast and discover
Furthermore, faced with the constant stream of news on cyber incidents. Paper selection focuses on English publica-
cybersecurity incidents, the industrial circles also actively seek tions that meet specified inclusion criteria. Queries on four
for proactive defenses. Usually, businesses adopt multiple lay- cybersecurity top conferences were performed using “predict”
ers of security protection to prioritize the entities which are and “incident”, “forecast” and “incident”, and “discover” and
at high risk of being attacked and to minimize the damages “incident”. The four conferences, including ACM Computer
caused by incidents. As we observed, there are two kinds of and Communications Security (CCS), IEEE Symposium on
businesses that are dedicated to cybersecurity by deploying the Security and Privacy (S&P), the Network and Distributed
prediction model: one is to help companies find and stop cyber System Security (NDSS) and Usenix Security Symposium, are
attacks before they cause harm, another is to incorporate cyber deemed as the top four cybersecurity conferences by security
insurance after risk assessment. Both had been developing research communities. However, it was recognized that this
steadily in recent years. For instance, Chronicle [11] is a new emphasis might overlook papers appeared in other conferences
business that is devoted to cybersecurity by utilizing machine or journals. Based on the collected papers, we also distilled
learning and cloud computing techniques. BizCover [12] is an the references of these papers and those that referenced the
insurance company that will cover expenses on cybersecurity collected papers. To catch up with the trend of cyber incident
incidents. prediction, as well as to cover new and emerging datasets,
There are several survey papers which are devoted to inves- we try our best to include representative works published in
tigating different kinds of security threats and attempting recent years in this survey. In short, this survey is intended to
to mitigate damage caused by security incidents reactively. introduce the new rising topic to readers, attract and appeal for
Jang-Jaccard and Nepal [13] studied the existing security vul- more readers to begin research in this field. For this aim, great
nerabilities and critically analyzed the mitigation techniques emphasis is placed on a thorough description of existing work,
in their survey paper. Liu et al.’s survey [14] was the lat- and references to primary datasets for each work are provided.
est one that identified cyber insider threats and looked into a When trying to predict cybersecurity incidents, it should be
large number of systems and schemes against insider threats. understood that data plays the crucial role in the process of
Moreover, Buczak and Guven [15] focused on data mining and analyzing cyber threats, modeling prediction problems and dis-
machine learning methods for cybersecurity intrusion detec- covering security incidents. Driven by more and more publicly
tion. They provided thorough descriptions of the ML/DM available data, predicting security trend and discovering indi-
methods and their applications in cybersecurity to readers. All cators of cyber incidents seem to be much more feasible than
of these works contribute to the understanding of the emerg- ever before. From the collected literature, the data sources
ing cyber threats and supply comprehensive reactive detection can be categorized as follows: (1) Organization reports and
guides. However, when a cybersecurity threat is detected, there datasets: some organizations regularly update their datasets or
is a high possibility that severe damages have already been reports to publish security-related information. For example,
caused, such as data leakage, financial losses, and even reputa- VERIS Community Database (VCDB) [17] records secu-
tion damages [4]. Proactively predicting cyber incidents based rity incidents in a common format. Important examples that
on observed indicators of cybersecurity threats can fill the gap, leverage organization reports or datasets as their data source
which motivates us to perform a literature review of existing include [9], [18], [19]. (2) Executables datasets: Executable
cybersecurity incident prediction work. code, file or program that is able to run by a computer is
Different from existing arts that focus on detection, the served as the dataset in work [20]–[22]. (3) Network datasets:
scope of this survey paper mainly emphasizes on cyberse- network datasets typically record the structure, properties, traf-
curity incident prediction. To make it clear, one may think fic or symptoms of a network. Specifically, there are four
that the distinction between detection and prediction can be kinds of network datasets in this survey: log files [23], network
analogous to the circumstance of diagnosing the sickness of mismanagement symptoms [24], temporal networks from dif-
a patient (e.g., by applying biopsy) and predicting whether ferent domains [25] and network traffic [26]. (4) Synthetic
a currently healthy person will suffer from a specific disease datasets: synthetic data is generated according to specific needs
in future (e.g., by using genetic testing) [9]. In more detail, and under certain conditions. In [25] and [27], the prediction
detection usually leverages identified features of a target to model was established and evaluated by using synthetic data.
be detected, while prediction leans on the factors believed to (5) Webpage data: Web contents crawled from webpages can
connect with the prediction objective [9]. From the conse- also be used as data source as shown in [7], [8], and [28].
quences and applications point of view, detection enables to (6) Social media data: social media is a platform that cov-
detect and mitigate threats while prediction understands the ers up-to-date insights and information from users around
riskiest parts of a given system, acts proactively and provides the world. Existing studies in [29]–[33] collected data
the administrator with a vulnerability index used for defend- from Tweets, articles and reviews. (7) Mixed-type datasets:
ing and hardening their network. Furthermore, discovering an some examples in [25], [32], and [33] made use of two or

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
1746 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 2, SECOND QUARTER 2019

more of the above data sources as mixed-type datasets to in their incident response guidebooks; (4) Computer Security
collect groundtruth, set up model and conduct validation. Incident Response Team (CSIRT) [38] defined the incident
Corresponding to the above six types of data sources, we as “unauthorized activity against a computer or network that
emphasize on the thorough description of the datasets and results in a violation of a security policy”. Based on the
provide references to seminal work for each dataset. above definitions, one can conclude that although there are
Our contributions: many definitions of “incidents”, all of these show a substan-
• The principal contribution is the collection and investiga- tial content of similarities. Hence, in this survey, we define the
tion of state-of-the-art cybersecurity incident prediction cybersecurity incident as actions/events/situations/collection of
schemes, methods and datasets, which highlights the data with bad intentions which lead to threats or damages on
existing work in this field. cybersecurity.
• The collected works are novelty classified into six cate- Furthermore, the term “cybersecurity incident” has been
gories according to utilized datasets: organization reports defined with appropriate taxonomy in [39]–[41]. A list of sin-
and dataset, network dataset, synthetic dataset, webpage gle and defined terms is a popularly adopted taxonomy as a
data, social media data, and mixed-type data. The datasets simple and clear method. David and Karl [39] used a list that
used in each work are identified and referenced in minute includes 24 terms to define a cybersecurity incident, such as
detail and depicted in the form of tables for clarification. “Viruses and Worms”, “Unauthorized Data Copying”, “Logic
• The modeling methodology commonly adopted by cyber- Bombs” and “Denial-of-Service”. Although this classification
security incident prediction is summarized. As stated in is easy to implement, for covering all types of cyber incidents,
each phase of the methodology, challenges and future the list needs to contain encyclopedic volumes of defined
directions are discussed. terms. Besides, the definitions of some specific terms are hard
This survey is organized as follows. Firstly, Section II sum- to accept. Consequently, some literature employed a list of
marizes the overview of cybersecurity incident prediction, categories to define cyber incidents [40], [41]. According to
including the definition of cybersecurity incident and research Computer Security Incident Handling Guide published by U.S.
methodology. Section III presents a detailed view of existing National Institute of Standards and Technology (NIST) [40],
work in the area of cybersecurity incident prediction according incidents are tagged into four types: Denial of Service (DoS),
to our categorized datasets. In line with the research method- malicious code, unauthorized access, or inappropriate usage.
ology, Section IV discusses the challenges & future direction Furthermore, some taxonomies focus on the action of cyberse-
in this area. Finally, Section V concludes the paper. curity incident [42] or the result triggered by the incidents [43].
Inspired by the above cyber incident taxonomies, we orga-
II. OVERVIEW OF C YBERSECURITY nize the incidents from the collected literature into our
I NCIDENT P REDICTION proposed categorization method as shown in Figure 1, to better
analyze and define cybersecurity incident, as well as clarifying
In this section, an overview of cybersecurity incident
the scope of incidents. The first hierarchy of indexes in our
prediction is given in the following two specifications: the def-
category adopts the NIST cyber incident taxonomy that tags
inition of cybersecurity incident and the research methodology
each incident as either inappropriate usage, DoS, malicious
for prediction. The definition aims to clarify the meaning of
code or unauthorized access. Furthermore, we herein catego-
cybersecurity incident prediction as well as to determine the
rize incident of each reviewed work in the manner of “list of
scope of this survey. Moreover, the definition contributes to the
terms” by David and Karl [39], as well as providing references
explanation of the methodology deployed in this field, provid-
to the relevant paper.
ing the roadmap for researchers who want to engage in related
fields.
B. Research Methodology
A. Cybersecurity Incident Definition The methodology, shown in Figure 2 illustrates common
here are many definitions of the term “cybersecurity inci- steps to predict and discover cybersecurity incidents. The
dent” in the literature. This raises the challenge of giving model is composed of six steps, which constitutes a circle
a standard definition and taxonomy for describing incidents as a continuous and incremental process: (1) cybersecurity
and limiting the scope of the survey paper. Particularly, the incident analysis; (2) security problem modeling; (3) data col-
definition of “incident” varies from team to team and from lection and processing; (4) feature engineering/ representation
project to project. Sample definitions in literature review learning; (5) model customization; (6) evaluation.
are as follows: (1) a general definition for a cybersecurity 1) Cybersecurity Incident Analysis: The first step is to
incident might be “any real or suspected adverse event in “know the enemy”. In the face of overwhelming cybersecu-
relation to the security of computer systems or computer rity incidents, the first thing we have to do is to target one or
networks” [34]; (2) Australian Computer Emergency Response more specific cybersecurity incident types. In order to achieve
Team (AusCERT) [35] defined an incident as “any type of the goal of predicting and discovering cybersecurity incidents,
computer network attack, computer-related crime, and the mis- many efforts have been devoted to finding more indicators that
use or abuse of network resources or access”; (3) the SANS may relate to incidents by comprehensively analyzing an inci-
Institute [36] and Department of the Navy [37] described an dent. Moreover, fully understanding a cybersecurity incident
incident as “an adverse event in an information system and/or is helpful to see some aspect of the problem at which to chip
network, or the threat of the occurrence of such an event” away.

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: DATA-DRIVEN CYBERSECURITY INCIDENT PREDICTION: SURVEY 1747

Fig. 1. Reviewed cybersecurity incidents categories.

are considered nontrivial for analyzing a cybersecurity inci-


dent. Besides, the infrastructure, environment and occurrence
time of the incidents are also worth analyzing. Last but least,
actors involved in the incident may affect the development of
incidents, which can be potential factors for prediction and
discovery objective [34], [38]. During the process of analysis,
a small number of samples may be required to extrapolate the
entire incident and reflected details of the case.
2) Security Problem Modeling: After profoundly analyz-
ing the previous target cybersecurity incidents, the research
problem shaped by project requirements should be determined
in this step. Roughly speaking, there are two ways of think-
ing about how to define cybersecurity incident prediction or
discovery problem: one way is to design a method that ulti-
mately aims to apply to all kinds of security incidents to
predict/discover security incidents and related information.
Fig. 2. Methodology of data-driven cybersecurity incident prediction. Such a method can demonstrate its feasibility by choosing
one or more specific incident types to conduct proof-of-
concept; the other way is to solve a particular prediction or
discovery problem investigated in the process of analyzing a
As referred to in Section II-A, cybersecurity incident is the kind of cybersecurity incident, such as predicting whether a
activity with evil intentions which results in threats or damages currently benign website will become malicious [7] or dis-
on cybersecurity in general definition. Depending on different covering black keywords used by the online underground
perspectives, researchers discerned and analyzed cybersecu- economy [28].
rity incidents in distinct ways as shown in Figure 1. In this On the basis of the research problem, a model to solve the
step, we provide ways on how to analyze cybersecurity inci- problem will be set up. Several samples can be applied to
dents. Generally, a type of incident can first be compared validate the feasibility and practicality of the implementation
and contrasted with similar events. From the technical level, alternatives to help determine whether the idea is worth taking
techniques, tools, and methods utilized to achieve the attack on to the next stage.

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
1748 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 2, SECOND QUARTER 2019

Hereon, we will briefly introduce related modeling each other, merging data and converting data to structured for-
approaches and techniques for cybersecurity incident mats. High quality label for data is necessary if the prediction
prediction. Typically, techniques in support of establishing model is built by supervised learning, which is a process of
security models focus on ML and DM. As ML and DM learning a function that maps an input to an output based on
often employ the same methods, there is considerable overlap example input-output pairs [45]. Some data may need to be
between the two terms [15]. Regarding cybersecurity incident labeled by specialists and experts.
prediction and discovery, ML concentrates on prediction, Last but least, by investigating the previous work on data
based on the knowledge learned from the training data. The collection, we have some insights on the data collection for
task of predicting the elements in a given dataset should cybersecurity incident prediction. We find that certain kinds
belong to which one of two groups that can be cast as a of data are easy to access; a few others are difficult to
binary classification problem, using existing techniques, such obtain for researchers. To be specific, data published on the
as those proposed in [9] and [18]. Similarly, where more than Web are usually available for Web crawling. For instance,
two labels can be assigned to each observation, multi-label we can readily obtain vulnerability records in the National
classification is applied to solve a specific problem in [30]. Vulnerability Database (NVD) [46] and find historical incident
Furthermore, when the prediction problem is designed to reports in Verizon annual Data Breach Investigations Reports
explore the relationship between variables, regression is an (DBIR) [47]. Also, by leveraging APIs provided by various
elective solution, as proposed in [27]. On the other side, kinds of social media, such as Twitter, we can mine social
DM focuses on discovering previously unknown knowledge media data. However, some data are relatively difficult to
in the datasets. To put it in practical terms, the task of acquire, especially when privacy information is involved. A
grouping a batch of objects in such an approach that objects case example is that of log files that are utilized in the work
in the same group are more similar to each other than to of [10].
those in other groups is typically considered as a clustering 4) Feature Engineering/Representation Learning: The
problem [8], [31]. fourth crucial step of the methodology is to extract features
Besides ML/DM, other techniques should be considered from the collected data, which is not only critical to achieving
when setting up model and dealing with specific issues. For ideal prediction results but also fundamental to the application
instance, as security incidents are normally recorded in natural of machine learning. The performance of ML methods is heav-
language, Natural Language Processing (NLP) is extensively ily contingent on the selection of features or representation
used to process data as in [24], [28], and [31]–[33]. In addi- of data to which they are applied on [48]. Feature engineer-
tion, some work unitizes statistical and graph mining [25] ing refers to a process of relating domain knowledge of data
approaches to achieve their goals. Note that we delay the to manually create features, and thus relies heavily on spe-
detailed analysis of how security problems of the selected cific domain knowledge. Feature engineering begins typically
works are modeled to Section III. As an introduction, Table II with brainstorming by investigating data as well as considering
summarizes the research problems and techniques applied to research problems. Sometimes, we can also use the experience
establish model in the reviewed works. of other literature or projects. In general, both automatic fea-
3) Data Collection and Processing: After the above two ture extraction algorithms and manual construction methods
steps, we need to obtain sufficient data. Predicting cybersecu- are severed for devising features.
rity incidents is firmly bound up with data. Collecting data However, sometimes, coming up with appropriate features
is a critical step, which forms a connecting link between is challenging, time-consuming, and lacks expert knowl-
the preceding and following steps in this methodology. The edge [49]. On the other hand, real-world data, such as image,
quality and quantity of data decide the feasibility of solv- video and other sensory data, are hard to yield specific features
ing the research problem proposed in the last step. Also, by traditional feature engineering methods. Representation
data can serve as the source for setting up groundtruth and learning as an alternative way of feature engineering attempts
affect the performance of the prediction model. We have wit- to recognize and disengage the potential explanatory factors
nessed the enormous increase of both the scope and variety buried in the data [48]. In other words, learning representa-
of security-related datasets in recent years, which provides us tions of the data can aid in extracting valuable information
with valuable resources to predict and discover cybersecurity when developing classifiers or further predictors. Practically,
incidents. capturing the posterior distribution of the hidden factors for
So how to collect valuable and unique needs data from the the observed data is a good representation for the probabilis-
large volumes of different kinds of data sources? The gen- tic model. Additionally, efficient representation is beneficial
eral steps to manage big data begin with gathering data from as input to supervised, unsupervised and deep learning pre-
diverse data sources according to the research problem and dictors with suitable representation learning algorithms, such
project purpose. After collecting the vast amount of raw data, as supervised dictionary learning [50] and principal component
we should consider how to store these data to perform fur- analysis [51].
ther processing. Physical foundation, as well as cloud storage It is well-known that feature engineering or representation
services, are usually required to put the data into appropriate learning is a typical process in machine learning and deep
databases or storage services [44]. The third step is to organize learning. What should be especially considered in cyberse-
and process data before knowledge discovery from databases. curity incident prediction? It is vital that features reflect the
That includes cleaning up noisy data, mapping data sources to condition before the incidents. In other words, the features may

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: DATA-DRIVEN CYBERSECURITY INCIDENT PREDICTION: SURVEY 1749

TABLE I
not be necessarily directly related to the cybersecurity incident C ONFUSION M ATRIX
but must be an indicator of the threat. To give an example,
when Liu et al. [9] forecasted cybersecurity incident on an
organization-level granularity, they made use of features that
revealed mismanagements on the network and infrastructure
instead of known characteristics of the incident. Based on the
special attributes of prediction, the representation learning has
the chance to enhance the potential for identifying indicators
• Recall — it also called sensitivity or TPR. It refers to
of cybersecurity incidents and therefore to make a successful
prediction. percentage of numbers of class X correctly predicted
5) Model Customization: Generally, the prediction model is as belonging to class X, which is calculated as TP/(TP
built by applying data mining and machine learning algorithms + FP).
• Precision — it illustrates percentage of items correctly
and optimizing parameters to fit the best model. Traditional
ML/DM methods can achieve acceptable performance on cer- predicted as X among all items classified as X, which is
tain domains. However, the same performance cannot be calculated as TP/(TP + FN).
• False Positive Rate (FPR) — it indicates percentage
guaranteed when it comes to a specific research problem
related to cybersecurity incident prediction. To solve this of items incorrectly classified as class X to all items
problem, if the traditional ML/DM algorithms can be cus- that belong to a class not X, which is calculated as
tomized according to the specific research problem instead FP/(TN+FP).
• F-measure — it is another measurement of accuracy com-
of directly using packaged tools, the model will be effectively
implemented to achieve maximum data efficiency. Thus, the bining precision and recall, which is calculated as 2 ·
efficiency (e.g., running speed) and efficacy (e.g., accuracy and Precision · Recall/(Precision+Recall).
other performance measurements) of predictive model can also Also, some work utilizes a graphical plot that is called
improve significantly. Receive Operating Characteristic (ROC) curve to illustrate the
Deep learning (DL), a member of a broader family of prediction ability of a binary predictor. Specifically, a ROC
machine learning techniques that hinges on learning data rep- space is created by plotting TPR as y-axis against FPR as
resentations, is moving beyond its early breakthroughs in x-axis, which demonstrates relative trade-offs between benefits
pattern recognition and towards innovative applications in dis- (TP) and costs (FP).
tinct domains and industries [52]. There is a promising trend In addition, some work utilizes a graphical plot that is called
of applying DL to analyze tremendous volumes of data with Receive Operating Characteristic (ROC) curve to illustrate the
high computational efficiency [53]. For instance, in the field prediction ability of a binary predictor. Specifically, a ROC
of cybersecurity, researchers recently leveraged DL to detect space is created by plotting TPR as y-axis against FPR as
spam [54], to discover vulnerability [22] and have intro- x-axis, which demonstrates relative trade-offs between benefits
duced DL for IoT-based system [55], [56]. Therefore, it is (TP) and costs (FP).
an opportunity to solve problems for cybersecurity incident Regularly, the FPR is always a significant challenge to
prediction if we rationally utilize and customize DL in this cybersecurity, which hampers the effectiveness of security
field. tools [57], [58]. For the detection problem, FPs always result
The ways and insights of customizing model can be in massive cost. For instance, a work which is used by one spe-
explored based on thoroughly understanding of traditional cific software must be interrupted if the software is detected as
ML/DM algorithms and data structures knowledge in classi- malware. Therefore, the goal of the detection problem is usu-
cal computer science. Furthermore, combined with the specific ally to maximize the TPR while keeping the FPR minimize.
research problem, the improvements tend to be more problem However, regarding the prediction problem, the cost of FPs
specialized. are more complicated to avoid, but the cost of FPs is lower
6) Evaluation: The last step is the evaluation of the model compared to the detection problem. This also agrees with the
to determine whether the results meet our objective. That goal of prediction: to discover all possible incidents for the
is, evaluating the model with appropriate metrics to verify implementing of proactive measures, prioritize alerts and sup-
research goals are reached. ply security training in advance. For an insurance company,
In Section III, the evaluation method and metrics of each 20% false positives can be entirely accepted, according to the
work will be described in details. For earlier understanding, recent work [23].
we list the definitions of evaluation metrics used throughout By leveraging comprehensive evaluation methods and met-
the paper. rics, we can check whether the results are satisfactory. If the
The evaluation metrics are calculated from confusion matrix goal failed to achieve, the circulation from analyzing cyber
that reports False Positives (FP), False Negatives (FN), True incident should restart incrementally to find a better solution.
Positives (TP) and True Negatives (TN), as shown in Table I.
The evaluation metrics frequently used for reviewed work are: III. DATA -D RIVEN P REDICTION AND D ISCOVERY
• Accuracy — it is defined as percentage of correctly pre- In this section, we review the significant cybersecurity
dicted items among the total number of items, which is incident prediction and discovery works by the category of
calculated as (TP + TN)/(TP + TN + FP + FN). data sources as follows: (1) organization reports and datasets;

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
1750 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 2, SECOND QUARTER 2019

TABLE II
OVERVIEW OF R EVIEWED PAPERS ON C YBERSECURITY I NCIDENT P REDICTION

(2) executables datasets; (3) networks datasets; (4) synthetic these datasets. This information can help the research prac-
datasets; (5) webpage data; (6) social media data; (7) mixed- tice with understanding, repeating and improving works in
type dataset. cybersecurity incident prediction. Also, we provide more
Using the methodology of data-driven cyber incident details about relevant datasets on our GitHub repository,1 such
prediction proposed in Section II as the main line, we review as release information and some sample data, which may
the critical points and details of each work in the following enlighten new researchers to find new solutions.
subsections. Table II introduces the overall reviewed papers
concerning paper publication year, goals, incident types and A. Organization Reports and Datasets
datasets addressed in each work. Furthermore, Table IX sum-
1) Predicting Data Breach Incidents: Data breach refers
marizes, contrast and compare the reviewed work following
to “a security incident in which sensitive, protected or
the proposed methodology step by step.
Besides the text description of the employed datasets in each 1 https://nansunsun.github.io/Cybersecurity-incident-prediction-and-
work, we also produce Table III to Table VIII to summarize discovery-data/

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: DATA-DRIVEN CYBERSECURITY INCIDENT PREDICTION: SURVEY 1751

confidential data is copied, transmitted, viewed, stolen or the authors analyzed top data breaches incidents in 2014 to
used by an individual unauthorized to do so” in [59]. It has illustrate the power of the prediction model. Their prediction
been a severe problem in the security field for a long time, model can reach a combination of 90% TPR and 10% FPR.
many researchers and communities are devoted to detecting Moreover, according to [72], the top five data breach inci-
it. While, the fact is that, it is already too late when the dents in 2014 are separately happening in JP Morgan Chase,
data breach is detected because the severe damage may have Sony pictures, eBay, Home Depot and Target. The proposed
already occurred. If a data breach incident can be forecasted in prediction model accurately forecasted most of the data breach
advance, the organization may survive in, instead of suffering incidents, except the Target one.
from, a cybersecurity incident. Sometimes, the attackers may establish the fake network
Liu et al. [9] presented a method to proactively predict orga- with clean data (no malicious activities) but with counter-
nization’s breach incidents based on the externally observed feit reported incidents, or the opposite way, to mislead the
organization’s network symptoms data. Firstly, the authors predictor. This is referred to as adversarial machine learn-
analyzed cybersecurity incidents referenced by Verizon annual ing. In [9], Liu et al. assumed the data are uncompromisingly
Data DBIR [17] and characterized the extent to which cyberse- real. In other words, they ignored the noise and error in the
curity incidents could be predicted. Standing on observations dataset. The complete solution remains a direction for future
on cyber incidents, the author framed the research problem as study.
a binary prediction problem of identifying whether an organi- 2) Predicting Risk Distributions Over Fine-Grained Data
zation will encounter a data breach incident in the near future Breach Types: Nowadays, every business is facing various
based upon externally observed organizations’ Internet data kinds of security incidents, including targeted attacks and
instead of data from internal workings of an organization’s internal errors. Once a security incident happens, the busi-
network. ness data involving private, as well as public information, has
The authors adopted machine learning method to train the extremely high possibility to be leaked. Furthermore, the
and test the classifiers by utilizing organization’s reports and business will be affected not only on its assets but also on its
datasets data, including security incident data and security pos- reputation. Therefore, organizations are devoted to assigning
ture data. On one hand, security incident data comes from resources to prevent themselves from ever-changing security
VERIS community database [17], Hackmageddon [60] and the incidents. If the risk distributions can be assessed and pre-
Web Hacking Incidents Database [61], serve as groundtruth. dicted, organizations can prioritize the protection, so as to
These three datasets cover the cyber incident events rang- achieve more effective protection and save the more resources.
ing from mid-2013 to 2014. On the other hand, security Sarabi et al. [18] leveraged the business details to train and
posture is quantitatively measured in the level of malicious test a sequence of predictors, which can help organizations
activities from an organization and five mismanagement symp- prioritize the preventive resource allocation. When the authors
toms. The malicious activities are measured in not only analyzed security incidents, they found that no business gravi-
the amounts of those but also their dynamic behaviors, by tates toward a single sort of incident. Meanwhile, they noticed
applying various reputation blacklists, including CBL [62], that incidents reports usually provide security recommenda-
SBL [63], SpamCop [64], WPBL [65], UCEPROTECT [66], tions based individually on business sector information. Hence,
SURBL [67], PhishTank [68], hpHosts [69], Darknet scanners different from work in [18], the ultimate goal of the paper is
list, Dshield [70] and OpenBL [71]. Furthermore, mismanage- to employ business details about an organization to predict
ment symptoms are obtained based on observation from open a sparser set of incidents types compared with [9], so as to
Recursive Resolvers, DNS Source Port Randomization, BGP provide protection resource allocation recommendations to an
misconfiguration, Untrusted HTTPS Certificates, and Open arbitrary organization.
SMTP Mail Relays [24] via using databases that record and For the study, the incidents happened in 2013 and 2014,
assess the organization’s network. After pre-processing and including 1729 and 592 entries were collected respectively
mapping, 258 externally measurable features are extracted from the VERIS Community Database [17] as groundtruth. In
from security posture data. Also, each organization is labeled order to achieve fine-grained cyber incident prediction, each
as “victim” or “non-victim” according to the security incident incident was labeled from three fields. The first field is the type
reports. of the cyber incident, including environmental, error, hacking,
There are two prediction scenarios proposed in [9], namely malware, misuse, physical or social. The second is the respon-
short-term prediction and long-term prediction. Experiments sible actor for the attack, labeled as the external, internal or
were conducted by using random forest algorithm. In addition, partner. The last is the compromised assets in the incident, con-
training datasets are composed of a random subset of victim taining kiosk/terminal, media, social, network, people, server
organizations and non-victim organizations. In the short-term and device.
prediction scenario, features are extracted from the most up- Features gathered for training and testing the predictors are
to-date time ahead of an incident. In the long-term prediction business details from the organization’s profiles and websites,
scenario, features are extracted from the periods prior to the which combine information obtained from the VCDB and
first incident happened in the testing dataset. Alexa Web Information Service (AWIS). The industry code,
As to the evaluation of the prediction performance, besides number of employees, and the region of operation of the vic-
of traditional evaluation metrics (including accuracy, true pos- tim organization are three business profiles features extracted
itive, false positive and ROC curve as defined in Section II), from VCDB. AWIS provides the organization’s website and

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
1752 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 2, SECOND QUARTER 2019

statistics information, including the traffic volume of the web- activities, 19 million influence graphs, in which the down-
site, number of visitors, speed, number of pages linking to the loaders have caused on 5 million real hosts, were included
website, and information about the organization that maintains in the dataset. Explicitly, the nodes of a graph indicated that
the website (e.g., address, contact information and stock ticker the Portable Executable (PE) files (including benign down-
symbol). loaders and malicious downloaders), and the edges of the
To forecast the risk of various kinds of data breach incidents, graph represented the download events. According to the
the authors designed multi-label classifiers by using random groundtruth of malicious and benign downloaders, 15 mil-
forest algorithm. Specifically, each binary classifier predicted lion of the influence graphs (IG) were labeled as benign, and
a field of the incident signature, which is namely action type, 0.25 million as malicious. The groundtruth was checked by
actor type or asset type, as mentioned above. three data sources, containing the downloader records from
Concerning evaluation, predictors were trained on the 2013 VirusTotal, the National Software Reference Library (NSRL),
incidents data and tested them using the 2014 data. The and a supplementary groundtruth data derived from Symantec.
prediction model was evaluated from the risk profiles of To explore the differences between malicious and benign
the company and the accuracy of the risk assessment model. downloaders’ IGs, the authors conducted an extensive mea-
The result showed that an organization can evade 90% inci- surement, providing the features to be utilized by malware
dents by using 70% of incident types on average. early-detection system. According to the analysis, there are
It is worth mentioning that the prediction results seem to four apparent indicators of malicious activities: (1) IGs with a
be too ambiguous to operate. That is, more practical recom- large diameter are mostly malicious. (2) IGs with slow growth
mendations can be provided for a security ignorant business rates are primarily malicious. (3) URL access patterns can
operator. For instance, the SANS [73] contributes 20 kinds be distinguished between malicious and benign downloaders.
of security controls that specify the operable guidance to the (4) Malware is more prone to download fewer files per domain.
business to increase their security level. Hence, transferring Based on the observation, 16 features (including four internal
the risk profiles to more actionable security recommendations dynamic features, three domain properties features, two down-
could be a future direction. loader score properties features, four life cycle features and
3) Discovering Previous Unknown Malware With three globally behavior features) were calculated from the IGs.
Downloader Graph Analytics: Malicious software, com- Utilizing the features obtained from the measurements and
monly known as malware, has been a vital threat to adopting random forest classifiers, the malware early detec-
cybersecurity for a long time. Reported by the PandaLabs, tion system was built. To evaluate how early the detection
18 million new malware samples were captured in the third system can detect the previously unknown malicious executa-
quarter of 2016, an average of 200,000 each day [74]. Due to bles, the authors defined “early detection” as “we can flag
human error, zero-day exploits or other factors, it is possible unknown executables as malicious before their first submis-
to be infected even for the most protected and state-of-the-art sion to VirusTotal” according to [19]. On average, it is shown
system in the world. Hence, detecting the malware which are that this early-detection system can detect unknown malware
previously unknown to the public as early as possible is an 9.24 days on average earlier than VirusTotal anti-virus prod-
effective way to minimize stress, time cost, and damage as uct. Besides, the authors attempted to perform online detection
well as defeating incidents from its very beginning. However, experiments, simulating the early detection system employed
some malware are hard to be detected by using the traditional operationally. In the experiment, the training dataset contained
malware detection methods that focus on analyzing the data collected before the year 2014, including 21,543 mali-
content and behavior of software. Downloader graphs have cious and 21,755 benign data. For the testing dataset, 12,299
the potential of providing indicators of malicious activities malicious and 12,594 benign data were gathered from the year
and discovering the vast majority of the malware download 2014. The resultant 99.8% TPR and 1.9% FPR demonstrate
activities that may otherwise remain undetected. the robustness of the system. Although there is a limita-
Kwon et al. [19] presented a malware early-detection system tion that droppers with rootkit functionality would escape
based on the insights from analyzing downloader graphs. The this technique, this method still provides a novel signal and
authors investigated that, due to social engineering or drive-by complementary to the current anti-virus mechanism.
attacks, users may download additional malware even if they As a concluding remark for this section, an overall descrip-
are downloading benign applications. Hence, they proposed tion for organization’s reports and datasets is produced in
a graph-based abstraction model to describe the download Table III.
activities on end hosts. Based on the abstraction model, they
performed a large-scale measurement to investigate the differ-
ences in the growth patterns between malicious and benign B. Executables Datasets
downloader graphs. Lastly, they employed features extracted 1) Early-Stage Malware Prediction Using Recurrent Neural
from measurements to build a malware early-detection system. Networks: In the face of rapidly increasing rate and the num-
The dataset used to build for the malware early-detection ber of new malware, both research communities and industries
system consists of malicious and benign download activities are devoted to detecting malware. Static malware analysis
graphs. The graphs are generated by reconstructing download derived from analyzing static code can be conducted quickly
events obtained by anti-virus (AV) telemetry and Symantec’s and easily. However, static analysis is vulnerable to obfus-
intrusion prevention systems (IPS). To represent the download cation techniques and performs poorly when facing entirely

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: DATA-DRIVEN CYBERSECURITY INCIDENT PREDICTION: SURVEY 1753

TABLE III
O RGANIZATION ’ S R EPORTS AND DATASETS

new malware. Behavioral data generated from file execu- random search of the hyperparameter space to adjust the hyper-
tion for dynamic malware analysis is more difficult to be parameter of the model. Finally, the customized configuration
obfuscated. But the process of file execution takes longer was the settings that achieve the best performance on 10-fold
time, which indicates the malware may have been delivered cross-validation over the training set.
before detection. The security protection personnel, instead of Malware prediction radically transforms the defensive strat-
repairing damage after detection, can block malicious pay- egy from recovery to prevention. The performance of the
loads if the prediction works before any damage taken place. malware prediction method was evaluated from 3 aspects:
Rhode et al. [20] proposed an early-stage malware prediction • The authors set the goal of prediction as “predict malware
method, which can leverage the first five seconds of execution quickly enough that user experience would not (signifi-
behavioral data to predict whether a file is malicious or not cantly) suffer from the time delay.” The testing dataset
by using a recurrent neural network (RNN). only used samples that were first seen by VirusTotal after
During the data collection process, the authors firstly gath- an exact time on 10th October 2017. Also, they tested the
ered 1,000 malicious and 600 benign Windows 7 executables method against Random Forest, Support Vector Machine,
from VirusTotal [78] as well as 800 benign samples from a Naive Bayes, J48 Decision Tree, K-Nearest Neighbor
fresh Windows 7 64-bit installation’s system files. The authors and Multi-layer Perceptron algorithms used in previous
also collected extra 4,000 Windows 7 applications from free research. The results showed that the RNN model out-
software sources (e.g., Softonic [79], PortableApps [80] and performed other algorithms after one second, achieved
SourceForge [81]) to better represent the real workload of the 91% accuracy after four seconds and 96% accuracy after
anti-virus system. The 4,000 Windows 7 applications were 19 seconds of execution. The results demonstrated that
labeled as “malicious” or “benign” by around 60 anti-virus the few seconds of execution’s dynamic data was ade-
engines from VirusTotal API [78]. Finally, the dataset included quate to forecast whether an executable file is malicious
2,345 benign samples and 2,286 malicious samples. or not.
After collecting data, the executable samples were exe- • The authors explored the robustness of the model to
cuted using Cuckoo Sandbox [82]. During the execution, ten discover the malware families and variants which are
sequential machine activity metrics were taken as features by previously unknown. To simulate “zero-day”, the authors
employing Python Pustul Library [83] and then applied as collected variants listed as advanced persistent threats
input data to the neural network. The captured activities that (APTs). They found that during the first second of exe-
were recorded every second in the process of sample exe- cution, the variant detection rate was over 89%, which
cution included system CPU usage, user CPU usage, packets indicates the capability of the model to discover “zero-
sent, packets received, bytes sent, bytes received, memory use, day” malware.
swap use, the total number of processes currently running and • A case study on ransomware was conducted using 3,000
the maximum process ID assigned. Regarding setting up the ransomware samples. The results showed that the accu-
prediction model, the architecture of the model is based on racy of prediction reaches 94% after one second of exe-
RNNs along with Gated Recurrent Units (GRUs). On one cution without prior exposure to samples of ransomware.
hand, RNN has the capability of catching input features, pro- As mentioned in [20], There are three main limitations
cessing time-series data, as well as capturing information that as well as future directions. Firstly, they only examined
changes over time. On the other hand, GRUs cells are used Windows 7 executables. In this concern, checking if the
for quickening the speed of training process. Furthermore, to model is capable of detecting some other potential carriers
face the rapid evolution of malware, the authors adopted a for malware or operating systems is essential. Secondly, due

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
1754 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 2, SECOND QUARTER 2019

to the easiness of the first 5 seconds of the malicious file apps with non-HSO branches from Google Play that have
tampering with adversaries, the robustness of the prediction never been flagged by VirusTotal [84]. (2) A labeled bad
model should be evaluated by adversarially crafted samples. dataset: This dataset includes 213 PHAs. Each of PHA has one
Lastly, with good practical value, the model should have the HSO branch. (3) An unknown dataset: This dataset comprises
capability to block the malicious payload if the predictor deter- 124,207 apps downloaded from Google Play, and 214,147
mines the process is malicious or suspects the process might collected from VirusTotal. The labeled dataset was used for
be malicious. training, and the unlabeled dataset was applied to testing and
2) Hidden Sensitive Operations Discovery in Android Apps: discover new knowledge.
Sensitive operations are only conducted on certain conditions Eventually, 63,372 apps with 70,660 branches were labeled
to hide from automated runtime analysis, which is called as HSO by implementing this approach. For evaluation of this
Hidden Sensitive Operations (HSOs). HSOs are increasingly approach, the researchers in [21] randomly sampled 125 apps
used by malicious mobile apps or other potential-harmful apps and manually inspected each of app. It is shown that the new
(PHAs) to evade detection. Finding previous unknown HSO approach can achieve 98% precision and over 94% coverage.
is critical to mitigating the emerging mobile threats. However, Furthermore, the approach was used in a measurement study,
existing static analysis approaches rely on known beforehand which aims to discover the new knowledge on HSO. By apply-
trigger conditions and hidden behaviors. Therefore, discov- ing the approach on 338,354 apps in the wild, the researchers
ering unknown HSO is invaluable to understand the HSO presented evolving trends and discoveries on HSO activities,
techniques evolving trends and provide insights about how to which contributes to more effective defense against the mobile
defend against incidents caused by mobile security threats, phone threats.
which encourages the researchers to devote themselves to this One point should be affirmed is that the work in [21] is sig-
trend. nificant to understand and defeat HSO. However, this is still
The branching structure of an HSO usually consists of a preliminary work in this field. A few limitations could be
one condition and multiple paths. From the observation on observed. For instance, the accuracy and completeness could
HSO branch, Pan et al. [21] proposed machine learning based be improved. Also, it should be validated whether the model
approach that employed a set of lightweight features extracted can evade carefully crafted HSO techniques. Lastly, the cur-
from an app to conduct a large-scale unknown HSO discovery. rent existing few heavyweight techniques can be optimized to
There are three unique observations concerning an HSO improve the scalability of the approach.
condition, its paths, and relations between condition and paths 3) Code Vulnerability Discovery: Exploitable software vul-
as follows: (1) HSO trigger conditions are always only relevant nerabilities are one of the primary causes of security incidents
to system input (time, location, screen touches, etc.), rather and data breaches [85], [86]. For instance, a recently disclosed
than the hosting app’s internal input. (2) Behaviors between vulnerability in the Server Message Block (SMB) protocol that
two paths are considerably different in HSO. (3) Data and was exploited by the WannaCry ransomware affected a large
semantic dependency between conditions and paths in HSO number of users and systems worldwide. This resulted in not
are remarkably weak. Inspired by the above unique obser- only financial loss but also the companies and organizations
vations, three sets of features were extracted from trigger reputation damage.
conditions and the corresponding paths: (1) System Input (SI) Efficiently discovering previously unknown vulnerabili-
is a binary feature indicates whether the trigger condition ties can be a feasible solution against potential attacks.
contains system input. The system input refers to system prop- Lin et al. [22] proposed a framework to discover vulnerabil-
erties (e.g., hardware traces of a mobile phone) or environment ities in function-level granularity. Besides that, by utilizing
parameters (e.g., time, location, user input etc.). (2) Activity the transfer representation learning based on a deep learn-
Distance (AD) and Data Distance (DD) are features used to ing algorithm, the approach they proposed can be applied to
measure the similarity of two paths in an HSO branch state- within-project and cross-project vulnerability discovery.
ment. Both of them werecalculated by Jaccard Distance. The authors claimed that the function level vulnerabili-
Specifically, AD = 1 − ( O l  Or
Ol Or ), where Ol and Or represent ties groundtruth dataset is scarce. Hence, they collected six
sets of sensitive operations on two paths of  an HSO branch.
 open-source projects’ codes from GitHub. The six projects are
While dealing with DD, it is set to 1 − 12 ( Vl  Vr
Vl Vr + Fl  Fr
Fl Fr ), respectively LibTIFF, LibPNG, FFmpeg, Pidgin, VLC Media
Vl , Vr and Fl , Fr are respectively sets of variables and refer- Player, and Asterisk. For each project, the authors manually
ences class fields of a branch statement. (3) Data Dependency labeled the vulnerabilities in the function level according to
(DF) and Implicit Relation (IR) are features that describe Common Vulnerability and Exposures (CVE) and National
the relationship between trigger conditions and behavior. DF Vulnerability Database data repositories. Finally, they obtained
refers to the ratio of the variables on a path connected 457 vulnerable functions and 32,531 non-vulnerable functions.
to the condition through data flows; the latter is the num- After setting up the groundtruth dataset, the authors need
ber of variables, keys, and APIs implicitly related to the to extract features from functions so as to discover vulnerabil-
condition. ities at function level. Firstly, each function was converted
Support vector machine (SVM) was chosen as the machine to Abstract Syntax Trees (ASTs) in a serialized form. All
learning algorithm for implementation of the approach. of the serialized ASTs are processed in the same length
The evaluation was conducted by leveraging three datasets: on the condition that preserves the structural and seman-
(1) A labeled good dataset: This dataset contains 213 benign tic features. As usual, the second step is to craft feature.

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: DATA-DRIVEN CYBERSECURITY INCIDENT PREDICTION: SURVEY 1755

It is worth mentioning that convention feature engineering Mismanagement is defined as “the failure to adopt com-
that requires specific domain knowledge was not adopted in monly accepted guidelines or policies when administrating and
this framework. Instead, the authors applied a long short- operating networks” in [24]. Eight network mismanagement
term memory (LSTM) [87] based on recurrent neural network symptoms are collected as follows:
(RNN) with Word2vec [88] embeddings to learn represen- • Open DNS recursive resolver: Open recursive queries
tations of programming patterns that help to differentiate can be exploited in an amplification attack, which poses
between vulnerable and non-vulnerable functions. Besides, for a direct threat to the networks. The authors consid-
a target project with a small number of labeled data, same rep- ered the hosts supporting open DNS recursive queries
resentation learning method is applied and as a complementary as misconfigured. According to the data provided by the
to feed into the pre-trained network. Lastly, a random forest Open Resolver Project [90] from June 2013, there were
classifier is trained by leveraging these learned representations. 27 million open DNS recursive resolvers.
The evaluation was generally based on the comparison • DNS source port randomization: Randomizing source
with traditional code metrics (CMs) feature extraction method, ports can prevent DNS cache poisoning to some extent.
which is commonly adopted in finding vulnerabilities. On one The source ports without randomization are considered as
hand, top-k precision was applied to measure the performance misconfigured. Data were collected by analyzing a series
of the model. Top-k precision [89] is widely adopted in of DNS queries against top-level domain (TLD) servers
information retrieval systems. For instance, when measuring on February 2013. Totally, 226,976 DNS resolvers with-
search engines performance, Top-k precision indicated that out being patched with source port randomization were
how many relevant information is retrieved in all of the Top- collected.
k information. In this work [22], top-k precision suggests • Consistent A and PTR records: According to DNS config-
the percentage of functions that are vulnerable in the top-k uration guidelines (RFC1912 [106]), every Address (A)
fetched functions. On the other hand, they defined Function record should have a matching Pointer (PTR) record. By
Inspection Reduction Rate (FIRER) to measure cost reduc- utilizing the records stored in VeriSign zone files [91] and
tion. In other words, the effort saved by applying the proposed the Alexa Top 1 Million popular websites [92], 27.4 mil-
method contrasted with traditional feature extraction approach lion A records without a corresponding PTR record were
is calculated. The empirical results showed the transfer rep- gathered.
resentation learning for vulnerability discovery method was • BGP misconfiguration: According to
more effective than the method based on CMs, both in the Mahajan et al. [107], 90% announcements made in
within-project and the cross-project detection scenarios. less than 24 hours are due to Border Gateway Protocol
However, the classification is not fine-grained enough at the (BGP) misconfiguration. By using this heuristic,
function-level. As a future direction, code-gadget level and 42.4 million short-lived routes were detected in the
statement level classification can be considered. Also, although Route Views project [93].
the imbalance problem was solved using the random forest • Egress filtering: Networks without implementing egress
to a certain extent, it should be concerned about employ- filtering is considered as misconfigured. By utilizing
ing the oversampling and undersampling method to furtherly data from Spoofer Project [94], 7,861 netblocks with-
addressing this problem. out egress filtering were detected and gathered in the
As a concluding remark for this section, an overall descrip- misconfiguration symptoms dataset.
tion for executables datasets is produced in Table IV. • Untrusted HTTPS certificates: 10.2 million HTTP servers
with untrusted certificates were found in the process of
ZMap network scanner project [95] scanning the HTTP
C. Network Datasets ecosystem.
1) Mismanagement and Maliciousness of Networks: There • SMTP server relaying: Servers which use open mail
is a hypothesis that mismanagement is correlated with mali- relays are easily abused by spammers since they do not
ciousness. Mismanaged networks are more likely to expose conduct any filtering before sending the messages to any
more attack vectors, resulting in bringing about more attackers destination. By using ZMap in July 2013 to investigate
and infected hosts [105]. Moreover, mismanagement networks the popularity of open mail relays, 22,284 SMTP servers
have less opportunity to adopt reactive approaches to reduce that allow open mail relays were collected.
the bad impact of compromise. • Publicly available out-of-band management cards: Out-
In other words, mismanagement networks might take charge of-band management cards are valuable for system
of wide-ranging malicious networks and security incidents. administrators. However, if management cards are pub-
Hence, exploring the relationship between mismanagement licly available, devices will be riddled with vulnerabil-
and maliciousness of a network is a stepping stone to dis- ities, posing severe security risks. Based on the above
cover security incidents. Zhang et al. [24] found a statistic insight, 98,274 publicly accessible Intelligent Platform
correlation between the mismanagement of networks and Management Interface (IPMI) cards were collected.
maliciousness of the systems. By utilizing the information Moreover, maliciousness in [24] indicates that an IP
drawn from mismanagement leading to maliciousness, proac- address is labeled as sending SPAM messages, host-
tive protection could be applied to prevent compromises and ing phishing websites, or performing malicious port
damages. scans by IP blacklists (including BRBL [96], CBL [97],

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
1756 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 2, SECOND QUARTER 2019

TABLE IV
E XECUTABLES DATASETS

SBL [98], SpamCop [99], WPBL [100], UCEPROTECT [101], mismanagement are collected and observed. To be specific,
SURBL [102], PhishTank [68], hpHosts [69], Darknet besides of the case of the Open DNS recursive resolver which
Scanners list, Dshield [70] and OpenBL [71]). can be maliciously used for DDoS attacks by amplification,
Zhang et al. [24] leveraged statistical analysis method to there are other amplification vector attacks, such as UDP
demonstrate the relationship between mismanagement and Memcached servers, to be considered. More generally, servers
maliciousness. Autonomous system (AS) level was chosen as exposed to the Internet and accessible without authentication
aggregated granularity to quantify the mismanagement and can also be observed. Secondly, the collection methodology,
maliciousness of a network. On one hand, eight misman- coverage and time frames are inconsistent, which may contain
agement symptoms were normalized as eight corresponding biases. Thirdly, the data collection and processing focus on the
mismanagement metrics. The whole network mismanage- AS level, which may of independent interest to extend them
ment metric was also considered by combining the individual to a more granular level in future work.
symptoms into an overall metric. On the other hand, the mali- 2) Discovering Zero-Day Applications in Traffic
ciousness of an AS was normalized by malicious IPs based Classification Systems: Network traffic analytics as a
on IP blacklists. critical technology is widely applied in intrusion detec-
The hypothesis is that mismanagement is positively cor- tion, malware analysis and botnet detection, according to
related with maliciousness. To prove it, firstly, Spearman’s Miao et al. [109] recent review. By capturing abnormal
correlation was calculated between maliciousness and each patterns in traffic data, traffic classification is fundamental
mismanagement symptom. The results showed that a statis- to cybersecurity and network management. Furthermore,
tically significant positive relationship existing between all of discovering past unknown zero-day traffic discrimination can
the symptoms and the network’s maliciousness. Given the significantly improve the accuracy of traffic classification,
overall mismanagement metric, the mismanagement metric which is critical to improve incident response and mitigate
has the strongest correlation with the maliciousness met- cybersecurity incidents.
ric, which encourages the researchers to consider the overall To solve the zero-day applications problem,
network health instead of specific vulnerabilities or symptoms. Zhang et al. [26] proposed a Robust Traffic Classification
Furthermore, the authors explored whether the mismanage- (RTC) scheme to discover previously unknown applications
ment will lead to maliciousness if social and economic in traffic classification systems by utilizing supervised and
elements are controlled. Ultimately, by using Fast Causal unsupervised machine learning techniques.
Inference (FCI) algorithm [108], they found an inferred casual RTC framework is composed of three modules, namely
relationship between mismanagement and maliciousness con- unknown discovery, “bag of flows” (BoF)-based traffic classi-
sidering social and economics. fication and system update. In the unknown discovery module,
This paper [24] demonstrated that different kinds of mis- the k-means based clustering algorithm is first applied to unla-
management symptoms are profoundly correlated to the beled and labeled mixed data. If a cluster does not include any
network’s maliciousness and will ultimately lead to malicious- prelabeled samples, this cluster is a zero-day cluster. However,
ness. Security community should pay attention to the networks the rough estimation can lead to high TP rate. Thus, the authors
with mismanagement symptoms in order to prevent security set up a multiclass random forest classifier with various known
incidents. For example, the symptoms of mismanagement can classes and one unknown class to purify zero-day samples. In
be utilized to develop a prediction system that can proactively the “bag of flows” (BoF)-based traffic classification module,
predict which network has a high probability to be worked Zhang et al. [110] proposed a novel classification method that
maliciously instead of waiting to behave maliciously in future. leverages flow correlation [110] in real-world traffic to con-
There are several limitations regarding data in this duct traffic classification and aggregated the prediction results.
work [24]. Firstly, not all symptoms that reveal network Finally, the system update module was designed to learn

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: DATA-DRIVEN CYBERSECURITY INCIDENT PREDICTION: SURVEY 1757

newly identified zero-day traffic as a complementary knowl- each machine, which captured the pattern of the usage and
edge to the system, by repeating k-means based clustering and user behavior of a device.
inspecting manually. Each machine was identified as “clean” or “infected” by
The datasets used for evaluation experiments combined using three different datasets, which were respectively a
four disparate Internet traffic traces, in order to minimize labeled dataset obtained from the AV company about known
the effects of data bias caused by heterogeneous sampling benign and malware, a dataset acquired from the AV product
points. Specifically, three traces were gathered from the pub- concerning known malware, and a telemetry dataset gen-
lic traffic data repository. And another one was captured by erated by IPS product regarding infection records. On the
using a probe from an Australia’s Internet service provider. one hand, it is generally acknowledged that the quality of
The groundtruth was set up by using previous experiences groundtruth is vital to the quality of predictors. On the other
and tools concerning the signatures of applications. Besides, hand, the number of data waiting to be labeled is enormous. To
20 unidirectional flow statistical features were extracted to address these problems, Bilge et al. [23] introduced the semi-
represent traffic flows, including 2 features about packets, supervised method that made use of profile similarity with
2 features regarding bytes, 8 features describing packet size labeled machines to infer fuzzy labels for unlabeled machines.
and 8 features concerning inter-packet size. Once the features and labels of machines were well pre-
Zhang et al. [26] designed comprehensive experiments to pared, the predictive model was established by using random
evaluate the performance of the RTC framework. They com- forest machine learning algorithm. Followed by that, a com-
pared their method with four state-of-the-art traffic classifica- prehensive evaluation was conducted to evaluate the predictor.
tion methods, which are respectively random forest [111], the As reported in [23], the prediction could reach 96% TPR with
BoF-based method [110], the semi-supervised method [112] only 5% FPR, which is the best result in a machine-level gran-
and one-class SVM [113]. Accuracy and F-measure were used ularity up to now. Additionally, the semi-supervised learning
as evaluation metrics and 100-time experiments were con- method proposed to estimate labels of the unlabeled dataset is
ducted to illustrate that the results are stable. Besides, they also proved to enrich the groundtruth as well as remain consistently
explored the robustness of their method when considering var- accurate.
ious training datasets, zero-day applications, and performance As a concluding remark for this section, an overall descrip-
with vs. without a system update module. All of the results tion for network datasets is produced in Table V.
showed that the RTC framework significantly outperformed
the other four methods.
Network data in computer networks is usually encapsulated D. Synthetic Dataset
in packets, each of which consists of the packet header and 1) Predicting the Resilience of Obfuscated Code Against
packet payload. Moreover, network flow refers to a sequence Automated Attacks: Code obfuscation applies transformations
of packets that originate from a source computer and are des- to the original code with the intention of improving the diffi-
tined for a destination. This work [110] demonstrated traffic culty of analysis and tampering but maintaining the functional-
data is capable of conducting intrusion detection and there- ity of the program. Obfuscating code transformation technique
fore for incident prediction as well. In general, the works is motivated by request to hide the particular implementation
that inspect the traffic at the level of the packet payload of a program from the unauthorized reverse engineering pro-
focus on textual-based protocols, while the work in [110] that cess [116]. If attackers exploit the hidden information/code,
focuses on the statistics of network flow information is more not only will intellectual property be stolen, but also threats
generic. and incidents may happen. However, it is known that attack-
3) Predicting Which Machines Are at Risk of Infection: ers can reverse engineer if given enough time and resources.
The current situation of the cyber-threat ecosystem is that no Therefore, estimating the period when an obfuscated pro-
system seems to be invulnerable. Hence, the IT administra- gram can withstand a given reverse engineering attack is an
tors are progressively shifting to look for proactive measures open challenge for software obfuscation. Banescu et al. [27]
that can reduce the damage caused by cybersecurity incidents. proposed a framework to predict the resilience of different
As discussed earlier, Liu et al. [9] utilized organizations’ his- obfuscated transformations against automated attacks.
torical incident reports to predict cyber incidents. In a finer Resilience is defined as a function of deobfuscator effort and
grained, Bilge et al. [23] employed binary file appearance logs programmer effort, namely, the time spent on deobfuscation
to forecast whether an enterprise machine would be infected. process [117]. Banescu et al. [27] proposed an approach to
Bilge et al. [23] come from Symantec Research Labs [114], predict deobfuscation time given by software relevant features.
which indicates that they can comparatively readily collect The groundtruth data was obtained by running automated
data in respect to the binaries appearing on machines. In total, attacks on the obfuscation C code and recording the deob-
the data was collected from 600K machines involving 18 enter- fuscation effort that was assessed by execution time needed
prises. Based on the analysis, 4.4 billion binary file appearance to complete an attack successfully. Ideally, the obfuscation
events were reported among these machines. 89 features were C code should be collected from real-world as presented
designed to establish the profiles of the machines, includ- in [118]. However, the authors found that it was hard to
ing file download statistics, vulnerability patching behavior, obtain enough amount of program data with security checks
application download behavior, and historical threat analysis. required by the study from code sharing platforms (e.g.,
These features were extracted from file appearance logs of GitHub). Therefore, synthetic code datasets were generated

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
1758 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 2, SECOND QUARTER 2019

TABLE V
N ETWORK DATASETS

for the proposed approach as a substituted solution. A C pro- applying five obfuscating transformations [115] to each of
gram generator was designed, and 4608 C programs with raw C programs, 23,040 synthetic obfuscation programs were
various license checking algorithms were produced. After created. Afterward, the deobfuscation process based on the

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: DATA-DRIVEN CYBERSECURITY INCIDENT PREDICTION: SURVEY 1759

TABLE VI
S YNTHETIC DATASETS

symbolic execution attack was conducted by using free and black keywords without coming to the surface is significant
open source software tools (e.g., KLEE [119], angr [120], in predicting future online shady business and destroying the
etc.). Additionally, the time spent on completing the attack underground economy.
was recorded. KDES took advantages of blackhat search engine
After setting up groundtruth dataset, the next step is extract- optimization (SEO) Web pages as data source to extract
ing features. There were 64 features extracted from synthetic unknown black keywords. The reasons why the blackhat SEO
programs, including 49 features characterizing the complexity Web pages are appropriate to discover black keywords are
of symbolic variables and 15 program features characteriz- two-fold: first, due to the illegal transaction, the underground
ing the code. By utilizing two light-weight feature selec- merchants rely on using black keywords to present and pro-
tion algorithms (Pearson correlation and variable importance mote their products; second, the underground merchants tend
respectively), the top 15 most relevant features were selected. to make use of blackhat SEO facilitating their websites rank-
For prediction, the approach utilized the top best 15 features ing for taking up online marketplace. Based on these two
to construct a regression model via several machine learning reasons, the raw data collected in [28] was Web page data,
algorithms, including support vector machine (SVM), random including 2,733,728 SEO pages, 60,000 porn pages and 3,424
forest (RF), genetic programming (GP) and neural networks gambling pages marked as “evil” by Baidu. KDES suggests
(NNs). It can be found the authors directly used packages for a new direction which is big data analytics in dealing with
regression algorithms in R [121] to do statistical computing. security problems believed difficult by traditional strategies.
Lastly, the evaluation of the approach was conducted from There are three components involved in the KDES archi-
the following two aspects: (1) The accuracy of prediction was tecture: (1) keywords extraction; (2) keywords expansion;
calculated. (2) The Smart Obfuscation Engine (SObE) was (3) core words identification. Keywords extraction module
proposed to combine the approach with obfuscation tools. extracted keywords from text inside the HTML tag a href.
The accuracy of predicting execution time of the symbolic After removing the duplicate words, a gigantic word list has
execution-based deobfuscation attacks could achieve 90% for been left. By restricting the length of keywords and explor-
80% programs in their synthetic dataset. Also, the approach ing the consequences resulted in the keywords, the range
could be applied to SObE, which compared the prediction of of keywords greatly narrowed down. The similar keywords
deobfuscation time with the attacker’s budget. If the prediction were added to the keywords list in the keyword expansion
time is longer than the attacker’s budget, the obfuscated pro- system, which leveraged the functionality of related search.
gram is the output. Otherwise, the developers should change However, the great amount of keywords is a heavy burden on
the obfuscation transformation method to protect their code the security analysts. To solve this problem, core words identi-
and withstand attack. Although this work is significant in fication module distinguished the “core word” (closely related
software protection as well as decrementing the risk of cyber- to underground economy) from “filter word” (less meaning-
security attacks, the real word programs should be considered ful words). The remaining keywords list was important for
to add into the dataset to improve the robustness of the model. analysts to investigate and understand the online underground
As a concluding remark for this section, an overall descrip- economy as well as prioritizing their tasks.
tion for synthetic datasets is produced in Table VI. The performance of KDES system was evaluated on the
accuracy of identifying keywords as black. For those black
keywords, the security analysts queried them on popular online
E. Webpage Data underground economy communication channels, including
1) Discovering Black Keywords Used by Underground Baidu Tieba, QQ groups, and Baidu. They sampled 1000 key-
Economy: Yang et al. [28] developed Keyword Detection and words and verified 943 keywords as black (94.3% accuracy).
Expansion System (KDES) to discover black keywords used Based on the obtained black keywords, an extensive mea-
by the underground economy automatically. The underground surement regarding the online underground economy in China
economy is a kind of activity which transacts illegal prod- was carried out regarding the underlying infrastructure, the
ucts between buyers and merchants. Communicating online criminals behind, and the impact on the cyber environment,
with Black keywords is commonly used by the underground which effectively provided solutions to prevent the illegal pro-
economy to evade outsiders. Nonetheless, black keywords are motion by the underground economy. However, the system
continually changing and updating. Consequently, capturing resilience can be strengthened. For instance, the attackers

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
1760 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 2, SECOND QUARTER 2019

could relocate the black keywords in anchors to other sections prediction result is correct or not immediately, so the
under the page, which can escape from the parser. Meanwhile, alternative method is to use the past data to simulate the
the KDES can be advanced by defending evasion implemented prediction done in the past and evaluate the prediction results
by adversaries. by using the present data. A ROC curve was generated for the
2) Predicting Website Will Become Malicious or Not: evaluation of the classifier’s performance. Within a one-year
Successfully detecting whether a target website is malicious or time horizon, the prediction model could achieve 66% true
benign with high accuracy has been achieved with concerted positives and 17% false positives.
efforts by researchers. While, there is a more effective 3) Automatic Identification of Unknown Web-Based
and less passive method proposed, which is to forecast Infection Campaigns: While people enjoy various kinds
whether the website will become malicious or not in advance. of wonderful services online and communicate globally,
Soska and Christin [7] proposed a classification system which software used to support the functionality of a website might
can predict whether a currently benign website has the high be vulnerable to attackers. The attackers inject malicious
risk of becoming malicious in future. This system is not only code snippets by exploiting the server-side vulnerabilities.
beneficial to search engines but also useful to blacklists and If users download or install the malware deriving from the
website operators. client-side vulnerabilities, they will be infected. For the
The authors designed, implemented and evaluated a machine attackers, finding a vulnerability and generating a malicious
learning classifier which could proactively identify whether code snippet need investing lots of time and effort. To save
a website would be compromised or not within one year. resources, the attackers usually launch an infection campaign
When it comes to the groundtruth, both the benign and mali- throughout multiple websites (potentially thousands) by using
cious websites were included in the dataset to set up machine carefully crafted infection vectors. For this reason, it is a
learning classifier. Two blacklists were used as groundtruth great opportunity to discover probably thousands of malicious
for malicious websites: PhishTank [68] and a list of websites websites if an unknown infection campaign is identified.
that have been injected by “search-redirection attacks” [122], Borgolte et al. [8] proposed a δ-system that can auto-
[123]. 34,922 websites from PhishTank and 14,425 websites matically identify Web-based infection campaigns. The
from the “search-redirection attacks” list were archived as the system adopted static analysis method to discover previously
groundtruth for malicious websites. To collect benign web- unknown infection campaigns which may involve thousands
sites, the authors randomly sampled entire.com zone file and of malicious websites. In the prior work, researchers were
yielded 337,191 website archives. Among these 337,191 web- devoted to detecting the malicious activities based on single
sites, 27 websites infected by “search-redirection attacks” and website/URL detection by using dynamic analysis approach.
72 websites matching PhishTank entries were discarded. In However, the δ-system is able to identify the infection cam-
addition, a complementary 421 sites collected from the DNS- paigns and predict malicious websites which are under surface
BH [124], Google SafeBrowsing [125], and hpHosts [126] before.
blacklists were removed from the benign corpus. Eventually, The δ-system follows a four-step process. The first step
the number of websites in benign corpus was 336,671. is to retrieve and normalize the website. The current ver-
Features are critical to the decisions of a classifier. To effec- sion and the base version of a website are retrieved, and
tively differentiate the websites that will become malicious, the source code is saved after normalization. The normaliza-
features derived for the classifier should characterize the web- tion includes normalizing capitalization, reordering attributes,
sites from various perspectives, including the appearance of discarding invalid attributes and normalizing attribute’s value.
the website, traffic information and textual contexts, and so Secondly, the similarity between the base and up-to-date ver-
on. In [7], the classifier adopted two main features by ref- sion of the website is measured. The similarities are computed
erencing the Alexa Web Information Service (AWIS) [92] via fuzzy tree difference algorithm, which performs the com-
and the content information of a website. AWIS information parison between the Domain Object Model (DOM) tree of
covers the popularity ranking of a website, the number of the base website and the DOM tree of the current website.
links to the website, load percentile, adult site or not and Thirdly, the similarity vector is clustered based on similar-
the number of reach per million; the content information of ity measurement. The density-based clustering algorithm is
a website includes the lists of tags from all the pages sur- applied for this step, named Ordering Points To Identify the
vived from the acquisition and filtering process. To yield the Clustering Structure, with Outlier Factors (OPTICS-OF) algo-
best classification performance, statistic-based dynamic feature rithm proposed by Breunig et al. [127]. The outputs of the
extraction was conducted. For each tag, the balanced accuracy clustering are defined as infection campaign, benign trend and
of a tag was calculated and ranked. When dealing with the new campaign separately. The δ-system relies on an external
problem that the tags used for classification may change due to detection system instead of detecting the malicious behaviors
the attacks against websites evolve, the windowing technique by itself. The last step is to generate the identifying signature.
was applied to the feature extraction process. However, the A signature is simply generated by illustrating each node’s tex-
limitations of dynamic features introduce the possibility that tual representation as a Deterministic Finite Automaton (DFA).
adversarial machine learning approach affects the performance The identified signature can be served to security analysts and
of the system, which deserves further analysis. intrusion detection/prevention system.
C4.5 decision trees classifier was chosen as the prediction The dataset in [8] contains the websites being crawled from
model in this system. It is hard to evaluate whether the January 2013 to May 2013 (4 months), counting as 26,459,103

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: DATA-DRIVEN CYBERSECURITY INCIDENT PREDICTION: SURVEY 1761

TABLE VII
W EBPAGE DATA

distinct website pairs. Also, there are six main features standard. IOC is significant for an organization to gain visibil-
selected for clustering, including three kinds of binary fea- ity into ever-changing security threats, identify early indicators
tures (template propagation, script inclusion, and inline frames of threats and change protection strategies. However, facing
respectively) and three categories of attribute values (Shannon the tremendous growth of information sources, discovering
entropy, character count and Kolmogorov complexity and generating high-quality IOC is a challenge as well as an
respectively). opportunity.
The δ-system successfully identified the infection cam- Liao et al. [29] implemented iACE system to automatically
paigns that were previously unknown to the public. Once discover IOC and generate OpenIOC (a kind of IOC stan-
a new infection campaign is identified, the websites in the dard framework) compatible, semantic-rich intelligence from
same clusters have the significantly high probability of being popular technical blogs (including AlienVault, Malwarebytes,
infected by the same infection campaigns in the past or future. Webroot, etc.). Furthermore, the insights gained from more
For the potential malicious websites, the prediction is signifi- than 71,000 articles gathered from 45 popular technical blogs
cant to the websites and users for the preventative purpose. shed light on the unknown relationships across different secu-
A real case which was successfully identified by δ-system rity attack incidents, especially their shared infrastructure
was the cool exploit kit infections of Discuz!X. 15 differ- resources, and the impacts on security protection and attack
ent websites in the same cluster were using the discussion evaluations.
platform “Discuz!X” that redirected to a specific infection The architecture of iACE consists of 5 modules: (1) Blog
campaign. It has been proved that the system significantly Scraper (BS): BS is designed to crawl technical articles from
improved the detection speed and reduced the evaluation technical blogs. (2) Blog Preprocessor (BP): By leveraging
overheads compared with the previous work via in-depth NLP techniques, the IOC-irrelevant articles are filtered out in
analysis and evaluation, which contributes to the real-world the BP. (3) Relevant-content Picker (RCP): RCP converts pic-
deployment. Nevertheless, some limitations, such as rely- tures and other special contents in the articles to text, splits
ing on the external analysis system that utilizes dynamic sentences and selects the candidate sentences by allocating
analysis, defending step-by-step injection, and countering with a kit of context terms and regexes. (4) The relation
evolution of infection vectors, still exist in the detection checker (RC): RC parses the grammatical structure correlating
system. the context terms and the IOC candidates, and decides whether
As a concluding remark for this section, an overall descrip- the latter is indeed an IOC. (5) IOC Generator (IG): according
tion for webpage data is produced in Table VII. to the OpenIOC standard, IG automatically generates header
and definition parts for all identified IOCs.
There are two machine learning classifiers deployed in the
F. Social Media Data system. The first is a topic classifier used in RCP module,
1) IOC Discovery: Cyber Threat Intelligence (CTI) is separating the non-IOC articles from the IOC articles. The
defined as “evidence-based knowledge, including context, classifier is trained on a dataset including 150 IOC articles
mechanisms, indicators, implications and actionable advice, and 300 non-IOC articles by using support vector machine
about an existing or emerging menace or hazard to assets (SVM) algorithm. It is worth mentioning that the IOC files
that can be used to inform decisions regarding the sub- are from 2 public feeds, including iocbucket [131] and ope-
ject’s response to that menace or hazard”, according to niocdb [132]. Topic words, article length and dictionary-word
Gartner [130]. CTI is usually collected in the manner of density are calculated and used as features. The second clas-
indicators of compromise (IOC), which can be automatically sifier is a relation checker, confirming the presence of IOC
transformed and deployed to different kinds of security defense relationships between a context term and an IOC candidate
mechanisms, such as intrusion detection system, when IOCs within a sentence. The classifier utilizes logistic regression
are recorded in a format of a specific threat information sharing algorithm based on the features calculated from dependency

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
1762 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 2, SECOND QUARTER 2019

graphs which are transformed from the candidate sentences tree form, the convolution tree kernel [136] algorithm is used
consisting of IOC candidates and context terms. 1,500 true to calculate the similarities between seed query and all col-
IOC sentences and 3,000 false IOC sentences constitute the lected tweets based on shared longest common paths. Both the
training dataset. semantic and syntactic constraints are considered when mea-
The dataset used for evaluation was collected from 45 suring the similarities between seed queries and tweets, to filter
security-related technical blogs between April 2003 and May out noise tweets. Secondly, the authors proposed a dynami-
2016, counting as 71K articles. The system can achieve 98% cally typed query expansion method by leveraging convolution
precision and 100% recall when dealing with finding IOC arti- kernels as well as dependency parses. A set of expanded
cles problem. In terms of identifying true IOCs and context queries are generated which represent the relevant concepts
terms, the classifier can reach 98% precision and 92% recall. delivered by tweets from the target domain. The final step is
On average, inspecting a real-world article cost about 0.25 sec- event extraction. The query expansions are clustered by using
onds. The accuracy and performance of the system can satisfy affinity propagation [137] algorithm. The representatives of
the demand of discovering high-quality IOCs, which provides the clusters are extracted as exemplars and annotated to the
immediate protection to organization security systems. type of cyber attack. The representative of the clusters comes
Eventually, the iACE system automatically discovered 900K from the expanded queries with the highest similarity value
IOC with the context. By inspecting and analyzing the IOCs, matching seed query. Finally, an event is represented by the
the authors reported unprecedented findings which gave valu- representative (discovered event), date and seed query (event
able guides and warnings to protect organizations’ assets. type).
Some examples: (1) The authors found that unrelated attack A large stream of tweets was gathered from August
incidents were related, such as sharing the same infrastruc- 2014 to October 2016. After removing retweets, the num-
tures. (2) For the attackers, they might change the attack ber of collected tweets is 5,146,666,178. To evaluate the
strategies facing with the new release of exposed IOCs. approach, the authors used the Gold Standard Reports (GSR)
(3) But, for organizations, they usually react slowly to the on cybersecurity incidents to severe as groundtruth, including
release of IOCs. (4) Regarding open-source intelligence’ qual- Hackmageddon [128] and Privacy rights [129]. From the above
ity, they found the Hexacorn and Naked Security can provide two sources, the security event type, date, victim organization
timely and comprehensive messages about an emerging attack. as well as description can be checked.
These findings uncovered the security insights that were never Regarding the performance evaluation, the experiments
known before, as well as providing profound suggestions on focused on three high impact security incidents, which are
security defense and protection. namely data breach, DDoS and account hijacking. Two highly-
The limitations of NLP techniques and presentation meth- cited previous work were chosen as baselines for comparison,
ods affect the discovery results. How to design and develop specifically, using expectation regularization to generate tar-
domain- and topic-specific tools is crucial in this field. On the get domain [33] and using bursty keywords to discover cyber
other hand, other intelligence sources, such as research papers, event [138]. The approach can achieve around 80% precision
can be considered to gather to extend this system. for data breach and DDoS events, and 66% precision for
2) Extracting and Encoding Cyber Attacks: Social media account hijacking respectively, outperforming the two base-
as a gold mine of information is popularly used as a lines. The authors also used case studies to illustrate the
sensor for different kinds of events, such as earthquake performance of their method. It is shown that this approach
prediction, disease outbreak forecast, and presidential elections not only discovered typical security events (e.g., targeted
prediction [133]–[135]. It is an opportunity to make use of DDoS attacks on Sony and Dyn, Ashley Madison website
social media as the crowdsourced sensor of cyber attacks (e.g., data breach and Twitter account hijacking) but also success-
data breaches, distributed denial of service (DDoS) attacks, fully extracted the cybersecurity attack events that haven’t
and account hijacking). Gaining insights into cyber incidents been recorded in the GSR, by validating the results using
is helpful to prevent the individuals, organizations and nations Google search. The class of attacks can be broadened in
from the losses and threats coming from various cyber attacks. future work. Besides, the sequential dependencies of a kind
Khandpur et al. [31] proposed a dynamic event trig- of attack can be modeled to characterize the prevalence of the
ger expansion (DETE) approach to extract cyber events cybercrime.
accordingly providing situational awareness into cybersecurity 3) Predicting Mobile App Security-Related Behaviors:
events. Compared with [33], this approach extracted and fur- As the popularity of mobile phones, the security and pri-
thermore characterized the security events evolving over time vacy of mobile apps become a concern to end users, app
in a weakly supervised method. developers and app market [139]. Google Play provides a
The approach consists of three modules, including target platform on which users can post valuable reviews from end
domain generation, dynamic typed query expansion, and event users’ perspective. Kong et al. [30] designed a system named
extraction. The module of target domain generation is designed AUTOREB, which leveraged the reviews from Google Play
for filtering related tweets which serve as the source of cyber to predict the app security behaviors. Although the security-
event extraction. Given a limited and fixed seed query for related behaviors are restricted to spamming, financial issue,
cyber attack events and the collection of tweets, the tweets over-privileged permission and data leakage in this work. This
most related to the seed query are retrieved in this step. After is the first work that infers the security-related behaviors of
converting the tweets and given seed query into dependency an app according to users’ reviews.

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: DATA-DRIVEN CYBERSECURITY INCIDENT PREDICTION: SURVEY 1763

TABLE VIII
S OCIAL M EDIA DATA

The goal of the system is to automatically predict the from 2,614,186 users. The authors did not hire enough labors
security-related behaviors based on the numerous users’ to label security behaviors of each app in dataset D.
reviews. Due to one review can be assigned to more For review-level security behavior inference, the dataset
than one security behavior categories, the prediction model L is evenly split into training set and testing set. The met-
should be set up by using multi-label classifier. The authors rics used for evaluation are precision, recall and F1 value.
proposed to predict the security behavior of an app from The accuracies of the classifier for identifying “spamming”,
review-level as well as app-level. The review-level secu- “over-privileges permission”, and “data leakage” are 91.96%,
rity behavior inference engine aims to label a review to 95.99% and 93.46% respectively. The exceptional case is
a particular behavior category automatically. The latter one that regardless of “financial issue”, the classifier labeled
utilizes crowdsourcing technique to predict the security behav- the reviews as “financial issues”, while the users did not
ior of an app, by assigning more credit to trustworthy complain about the “financial issues”. Besides, a base-line
users. using keyword-based method was compared with AUTORBF.
Each review is treated as a sample, being fed into machine AUTORBF exceeded a large margin with 51.36% in accuracy.
learning model after labeled manually corresponding to For app-level security behavior inference, due to the lack of
security-related behaviors and extracted security-related fea- groundtruth, the authors listed 50 apps that existed user com-
tures. In respect to review-level security behavior inference, plaints about security issues. The security-related behaviors
three annotators are working together to decide the label of predicted by AUTOREB can be regarded as cyber threats indi-
each user review. The security-related features are extracted cators to alert users, mobile app developers, and app market
from each review, including words and phrases closely related administrators.
to the four security concerned categories. In addition, the fea- In general, this is the first work leveraging user reviews to
tures are augmented and expanded by adopting “relevance analyze the security risk of an app. The four kinds of security
feedback” information retrieval technique [140] to add more behaviors intuitively come from common sense, which can be
“relevant” words and phrases. Each review is abstracted into extended to other categories of risk behaviors.
a feature vector denoted by a bag-of-words (BOW). The fea- As a concluding remark for this section, an overall descrip-
tures and labels of instances are fed into sparse SVM machine tion for synthetic datasets is produced in Table VIII.
learning classifier to train a multi-label classifier. For the
app-level security behavior inference engine, crowdsourcing
technique [141] is applied to aggregate the security labels G. Mixed-Type Data
from review-level to app-level. By considering the credibility 1) Vulnerability Exploits Prediction: The number of soft-
of different users, the trustworthy users are given more credit ware vulnerabilities is dramatically growing in these years.
based on two-coin model [141] instead of majority voting Certain vulnerabilities might be exploited after a long time,
model [141]. while numerous vulnerabilities need to be quickly responded.
The dataset includes reviews clawed from Google Play. One Hence, it is of great importance to prioritize the response to
dataset L was collected for validating review-level security the vulnerabilities.
behaviors inference during November 2014, which contains Before the vulnerabilities being exploited, hackers, secu-
19,413 user reviews on 3,174 apps. Each review was labeled rity vendors and system administrators frequently discuss the
by three workers reaching a consensus. The other dataset vulnerabilities on social media, such as Twitter, to discuss
D was collected for validating the effectiveness of app-level technical details and sharing experiences. Sabottke et al. [32]
security behavior inference during December 2013 to May designed an exploit detector based on Twitter, which can
2014, which includes 12,783 apps with 13,129,783 reviews predict the vulnerabilities to be exploited in real-world.

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
1764 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 2, SECOND QUARTER 2019

Based on the groundtruth about exploits and the information future and past data. Secondly, although the authors evaluated
posted on the Twitter before vulnerabilities being exploited, the utility of tweets data, they never compared it with those
the exploit detector was set up by using supervised machine utilizing readily available summary on the vulnerability.
learning techniques. The authors collected 287,717 tweets in 2) Predicting the Future Structural Changes of the
total which included explicit CVE IDs. The groundtruth about Network: Network datasets include abundant data record-
exploits comes from three data sources, according to CVE ing the interactions and/or communications among differ-
IDs mentioned in the descriptions of Symantec’s anti-virus ent activities (for example, personal communications via
(AV) and intrusion-protection (IPS) signatures [142]: phone or e-mail, interactions on social media, network traffic
(1) public proof-of-concept exploits by querying information between hosts and servers, and so on).
ExploitDB [75]; (2) private proof-of-concept exploits The structure of an active network has a notable pattern
by querying Microsoft security advisories’ Exploitability and will change as time goes on. The temporal dynamics are
Index [76]; (3) real-world exploits by querying Symantec’s critical to a system, which facilitates finding anomalies in the
Worldwide Intelligence Network Environment (WINE) [77]. system, detecting fraud and intrusions and allocating increas-
The dates of each vulnerability known to the security ing resources. Rossi et al. [25] proposed a dynamic behavioral
community and vulnerability exploited were both recorded. mixed-membership (DBMM) model to capture the “roles” of
Furthermore, the vulnerabilities are labeled as “real-world individual nodes in the network graph and predict how they
exploits” or “not exploited”. change over time.
Features selected for the classifiers include Twitter features Given a sequence of time-evolving network snapshots, the
and database information features. Twitter features are authors proposed a DBMM framework that can investigate
extracted from the word distributions of tweets containing the property of the network, understand behaviors of the
the keyword “CVE” and the Twitter traffic data (e.g., num- network, detect anomalies in the system and predict future
ber of tweets) involving the corresponding CVEs. In order structural changes through the following steps: (1) The first
to improve the performance and robustness of the detector, step is to automatically learn representative feature that rep-
Common Vulnerability Scoring System (CVSS) (e.g., CVSS resents each node in a given graph by leveraging method
score) and database (including National Vulnerability Database proposed in [144]. (2) The second step is to extract features
(NVD), and Open Sourced Vulnerability Database (OSVDB)) from each graph. (3) The third step is to discover behav-
features (e.g., NVD last modified date and OSVDB cate- ioral roles that represent the common patterns of behaviors
gory) are considered in this study, which are proven useful in based on extracted features by assigning a probability dis-
previous work for predicting exploits [143]. After the mutual tribution to each node in the network. (4) Next step is to
information-based feature selection process, 67 features are extract these roles according to the network snapshots iter-
reserved for training and testing the exploits classifier. atively. (5) Finally, a predictive model which depicts these
After feature engineering, these features are fed into a sup- time-varying behaviors is learned. Therefore, abnormal behav-
port vector machine (SVM) classifier. The output of the binary ior can be found from unusual structural changes. In a nutshell,
classifier is whether the vulnerability mentioned in tweets DBMN has four main advantages: (1) User-defined parameters
will be exploited or not. Samples are randomly selected from are not required in this algorithm. (2) By parallel comput-
the dataset after shuffling ten times. For each round of sam- ing, the features, roles and transition models are learned
pling, 50% of the available data is used for training, and the individually, improving the scalability of the model. (3) The
remaining 50% data is used for testing. behavior representation is interpretable. (4) The model is flex-
The detector is evaluated for the precision and recall, and ible and applicable to all kinds of networks which change
assessed by the detection time leading up to the real-world over time.
existing time recorded in the dataset. The detector can detect In order to validate the DBMM, real-world datasets and
the exploits in advance of the existing datasets two days synthetic datasets were both applied to the model. Nine
in median. It is shown that the Twitter-based detector has real-world datasets were respectively Twitter “who-follows-
fewer false positive compared with Common Vulnerability whom” network dataset, Twitter “reply-to-messages” network
Scoring System (CVSS) detector and increases the precision dataset, University emails network dataset, enterprise network
by one order of magnitude. Moreover, the authors intro- traces, Facebook network dataset [103], Enron Inc. email
duced three kinds of adversary machine learning attacks to communications network dataset [104], Oregon RouteViews
the exploit detector, which are randomly posting tweets with- project Internet dataset [93], Internet movie database and MIT
out knowledge of features (blabbering adversary), mirroring mobile-phone communications network dataset. Besides, the
words’ statistic information according to exploited vulnerabil- synthetic data was generated in the form of graphs that were
ities (word copycat adversary) and manipulating all Twitter probabilistically constructed with four main patterns: “cen-
features (full copycat adversary). They also simulated the ter of a star”, “edge of a star”, “bridge nodes” (connecting
above three types of adversaries and presented the bounds stars/cliques), and “clique nodes”. Based on the above real-
for the damage to their Twitter-based exploits detector for the world and synthetic datasets, the performance of predicting
evaluation, which illustrates the robustness of their method. future behavior of nodes was evaluated in two ways: (1) using
Lastly, it should be mentioned that there are still two loss function to compare the predicted behavioral snapshot
limitations in the above method. Firstly, the training and testing to true behavioral snapshot; (2) using predicted behavioral
data are randomly split, which results in temporal mixing of snapshot to predict the role of each node in the network,

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: DATA-DRIVEN CYBERSECURITY INCIDENT PREDICTION: SURVEY 1765

and then evaluating the predictions by using Area Under Due to the fact that not all of the event candidate will fit
the ROC curve (AUC). The result showed that the model the event category, the classifier was designed to determine
could outperform the sensible baseline models with few which event candidate fits the event category and remove dis-
exceptions. When using the synthetic dataset to validate the tractors. For example, some event candidates only present a
model, the accuracy in detecting anomalous behavior could general knowledge of an event category or promote a prod-
achieve 88.5%. uct, which should be filtered out. In traditional information
3) Discovering Security Events on a Specific Event extraction, the approach is to require substantial annotation
Category: Twitter, as a popular social media platform, con- effort to create a comprehensive event triggers and arguments
tains rich and timely information including the events hap- corpus by security experts. According to the annotations, a
pening in the world. Abundant information from social media supervised machine learning classifier was set up to extract
not only facilitates people’s life but also boosts the economy. new event instances. In this paper, the authors customized the
However, information overload has become more and more learning problem by using positive and unlabeled examples
common as ever-increasing amounts of information spread and proposed a new strategy that regularized the label distribu-
online. When it comes to security events, most users prefer tion over unlabeled samples in the direction of a user-specified
to get a simple indicator (e.g., breaking news) or prediction expectation of label distribution for the keyword.
of a specific attack rather than receiving a great number of To evaluate the performance of the model, the authors
aimless security-related information. For the security analysts, manually sampled candidate events for each event cate-
the enormous amount of messages spread in social media are gory and compared the model’s prediction results against
hard to monitor and analyze. expert judgments. Furthermore, the confirmation of a discov-
Faced with the overload security-related social media ered event was also validated and confirmed by measuring
information, Ritter et al. [33] proposed an approach to discover computer network traffic. The precision and recall curves
cybersecurity-related events by using a weakly supervised of the model were presented and observed compared with
method. This approach is the beginning work to discover and three baselines (including heuristically labeling negative exam-
look into security-related incidents from social media. The ples [145], one-class SVMs [146] and semi-supervised EM). It
approach automatically extracts and defines the new security was demonstrated that the approach dramatically outperforms,
event category based on the raw Twitter stream by given a compared with the other two novel and competitive base-
handful of historical seed event samples. lines. Aggregating multiple sources of social media to discover
The architecture of the approach consists of two modules: security events should be considered as a future direction.
extracting event candidates from Twitter, and training a clas- We put the description of datasets mentioned in this Section
sifier with positive seed and unlabeled event candidate data in corresponding Table III to Table IV.
to determine whether the event candidate can describe a new
instance of the event category.
Three traditional security event categories were studied in IV. C HALLENGES AND F UTURE D IRECTIONS
this work as a proof-of-concept, namely, Denial of Service In the preceding section, we have surveyed the state-of-the-
(DoS) attacks, account hijacking and data breach. The his- art systems/schemes of predicting and discovering cybersecu-
torical seed event was symbolised by the entity involved in rity incidents. Nevertheless, the progress is still in the infant
the event, as well as the event date. The number of seeds stage, and many critical issues might have been overlooked for
for each category of events as mentioned above is respec- simplicity. In the following, we discuss crucial research chal-
tively 15 (e.g., (Spamhaus, 2013/03/18)), 10 (e.g., (associated lenges that need to be addressed and propose future directions
press, 2013/04/23)) and 11 (e.g., (citi, 2011/06/09)). The unla- for researchers who are interested in this area, in line with the
beled event candidates were collected from January 17, 2014, research methodology described in Section II.
until February 20 by tracking user-provided keywords asso-
ciated with the event type. In their experiment, keywords
were respectively DDoS, hacked and breach for DoS attacks, A. Cybersecurity Incident Analysis
account hijacking and data breaches. Candidate events match- The first step of predicting and discovering cybersecurity
ing keywords associated with the event type were gathered by incident is to analyze cybersecurity incident from the different
using Twitter API. Finally, there were 570 candidate DDoS point of view as illustrated in Section II. The security model
attacks events, 4,014 extracted candidate account hijacking is set up based on observations from cyber incident analysis.
events, and 1,728 data breach events being gathered during From Table IX, we find that most of the work analyzed cyber
the data collection process. incidents in different perspectives, except [25] which proposed
For feature engineering, two collections of binary features a model to detect anomalies in extensive network data instead
were considered for identifying event type. The first feature of targeting a specific cyber incident.
set defined the entity by extracting the contextual words and To thoroughly analyze a cybersecurity incident, some ana-
parts of speech nearby the entity. The second set consisted lysts attempted to find information that reflects cybersecurity
of contextual features around the tracked keyword. In total, incident from the side. In [30], the researchers found few of the
there were 3,790 features for DDoS attacks, 52,995 features mobile apps reviews may relate to security and privacy issues
for account hijacking and 11,271 features for data breach being when trying to utilize reviews from Google Play to discover
extracted. potentially malicious apps. Also, some researchers [31]–[33]

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
1766 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 2, SECOND QUARTER 2019

TABLE IX
A S UMMARY OF R EVIEWED PAPERS ON M ODEL M ETHODOLOGY

made an attempt to get indirect evidence on cybersecurity inci- organizations, governments, business operators, even end users
dents from Twitter, however, they found that only a small from incident instead of generating an ambiguous prediction
subset of Twitter users discuss vulnerability exploits [32] result. If the result of prediction or discovery is unable to act
and security incidents [31], [33]. Although these kinds of upon for a security unaware operator, there seems to be little
information are valuable, relevant information is hard to point in the forecast.
extract with high quality and sufficient number, which chal- Although the existing work has defined and modeled
lenges to infer security incident related characteristics. Hence, prediction/discovery problems, in accordance with different
expanding the way of thinking rather than focusing on a sin- kinds of incidents, as summarized in Table II, there is still
gle angle can be helpful to analyze a security incident. As a long way to make the problem refined, as well as mak-
a complementary solution, Howard and Longstaff’s security ing it possible to improve cyber resilience defined as “the
incident taxonomy provides us a future direction to com- ability to continuously deliver the intended outcome despite
prehensively analyze an incident. Specifically, as shown in adverse cyber events” [149]. For example, translating risk pro-
Figure 3, an incident can be analyzed through observation by files into actionable security recommendations is a direction
seven viewpoints, including types of attackers, exploit tools, for future work [10], [18]. Also, by obtaining information
vulnerabilities that have been exploited, attackers’ actions, on the monetary impact of each incident type, the prediction
targets, unauthorized results of incident and objectives of can provide more economically-informed recommendations.
attackers step by step [147], [148]. Also, the taxonomy gives Recent work has shown that developing new technical solu-
some tips for collecting data related to cyber incidents. tions may be less efficient than giving social or finan-
cial incentives for improving overall security [150], [151].
B. Security Problem Modeling It is shown that proper incentives to encourage man-
It is worth mentioning that the ultimate objective of agement that one may reduce incident is worthwhile to
cybersecurity incident prediction and discovery is to protect study.

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: DATA-DRIVEN CYBERSECURITY INCIDENT PREDICTION: SURVEY 1767

Fig. 3. Howard and Longstaff’s security incident taxonomy.

C. Data Collection and Processing Secondly, the comprehensive data indicates that the dataset
Predicting and discovering cybersecurity incident is a data- includes everything needed or relevant. In the existing
driven problem. Forecasting future incident with high accuracy work [9], [18], when dealing with the problem that forecasts
rises with the assumption that the dataset is representative, whether an organization meets data breach incident in future,
authentic and comprehensive. Hence, how to guarantee that the both of the authors leverage VERIS community database [17]
collected data satisfying the above three criteria is a challenge. to collect previous incident records. Actually, the reports from
Firstly, the representative data refers to the samples with no the VERIS community database mainly focus on U.S. inci-
biases and are typical of the incident to which they belong can dents. A possible solution is to utilize more comprehensive
be described. Whether the results of the model can generalize sources of data breaches (e.g., Hackmageddon [60], Web
for incidents that were not reported, depends on how represen- Hacking Incidents Database [61] and incidents reports from
tative their incident samples are, as stated in [18]. Specifically, Australia Cyber Security Centre [152]) and aggregate together
Sarabi et al. [18] clarified that self-reporting incidents exter- by the granularity of organization. Another case is in [24].
nally detected by a third party usually have high biases. In Facing many symptoms that reflect poor management, the
the reviewed work, the researchers attempted to employ dif- authors attempt to comprehensively describe all manners in
ferent ways to collect vast amounts of data. However, the which a network could be mismanaged. Combining eight mis-
representativeness of data may not be promised. For instance, configured symptoms into one mismanagement metric and
in [27], the researchers cannot find representative real-world aggregating these symptoms at autonomous system level might
programs containing the sorts of security checks required by be a solution to capture mismanagement characteristics as
their study. To resolve this problem, they designed a C program comprehensive as possible.
generator which produced a large number of programs with Thirdly, the authentic data represents the data that needs
various license checking algorithms. Nonetheless, compared to be reliable and accurate. That is to say, the quality
with real-world data, synthetic programs seem to be question- of groundtruth determines the performance of prediction.
able when facing with the new and existing obfuscation and Filtering and validating samples by using multiple criteria
deobfuscation techniques. might be able to confirm the authenticity of data to a certain

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
1768 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 2, SECOND QUARTER 2019

degree. For example, benign websites are validated by five rep- previously unknown issue. Due to lack of knowledge of the
utation blacklists in [7], which must have never been issued in future and significant amount of new findings, challenges
any blacklist. Moreover, Kong et al. [30] adopt crowdsourc- are risen from how to evaluate the results of prediction and
ing approach to pay more credits to trustworthy users, which discovery properly.
proposes a solution to distinguish and sort out more reliable In respect to verifying prediction result, the alternative
samples. approach for waiting for the arrival of future is to use past
data to simulate a prediction done in the past. That is, past
D. Feature Engineering/Representation Learning data is applied to the training phase to set up prediction model,
The performance of machine learning algorithm heavily and data at present is used to validate prediction result. This
depends on the choice of features or data representation. On approach is commonly adopted in the existing cybersecurity
this account, lots of efforts are directed towards data prepro- incident prediction work [7], [9], [18], [32].
cessing and transformations that can lead to a representation of For the discovery of problem, the results of discovery
data to support efficient machine learning. From Table IX, it is are supposed to be a previously unknown issue. Performing
shown that most of the work adopted feature engineering relies manual check on the results of discovery seems to be unre-
on human ingenuity to extract discriminative features from alistic and time-consuming. Hence, the solution is to sample
data, which is essential but requires domain-specific expert discovered results and then validate manually. For example,
knowledge, human resources and wealth. Sometimes, critical in [28], after finding new black keywords, the authors sampled
underlying factors hidden behind the data are even overlooked 1,000 keywords randomly and manually validated by querying
by the human. Representation learning is highly desirable and forums, chat groups and search engines to determine their real
has achieved remarkable success both in academia and in meanings.
industry, including speech recognition and signal processing, After performing the above processes, the traditional eval-
object recognition, natural language processing, and transfer uation metrics can be applied to evaluate the designed model.
learning. Bengio et al. [48] summarized the recent work in Table IX addresses the evaluation metrics used in the reviewed
representation learning, providing multiple solutions for gen- work. It can be found that most of the work presently applied
erating a good representation. Using representation learning simple evaluation metrics to evaluate the model. There is still
that identifies and disentangles the underlying explanatory fac- a lot of space for further research in this step. For example,
tors hidden in the observed data, significant advances could while most of the models can achieve high accuracy on sam-
be made to cybersecurity incident prediction and discovery. If ple data, the models’ usability needs to be carefully evaluated.
successful, tremendous breakthroughs are possible. Besides, with the passing of time and the development of tech-
nology, whether the performance of the models remains stable
E. Model Customization worths investigation. Furthermore, to achieve the ultimate goal
that enhances the cyber resilience for governments, enterprises
When dealing with security incident related problems, the and individual users, time, speed, deployment requirements,
existing DM/ML algorithms and models, and NLP tools are and other considerations are also required to assess.
seemed hard to use without customization, let alone achieving
satisfied performance directly.
When it comes to DM/ML algorithms and models, except V. C ONCLUSION
that few work customizes the model according to the research In this survey, we presented an overview and research
problem, features, and outcome as shown in Table IX, most of outlook of the emerging field that is cybersecurity incident
the work directly uses machine learning model as a black-box prediction. Firstly, we summarized the research methodol-
by running packages in Python or R. ogy on critical phases of predicting cybersecurity incident,
The challenge is also pretty apparent when processing which is an incremental circular process composed of cyber-
security-related text. On the one hand, it is known that NLP is security incident analysis, security problem modeling, data
highly domain-specific. Those NLP systems designed for one collection and processing, feature engineering/representation
domain hardly work with high quality on the other domains. learning, model customization and evaluation. Based on the
On the other hand, the cybersecurity-related vocabularies (e.g., research methodology, a thorough literature review is con-
online underground black keywords [28] and IOCs [29]) are ducted on recent research efforts on schemes and methods
rapidly evolving and significantly different from the commonly of cybersecurity incident prediction. Furthermore, since data
used vocabularies. is an essential and indispensable element, which drives secu-
In a nutshell, customizing model according to the project rity problems, determines representation methods, and sup-
requirement and designing domain-specific NLP tools are pos- ports model setup, we categorized all of the reviewed work
sible to be a stepping stone to achieve higher performance into six data types. They are respectively the organization’s
cybersecurity incident prediction and discovery approach. report and dataset, network dataset, synthetic dataset, webpage
data, social media data, and mixed-type dataset. References
F. Evaluation and crucial information of each dataset are organized and
The output of prediction should be a fact about future. No made public. Finally, conforming to research methodology,
one can give the right answers until the day really comes. many challenges existing in the infant stage research area
Similarly, the outcome of discovery is supposed to be a were addressed, as well as future directions were elaborated.

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: DATA-DRIVEN CYBERSECURITY INCIDENT PREDICTION: SURVEY 1769

Hopefully, cybersecurity incident prediction can raise concern [19] B. J. Kwon, J. Mondal, J. Jang, L. Bilge, and T. Dumitras, “The dropper
among academia and industry. Also, the survey can help to effect: Insights into malware distribution with downloader graph ana-
lytics,” in Proc. 22nd ACM SIGSAC Conf. Comput. Commun. Security,
characterize the latency and serve as a useful reference and 2015, pp. 1118–1129.
valuable guideline for further research. [20] M. Rhode, P. Burnap, and K. Jones, “Early-stage malware prediction
using recurrent neural networks,” Comput. Security, vol. 77,
pp. 578–594, Aug. 2018.
[21] X. Pan, X. Wang, Y. Duan, X. Wang, and H. Yin, “Dark hazard:
ACKNOWLEDGMENT Learning-based, large-scale discovery of hidden sensitive operations
The authors wish to acknowledge the anonymous review- in Android apps,” in Proc. Symp. Netw. Distrib. Syst. Security (NDSS),
2017, pp. 1–15.
ers for their valuable comments and special thanks to [22] G. Lin et al., “Cross-project transfer representation learning for vul-
Xiaoxing Mo (Deakin University) for helping to prepare this nerable function discovery,” IEEE Trans. Ind. Informat., vol. 14, no. 7,
manuscripts. pp. 3289–3297, Jul. 2018.
[23] L. Bilge, Y. Han, and M. Dell’Amico, “RiskTeller: Predicting the risk
of cyber incidents,” in Proc. ACM SIGSAC Conf. Comput. Commun.
Security, 2017, pp. 1299–1311.
R EFERENCES [24] J. Zhang, Z. Durumeric, M. Bailey, M. Liu, and M. Karir, “On the
mismanagement and maliciousness of networks,” in Proc. Symp. Netw.
[1] Australia Cyber Security Centre. Australia Cyber Security Centre Distrib. Syst. Security (NDSS), 2014, pp. 1–12.
Threat Report 2017. Accessed: Apr. 2, 2018. [Online]. Available: [25] R. A. Rossi, B. Gallagher, J. Neville, and K. Henderson, “Modeling
https://www.acsc.gov.au/publications/ACSC_Threat_Report_2017.pdf dynamic behavior in large evolving graphs,” in Proc. 6th ACM Int.
[2] M. A. Kuypers, T. Maillart, and E. Pate-Cornell, An Empirical Conf. Web Search Data Min., 2013, pp. 667–676.
Analysis of Cyber Security Incidents at a Large Organization, [26] J. Zhang, X. Chen, Y. Xiang, W. Zhou, and J. Wu, “Robust
Dept. Manag. Sci. Eng., Stanford Univ., Stanford, CA, USA, network traffic classification,” IEEE/ACM Trans. Netw., vol. 23, no. 4,
and School Inf., Univ. California at Berkeley, Berkeley, CA, pp. 1257–1270, Aug. 2015.
USA, 2016. Accessed: Jul. 30, 2016. [Online]. Available: [27] S. Banescu, C. Collberg, and A. Pretschner, “Predicting the resilience
http://fsi.stanford.edu/sites/default/files/kuypersweis_v7.pdf of obfuscated code against symbolic execution attacks via machine
[3] C. Blackwell, “A security ontology for incident analysis,” in Proc. 6th learning,” in Proc. 26th USENIX Security Symp., 2017, pp. 661–678.
Annu. Workshop Cyber Security Inf. Intell. Res., 2010, p. 46. [28] H. Yang et al., “How to learn Klingon without a dictionary: Detection
[4] Australian Cyber Security Centre. Australian Cyber Security and measurement of black keywords used by the underground econ-
Centre Survey 2016. Accessed: Apr. 2, 2018. [Online]. omy,” in Proc. IEEE Symp. Security Privacy (SP), 2017, pp. 751–769.
Available: https://www.acsc.gov.au/publications/ACSC_Cyber_Security [29] X. Liao et al., “Acing the IOC game: Toward automatic discovery
_Survey_2016.pdf and analysis of open-source cyber threat intelligence,” in Proc. ACM
[5] Cisco. Cisco 2018 Annual Cybersecurity Report. Accessed: Apr. 2, SIGSAC Conf. Comput. Commun. Security, 2016, pp. 755–766.
2018. [Online]. Available: https://www.cisco.com/ c/dam/m/digital/elq- [30] D. Kong, L. Cen, and H. Jin, “AUTOREB: Automatically under-
cmcglobal/witb/acr2018/acr2018final.pdf?dtid=odicdc000016&ccid=cc standing the review-to-behavior fidelity in Android applications,” in
000160&oid=anrsc005679&ecid=8196&elqTrackId=686210143d3449 Proc. 22nd ACM SIGSAC Conf. Comput. Commun. Security, 2015,
4fa27ff73da9690a5b&elqaid=9452&elqat=2 pp. 530–541.
[6] J. Li, X. Huang, J. Li, X. Chen, and Y. Xiang, “Securely outsourc- [31] R. P. Khandpur et al., “Crowdsourcing cybersecurity: Cyber attack
ing attribute-based encryption with checkability,” IEEE Trans. Parallel detection using social media,” in Proc. ACM Conf. Inf. Knowl. Manag.,
Distrib. Syst., vol. 25, no. 8, pp. 2201–2210, Aug. 2014. 2017, pp. 1049–1057.
[7] K. Soska and N. Christin, “Automatically detecting vulnerable websites [32] C. Sabottke, O. Suciu, and T. Dumitraş, “Vulnerability disclosure in
before they turn malicious,” in Proc. USENIX Security Symp., 2014, the age of social media: Exploiting Twitter for predicting real-world
pp. 625–640. exploits,” in Proc. USENIX Security Symp., 2015, pp. 1041–1056.
[8] K. Borgolte, C. Kruegel, and G. Vigna, “Delta: Automatic identification [33] A. Ritter, E. Wright, W. Casey, and T. Mitchell, “Weakly supervised
of unknown Web-based infection campaigns,” in Proc. ACM SIGSAC extraction of computer security events from Twitter,” in Proc. 24th Int.
Conf. Comput. Commun. Security, 2013, pp. 109–120. Conf. World Wide Web, 2015, pp. 896–905.
[9] Y. Liu et al., “Cloudy with a chance of breach: Forecasting cyber secu- [34] CD Team. Cert/cc. Computer Security Incident Response Team
rity incidents,” in Proc. USENIX Security Symp., 2015, pp. 1009–1024. Frequently Asked Questions. Accessed: Apr. 3, 2018. [Online].
[10] Y. Liu, M. Dong, K. Ota, and A. Liu, “ActiveTrust: Secure and trustable Available: https://resources.sei.cmu.edu/asset_files/WhitePaper/2017_
routing in wireless sensor networks,” IEEE Trans. Inf. Forensics 019_001_485654.pdf
Security, vol. 11, no. 9, pp. 2013–2027, Sep. 2016. [35] AusCERT Team. Auscert Is a Leading Cyber Emergency Response
[11] Chronicle. Accessed: Sep. 13, 2018. [Online]. Available: Team (CERT) in Australia and the Asia/Pacific Region. Accessed:
https://chronicle.security/ Apr. 3, 2018. [Online]. Available: http://www.auscert.org.au/
[12] BizCover: Compare Small Business Insurance Quotes [36] TS Institute. Computer Security Incident Handling Step-by-Step.
Australia. Accessed: Sep. 13, 2018. [Online]. Available: Accessed: Apr. 3, 2018. [Online]. Available: https://www.sans.org/
https://www.bizcover.com.au reading-room/whitepapers/incident/incident-handlers-handbook-33901
[13] J. Jang-Jaccard and S. Nepal, “A survey of emerging threats in [37] Department of the Navy. Computer Incident Response
cybersecurity,” J. Comput. Syst. Sci., vol. 80, no. 5, pp. 973–993, 2014. Guidebook. Accessed: Apr. 3, 2018. [Online]. Available:
[14] L. Liu, O. De Vel, Q.-L. Han, J. Zhang, and Y. Xiang, “Detecting and http://www.csirt.org/publications/navy.htm
preventing cyber insider threats: A survey,” IEEE Commun. Surveys [38] G. Killcrece, K.-P. Kossakowski, R. Ruefle, and M. Zajicek,
Tuts., vol. 20, no. 2, pp. 1397–1417, 2nd Quart., 2018. “State of the practice of computer security incident response
[15] A. L. Buczak and E. Guven, “A survey of data mining and teams (CSIRTs),” CSIRT Develop. Team, Waldorf, MD, USA,
machine learning methods for cyber security intrusion detection,” IEEE Rep. CMU/SEI-2003-TR-001, 2003.
Commun. Surveys Tuts., vol. 18, no. 2, pp. 1153–1176, 2nd Quart., [39] I. David and S. Karl, Computer Crime: A Crime Fighter’s Handbook.
2016. Sebastopol, CA, USA: O’Reilly Assoc., 1995.
[16] Y. Kodratoff, “Knowledge discovery in texts: A definition, and appli- [40] T. Grance, K. Kent, and B. Kim, “Computer security incident handling
cations,” in Proc. Int. Symp. Methodol. Intell. Syst., 1999, pp. 16–29. guide,” document SP 800-61, NIST, Gaithersburg, MD, USA, 2004.
[17] VERIS. VERIS Community Database (VCDB). Accessed: [41] W. R. Cheswick, S. M. Bellovin, and A. D. Rubin, Firewalls and
Sep. 13, 2018. [Online]. Available: http://veriscommunity.net/ Internet Security: Repelling the Wily Hacker. Boston, MA, USA:
index.html Addison-Wesley, 2003.
[18] A. Sarabi, P. Naghizadeh, Y. Liu, and M. Liu, “Prioritizing security [42] W. Stallings, Network and Internetwork Security: Principles and
spending: A quantitative analysis of risk distributions for different busi- Practice, vol. 1. Englewood Cliffs, NJ, USA: Prentice-Hall, 1995.
ness profiles,” in Proc. Workshop Econ. Inf. Security (WEIS), 2015, [43] F. B. Cohen and F. B. Cohen, Protection and Security on the
pp. 1–12. Information Superhighway. New York, NY, USA: Wiley, 1995.

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
1770 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 2, SECOND QUARTER 2019

[44] J. Li, Y. Zhang, X. Chen, and Y. Xiang, “Secure attribute-based [74] PandaLabs. (2015). Cybercrime Reaches New Heights in the
data sharing for resource-limited users in cloud computing,” Comput. Third Quarter. Accessed: May 3, 2018. [Online]. Available:
Security, vol. 72, pp. 1–12, Jan. 2018. https://www.pandasecurity.com/mediacenter/pandalabs/pandalabs-q3/
[45] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. [75] (Dec. 2014). Exploits Database by Offensive Security. [Online].
Kuala Lumpur, Malaysia: Pearson Edu. Ltd., 2016. Available: https:http://exploit-db.com/
[46] National Vulnerability Database. Accessed: Sep. 13, 2018. [Online]. [76] Microsoft Exploitability Index. Accessed: Dec. 15, 2018. [Online].
Available: https://nvd.nist.gov/vuln Available: https://www.microsoft.com/en-us/msrc/exploitability-index
[47] 2018 Verizon Annual Data Breach Investigations Report. Accessed: [77] T. Dumitras and D. Shou, “Toward a standard benchmark for computer
Sep. 13, 2018. [Online]. Available: https://www.verizonenterprise.com/ security research: The worldwide intelligence network environment
verizon-insights-lab/dbir/ (WINE),” in Proc. 1st Workshop Build. Anal. Datasets Gathering Exp.
[48] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A Returns Security, 2011, pp. 89–96.
review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., [78] B. Quintero et al. (2004). Virustotal. [Online]. Available:
vol. 35, no. 8, pp. 1798–1828, Aug. 2013. https://virustotal.com/
[49] A. Ng. (2013). Machine Learning and AI Via Brain [79] (2017). Softonic. [Online]. Available: https://en.softonic.com/
Simulations. Accessed: May 3, 2018. [Online]. Available: [80] (2017). Portableapps. [Online]. Available: https://portableapps.com/
http://ai.stanford.edu/Ëoeang/slides/DeepLearning-Mar2013.pptx [81] (2017). Sourceforge. [Online]. Available: https://sourceforge.net/
[50] J. Mairal, J. Ponce, G. Sapiro, A. Zisserman, and F. R. Bach, [82] C. Guarnieri, A. Tanasi, J. Bremer, and M. Schloesser. (2012).
“Supervised dictionary learning,” in Proc. Adv. Neural Inf. Process. The Cuckoo Sandbox. Accessed: Dec. 16, 2018. [Online]. Available:
Syst., 2009, pp. 1033–1040. https://media.readthedocs.org/pdf/cuckoo/latest/cuckoo.pdf
[51] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” [83] Psutil Python Library, PS Found., Mumbai, India, 2017.
Chemometrics Intell. Lab. Syst., vol. 2, nos. 1–3, pp. 37–52, 1987. [84] Virustotal-Free Online Virus, Malware and URL Scanner. Accessed:
[52] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: A next- Sep. 13, 2018. [Online]. Available: https://www.virustotal.com
generation open source framework for deep learning,” in Proc. [85] I. Chowdhury and M. Zulkernine, “Using complexity, coupling, and
Workshop Mach. Learn. Syst. (LearningSys) 29th Annu. Conf. Neural cohesion metrics as early indicators of vulnerabilities,” J. Syst. Archit.,
Inf. Process. Syst. (NIPS), vol. 5, 2015, pp. 1–6. vol. 57, no. 3, pp. 294–313, 2011.
[53] M. M. Najafabadi et al., “Deep learning applications and challenges [86] F. Yamaguchi, C. Wressnegger, H. Gascon, and K. Rieck, “Chucky:
in big data analytics,” J. Big Data, vol. 2, no. 1, p. 1, 2015. Exposing missing checks in source code for vulnerability discov-
[54] B. Feng, Q. Fu, M. Dong, D. Guo, and Q. Li, “Multistage and elastic ery,” in Proc. ACM SIGSAC Conf. Comput. Commun. Security, 2013,
spam detection in mobile social networks through deep learning,” IEEE pp. 499–510.
Netw., vol. 32, no. 4, pp. 15–21, Jul./Aug. 2018. [87] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
[55] H. Li, K. Ota, and M. Dong, “Learning IoT in edge: Deep learning Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
for the Internet of Things with edge computing,” IEEE Netw., vol. 32, [88] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation
no. 1, pp. 96–101, Jan./Feb. 2018. of word representations in vector space,” in Proc. Int. Conf. Learn.
[56] L. Li, K. Ota, and M. Dong, “When weather matters: IoT-based elec- Represent. (ICLR), 2013, pp. 313–317.
trical load forecasting for smart grid,” IEEE Commun. Mag., vol. 55,
[89] R. R. Larson, “Introduction to information retrieval,” J. Amer. Soc. Inf.
no. 10, pp. 46–51, Oct. 2017.
Sci. Technol., vol. 61, no. 4, pp. 852–853, 2010.
[57] P. García-Teodoro, J. Díaz-Verdejo, G. Maciá-Fernández, and
[90] Open Resolver Project. Accessed: Sep. 13, 2018. [Online]. Available:
E. Vázquez, “Anomaly-based network intrusion detection: Techniques,
http://openresolverproject.org/
systems and challenges,” Comput. Security, vol. 28, nos. 1–2,
[91] Verisign. Inc. Accessed: Sep. 13, 2018. [Online]. Available:
pp. 18–28, 2009.
www.verisigninc.com
[58] S. Axelsson, “The base-rate fallacy and its implications for the dif-
[92] Alexa Web Information Service. Accessed: Sep. 13, 2018. [Online].
ficulty of intrusion detection,” in Proc. 6th ACM Conf. Comput.
Available: http://aws.amazon.com/awis
Commun. Security, 1999, pp. 1–7.
[59] Administration for Children and Families. (2015). United [93] University of Oregon RouteViews Project. Accessed: Sep. 13, 2018.
States Department of Health and Human Services, Information [Online]. Available: http://www.routeviews.org/
Memorandum. Accessed: May 3, 2018. [Online]. Available: [94] Spoofer Project. Accessed: Sep. 13, 2018. [Online]. Available:
https://www.acf.hhs.gov/sites/default/files/cb/im1504.pdf http://spoofer.cmand.org/index.php
[60] Oregon Route Views Project. Accessed: Sep. 13, 2018. [Online]. [95] Z. Durumeric, E. Wustrow, and J. A. Halderman, “Zmap: Fast Internet-
Available: http://www. routeviews.org/ wide scanning and its security applications,” in Proc. USENIX Security
[61] VERIS. T.W.A.S. Web-Hacking-Incident-Database. Accessed: Symp., vol. 8, 2013, pp. 605–620.
Sep. 13, 2018. [Online]. Available: http://projects.webappsec.org/ [96] Barracuda Reputation Blocklist. Accessed: Sep. 13, 2018. [Online].
w/page/13246995/Web-Hacking-Incident-Database Available: http://www.barracudacentral.org/
[62] Composite Blocking List. Accessed: Sep. 13, 2018. [Online]. Available: [97] CBL: Composite Blocking List. Accessed: Sep. 13, 2018. [Online].
http://cbl.abuseat.org/ Available: http://cbl.abuseat.org/
[63] The SPAMHAUS Project: SBL, XBL, PBL, ZEN Lists. Accessed: [98] The SPAMHAUS Project: SBL, XBL, PBL, ZEN Lists. Accessed:
Sep. 13, 2018. [Online]. Available: http://www.spamhaus.org/ Sep. 13, 2018. [Online]. Available: http://www.spamhaus.org/
[64] SpamCop Blocking List. Accessed: Sep. 13, 2018. [Online]. Available: [99] SpamCop Blocking List. Accessed: Sep. 13, 2018. [Online]. Available:
http://www.spamcop.net/ http://www.spamhaus.org/
[65] WPBL: Weighted Private Block List. Accessed: Sep. 13, 2018. [Online]. [100] WPBL: Weighted Private Block List. Accessed: Sep. 13, 2018. [Online].
Available: http://wpbl.info/ Available: http://www.wpbl.info/
[66] UCEPROTECTOR Network. Accessed: Sep. 13, 2018. [Online]. [101] UCEPROTECTOR Network. Accessed: Sep. 13, 2018. [Online].
Available: http://uceprotect.net/ Available: http://www.uceprotect.net/
[67] SURBL: URL Reputation Data. Accessed: Sep. 13, 2018. [Online]. [102] SURBL: URL REPUTATION DATA. Accessed: Sep. 13, 2018. [Online].
Available: http://www.surbl.org/ Available: http://www.surbl.org/
[68] PhishTank. Accessed: Sep. 13, 2018. [Online]. Available: [103] B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi, “On the evo-
http://www.PhishTank.com/ lution of user interaction in Facebook,” in Proc. 2nd ACM Workshop
[69] hpHosts for Your Pretection. Accessed: Sep. 13, 2018. [Online]. Online Soc. Netw., 2009, pp. 37–42.
Available: http://hosts-file.net/ [104] Enron Email Dataset. Accessed: Sep. 28, 2018. [Online]. Available:
[70] DShield. Accessed: Sep. 13, 2018. [Online]. Available: http://www.cs.cmu.edu/ enron/
http://www.dshield.org/ [105] Hackers Focus on Misconfigured Networks. Accessed: May 3,
[71] OpenBL. Accessed: Sep. 13, 2018. [Online]. Available: 2018. [Online]. Available: http://forums.cnet.com/7726-6132102-
http://www.openbl.org/ 3366976.html
[72] B. Prince. Top Data Breaches of 2014. Accessed: May 3, 2018. [106] D. Barr, “Common DNS operational and configuration errors,” Internet
[Online]. Available: www.securityweek.com/top-data-breaches-2014 Eng. Task Force, Fremont, CA, USA, RFC 1912, 1996.
[73] T.S. Institute. CSANS Institute Critical Security Controls. Accessed: [107] R. Mahajan, D. Wetherall, and T. Anderson, “Understanding BGP
Aug. 20, 2018. [Online]. Available: https://www.sans.org/critical- misconfiguration,” in Proc. ACM SIGCOMM Comput. Commun. Rev.,
security-controls vol. 32, 2002, pp. 3–16.

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
SUN et al.: DATA-DRIVEN CYBERSECURITY INCIDENT PREDICTION: SURVEY 1771

[108] P. Spirtes, C. Meek, and T. Richardson, “Causal inference in the [136] R. J. Kate, “A dependency-based word subsequence kernel,” in Proc.
presence of latent variables and selection bias,” in Proc. 11th Conf. Conf. Empir. Methods Nat. Lang. Process., 2008, pp. 400–409.
Uncertainty Artif. Intell., 1995, pp. 499–506. [137] B. J. Frey and D. Dueck, “Clustering by passing messages between
[109] Y. Miao et al., “Automated big traffic analytics for cyber security,” data points,” Science, vol. 315, no. 5814, pp. 972–976, 2007.
Comput. Res. Repository, vol. abs/1804.09023, 2018. [138] J. Kleinberg, “Bursty and hierarchical structure in streams,” Data Min.
[110] J. Zhang et al., “Network traffic classification using correlation Knowl. Disc., vol. 7, no. 4, pp. 373–397, 2003.
information,” IEEE Trans. Parallel Distrib. Syst., vol. 24, no. 1, [139] W. Z. Khan, Y. Xiang, M. Y. Aalsalem, and Q. Arshad, “Mobile phone
pp. 104–117, Jan. 2013. sensing systems: A survey,” IEEE Commun. Surveys Tuts., vol. 15,
[111] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, no. 1, pp. 402–427, 1st Quart., 2013.
2001. [140] J. Xu and W. B. Croft, “Query expansion using local and global
[112] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson, document analysis,” in Proc. 19th Annu. Int. ACM SIGIR Conf. Res.
“Offline/realtime traffic classification using semi-supervised learning,” Develop. Inf. Retrieval, 1996, pp. 4–11.
Perform. Eval., vol. 64, nos. 9–12, pp. 1194–1213, 2007. [141] V. C. Raykar et al., “Learning from crowds,” J. Mach. Learn. Res.,
[113] A. Este, F. Gringoli, and L. Salgarelli, “Support vector machines vol. 11, pp. 1297–1322, Jan. 2010.
for TCP traffic classification,” Comput. Netw., vol. 53, no. 14, [142] Symantec Attack Signatures. Accessed: Sep. 28, 2018.
pp. 2476–2490, 2009. [Online]. Available: http://www.symantec.com/security_response/
[114] Symantec Research Labs. Accessed: Sep. 28, 2018. attacksignatures/
[Online]. Available: https://www.symantec.com/about/corporate- [143] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm
profile/technology/research-labs for optimal margin classifiers,” in Proc. 5th Annu. Workshop Comput.
[115] C. Collberg, S. Martin, J. Myers, and J. Nagra, “Distributed application Learn. Theory, 1992, pp. 144–152.
tamper detection via continuous software updates,” in Proc. 28th Annu. [144] K. Henderson et al., “It’s who you know: Graph mining using recursive
Comput. Security Appl. Conf., 2012, pp. 319–328. structural features,” in Proc. 17th ACM SIGKDD Int. Conf. Knowl.
[116] S. Schrittwieser, S. Katzenbeisser, J. Kinder, G. Merzdovnik, and Disc. Data Min., 2011, pp. 663–671.
E. Weippl, “Protecting software through obfuscation: Can it keep pace [145] M. Mintz, S. Bills, R. Snow, and D. Jurafsky, “Distant supervision
with progress in code analysis?” ACM Comput. Surveys, vol. 49, no. 1, for relation extraction without labeled data,” in Proc. Joint Conf. 47th
p. 4, 2016. Annu. Meeting ACL 4th Int. Joint Conf. Nat. Lang. Process. (AFNLP),
[117] C. Collberg, C. Thomborson, and D. Low, “A taxonomy of obfuscat- vol. 2, 2009, pp. 1003–1011.
ing transformations,” Dept. Comput. Sci., Univ. Auckland, Auckland, [146] B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and
New Zealand, Rep. 148, 1997. R. C. Williamson, “Estimating the support of a high-dimensional
[118] S. Banescu, C. Collberg, V. Ganesh, Z. Newsham, and A. Pretschner, distribution,” Neural Comput., vol. 13, no. 7, pp. 1443–1471, 2001.
“Code obfuscation against symbolic execution attacks,” in Proc. 32nd [147] J. D. Howard, “An analysis of security incidents on the Internet
Annu. Conf. Comput. Security Appl., 2016, pp. 189–200. 1989–1995,” Ph.D. dissertation, Carnegie Mellon Univ., Pittsburgh, PA,
[119] C. Cadar, D. Dunbar, and D. R. Engler, “Klee: Unassisted and USA, 1997.
automatic generation of high-coverage tests for complex systems pro- [148] J. D. Howard and T. A. Longstaff, “A common language for computer
grams,” in Proc. USENIX Symp. Oper. Syst. Design Implement., vol. 8, security incidents,” Sandia Nat. Lab., Albuquerque, NM, USA, and
2008, pp. 209–224. Sandia Nat. Lab., Livermore, CA, USA, Rep. SAND98-8667, 1998.
[120] Y. Shoshitaishvili, R. Wang, C. Hauser, C. Kruegel, and G. Vigna, [149] F. Björck, M. Henkel, J. Stirna, and J. Zdravkovic, “Cyber resilience—
“Firmalice—Automatic detection of authentication bypass vulnerabil- Fundamentals for a definition,” in New Contributions in Information
ities in binary firmware,” in Proc. Symp. Netw. Distrib. Syst. Security Systems and Technologies. Cham, Switzerland: Springer, 2015,
(NDSS), 2015, pp. 1–15. pp. 311–316.
[121] The R Project for Statistical Computing. Accessed: Sep. 28, 2018. [150] L. Jiang, V. Anantharam, and J. Walrand, “How bad are selfish invest-
[Online]. Available: https://www.r-project.org/ ments in network security?” IEEE/ACM Trans. Netw., vol. 19, no. 2,
[122] N. Leontiadis, T. Moore, and N. Christin, “Measuring and analyzing pp. 549–560, Apr. 2011.
search-redirection attacks in the illicit online prescription drug trade,” [151] J. Wu, K. Ota, M. Dong, and C. Li, “A hierarchical security framework
in Proc. USENIX Security Symp., vol. 11, 2011, p. 19. for defending against sophisticated attacks on wireless sensor networks
[123] L. Lu, R. Perdisci, and W. Lee, “SURF: Detecting and measur- in smart cities,” IEEE Access, vol. 4, pp. 416–424, 2016.
ing search poisoning,” in Proc. 18th ACM Conf. Comput. Commun. [152] Australia Cyber Security Centre. Accessed: Sep. 28, 2018. [Online].
Security, 2011, pp. 467–476. Available: https://www.acsc.gov.au/incident.html
[124] DNS-BH: Malware Domain Blocklist. Accessed: Sep. 28, 2018.
[Online]. Available: http:// www.malwaredomains.com/
[125] Google. Google Safe Browsing API. Accessed: Sep. 28, 2018. [Online].
Available: https://code.google.com/apis/safebrowsing/
[126] MalwareBytes. hpHosts Online. Accessed: Sep. 28, 2018. [Online].
Available: http://www. hosts-file.net/
[127] M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander, “OPTICS-OF:
Identifying local outliers,” in Proc. Principles Data Min. Knowl. Disc.,
1999, pp. 262–270.
[128] Hackmageddon. Accessed: Sep. 28, 2018. [Online]. Available:
https://www.hackmageddon.com
[129] Privacy Rights. Accessed: Sep. 28, 2018. [Online]. Available:
https://www.privacyrights.org/data-breaches
[130] R. McMillan. (2013). Open Threat Intelligence. [Online]. Available:
https://www.gartner.com/doc/2487216/definition-threat-intelligence
[131] (2016). IOCbucket. [Online]. Available: https://www.iocbucket.com/
[132] A Community OpenIOC Resource. Accessed: Sep. 28, 2018. [Online].
Available: https://openiocdb.com/
[133] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes twitter
users: Real-time event detection by social sensors,” in Proc. 19th Int. Nan Sun received the bachelor’s degree (Hons.)
Conf. World Wide Web, 2010, pp. 851–860. in information technology from Deakin University,
[134] A. Signorini, A. M. Segre, and P. M. Polgreen, “The use of Twitter to Geelong, VIC, Australia, in 2016, where she is
track levels of disease activity and public concern in the U.S. during currently pursuing the Ph.D. degree. Her current
the influenza A H1N1 pandemic,” Public Library Sci. (PLoS) ONE, research interests include cybersecurity and social
vol. 6, no. 5, 2011, Art. no. e19467. network security.
[135] A. Tumasjan, T. O. Sprenger, P. G. Sandner, and I. M. Welpe,
“Predicting elections with Twitter: What 140 characters reveal about
political sentiment,” in Proc. Int. Conf. Weblogs Soc. Media (ICWSM),
vol. 10, 2010, pp. 178–185.

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.
1772 IEEE COMMUNICATIONS SURVEYS & TUTORIALS, VOL. 21, NO. 2, SECOND QUARTER 2019

Jun Zhang (M’12–SM’18) received the Ph.D. Leo Yu Zhang (S’14–M’17) received the B.S. and
degree in computer science from the University M.S. degrees in computational mathematics from
of Wollongong, Wollongong, Australia, in 2011. Xiangtan University in 2009 and 2012, respectively,
He is an Associate Professor with the School and the Ph.D. degree from the City University of
of Software and Electrical Engineering and the Hong Kong in 2016. He is currently a Lecturer
Deputy Director of Swinburne Cybersecurity Lab, with the School of Information Technology, Deakin
Swinburne University of Technology, Australia. He University, Australia. He held various research posi-
has published over 90 research papers in refer- tions with the City University of Hong Kong, the
eed international journals and conferences, such University of Macau, the University of Ferrara, and
as the IEEE C OMMUNICATIONS S URVEYS AND the University of Bologna. His research interests
T UTORIALS, the IEEE/ACM T RANSACTIONS ON include cloud security, multimedia security, and
N ETWORKING, the IEEE T RANSACTIONS ON I MAGE P ROCESSING, the compressed sensing.
IEEE T RANSACTIONS ON PARALLEL AND D ISTRIBUTED S YSTEMS, the
IEEE T RANSACTIONS ON I NFORMATION F ORENSICS AND S ECURITY, the
ACM Conference on Computer and Communications Security, and the ACM
Asia Conference on Computer and Communications Security. His research
interests include cybersecurity and applied machine learning. He has been
internationally recognized as an Active Researcher in cybersecurity, evidenced
by his chairing (PC Chair, Workshop Chair, or Publicity Chair) of eight inter-
national conferences since 2013, and presenting of invited keynote addresses
in two conferences and an invited lecture in IEEE SMC Victorian Chapter.

Paul Rimba received the Ph.D. degree in soft-


ware engineering from the University of New South
Wales, Australia, in 2016, focused on building high
assurance secure applications using security patterns.
He is a Research Scientist with the Software and
Computational Systems Group of Data61, CSIRO.
He specializes security patterns for capability-based
Yang Xiang (M’07–SM’12) received the Ph.D.
platforms and provides assurances about the spe-
degree in computer science from Deakin University,
cialized patterns through formal verification. His
Australia. He is currently a Full Professor and the
research focuses on empirical and security analysis
Dean of Digital Research and Innovation Capability
of blockchain systems.
Platform, Swinburne University of Technology,
Australia. His research interests include cyber secu-
rity, which covers network and system security, data
analytics, distributed systems, and networking. In
particular, he is currently leading his team devel-
Shang Gao received the Ph.D. degree in computer
oping active defense systems against large-scale dis-
science from Northeastern University, Shenyang,
tributed network attacks. He is the Chief Investigator
China, in 2000. She was a Post-Doctoral Research
of several projects in network and system security, funded by the Australian
Fellow with the University of Technology Sydney
Research Council. He has published over 200 research papers in many interna-
and an Associate Lecturer with Central Queensland
tional journals and conferences. He served as an Associate Editor for the IEEE
University, Australia. She is currently a Lecturer
T RANSACTIONS ON C OMPUTERS, the IEEE T RANSACTIONS ON PARALLEL
with the School of IT, Deakin University, Geelong,
AND D ISTRIBUTED S YSTEMS , and Security and Communication Networks
Australia. Her research interests include networking,
(Wiley), and an Editor for the J OURNAL OF N ETWORK AND C OMPUTER
adaptive learning, big data processing, cyber secu-
A PPLICATIONS. He is the Coordinator (Asia) for IEEE Computer Society
rity, and cloud computing.
Technical Committee on Distributed Processing.

Authorized licensed use limited to: Central Institute of Technology - Kokrajhar. Downloaded on July 12,2022 at 10:01:32 UTC from IEEE Xplore. Restrictions apply.

You might also like