Breaching Data Analytics Paper

Uploaded by

Parshuram Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views16 pages

Breaching Data Analytics Paper

Uploaded by

Parshuram Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

2856 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO.

11, NOVEMBER 2018

Modeling and Predicting Cyber Hacking Breaches

Maochao Xu, Kristin M. Schweitzer, Raymond M. Bateman, and Shouhuai Xu

Abstract— Analyzing cyber incident data sets is an important reports that in year 2016, the median number of breached
method for deepening our understanding of the evolution of the records was 1,339, the median per-record cost was $39.82,
threat situation. This is a relatively new research topic, and many the average breach cost was $665,000, and the median breach
studies remain to be done. In this paper, we report a statistical cost was $60,000.
analysis of a breach incident data set corresponding to 12 years
(2005–2017) of cyber hacking activities that include malware While technological solutions can harden cyber systems
attacks. We show that, in contrast to the findings reported in the against attacks, data breaches continue to be a big prob-
literature, both hacking breach incident inter-arrival times and lem. This motivates us to characterize the evolution of data
breach sizes should be modeled by stochastic processes, rather breach incidents. This not only will deep our understanding
than by distributions because they exhibit autocorrelations. Then, of data breaches, but also shed light on other approaches
we propose particular stochastic process models to, respectively, for mitigating the damage, such as insurance. Many believe
fit the inter-arrival times and the breach sizes. We also show
that these models can predict the inter-arrival times and the
that insurance will be useful, but the development of accurate
breach sizes. In order to get deeper insights into the evolution cyber risk metrics to guide the assignment of insurance rates is
of hacking breach incidents, we conduct both qualitative and beyond the reach of the current understanding of data breaches
quantitative trend analyses on the data set. We draw a set of (e.g., the lack of modeling approaches) [6].
cybersecurity insights, including that the threat of cyber hacks Recently, researchers started modeling data breach inci-
is indeed getting worse in terms of their frequency, but not in dents. Maillart and Sornette [7] studied the statistical prop-
terms of the magnitude of their damage. erties of the personal identity losses in the United States
Index Terms— Hacking breach, data breach, cyber threats, between year 2000 and 2008 [8]. They found that the num-
cyber risk analysis, breach prediction, trend analysis, time series, ber of breach incidents dramatically increases from 2000 to
cybersecurity data analytics. July 2006 but remains stable thereafter. Edwards et al. [9]
analyzed a dataset containing 2,253 breach incidents that span
I. I NTRODUCTION over a decade (2005 to 2015) [1]. They found that neither
the size nor the frequency of data breaches has increased
D ATA breaches are one of the most devastating cyber
incidents. The Privacy Rights Clearinghouse [1] reports
7,730 data breaches between 2005 and 2017, accounting for
over the years. Wheatley et al. [10] analyzed a dataset that is
combined from [8] and [1] and corresponds to organizational
9,919,228,821 breached records. The Identity Theft Resource breach incidents between year 2000 and 2015. They found
Center and Cyber Scout [2] reports 1,093 data breach incidents that the frequency of large breach incidents (i.e., the ones
in 2016, which is 40% higher than the 780 data breach that breach more than 50,000 records) occurring to US firms
incidents in 2015. The United States Office of Personnel is independent of time, but the frequency of large breach
Management (OPM) [3] reports that the personnel information incidents occurring to non-US firms exhibits an increasing
of 4.2 million current and former Federal government employ- trend.
ees and the background investigation records of current, The present study is motivated by several questions
former, and prospective federal employees and contractors that have not been investigated until now, such as: Are
(including 21.5 million Social Security Numbers) were stolen data breaches caused by cyber attacks increasing, decreas-
in 2015. The monetary price incurred by data breaches is ing, or stabilizing? A principled answer to this question will
also substantial. IBM [4] reports that in year 2016, the global give us a clear insight into the overall situation of cyber threats.
average cost for each lost or stolen record containing sensi- This question was not answered by previous studies. Specifi-
tive or confidential information was $158. NetDiligence [5] cally, the dataset analyzed in [7] only covered the time span
from 2000 to 2008 and does not necessarily contain the breach
Manuscript received November 22, 2017; revised March 16, 2018 and incidents that are caused by cyber attacks; the dataset analyzed
April 23, 2018; accepted April 28, 2018. Date of publication May 16, in [9] is more recent, but contains two kinds of incidents:
2018; date of current version May 23, 2018. This work was supported negligent breaches (i.e., incidents caused by lost, discarded,
in part by ARL under Grant W911NF-17-2-0127. The associate editor
coordinating the review of this manuscript and approving it for publication was stolen devices and other reasons) and malicious breaching.
Prof. Mauro Conti. (Corresponding author: Shouhuai Xu.) Since negligent breaches represent more human errors than
M. Xu is with the Department of Mathematics, Illinois State University, cyber attacks, we do not consider them in the present study.
Normal, IL 61761 USA.
K. M. Schweitzer and R. M. Bateman are with the U.S. Army Research
Because the malicious breaches studied in [9] contain four
Laboratory South (Cyber), San Antonio, TX 78284 USA. sub-categories: hacking (including malware), insider, payment
S. Xu is with the Department of Computer Science, The University of Texas card fraud, and unknown, this study will focus on the hacking
at San Antonio, San Antonio, TX 78249 USA (e-mail: [email protected]). sub-category (called hacking breach dataset thereafter), while
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. noting that the other three sub-categories are interesting on
Digital Object Identifier 10.1109/TIFS.2018.2834227 their own and should be analyzed separately.

U.S. Government work not protected by U.S. copyright.

XU et al.: MODELING AND PREDICTING CYBER HACKING BREACHES 2857

A. Our Contributions implying that neither the breach size nor the breach frequency
In this paper, we make the following three contributions. has increased over the years.
First, we show that both the hacking breach incident inter- Wheatley et al. [10] analyzed an organizational breach inci-
arrival times (reflecting incident frequency) and breach sizes dents dataset that is combined from [8] and [1] and spans over
should be modeled by stochastic processes, rather than by a decade (year 2000 to 2015). They used the Extreme Value
distributions. We find that a particular point process can ade- Theory [11] to study the maximum breach size, and further
quately describe the evolution of the hacking breach incidents modeled the large breach sizes by a doubly truncated Pareto
inter-arrival times and that a particular ARMA-GARCH model distribution. They also used linear regression to study the
can adequately describe the evolution of the hacking breach frequency of the data breaches, and found that the frequency
sizes, where ARMA is acronym for “AutoRegressive and of large breaching incidents is independent of time for the
Moving Average” and GARCH is acronym for “Generalized United States organizations, but shows an increasing trend for
AutoRegressive Conditional Heteroskedasticity.” We show that non-US organizations.
these stochastic process models can predict the inter-arrival There are also studies on the dependence among cyber risks.
times and the breach sizes. To the best of our knowledge, Böhme and Kataria [12] studied the dependence between
this is the first paper showing that stochastic processes, rather cyber risks of two levels: within a company (internal depen-
than distributions, should be used to model these cyber threat dence) and across companies (global dependence). Herath and
factors. Herath [13] used the Archimedean copula to model cyber risks
Second, we discover a positive dependence between the caused by virus incidents, and found that there exists some
incidents inter-arrival times and the breach sizes, and show that dependence between these risks. Mukhopadhyay et al. [14]
this dependence can be adequately described by a particular used a copula-based Bayesian Belief Network to assess cyber
copula. We also show that when predicting inter-arrival times vulnerability. Xu and Hua [15] investigated using copulas to
and breach sizes, it is necessary to consider the dependence; model dependent cyber risks. Xu et al. [16] used copulas to
otherwise, the prediction results are not accurate. To the best investigate the dependence encountered when modeling the
of our knowledge, this is the first work showing the existence effectiveness of cyber defense early-warning. Peng et al. [17]
of this dependence and the consequence of ignoring it. investigated multivariate cybersecurity risks with dependence.
Third, we conduct both qualitative and quantitative trend Compared with all these studies mentioned above,
analyses of the cyber hacking breach incidents. We find that the present paper is unique in that it uses a new methodology to
the situation is indeed getting worse in terms of the incidents analyze a new perspective of breach incidents (i.e., cyber hack-
inter-arrival time because hacking breach incidents become ing breach incidents). This perspective is important because
more and more frequent, but the situation is stabilizing in it reflects the consequence of cyber hacking (including mal-
terms of the incident breach size, indicating that the damage of ware). The new methodology found for the first time, that
individual hacking breach incidents will not get much worse. both the incidents inter-arrival times and the breach sizes
We hope the present study will inspire more investigations, should be modeled by stochastic processes rather than distri-
which can offer deep insights into alternate risk mitigation butions, and that there exists a positive dependence between
approaches. Such insights are useful to insurance companies, them.
government agencies, and regulators because they need to 2) Other Prior Works Related to the Present Study:
deeply understand the nature of data breach risks. Eling and Loperfido [18] analyzed a dataset [1] from
the point of view of actuarial modeling and pricing.
Bagchi and Udo [19] used a variant of the Gompertz model
B. Related Work
to analyze the growth of computer and Internet-related crimes.
1) Prior Works Closely Related to the Present Study: Condon et. al [20] used the ARIMA model to predict secu-
Maillart and Sornette [7] analyzed a dataset [8] of 956 per- rity incidents based on a dataset provided by the Office
sonal identity loss incidents that occurred in the United States of Information Technology at the University of Maryland.
between year 2000 and 2008. They found that the personal Zhan et al. [21] analyzed the posture of cyber threats by using
identity losses per incident, denoted by X, can be modeled by a a dataset collected at a network telescope. Using datasets
heavy tail distribution Pr(X > n) ∼ n −α where α = 0.7 ± 0.1. collected at a honeypot, Zhan et al. [22], [23] exploited their
This result remains valid when dividing the dataset per type of statistical properties including long-range dependence and
organizations: business, education, government, and medical extreme values to describe and predict the number of
institution. Because the probability density function of the attacks against the honeypot; a predictability evaluation of
identity losses per incident is static, the situation of identity a related dataset is described in [24]. Peng et al. [25] used
loss is stable from the point of view of the breach size. a marked point process to predict extreme attack rates.
Edwards et al. [9] analyzed a different breach dataset [1] Bakdash et al. [26] extended these studies into related cyber-
of 2,253 breach incidents that span over a decade security scenarios. Liu et al. [27] investigated how to use
(2005 to 2015). These breach incidents include two categories: externally observable features of a network (e.g., misman-
negligent breaches (i.e., incidents caused by lost, discarded, agement symptoms) to forecast the potential of data breach
stolen devices, or other reasons) and malicious breaching incidents to that network. Sen and Borle [28] studied the
(i.e., incidents caused by hacking, insider and other reasons). factors that could increase or decrease the contextual risk
They showed that the breach size can be modeled by the of data breaches, by using tools that include the opportunity
log-normal or log-skewnormal distribution and the breach fre- theory of crime, the institutional anomie theory, and the
quency can be modeled by the negative binomial distribution, institutional theory.
2858 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 11, NOVEMBER 2018

of time). For example, when the probability that a big

breach incident will occur during the next week is high,
the defender may dynamically adjust the defense posture
(e.g., enforcing more restricted policies during the next
week). This is similar to what weather forecasting can
do in the physical world. (Section V)
Fig. 1. Illustrative description of cyber hacking breach incidents. 5) What are the trends that are exhibited by hacking breach
incidents? This question is important because we can
C. Paper Outline draw higher-level insights into whether the situation
The rest of the paper is organized as follows. In Section II is getting better or worse over a large time scale
we describe the dataset and research questions. In Section III (e.g., 10 years), and to what extent. (Section VI)
we present a basic analysis of the dataset. In Section IV we
develop a novel point process model for analyzing the dataset. B. Dataset
In Section V, we discuss the prediction performance of the
proposed model. In Section VI we present qualitative and The hacking breach dataset we analyze in this paper was
quantitative trend analyses. In Section VII we conclude our obtained from the Privacy Rights Clearinghouse (PRC) [1],
paper with future research directions. We defer formal descrip- which is the largest and most extensive dataset that is also pub-
tion of the main statistical notions to the Appendix, and discuss licly available. Since we focus on hacking breaches, we dis-
their intuitive meanings when they are mentioned for the first regard the negligent breaches and the other sub-categories
time. of malicious breaches (i.e., insider, payment card fraud,
and unknown). From the remaining raw hacking breaches
data, we further disregard the incomplete records with
II. R ESEARCH Q UESTIONS AND DATASET D ESCRIPTION unknown/unreported/missing hacking breach sizes because
A. Research Questions breach size is one of the objects for our study.
The resulting dataset contains 600 hacking breach inci-
Figure 1 gives an illustrative description of cyber hacking
dents in the United States between January 1st, 2005
breach incidents. There are three incidents that occur respec-
and April 7th, 2017. The hacking breach victims span
tively at times t1 , t2 , and t3 , each exposing a different number
over 7 industries: businesses-financial and insurance ser-
of data records. The incidents are irregularly spaced because
vices (BSF); businesses-retail/merchant including online
t2 − t1 = t3 − t2 . Two concepts of interest are: the inter-
retail (BSR); businesses-other (BSO); educational institu-
arrival times between two consecutive incidents, which lead
tions (EDU); government and military (GOV); healthcare,
to a time series {d1 = t1 , d2 = t2 − t1 , d3 = t3 − t2 , . . .};
medical providers and medical insurance services (MED); and
and the breach sizes (i.e., the number of data records that are
nonprofit organizations (NGO).
compromised because of an incident), which lead to a time
The dataset is represented by a sequence, denoted by
series {y1 , y2 , y3 , . . .}.
{(ti , yti )}0≤i≤600 , where ti represents the day on which there
Given a dataset of cyber hacking breach incidents, we want
is an incident of breach size yti (i.e., the number of private
to use it to answer the following questions.
data records that are breached by the incident), and t0 is the
1) Should we use a distribution or stochastic process to day on which observation starts (i.e., t0 does not correspond
describe the breach incidents inter-arrival times, and to the occurrence of any incident). The inter-arrival times are
which distribution or process? This question is important di = ti − ti−1 , where i = 1, 2, . . . , 600. Among the ti ’s, most
because answering it will directly deepen our under- days have one single incident report, 52 days with 2 incidents
standing of the dynamic cyber hacking breach situation on each day, 7 days with 3 incidents on each day, and one day
from a temporal perspective. (Sections III and IV) (02/26/2016) with 7 incidents.
2) Should we use a distribution or stochastic process We caution that the dataset does not necessarily contain all
to describe the breach sizes, and which distribu- of the hacking breach incidents, because there may be unre-
tion or process? This question is important because ported ones. Moreover, the dates corresponding to the inci-
answering it will directly deepen our understanding of dents are the days on which the incidents are reported, rather
the dynamic cyber hacking breach situation from a than the dates on which the incidents took place. Nevertheless,
magnitude perspective. (Sections III and IV) this dataset (or data source [1]) represents the best dataset
3) Are the breach sizes and the incidents inter-arrival that can be obtained in the public domain [9], [29]. There-
times independent of each other? If not, how should fore, analysis of it will shed light on the severeness of
we characterize the dependence between them? This the data breach risk, and the analysis methodologies can be
question is important because answering it will directly adopted or adapted to analyze more accurate datasets of this
deepen our understanding of the dynamic cyber hacking kind when they become available in the future.
breach situation from a joint temporal and magnitude
perspective. (Section IV)
4) Can we predict when the next hacking incident will C. Preprocessing
occur, and what the breach size would be? This question Because we observed, as mentioned above, some days
is important because answering it shows our capability have multiple hacking breach incidents, one may suggest to
to predict the situation and possibly conduct proactive treat such multiple incidents as a single “combined” inci-
defense at a small time scale (e.g., days or weeks ahead dent (i.e., adding their number of breached records together).
XU et al.: MODELING AND PREDICTING CYBER HACKING BREACHES 2859

TABLE I
S UMMARY OF N OTATIONS (r.v. S TANDS FOR R ANDOM VARIABLE )

However, this method is not sound because the multiple

incidents may happen to different victims that have different
cyber systems. Given that the time resolution of the dataset is Fig. 2. Time series plots of inter-arrival times and log-transformed breach
a day, multiple incidents that are reported on the same data sizes of the aggregated incidents (x-axis is the sequence of incidents). Fig 2(a)
may be reported at different points in time of the same day shows that more recent breach incidents have smaller inter-arrival times.
(e.g., 8pm vs. 10pm). As such, we propose generating small Fig 2(b) shows that there is a huge volatility in the breach size. (a) Incidents
inter-arrival times (y-axis with a unit ‘day’). (b) Log-transformed breach sizes
random time intervals to separate the incidents corresponding (y-axis with a unit ‘record’).
to the same day. Specifically, we randomly order the incidents TABLE II
corresponding to the same day, and then insert a small and S TATISTICS OF B REACH I NCIDENTS I NTER -A RRIVAL T IME (U NIT: D AY ),
random time interval in between two consecutive incidents (for W HERE ‘SD’ S TANDS FOR S TANDARD D EVIATION
the first interval, the starting point is midnight), while assuring
that these incidents correspond to the same day (e.g., the two
incidents on a two-incident day may be assigned at 8am
and 1pm).

D. Remark
In this paper, we use a number of statistical techniques,
a thorough review of which would be lengthy. In order to
comply with the space requirement, here we only briefly
review these techniques at a high level, and refer the readers to
specific references for each technique when it is used. We use
the autoregressive conditional mean point process [30], [31],
which was introduced for describing the evolution of condi-
tional means, to model the evolution of the inter-arrival time.
We use the ARMA-GARCH time series model [32], [33] to skewness, which make it difficult to model the breach sizes.
model the evolution of the breach size, where the ARMA part We observe a large volatility in the breach size and the volatil-
models the evolution of the mean of the breach sizes and the ity clustering phenomenon of large (small) changes followed
GARCH part models the high volatility of the breach sizes. by large (small) changes. We also observe that some breach
We use copulas [34], [35] to model the nonlinear dependence sizes are especially large (meaning severe hacking breach
between the inter-arrival times and the breach sizes. incidents). We will pay particular attention for modeling these
Table I summarizes the main notations used in the paper. extreme breach incidents.

III. BASIC A NALYSIS A. Basic Analysis of Breach Incidents Inter-Arrival Times

Figure 2 plots the two time series that are actually investi- Table II describes the basic statistics of the inter-arrival
gated in the present paper. Figure 2(a) plots the time series times for individual victim categories as well as the aggre-
of incidents inter-arrival time (unit: day). We observe that gation of them (which corresponds to Figure 2). We observe
most inter-arrival times are small (say, less than 20 days), that the standard deviation of the inter-arrival times in each
and that the recent inter-arrival times are even smaller, which category is also much larger than the mean, which hints that
hints that the frequency of hacking breaches intensifies. That the processes describing the hacking breach incidents are not
is, Figure 2(a) hints the existence of clusters of small inter- Poisson. We also observe that the aggregation of the inter-
arrival times (i.e., multiple incidents occur during a short arrival times of all categories leads to much smaller inter-
period of time). One possible explanation for the cluster arrival times. For example, the maximum inter-arrival time of
phenomenon is the following: multiple successful hacks are NGO breach incidents is 1178 days, while the maximum inter-
detected and reported within a very short period of time arrival time of the aggregation is 96 days.
because the attackers used the same attacks or exploited the In order to formally answer the question whether the
same vulnerabilities, which are detected at roughly the same incidents inter-arrival times should be modeled by a dis-
time. Figure 2(a) also shows that the breach incidents are tribution or a stochastic process, we look into the sample
irregularly spaced (i.e., exhibiting both large and small inter- AutoCorrelation Function (ACF) and Partial AutoCorrelation
arrival times). Function (PACF) of the inter-arrival times. Intuitively, ACF
Figure 2(b) plots the log-transformed breach sizes (unit: measures the correlation between the observations at earlier
record) because the breach sizes exhibit large variability and times and the observations at later times without disregarding
2860 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 11, NOVEMBER 2018

Fig. 3. The sample ACF and PACF of incidents inter-arrival times. (a) ACF of Fig. 4. The sample ACF and PACF of log-transformed breach sizes. (a) ACF
inter-arrival times. (b) PACF of inter-arrival times. of transformed breach sizes. (b) PACF of transformed breach sizes.
TABLE III
always much larger than the corresponding mean. Figure 2(b)
S TATISTICS OF H ACKING B REACH S IZES , W HERE ‘SD’
S TANDS FOR S TANDARD D EVIATION plots the log-transformed breach sizes because, as we can
observe from Table III, the breach sizes exhibit large volatility
and skewness (which is indicated by the substantial difference
between the median and the mean values), which make them
hard to model without making transformations.
In order to answer the question whether the breach sizes
should be modeled by a distribution or stochastic process,
we plot the temporal correlations between the breach sizes.
Figures 4(a) and 4(b) plot the sample ACF and PACF for
the log-transformed breach sizes, respectively. We observe
correlations between the breach sizes, meaning that we should
use a stochastic process, rather than a distribution, to model
the breach sizes [33], [36]. This is in contrast to the insight
offered by previous studies [7], [18], which suggests to use a
skewed distribution to model the breach sizes. We attribute the
drawing of this insight to the fact that these studies [7], [18]
the observations in between them, and PACF measures the did not look into this due perspective of temporal correla-
correlation between the observations at earlier times and the tions. An important factor for determining whether to use
observations at later times while disregarding the observations a distribution or a stochastic process to describe something,
in between them. The formal definitions of ACF and PACF depends on whether or not there is temporal autocorrelation
are given in Appendix A. ACF and PACF are widely used to between the individual samples. This is because zero temporal
detect temporal correlations in time series [36], [37]. autocorrelation means that the samples are independent of each
Figure 3 plots the sample ACF and PACF, respectively. other; otherwise, non-zero temporal autocorrelation means that
We observe correlations in both plots because there are correla- they are not independent of each other and should not be
tion values that exceed the dashed blue lines (i.e., the threshold modeled by a distribution.
values which are derived based on the asymptotic statistical Insight 2: The hacking breach sizes exhibit a large volatil-
theory [36], [38]). This means that there are significant corre- ity, a large skewness, and a volatility clustering phenom-
lations between the inter-arrival times and that the inter-arrival enon, namely large (small) changes followed by large (small)
times do not follow the exponential distribution. Moreover, changes. Moreover, there are correlations between the breach
we should use a stochastic process to describe the inter-arrival sizes, implying that they should be modeled by an appropriate
times [39]. In summary, we have: stochastic process than a distribution.
Insight 1: The hacking breach incidents inter-arrival times
exhibit some clusters of small inter-arrival times (i.e., multiple IV. M ODELING THE H ACKING B REACH DATASET
incidents occur within a short period of time) and the inci-
dents are irregularly spaced. Moreover, there are correlations In this section, we develop a novel statistical model to
between the inter-arrival times, meaning that the inter-arrival fit the breach dataset, or more specifically the in-sample of
times should be modeled by an appropriate stochastic process 320 incidents. The fitted model will be used for prediction,
rather than by a distribution. which will be evaluated by the out-of-sample of 280 incidents
(Section V).
B. Basic Analysis of Hacking Breach Sizes
Table III summarizes the basic statistics of the hacking A. Modeling the Inter-Arrival Times
breach sizes. We observe that three Business categories have Insight 1 suggests that we model the hacking breach inci-
much larger mean breach sizes than others. We further observe dents inter-arrival times with an autoregressive conditional
that there exists a large standard deviation for the breach size in mean (ACD) model, which was originally introduced to model
each of the victim categories, and that the standard deviation is the evolution of the inter-arrival time, or duration, between
XU et al.: MODELING AND PREDICTING CYBER HACKING BREACHES 2861

stock transactions [30] and later extended to model duration TABLE IV

processes (see, e.g., [31], [40]).1 M ODEL F ITTING R ESULTS OF THE ACD AND L OG -ACD M ODELS TO
THE I NTER -A RRIVAL T IMES OF H ACKING B REACH I NCIDENTS .
Recall that the dataset is represented by a sequence
T HE N UMBERS IN THE PARENTHESES A RE THE
{(ti , yti )}0≤i≤n , where n = 600, ti for i ≥ 1 is the day E STIMATED S TANDARD D EVIATIONS
on which there is an incident of breach size yti . The inter-
arrival times are di = ti − ti−1 , where i = 1, 2, . . . , n. The
basic idea of the conditional mean model is to standardize
the inter-arrival time di = ti − ti−1 by leveraging the historic
information, where i = 1, 2, . . . , n. Specifically, we define
di = i i , (IV.1)
where the i ’s are functions of the historical inter-arrival times
i = E(di |Fi−1 )
with Fi−1 representing the historical information up to time
ti−1 , and the i ’s are independent and identically distributed
(i.i.d.) innovations with E(i ) = 1.
1) Model Selection: For model selection, we focus on
the following ACD models because (i) these models are
relatively simple and can be efficiently estimated in practice;
and (ii) these models are flexible enough to accommodate the
evolution of the inter-arrival times based on our preliminary
analysis.
• The standard ACD model (ACD) [30]: Fig. 5. The qq-plot and sample ACF of the residuals for the inter-arrival
times. (a) The qq-plot of residuals. (b) ACF of residuals.

p
q
i = ω + a j di− j + b j i− j , such as the exponential distribution, the Weibull distribu-
j =1 j =1 tion, the half-normal distribution, and the gamma distribution.
where subscript i indicates the i th breach incident, In order to assure E(i ) = 1, in our estimation we set
ω, a j , b j ≥ 0, and p and q are positive integers indicating (k)
the orders of the autoregressive terms. λ= .
(k + 1/γ )
• The type-I log-ACD model (LACD1 ) [41]:
We use the maximum likelihood estimation (MLE)

p
q
log(i ) = ω + a j log(i− j ) + b j log(i− j ). method [31] to fit the model parameters. Table IV describes
j =1 j =1
the fitting results. We observe that according to the model
selection methods Akaikes Information Criterion (AIC) and
• The type-II log-ACD model (LACD2 ) [41]: Bayes Information Criterion (BIC) [36], which (as reviewed in

p
q Appendix B) intuitively measure how well the proposed model
log(i ) = ω + a j log(di− j ) + b j log(i− j ). fit the observations (i.e., the smaller these values, the better the
j =1 j =1 fitting), LACD1 should be selected. We also observe that the
coefficient b1 = −0.767 (0.0971) of LACD1 is statistically
In what follows, we further restrict our investigation to the significant, where 0.0971 is the estimated standard deviation.
case of p = q = 1 because a higher order does not necessarily This means that the historic inter-arrival times do have a
improve the prediction accuracy [42]. The distribution of the significant effect on the current inter-arrival time. We further
standardized innovations of the i ’s is assumed to be a gener- observe that kγ < 1 and γ > 1, implying that the conditional
alized gamma distribution. This assumption will be validated hazard function of inter-arrival times is U-shaped.
below. We make this assumption because it is flexible and In order to formally evaluate the fitting accuracy of LACD1 ,
because it was recommended in the literature for modeling we plot the fitting residuals in Figure 5. Figure 5(a) is the
irregularly spaced data [40], [42]. qq-plot of the residuals, and shows that all points except
Recall that the density function of the generalized gamma one are around the 45-degree line, meaning that the fitting
distribution is is accurate.
γ x kγ −1 x γ In order to examine whether or not the proposed LACD1
f (x|λ, γ , k) = kγ exp − , (IV.2) model is sufficient to capture the dependence between the
λ (k) λ
inter-arrival times, we plot the sample ACF of the resid-
where λ > 0 is the scale parameter, and γ , k > 0 are uals in Figure 5(b), which shows that the correlations at
the shape parameters. The generalized gamma distribution all lags are very small. In particular, the right-hand half of
includes many well-known distributions as special cases, Table V presents the p-values of the formal McLeod-Li and
1 In this paper, the term inter-arrival time, which is widely used in the Ljung-Box statistical tests [31], [36], which (as reviewed in
computer science community, and the term duration, which is widely used in Appendix C) intuitively measure whether or not there are
the statistics community, are used interchangeably. correlations that are left in the residuals. We observe that these
2862 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 11, NOVEMBER 2018

TABLE V
T HE p-VALUES OF S TATISTICAL T ESTS FOR THE R ESIDUALS

p-values are all greater than 0.1, meaning that there is no

correlation left in the residuals and that the proposed LACD1
can adequately describe the evolution of the incidents inter-
arrival time.
In order to validate the afore-mentioned assumption of
the generalized gamma innovations, we report the p-values Fig. 6. Sample ACF of standardized and squared standardized residuals for
of the Kolmogorov-Smirnov (KS), Anderson-Darling (AD), the log-transformed breach sizes. (a) ACF of standardized residuals. (b) ACF
of squared standardized residuals.
and Cramer-von Mises (CM) tests [43] in the left-hand
half of Table V. Intuitively, these tests (as reviewed TABLE VI
in Appendix VII-D) examine how well the samples fit a T HE F ITTING R ESULTS OF THE ARMA(1, 1)-GARCH(1, 1) M ODEL FOR
THE B REACH S IZES , W HERE THE N UMBERS IN PARENTHESES
theoretical distribution such that a larger p-value indicates a
A RE THE E STIMATED S TANDARD D EVIATIONS
better fit, but using different approaches. The KS test focuses
on the largest deviation of the samples from the theoretical
distribution, whereas the AD and CM tests consider the overall
deviation. We observe that the p-values are .2312, .2116
and .3581, respectively. Therefore, the assumption is validated.
The preceding discussions lead to:
Insight 3: The inter-arrival times of hacking incidents For model selection, we use the AIC criterion to deter-
exhibit a significant temporal correlation, and therefore should
mine the orders of the ARMA models. Note that if
be modeled by a stochastic process rather than a distribution. ARMA( p, q)-GARCH can successfully accommodate the ser-
Given this, we find that the incidents inter-arrival times can be
ial correlations in the conditional mean and the conditional
adequately described by the proposed type-I log-ACD model variance, there would be no autocorrelations left in the
(LACD1 ), which implies that the next inter-arrival time is in
standardized and squared standardized residuals. When the
fact affected by the present one. AIC criterion suggests to select multiple models with similar
B. Modeling the Breach Sizes AIC values, we select the simpler model. The autoregressive
p and the moving average order q are allowed to vary
In order to model the evolution of the mean of the breach between 0 and 5. We find that ARMA(1, 1)-GARCH(1, 1)
sizes, we propose using the ARMA process, or more specif- with normally-distributed innovations is sufficient to remove
ically ARMA( p, q), where p is the AR order and q is the the serial correlations.
MA order.the The preceding Insight 2, especially the volatil- In order to further evaluate the fitting of
ity clustering phenomenon exhibited by the log-transformed ARMA(1, 1)-GARCH(1, 1), we plot the sample ACFs
breach sizes, suggests that we use a GARCH model to for the standardized residuals and the squared standardized
model the volatilities in the breach sizes. An analysis on the residuals in Figure 6. We observe that none of the lags is
residuals suggests that GARCH(1, 1) is sufficient to describe significant (i.e., the correlations are removed). The p-values
the volatilities in the residuals, which coincides with the of the Ljung-Box tests for both the standardized residuals
conclusion drawn in the literature that higher-order GARCH and the standardized square residuals are very large, namely,
models are not necessarily better than GARCH(1, 1) [44]. .999 and .958, respectively. This means that we cannot
Therefore, we fix the GARCH part as GARCH(1, 1). This reject the null hypothesis that no serial correlations are
leads to the following ARMA-GARCH model: left in the residuals. Table VI shows the fitting results by
Yt = E(Yt |Ft −1 ) + t , ARMA(1, 1)-GARCH(1, 1). We observe that the estimated
coefficients for the ARMA and GARCH parts are all
where E(·|·) is the conditional expectation function, Ft −1 statistically significant.
is the historic information up to time t − 1, and t is the Having observed that ARMA(1, 1)-GARCH(1, 1) can fit the
innovation of the time series. Since the mean part is modeled breach sizes overall, we need to know whether or not this
as ARMA( p, q), the model can be rewritten as model can fit the tails as well. Unfortunately, we observe

p
q
that normally-distributed innovations fail to capture the tails
Yt = μ + φk Yt −k + θl t −l + t , (IV.3) of the breach sizes because both tails are thick. Therefore,
k=1 l=1 we further consider other distributions for the innovations,
where t = σt Z t with Z t being the i.i.d. innovations, and the including Student-t, generalized error, skewed normal, skewed
φk ’s and the θl ’s are respectively the coefficients of the AR Student-t, and skewed generalized error distributions. We find
and MA parts. For the standard GARCH(1, 1) model, we have that among all these innovation distributions, the skewed
Student-t distribution leads to a relatively more accurate fitting.
σt2 = w + α1 t2−1 + β1 σt2−1 , (IV.4)
However, as shown by the qq-plot in Figure 7(a), the skewed
where σt2 is the conditional variance and w is the intercept. Student-t still fails to fit the tails. This motivates us to
XU et al.: MODELING AND PREDICTING CYBER HACKING BREACHES 2863

Fig. 7. The qq-plots of the residuals of ARMA(1, 1)-GARCH(1, 1) with Fig. 8. Normal score plot and fitted contour plot. (a) Normal scores plot.
innovations following different distributions for fitting the log-transformed (b) Gumbel contour plot.
breach sizes. (a) The qq-plot of the skewed Student-t. (b) The qq-plot of the
mixed distribution. It is interesting to note that the upper tail shape parameter
ξ = .001 indicates that the upper tail is heavy. The qq-plot
propose an extreme value mixture distribution for describing in Figure 7(b) indicates that the mixed distribution describes
the innovations. the tails well because all of the points are around the 45-degree
The Extreme Value Theory (EVT) [32], [45] is a useful tool line. This leads to:
for modeling the heavy-tail distribution. A popular method is Insight 4: The log-transformed hacking breach sizes exhibit
known as the peaks over threshold approach (POT). Given a significant temporal correlation, and therefore should be
a sequence of i.i.d. observations X 1 , . . . , X n , the excesses modeled by a stochastic process rather than a distribution.
X i − μ of some suitably high threshold μ can be modeled by, Moreover, the log-transformed hacking breach sizes exhibit the
under certain mild conditions, the generalized Pareto distrib- volatility clustering phenomenon with possibly extremely large
ution (GPD). The survival function of the GPD breach sizes. These two properties lead to the development
⎧ −1/ξ of ARMA(1, 1)-GARCH(1, 1) with innovations that follow
⎨ 1+ξx −μ
⎪
, ξ = 0, a mixed extreme value distribution, which can adequately
Ḡ ξ,σ,μ (x) = 1 − G ξ,σ,μ = σ + describe the evolution of the log-transformed breach size.
⎪
⎩exp − x − μ ,
σ ξ = 0. Note that the ARMA(1, 1) part models the means of the
observations and the GARCH(1, 1) part models the large
where x ≥ μ if ξ ∈ R+ and x ∈ [μ, μ − σ/ξ ] if ξ ∈ R− , and volatility exhibited by the data.
ξ and σ are respectively called the shape and scale parameters.
Because Figure 7(a) shows that both tails cannot be modeled C. Dependence Between Inter-Arrival Times
by the skewed Student-t distribution, we propose modeling and Breach Sizes
both tails with the GPD and modeling the middle part with In order to answer the question whether or not there
the normal distribution. This leads to a mixed extreme value exists dependence between the inter-arrival times and the
distribution that is used to model the innovations as follows: breach sizes, we propose conducting the normal score
transformation [35] to the residuals that are obtained after
G m (x)
⎧ fitting these two time series. For residuals of the LACD1
⎪
⎪ pl [1 − G(−x|ξl , σl , −μl )], fitting, denoted by e1 , . . . , en , we use the fitted generalized
⎪
⎪
⎪
⎪ if x ≤ μl , gamma distribution G(·|γ , k) to convert them into empirical
⎪
⎪
⎨ p + (1 − p − p ) (x|μm , σm ) − (μl |μm , σm ) , normal scores:
l l u (μ |μ , σ ) − (μ |μ , σ )
= u m m l m m
⎪
⎪ if μ < x < μ , ei → −1 (G(ei |γ , k)), i = 1, . . . , n,
⎪
⎪ l u
⎪
⎪
⎪
⎪ 1 − pu + pu G(x|ξu , σu , μu ), where −1 is the inverse of the standard normal distribution.
⎩
if x ≥ μu . For the residuals of the ARMA(1, 1)-GARCH(1, 1) fitting,
we use the estimated mixed extreme value distribution to
where pl = P(X ≤ μl ) and pu = P(X > μu ) are the convert them into empirical normal scores.
probabilities corresponding to the tails, and μm and σm are Figure 8(a) plots the bivariate normal scores. We observe
respectively the mean and the standard deviation of the normal that large transformed durations are associated with large
distribution. It is worth mentioning that a similar idea has been transformed sizes, implying a positive dependence between the
used to model the impact of the financial crisis on stock and inter-arrival times and the breach sizes. In order to statistically
index returns [46], [47]. test the dependence, we compute the sample Kendall’s τ and
The estimated parameters for the tail proportions are Spearman’s ρ for the incidents inter-arrival times and the
( pl , pu ) = (0.126, 0.098), which means that both tails account breach sizes, which are 0.07578 and .11515, respectively.
for about 10% of the observations of GPD. The estimated The nonparametric rank tests [43] for both statistics lead
parameters μ̂m , σ̂m , μ̂l , σ̂l , ξ̂l , μ̂u , σ̂u , ξ̂u for the GPD and to a p-value of .04313 and .03956, respectively, which are
normal distributions are very small. This means that there indeed exists some positive
dependence between the inter-arrival times and the breach
(−0.002, 0.963, −1.105, 0.877, −0.694, 1.243, 0.471, 0.001). sizes.
2864 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 11, NOVEMBER 2018

In order to model the bivariate dependence between the TABLE VII

incidents inter-arrival times and the breach sizes, we propose D EPENDENCE M ODEL F ITTING
using the Copula technique [34], [35]. A bivariate copula
is a Cumulative Distribution Function (CDF) with uniform
marginals on [0, 1]. Let X 1 and X 2 be continuous random
variables with joint cumulative distribution function
F(x 1 , x 2 ) = P(X 1 ≤ x 1 , X 2 ≤ x 2 ),
and univariate marginal distributions F1 and F2 . A cop-
ula C is defined as the joint CDF of the random vector
(F1 (X 1 ), Fd (X 2 )). From Sklar’s theorem [34], [35], the cop-
ula C is unique and satisfies
F(x 1 , x 2 ) = C(F1 (x 1 ), F2 (x 2 ))
when the Fi ’s are all continuous. The corresponding joint
density function can be represented as
In order to further examine the dependence fitting of

2
Gumbel copula, we use two goodness-of-fit tests: (i) the White
f (x 1 , x 2 ) = c(F1 (x 1 ), F2 (x 2 )) f i (x i ),
test [49], [50], which leads to a test statistic of .0648 and
i=1
a p-value of 0.2626 (meaning that the dependence can be
where c(u 1 , u 2 ) is the 2-dimensional copula density function, modeled by the Gumbel copula); (ii) The Cramer-von Mises
and f i is the marginal density function of X i , i = 1, 2. statistic [51], [52], which leads to a test statistic of .1379 and
Suppose at time t, the vector Zt = (Z 1,t , Z 2,t ) has the a p-value of 0.1212 (meaning that the dependence can be
following distribution modeled by the Gumbel copula). Since the p-values are large
for both tests, we conclude that the Gumbel copula can
Fz (zt ; ϑ, ) = C F(z 1,t ), G(z 2,t ); , ϑ , (IV.5)
adequately describe the dependence between the inter-arrival
where denotes the vector of parameters of a copula, times and the breach sizes.
ϑ represents the vector of parameters of the marginal models, Insight 5: There exists a statistical positive dependence
and F is the marginal distribution of the residual of the inter- between the hacking breach incidents inter-arrival times and
arrival times, and G is the marginal distribution of the residual the breach sizes. The cybersecurity meaning of the dependence
of the breach sizes. The joint log-likelihood function of the is that if there is a long period of time during which there are
model can be written as no hacking breach incidents, then it is more likely to have a
n large hacking breach when an incident occurs.
dt y t − μt
L(; ϑ) = log c F ,G ; ϑ, The situation of cyber hacking breaches reflects the outcome
t σt of the cyber attack-defense interactions (e.g., whether or not
t =1
the attack tools can successfully evade the defense tools).
y t − μt
− log(σt ) − log(t ) + log g ;ϑ Although the particular phenomenon mentioned above can
σt
happen under many different scenarios and precisely pinning
dt down of its cause is beyond the scope of the present paper
+ log f ;ϑ ,
t (simply because of the lack of various kinds of supporting
where c(·) is the copula density of C(·), μt = E(Yt |Ft −1), data), one possibility is the following: When the attack tools
f (·) is the density function of Z 1,t , and g(·) is the density are no longer effective from the attacker’s point of view,
function of Z 2,t . the attackers may need to take a longer period of time to
A popular method for estimating the parameters of a joint develop new attack tools for successfully breaching data.
model is the Inference Function of Margins method [48]. This
method has two steps: (i) estimate the parameters of the V. P REDICTION
marginal stochastic models; and (ii) estimate the parameters of Having showed how to fit the inter-arrival times and the
the copula by fixing the parameters obtained at step (i). Since breach sizes, now we investigate how to predict them.
we have identified the stochastic models for the inter-arrival
times and the breach sizes, in what follows we discuss how A. Prediction Evaluation Metric
to model the bivariate dependence. Let us recall the Value-at-Risk (VaR) [53] metric. For a
There are many bivariate copulas [34], [35]. We con- random variable X t of interest, the VaR at level α, where
sider a range of them by using the state-of-art R pack- 0 < α < 1, is defined as
age VineCopula, and Table VII describes the fitting results VaRα (t) = inf {l : P (X t ≤ l) ≥ α}.
of these copulas. We observe that the Gumbel copula has
the smallest AIC and BIC, which confirms what is hinted For example, VaR.95 (t) means that there is only a 5% prob-
by Figure 8(a), namely that there exists a right-tail depen- ability that the observed value is greater than the predicted
dence between the inter-arrival times and the breach sizes. value VaR.95 (t). An observed value greater than the predicted
Figure 8(b) plots the fitted Gumbel contour, indicating an VaRα (t) is called a violation, indicating inaccurate prediction.
accurate fitting. In order to evaluate the prediction accuracy of the VaR
XU et al.: MODELING AND PREDICTING CYBER HACKING BREACHES 2865

Algorithm 1 Algorithm for Predicting the VaRα ’s of the TABLE VIII

Hacking Incidents Inter-Arrival Times and the Breach Sizes VA R T ESTS OF P REDICTED I NTER -A RRIVAL T IMES AND
B REACH S IZES AT L EVELS α = .90, .92, .95
Separately
Input: Historical incidents inter-arrival times and breach
sizes, denoted by {(dti , yti )}i=1,...,m+n , where an in-sample
{(dti , yti )}i=1,...,m as mentioned above was used for fitting
and an out-of-sample {(dti , yti )}i=m+1,...,n is used for
evaluation prediction accuracy; α level.

1: for i = m + 1, · · · , n do
2: Estimate the LACD1 model of the incidents inter-arrival
times based on {ds |s = 1, . . . , i − 1}, and predict the
conditional mean
i = exp (ω + a1 log(i−1 ) + b1 log(i−1 ));
3: Estimate the ARMA-GARCH of log-transformed size, use rolling prediction, meaning that training data grows as
and predict the next mean μ̂i and standard error σ̂i ; the prediction operation moves forward, newer training data
4: Select a suitable Copula using the bivariate residuals needs to be re-fitted, possibly needing different copula models.
from the previous models based on AIC; As such, we need to consider more dependence structure. This
5: Based on the estimated copula,simulate 10000 explains why we need to re-select the copula structure, which
(k) (k) can fit the newly updated training data better, via the criterion
2-dimensional copula samples u 1,i , u 2,i ,
of AIC (see Step 4 of Algorithm 1).
k = 1, . . . , 10000;
Table VIII reports the prediction results. We observe that
6: For the incidents inter-arrival times, convert the
(k) (k) the prediction models pass all of the tests at the .1 significant
simulated dependent samples u 1,i ’s into the z 1,i ’s by
level. In particular, the models can predict the future inter-
using the inverse of the estimated generalized gamma arrival times for all of the α’s levels. For the breach sizes,
distribution, k = 1, . . . , 10000; at level α = .90, the model predictions have 28 violations,
7: For the breach sizes, convert the simulated dependent while the number of violations from the observed values is 31,
samples u (k) (k)
2,i ’s into the z 2,i ’s by using the inverse of the which is fairly close to each other. For α = .95, the number of
estimated mixed extreme value distribution, violations from the observed values is 20, while the model’s
k = 1, . . . , 10000; expected number of violations is 14. This indicates that the
8: Compute the predicted
10000 2-dimensional breach models for predicting the future breach sizes are somewhat
(k) (k)
data di , yi ,k = 1, . . . , 10000 based on Eq. (IV.1) conservative.
and (IV.3), respectively; Figure 9 plots the prediction results for the 280 out-
9: Compute the VaRα,d (i ) for the incidents inter-arrival of-samples. Figure 9(a) plots the prediction results for the
times and VaRα,y (i ) for the log-transformed breach incidents inter-arrival times. Figure 9(c) plots of the original
sizes based on the simulated breach data. breach sizes, but it is hard to look into visually. For a better
(k) visualization effect, we plot in Figure 9(b) the log-transformed
10: if di > VaRα,d (i ) then
11: A violation to the incidents inter-arrival time occurs; breach sizes. We observe from Figure 9(c) that for the breach
sizes, there are several extreme large values, which are far
12: end if
(k) from the predicted VaR.95 ’s. This means that the prediction
13: if yi > VaRα,y (i ); then
missed some of the extremely large breaches, the prediction
14: A violation to the breach size occurs;
of which is left as an open problem.
15: end if
In conclusion, the proposed models can effectively predict
16: end for
the VaR’s of both the incidents inter-arrival time and the
Output: Numbers of violations in inter-arrival times and breach size, because they both pass the three statistical tests.
breach sizes. However, there are several extremely large inter-arrival times
and extremely large breach sizes that are far above the pre-
values, we use the following three popular tests [54]. The dicted VaR.95 ’s, meaning that the proposed models may not
first test is the unconditional coverage test, denoted by LRuc , be able to precisely predict the exact values of the extremely
which evaluates whether or not the fraction of violations is large inter-arrival times or the extremely large breach sizes.
significantly different from the model’s violations. The second Nevertheless, as shown in Section V-C below, our models
test is the conditional coverage test, denoted by LRcc , which is can predict the joint probabilities that an incident of a certain
a joint likelihood ratio test for the independence of violations magnitude of breach size will occur during a future period of
and unconditional coverage. The third test is the dynamic time.
quantile test (DQ) [55], which is based on the sequence of
‘hit’ variables. C. Algorithm for Joint Prediction and Results
In practice, it is important to know the joint probability
B. Algorithm for Separate Prediction and Results that the next breach incident of a particular size happens at
We use Algorithm 1 to perform the recursive rolling predic- a particular time (i.e., with a particular inter-arrival time).
tion for the inter-arrival time and the breach sizes. Because we For this purpose, we consider the 10000 values predicted
2866 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 11, NOVEMBER 2018

Fig. 9. Predicted inter-arrival times and breach sizes, where black-colored circles represent the observed values. (a) Incidents inter-arrival times.
(b) Log-transformed breach sizes. (c) Breach sizes (prior to the transformation).

TABLE IX
P REDICTED J OINT P ROBABILITIES OF I NCIDENTS I NTER -A RRIVAL T IMES AND B REACH S IZES , W HERE “P ROB .” I S THE P ROBABILITY
OF B REACH S IZE A C ERTAIN P REDICTED yt O CCURRING W ITH THE N EXT T IME dt ∈ (0, ∞)

by Algorithm 1. Specifically, we consider several combinations probabilities by using the benchmark model, which makes
of (di , yti ), where di = ti − ti−1 and yti is the breach size at the independence assumption between the incidents inter-
time ti for i = 1, . . . , n as mentioned above. arrival times and the breach sizes. We observe that these
We divide the predicted inter-arrival time of the next breach probabilities are different from that of the benchmark model.
incident into the following time intervals: (i) longer than For example, the probability of data breach is .0460 for breach
one month or dt ∈ (30, ∞); (ii) in between two weeks sizes exceeding one million (i.e., severe breach incidents),
and one month or dt ∈ (14, 30]; (iii) in between one and namely yt ∈ (1 × 106 , ∞), while the probability based on the
two weeks dt ∈ (7, 14]; (iv) in between one day and one benchmark model is only .0339. Moreover, when we look at
week dt ∈ (1, 7]; (v) within one day dt ∈ (0, 1]. Similarly, the joint event of inter-arrival time dt ∈ (0, 7) and breach size
we divide the predicted breach size of the next breach incident yt ∈ (1×106 , ∞), the copula model predicts the probability as
into the following size intervals: (i) greater than one million .0332; whereas, the benchmark model predicts the probability
records or yt ∈ (1 × 106 , ∞), indicating a large breach; as .0255. This means that the benchmark model underestimates
(ii) yt ∈ (5 × 105 , 1 × 106 ]; (iii) yt ∈ (1 × 105 , 5 × 105 ]; the severity of data breach incidents.
(iv) yt ∈ (5 × 104 , 1 × 105 ]; (v) yt ∈ (1 × 104 , 5 × 104 ]; We further observe that both models predict that there
(vi) yt ∈ (5 × 103 , 1 × 104 ]; (vii) yt ∈ (1 × 103 , 5 × 103 ]; will be a breach incident occurring within a month, where
(viii) smaller than 1000 or yt ∈ [1, 1 × 103 ], indicating a the copula model predicts the probability of this incident
small breach. We use the models mentioned above to fit these being .9976, and the benchmark model predicts this probability
bivariate observations, and predict the joint event by using being .9969. This indicates that almost certainly a data breach
Algorithm 1 (steps 2-8). incident will happen within a month. Further, the copula model
Table IX describes the predicted probabilities of joint events predicts a probability of .7783 that a breach incident will
(dt , yt ) using the copula model, as well as the predicted joint occur within a week, while the benchmark model predicts
XU et al.: MODELING AND PREDICTING CYBER HACKING BREACHES 2867

this probability as .7712. This means that there is a high

chance that a data breach incident will happen within a week.
When we reexamine the database by PRC, there was a data
breach reported on April 12, 2017 with 1.3 million records
breached. Note that our model uses the data ending on tn
equals April 7, 2017, meaning that the incident happened
during a week as predicted by our model.
The other interesting discovery is that the model predicted
the following: the probability that a new incident will occur
within one day (i.e., April 8, 2017) with a probability of
0.287. After looking into the original the dataset, we find no
incident that was reported on April 8, 2017. Therefore, a cyber
incident may not be recorded with chance 28.7%. Moreover,
the prediction result says that if there is indeed an incident that
was not recorded, the probability that the breach size of the
incident exceeds 500,000 is very low (0.047); with probability
Fig. 10. Using the LACD1 model to decompose the hacking breach incidents
0.7774, the breach size was less than 50,000. inter-arrival times into a trend part and a random part.
By summarizing the preceding discussion, we draw:
Insight 6: The proposed approach can accurately predict data into two parts: the trend part and the random (or noise)
the joint probability that the next hacking breach incident part. In general, the trend part refers to the pattern that is exhib-
occurs during a particular period of time and the correspond- ited by the data and can be modeled via the technical/statistical
ing breach size falls into a particular interval (i.e., the prob- analysis (e.g. linear, nonlinear, and cyclic/seasonal trends),
ability that an incident of a certain magnitude of breach size and the random part refers to the remainder of the data after
will occur within a certain period of time). removing the trend part [38].
In practice, if one is interested in predicting the particular
breach size at a particular future point in time, the former
method should be used, with the “caveat” that the predicted A. Qualitative Trend Analysis
value has a no-more-than 5% chance of being smaller than 1) Qualitative Trend Analysis of the Hacking Breach Inci-
the actual value that will be observed. If one is interested dents Inter-Arrival Times: In Section IV-A, we showed that the
in predicting the joint probability that a breach incident LACD1 model can describe the breach incidents inter-arrival
with a certain magnitude of breach size during a certain times. The trend is formally defined as:
future period of time, the latter method should be used.
This kind of prediction capability is, like weather forecasting log(i ) = ω + a1 log(i−1 ) + b1 log(i−1 ),
(e.g., a hurricane of a certain degree will occur within the next namely the LACD1 model, and the random part is defined
5 days), useful because cyber defenders can dynamically adjust as i , which is modeled by the generalized gamma distribution
their defense posture to mitigate the damage, ranging from in Eq. (IV.2). The estimated parameters of which are
temporarily shutting down unnecessary services (if applicable)
to allocating additional resources in examining network traffic (ω, a1 , b1 , k, γ ) = (3.825, 0.058, −0.767, 0.556, 1.254),
(e.g., expensive but effective deep packet inspections or large-
and the estimated standard deviations of these parame-
scale data correlation analyses). Moreover, the prediction
ters are respectively (0.2254, 0.0241, 0.0971, 0.1136, 0.1748).
model might help estimate the budget in a defense strategy
We observe that all these parameters are significant.
planning. This is important because the effort spent to defend
Figure 10 plots the decomposed time series of the inter-
an enterprise against an attack (e.g. the amount of cost
arrival times: the top-panel corresponds to the observed data;
incurred by a certain defense) depends on the likelihood of
the middle-panel corresponds to the trend; and the bottom-
an attack to happen and its severeness (i.e., quantitative risk
panel corresponds to the random noise. We observe from the
management). For instance, when the model predicts that
middle-panel that the inter-arrival time shows a decreasing
a huge data breach is unlikely to happen, the defenses for
trend in the recent years (say, after the 415th incident occurring
that attack can be less sophisticated (ratio cost-effectiveness);
on 12/18/2014), and then is followed by a slightly increasing
when the model predicts that a huge data breach is likely
trend (say, after the 521st incident occurring on 06/14/2016).
to happen, the defender can set up more delicate defenses
This implies that hacking breach incidents happen more
(e.g., honeypots and more accurate audit systems). We believe
frequently prior to 06/14/2016 (because the incident inter-
that these types of predictive-defense (i.e., dynamic defense
arrival times are shorter) and less frequently after 06/14/2016
enabled by prediction capability) are an important topic for
(because the incident inter-arrival times are longer).
future research, as analogously justified by the usefulness of
In order to further study the trend of the inter-arrival times,
weather forecasting in the physical world.
we plot the estimated VaR.9 corresponding to the time interval
between 12/18/2014 and 04/12/2017 in Figure 11. We observe
VI. T REND A NALYSIS that the VaR first shows a decreasing trend and then a slightly
In this section we present both qualitative and quantitative increasing pattern. This indicates that the hacking breach
trend analyses on the hacking breach incidents based on the incidents first become worse and then become somewhat
models presented above. For this purpose, we decompose the less frequent from the perspective of the inter-arrival time.
2868 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 11, NOVEMBER 2018

TABLE X
Q UANTITATIVE T REND A NALYSIS S TATISTICS OF H ACKING B REACH I NCIDENTS , W HERE ‘SD’ S TANDS FOR S TANDARD D EVIATION

Fig. 11. The estimated VaR.9 ’s of the hacking breach incidents inter-arrival Fig. 12. Using the ARMA-GARCH model to decompose the log-transformed
times based on the LACD1 model. breach sizes into a trend part and a random part.

This finding is different from the conclusion drawn in [9], Recall that {(ti , yti )}i=1,...,n is the sequence of breach incidents
which was based on a super dataset in terms of the incident occurring at time ti with a breach size yti . Inspired by the
types (i.e., negligent breaches and malicious breaching as we growth rate analysis in economics [56], we propose:
will discuss in Section I-B); whereas, the present study focuses • Growth Rate (GR): We define the breach-size GR as
on hacking breach incidents only (i.e., a proper sub-type of the yt − yt i
malicious breaches type analyzed in [9]). GRi = i+1 .
yt i
2) Qualitative Trend Analysis of the Hacking Breach Sizes:
In Section IV-B, we used the ARMA-GARCH model with Inter-arrival times GR can be defined similarly.
innovations that follow the mixed extreme value distribution • Average Growth Rate over Time (AGRT): We define the
to describe the log-transformed breach sizes. Figure 12 plots AGRT as
the decomposition of the time series using this model. The 1 yti+1 − yti
trend is defined as AGRTi = .
di+1 yt i
Yt = μ + φ1 Yt −1 + θ1 t −1 , • Compound Growth Rate over Time (CGRT): We define
and the random part is defined as t , which is modeled by the the CGRT as

GARCH(1, 1) model described in Eq. (IV.4). We observe that yti+1 1/di+1
although the breach sizes vary over time, there is no clear CGRTi = − 1.
yt i
trend. This conclusion coincides with what was concluded Note that AGRT represents the percentage change of the
in [9], which is drawn from, as mentioned above, a proper breach size over time, and CGRT describes the rate at which
super set of the dataset we analyze. the breach size would grow.
Table X summarizes the results of the quantitative trend
B. Quantitative Trend Analysis analysis. For the breach-size GR, we observe that the means of
In order to quantify the trend, we propose using two the GR are all positive, meaning that the breach size becomes
metrics to characterize the growth of hacking breach incidents. increasingly larger each year. Note that the means of the GR
XU et al.: MODELING AND PREDICTING CYBER HACKING BREACHES 2869

are largely affected by the extreme GR. For example, for year presented in the literature, because the latter ignored both the
2016, we have the maximum GR 411, 999, which leads to a temporal correlations and the dependence between the inci-
very large mean GR (i.e., 3,917.1173). In terms of the medians, dents inter-arrival times and the breach sizes. We conducted
we observe that from 2005 to 2008, the GRs are negative, qualitative and quantitative analyses to draw further insights.
meaning that the breach sizes decrease during these years. We drew a set of cybersecurity insights, including that the
The negative GRs of breach sizes are also observed for years threat of cyber hacking breach incidents is indeed getting
2010, 2013 and 2014. For years 2015 and 2016, we observe worse in terms of their frequency, but not the magnitude of
positive GRs, 2.0172 and 0.2699, meaning that the breach their damage. The methodology presented in this paper can be
size increases for these two years. For year 2017, we have adopted or adapted to analyze datasets of a similar nature.
a negative median GR (i.e., −0.3092) until April 7, 2017. It is There are many open problems that are left for future
worth mentioning that for years 2010, 2013, and 2016, we have research. For example, it is both interesting and challenging to
very large standard deviations, which indicate that there exist investigate how to predict the extremely large values and how
extreme breach sizes during these years. to deal with missing data (i.e., breach incidents that are not
For the inter-arrival time GR, we observe that the median reported). It is also worthwhile to estimate the exact occurring
GR for each year is relatively small. In particular, we observe times of breach incidents. Finally, more research needs to be
that the median is 0 for years 2007, 2007, 2009, 2016, and conducted towards understanding the predictability of breach
2017, meaning that during these years, the breach inter-arrival incidents (i.e., the upper bound of prediction accuracy [24]).
times are relatively stable. We also observe that for years
2014 and 2015, the medians of the inter-arrival time are A PPENDIX
negative, meaning that the inter-arrival time decreases for these
A. ACF and PACF
years. We also note that since year 2012 (except for year
2015), the standard deviations of the GRs of the inter-arrival ACF and PACF [36] are two important tools for examin-
time are relatively small (smaller than 3.6). We conclude ing temporal correlations. Consider a sequence of samples
that hacking breach incidents inter-arrival time decreases in {Y1 , . . . , Yn }. The sample ACF is defined as
n
recent years. This deepens the qualitative trend analysis in the
t =k+1 Yt − Ȳ Yt −k − Ȳ
previous section. rk = n 2 , k = 1, . . . , n − 1,
The AGRT and CGRT metrics consider both the breach size t =k+1 Yt − Ȳ
and the inter-arrival time. We observe that the means of the n
where Ȳ = t =1 Yt /n is the sample mean. The PACF is
AGRT are all positive, meaning that the breach size increases defined as a conditional correlation of two variables given the
on average. In terms of the median, we observe that the AGRTs information of the other variables. Specifically, the PACF of
of years 2013 and 2014 are negative. Compared to the GRs (Yt , Yt −k ) is the autocorrelation between Yt and Yt −k after
of these two years, we observe that the absolute values of removing any linear dependence on Yt +1 , Yt +2 , . . . , Yt −k+1 ;
the AGRTs are smaller, namely, 0.0318 and 0.0360 for the see [36] for more details.
AGRTs versus 0.2633 and 0.2878 for the GRs, respectively.
This can be explained by the evolution of the inter-arrival
times. Based on AGRT, we conclude that although the breach B. AIC and BIC
size turns to be smaller (negative growth) in years 2013 and AIC and BIC are the most commonly used criteria in
2014, it becomes larger (positive growth) in years 2015 and the model selection in the statistics [36], [37], [53]. AIC is
2016, and becomes smaller at the beginning of year 2017. meant to balance the goodness-of-fit and the penalty for model
A similar conclusion can be drawn for the CGRT metric. The complexity (the smaller the AIC value, the better the model).
median value 0.0808 of CGRT in year 2016 can be interpreted Specifically,
as the median daily growth rate of 0.0808 for year 2016.
AIC = −2 log(MLE) + 2 k,
By summarizing the preceding qualitative and quantitative
trend analysis, we draw: where MLE is the likelihood associated to the fitted model and
Insight 7: The situation of hacking breach incidents are measures the goodness-of-fit, and k is the number of estimated
getting worse in terms of their frequency, but appear to be parameters and measures the model complexity. Similarly,
stabilizing in terms of their breach sizes, meaning that more the smaller the BIC value, the better the model. Specifically,
devastating breach incidents are unlikely in the future.
BIC = −2 log(MLE) + k log(n),
VII. C ONCLUSION where n is the sample size. BIC penalizes complex models
We analyzed a hacking breach dataset from the points of more heavily than AIC, thus favoring simpler models.
view of the incidents inter-arrival time and the breach size,
and showed that they both should be modeled by stochas- C. Ljung-Box and McLeod-Li Tests
tic processes rather than distributions. The statistical models The Ljung-Box test consider a group of ACFs of a time
developed in this paper show satisfactory fitting and prediction series [37], [57]. The null hypotheses is
accuracies. In particular, we propose using a copula-based
approach to predict the joint probability that an incident with H0 : The time series are independent.
a certain magnitude of breach size will occur during a future
and the alternative is
period of time. Statistical tests show that the methodolo-
gies proposed in this paper are better than those which are Ha : The time series are not independent.
2870 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 13, NO. 11, NOVEMBER 2018

The Ljung-Box test statistic is defined as [6] M. Eling and W. Schnell, “What do we know about cyber risk and cyber
risk insurance?” J. Risk Finance, vol. 17, no. 5, pp. 474–491, 2016.
r̂12 r̂k2 [7] T. Maillart and D. Sornette, “Heavy-tailed distribution of cyber-risks,”
Q = n(n + 2) + ···+ , Eur. Phys. J. B, vol. 75, no. 3, pp. 357–364, 2010.
n−1 n−k [8] R. B. Security. Datalossdb. Accessed: Nov. 2017. [Online]. Available:
https://blog.datalossdb.org
where r̂i is the estimated correlation coefficient at lag i . [9] B. Edwards, S. Hofmeyr, and S. Forrest, “Hype and heavy tails: A closer
look at data breaches,” J. Cybersecur., vol. 2, no. 1, pp. 3–14, 2016.
We reject the null hypothesis if Q > χ1−α,k
2 where χ1−α,k
2 is [10] S. Wheatley, T. Maillart, and D. Sornette, “The extreme risk of personal
the αth quantile of the chi-squared distribution with k degrees data breaches and the erosion of privacy,” Eur. Phys. J. B, vol. 89, no. 1,
of freedom. p. 7, 2016.
The McLeod-Li test is similarly defined but it tests whether [11] P. Embrechts, C. Klüppelberg, and T. Mikosch, Modelling Extremal
Events: For Insurance and Finance, vol. 33. Berlin, Germany:
the first m autocorrelations of squared data are zero using the Springer-Verlag, 2013.
Ljung-Box test [31], [57]. [12] R. Böhme and G. Kataria, “Models and measures for correlation in
cyber-insurance,” in Proc. Workshop Econ. Inf. Secur. (WEIS), 2006,
pp. 1–26.
D. Goodness-of-Fit Test Statistics [13] H. Herath and T. Herath, “Copula-based actuarial model for pricing
cyber-insurance policies,” Insurance Markets Companies: Anal. Actuar-
The goodness-of-fit of a distribution describes how well ial Comput., vol. 2, no. 1, pp. 7–20, 2011.
the distribution fits a set of samples. Three commonly [14] A. Mukhopadhyay, S. Chatterjee, D. Saha, A. Mahanti, and
S. K. Sadhukhan, “Cyber-risk decision models: To insure it or not?”
used test statistics are: the Kolmogorov-Smirnov (KS) Decision Support Syst., vol. 56, pp. 11–26, Dec. 2013.
test, the Anderson-Darling (AD) test, and the Cramér-von [15] M. Xu and L. Hua. (2017). Cybersecurity Insurance: Modeling
Mises (CM) test [58], [59]. Specifically, let X 1 , . . . , X n be and Pricing. [Online]. Available: https://www.soa.org/research-reports/
independent and identical random variables with distribu- 2017/cybersecurity-insurance
[16] M. Xu, L. Hua, and S. Xu, “A vine copula model for predicting the
tion F. The empirical distribution Fn is defined as effectiveness of cyber defense early-warning,” Technometrics, vol. 59,
no. 4, pp. 508–520, 2017.
1
n
[17] C. Peng, M. Xu, S. Xu, and T. Hu, “Modeling multivariate cybersecurity
Fn (x) = I(X i ≤ x), risks,” J. Appl. Stat., pp. 1–23, 2018.
n [18] M. Eling and N. Loperfido, “Data breaches: Goodness of fit, pricing,
i=1
and risk measurement,” Insurance, Math. Econ., vol. 75, pp. 126–136,
where I(X i ≤ x) is the indicator function: Jul. 2017.
[19] K. K. Bagchi and G. Udo, “An analysis of the growth of computer and
1, X i ≤ x, Internet security breaches,” Commun. Assoc. Inf. Syst., vol. 12, no. 1,
I(X i ≤ x) = p. 46, 2003.
0, o/w. [20] E. Condon, A. He, and M. Cukier, “Analysis of computer security
incident data using time series models,” in Proc. 19th Int. Symp. Softw.
The KS, CM, and AD test statistics are defined as: Rel. Eng. (ISSRE), Nov. 2008, pp. 77–86.
√ [21] Z. Zhan, M. Xu, and S. Xu, “A characterization of cyber-
KS = n sup |Fn (x) − F(x)| , security posture from network telescope data,” in Proc. 6th
x Int. Conf. Trusted Syst., 2014, pp. 105–126. [Online]. Available:
http://www.cs.utsa.edu/~shxu/socs/intrust14.pdf
CM = n (Fn (x) − F(x))2 d F(x), [22] Z. Zhan, M. Xu, and S. Xu, “Characterizing honeypot-captured cyber
attacks: Statistical framework and case study,” IEEE Trans. Inf. Forensics
Security, vol. 8, no. 11, pp. 1775–1789, Nov. 2013.
AD = n (Fn (x) − F(x))2 w(x)d F(x), [23] Z. Zhan, M. Xu, and S. Xu, “Predicting cyber attack rates with
extreme values,” IEEE Trans. Inf. Forensics Security, vol. 10, no. 8,
where w(x) = [F(x)(1 − F(x))]−1 . pp. 1666–1677, Aug. 2015.
[24] Y.-Z. Chen, Z.-G. Huang, S. Xu, and Y.-C. Lai, “Spatiotemporal pat-
terns and predictability of cyberattacks,” PLoS ONE, vol. 10, no. 5,
p. e0124472, 2015.
ACKNOWLEDGMENT [25] C. Peng, M. Xu, S. Xu, and T. Hu, “Modeling and predicting extreme
The authors thank the reviewers for their constructive com- cyber attack rates via marked point processes,” J. Appl. Stat., vol. 44,
no. 14, pp. 2534–2563, 2017.
ments that helped improve the paper. In Section V, they [26] J. Z. Bakdash et al. (2017). “Malware in the future? fore-
incorporated some insightful comments of one reviewer on casting analyst detection of cyber events.” [Online]. Available:
how to connect the prediction models to real-world cyber https://arxiv.org/abs/1707.03243
[27] Y. Liu et al., “Cloudy with a chance of breach: Forecasting cyber security
defense quantitative risk management. incidents,” in Proc. 24th USENIX Secur. Symp., Washington, DC, USA,
2015, pp. 1009–1024.
R EFERENCES [28] R. Sen and S. Borle, “Estimating the contextual risk of data breach:
An empirical approach,” J. Manage. Inf. Syst., vol. 32, no. 2,
[1] P. R. Clearinghouse. Privacy Rights Clearinghouse’s Chronol- pp. 314–341, 2015.
ogy of Data Breaches. Accessed: Nov. 2017. [Online]. Available: [29] F. Bisogni, H. Asghari, and M. Eeten, “Estimating the size of the iceberg
https://www.privacyrights.org/data-breaches from its tip,” in Proc. Workshop Econ. Inf. Secur. (WEIS), La Jolla, CA,
[2] ITR Center. Data Breaches Increase 40 Percent in 2016, Finds USA, 2017.
New Report From Identity Theft Resource Center and CyberScout. [30] R. F. Engle and J. R. Russell, “Autoregressive conditional duration:
Accessed: Nov. 2017. [Online]. Available: http://www.idtheftcenter.org/ A new model for irregularly spaced transaction data,” Econometrica,
2016databreaches.html vol. 66, no. 5, pp. 1127–1162, 1998.
[3] C. R. Center. Cybersecurity Incidents. Accessed: Nov. 2017. [Online]. [31] N. Hautsch, Econometrics of Financial High-Frequency Data. Berlin,
Available: https://www.opm.gov/cybersecurity/cybersecurity-incidents Germany: Springer-Verlag, 2011.
[4] IBM Security. Accessed: Nov. 2017. [Online]. Available: [32] P. Embrechts, C. Klüppelberg, and T. Mikosch, Modelling Extremal
https://www.ibm.com/security/data-breach/index.html Events: For Insurance and Finance. Berlin, Germany: Springer, 1997.
[5] NetDiligence. The 2016 Cyber Claims Study. Accessed: Nov. 2017. [33] T. Bollerslev, J. Russell, and M. Watson, Volatility and Time Series
[Online]. Available: https://netdiligence.com/wp-content/uploads/2016/ Econometrics: Essays in Honor of Robert Engle. London, U.K.:
10/P02_NetDiligence-2016-Cyber-Claims-Study-ONLINE.pdf Oxford Univ. Press, 2010.
XU et al.: MODELING AND PREDICTING CYBER HACKING BREACHES 2871

[34] R. B. Nelsen, An Introduction to Copulas. New York, NY, USA: Maochao Xu received the Ph.D. degree in statis-
Springer-Verlag, 2007. tics from Portland State University in 2010. He is
[35] H. Joe, Dependence Modeling With Copulas. Boca Raton, FL, USA: currently an Associate Professor of mathematics
CRC Press, 2014. with Illinois State University. His research interests
[36] J. D. Cryer and K.-S. Chan, Time Series Analysis With Applications in R. include statistical modeling, cyber risk analysis, and
New York, NY, USA: Springer, 2008. ensuring cyber security. He also serves as an Asso-
[37] B. Peter and D. Richard, Introduction to Time Series and Forecasting. ciate Editor for Communications in Statistics.
New York, NY, USA: Springer-Verlag, 2002.
[38] P. J. Brockwell and R. A. Davis, Introduction to Time Series and
Forecasting. New York, NY, USA: Springer-Verlag, 2016.
[39] D. J. Daley and D. Vere-Jones, An Introduction to the Theory of Point
Processes, vol. 1, 2nd ed. New York, NY, USA: Springer-Verlag, 2002.
[40] M. Y. Zhang, J. R. Russell, and R. S. Tsay, “A nonlinear autoregressive
conditional duration model with applications to financial transaction
data,” J. Econ., vol. 104, no. 1, pp. 179–207, 2001. Kristin M. Schweitzer is a Mechanical Engineer
[41] L. Bauwens and P. Giot, “The logarithmic ACD model: An application with the U.S. Army Research Laboratory (ARL),
to the bid-ask quote process of three NYSE stocks,” Ann. Économie Cyber and Networked Systems Branch. Her current
Stat., no. 60, pp. 117–149, Oct./Dec. 2000. role is to conduct and coordinate use-inspired basic
[42] L. Bauwens, P. Giot, J. Grammig, and D. Veredas, “A comparison of research in cyber security for the ARL South office
financial duration models via density forecasts,” Int. J. Forecasting, located at the University of Texas at San Antonio.
vol. 20, no. 4, pp. 589–609, 2004. Previously for ARL, she provided Human Systems
[43] G. W. Corder and D. I. Foreman, Nonparametric Statistics: A Step-by- Integration analyses for U.S. Army, Marine Corps,
Step Approach. Hoboken, NJ, USA: Wiley, 2014. Air Force, and Department of Homeland Security
[44] P. R. Hansen and A. Lunde, “A forecast comparison of volatility models: systems. She also conducted research on human
Does anything beat a garch(1, 1)?” J. Appl. Econ., vol. 20, no. 7, performance in uncontrolled environments.
pp. 873–889, 2005.
[45] S. I. Resnick, Heavy-Tail Phenomena: Probabilistic and Statistical
Modeling. New York, NY, USA: Springer-Verlag, 2007.
[46] X. Zhao, C. Scarrott, L. Oxley, and M. Reale, “Extreme value modelling Raymond M. Bateman received the Ph.D. degree
for forecasting market crisis impacts,” Appl. Financial Econ., vol. 20, in mathematical and computer sciences (operations
nos. 1–2, pp. 63–72, 2010. research) from the Colorado School of Mines.
[47] C. Scarrott, “Univariate extreme value mixture modeling,” in Extreme He retired as a Lieutenant Colonel from the U.S.
Value Modeling and Risk Analysis: Methods and Applications, J. Yan Army Special Forces with 20 years of enlisted and
and D. K. Dey, Eds. London, U.K.: Chapman & Hall, 2016, pp. 41–67. officer service. He conducted research for significant
[48] H. Joe, Multivariate Models and Dependence Concepts (Mono- and relevant issues affecting the U.S. Army Medical
graphs on Statistics and Applied Probability), vol. 73. London, U.K.: Department Center and School, Health Readiness
Chapman & Hall, 1997. Center of Excellence by applying human systems
[49] H. White, “Maximum likelihood estimation of misspecified models,” integration (HSI) and operations research tech-
Econometrica, J. Econ. Soc., vol. 50, no. 1, pp. 1–25, 1982. niques. He currently serves as the Army Research
[50] W. Huang and A. Prokhorov, “A goodness-of-fit test for copulas,” Econ. Laboratory (ARL) South Lead for cybersecurity for use-inspired basic research
Rev., vol. 33, no. 7, pp. 751–771, 2014. at The University of Texas, San Antonio. His projects included serving as the
[51] W. Wang and M. T. Wells, “Model selection and semiparametric Non-Medical Operations Research Systems Analyst and HSI Expert for the
inference for bivariate failure-time data,” J. Amer. Statist. Assoc., vol. 95, Medical Command Root-Cause Analysis Event Support and the Engagement
no. 449, pp. 62–72, 2000. Team that investigates sentinel events that result in permanent harm or death.
[52] C. Genest, J.-F. Quessy, and B. Rémillard, “Goodness-of-fit procedures He has two deployments to Iraq as the Army Civilian Science Advisor to
for copula models based on the probability integral transformation,” Commander III Corps and Army Materiel Command.
Scandin. J. Stat., vol. 33, no. 2, pp. 337–366, 2006.
[53] A. McNeil, R. Frey, and P. Embrechts, Quantitative Risk Man-
agement: Concepts, Techniques, and Tools. Princeton, NJ, USA:
Princeton Univ. Press, 2010. Shouhuai Xu received the Ph.D. degree in computer
[54] P. F. Christoffersen, “Evaluating interval forecasts,” Int. Econ. Rev., science from Fudan University. He is currently a Full
vol. 39, no. 4, pp. 841–862, 1998. Professor with the Department of Computer Science,
[55] R. F. Engle and S. Manganelli, “CAViaR: Conditional autoregressive The University of Texas at San Antonio. He is also
value at risk by regression quantiles,” J. Bus. Econ. Stat., vol. 22, no. 4, the Founding Director of the Laboratory for Cyber-
pp. 367–381, 2004. security Dynamics. He pioneered the Cybersecurity
[56] P. M. Romer, “Increasing returns and long-run growth,” J. Political Dynamics framework for modeling and analyzing
Econ., vol. 94, no. 5, pp. 1002–1037, 1986. cybersecurity from a holistic perspective. He is inter-
[57] G. M. Ljung and G. E. P. Box, “On a measure of lack of fit in time ested in both theoretical modeling and analysis of
series models,” Biometrika, vol. 65, no. 2, pp. 297–303, 1978. cybersecurity and devising practical cyber defense
[58] G. R. Shorack and J. A. Wellner, Empirical Processes With Applications solutions. He co-initiated the International Confer-
to Statistics. Philadelphia, PA, USA: SIAM, 1986. ence on Science of Cyber Security (SciSec) in 2018 and the ACM Scalable
[59] M. A. Stephens, “Tests based on EDF statistics,” in Goodness-of-Fit Trusted Computing Workshop. He is/was a Program Committee Co-Chair of
Techniques, R. B. d’Agostino and M. A. Stephens, Eds. New York, NY, SciSec’18, ICICS’18, NSS’15, and Inscrypt’13. He was/is an Associate Editor
USA: Marcel Dekker, 1986, pp. 97–193. of IEEE TDSC, IEEE T-IFS, and IEEE TNSE.

BTech. 4th Year - Computer Science and Engineering-Artificial Intelligence - Machine Learning - 2023-24 AKTU Syllabus
0% (1)
BTech. 4th Year - Computer Science and Engineering-Artificial Intelligence - Machine Learning - 2023-24 AKTU Syllabus
21 pages
1997 (Jack Johnston, John Dinardo) Econometric Methods PDF
84% (19)
1997 (Jack Johnston, John Dinardo) Econometric Methods PDF
514 pages
(Stephen J. Taylor) Modelling Financial Times Series
100% (2)
(Stephen J. Taylor) Modelling Financial Times Series
297 pages
FN3142 PDF
No ratings yet
FN3142 PDF
271 pages
2016 Book TimeSeriesEconometrics
100% (3)
2016 Book TimeSeriesEconometrics
421 pages
Modeling and Predicting Cyber Hacking BR
No ratings yet
Modeling and Predicting Cyber Hacking BR
78 pages
Data Security Breach
No ratings yet
Data Security Breach
20 pages
Ch22 Time Series Econometrics - Forecasting
No ratings yet
Ch22 Time Series Econometrics - Forecasting
38 pages
Data Breach
100% (1)
Data Breach
51 pages
Cyber Hacking Breaches
No ratings yet
Cyber Hacking Breaches
11 pages
Modelling and Forecasting of Price Volatility An Application of GARCH and PDF
No ratings yet
Modelling and Forecasting of Price Volatility An Application of GARCH and PDF
11 pages
Modeling and Predicting Cyber Hacking Breaches: Ranjit Patnaik Sekharamantri, Avinash Grandhi
No ratings yet
Modeling and Predicting Cyber Hacking Breaches: Ranjit Patnaik Sekharamantri, Avinash Grandhi
4 pages
2019 Cost of A Data Breach Report
100% (1)
2019 Cost of A Data Breach Report
76 pages
Block 3
No ratings yet
Block 3
148 pages
Modeling and Predicting Cyber Hacking Breaches: Under The Guidance Of: Team Members
100% (1)
Modeling and Predicting Cyber Hacking Breaches: Under The Guidance Of: Team Members
38 pages
Guidance For Academic Writing Revised 27072018
No ratings yet
Guidance For Academic Writing Revised 27072018
30 pages
Changes To IELTS Listening Test - Jan 20
No ratings yet
Changes To IELTS Listening Test - Jan 20
1 page
Node JS - L1: Trend NXT Hands-On Assignments
0% (1)
Node JS - L1: Trend NXT Hands-On Assignments
3 pages
Time Series: Chapter 3 - ARMA Model
No ratings yet
Time Series: Chapter 3 - ARMA Model
39 pages
Jis2024152 67800983
No ratings yet
Jis2024152 67800983
28 pages
OPIM5671 Case Study Report
No ratings yet
OPIM5671 Case Study Report
76 pages
Estimating The Contextual Risk of Data Breach An Empirical Approach
No ratings yet
Estimating The Contextual Risk of Data Breach An Empirical Approach
29 pages
Mallware Detection Using Artificial Intelligence-Ppt Final
No ratings yet
Mallware Detection Using Artificial Intelligence-Ppt Final
44 pages
A Course in Time Series Analysis
No ratings yet
A Course in Time Series Analysis
139 pages
Gretl Guide (301 350)
No ratings yet
Gretl Guide (301 350)
50 pages
Time Series: Chapter 4 - Estimation
No ratings yet
Time Series: Chapter 4 - Estimation
53 pages
Tamal CV
No ratings yet
Tamal CV
2 pages
Deconstructing Data Breach Cost
No ratings yet
Deconstructing Data Breach Cost
65 pages
Complementary Material: Example 2.1.1 Consider The Plant
No ratings yet
Complementary Material: Example 2.1.1 Consider The Plant
6 pages
Batch-59 - Analysis On Cyber Attacks
No ratings yet
Batch-59 - Analysis On Cyber Attacks
13 pages
Cybersecurity Threats and Data Breaches
No ratings yet
Cybersecurity Threats and Data Breaches
21 pages
Cost of A Data Breach Report 2020
No ratings yet
Cost of A Data Breach Report 2020
82 pages
Ramesh Internship Report
No ratings yet
Ramesh Internship Report
35 pages
Hacking Incidents and Their Long-Term Implications For User Privacy and Trust
No ratings yet
Hacking Incidents and Their Long-Term Implications For User Privacy and Trust
12 pages
Law Social Inquiry - 2017 - Talesh - Data Breach Privacy and Cyber Insurance How Insurance Companies Act As
No ratings yet
Law Social Inquiry - 2017 - Talesh - Data Breach Privacy and Cyber Insurance How Insurance Companies Act As
24 pages
1 s2.0 S1877050919306064 Main
No ratings yet
1 s2.0 S1877050919306064 Main
6 pages
154 Ijaema May 2020
No ratings yet
154 Ijaema May 2020
11 pages
The Extreme Risk of Personal Data Breaches and The Erosion of Privacy
No ratings yet
The Extreme Risk of Personal Data Breaches and The Erosion of Privacy
12 pages
Riesgos
No ratings yet
Riesgos
23 pages
The Extreme Risk of Personal Data Breaches & The Erosion of Privacy
No ratings yet
The Extreme Risk of Personal Data Breaches & The Erosion of Privacy
17 pages
Project 2: Spam Filtering: Linear Statistical Models SYS 4021
No ratings yet
Project 2: Spam Filtering: Linear Statistical Models SYS 4021
36 pages
AA Trends in Cybersecurity Report May 2020
No ratings yet
AA Trends in Cybersecurity Report May 2020
12 pages
A Hybrid Model For Day-Ahead Electricity Price Forecasting: Combining Fundamental and Stochastic Modelling
No ratings yet
A Hybrid Model For Day-Ahead Electricity Price Forecasting: Combining Fundamental and Stochastic Modelling
39 pages
English Project Last Version
No ratings yet
English Project Last Version
21 pages
2023 Dbir Executive Summary
No ratings yet
2023 Dbir Executive Summary
18 pages
Healthcare 08 00133 v2
No ratings yet
Healthcare 08 00133 v2
18 pages
Testbank KTLTC
No ratings yet
Testbank KTLTC
54 pages
PPPTP
No ratings yet
PPPTP
26 pages
Toaz - Info 16 Modeling and Predicting Cyber Hacking Breachesdocx PR
No ratings yet
Toaz - Info 16 Modeling and Predicting Cyber Hacking Breachesdocx PR
8 pages
SSRN Id4716320
No ratings yet
SSRN Id4716320
6 pages
Cyber Security
No ratings yet
Cyber Security
15 pages
Data Security Crisis in Universities: Identi Fication of Key Factors Affecting Data Breach Incidents
No ratings yet
Data Security Crisis in Universities: Identi Fication of Key Factors Affecting Data Breach Incidents
18 pages
Cyber Hacking Breaches Prediction and Detection
No ratings yet
Cyber Hacking Breaches Prediction and Detection
6 pages
WQU Econometrics M3 Compiled Content PDF
No ratings yet
WQU Econometrics M3 Compiled Content PDF
44 pages
Paper Cyberrisk 2014 PDF
No ratings yet
Paper Cyberrisk 2014 PDF
27 pages
Project-12 ppt11
No ratings yet
Project-12 ppt11
16 pages
CyberCon 2023 22
No ratings yet
CyberCon 2023 22
8 pages
Data Breach: Threats in Digital World
No ratings yet
Data Breach: Threats in Digital World
12 pages
AP5152-Advanced Digital Signal Processing
No ratings yet
AP5152-Advanced Digital Signal Processing
10 pages
Modeling and Predicting The Cyber Hacking Breaches 15 Slides
No ratings yet
Modeling and Predicting The Cyber Hacking Breaches 15 Slides
15 pages
Analysis of Data Breaches and Its Impact On Organizations
No ratings yet
Analysis of Data Breaches and Its Impact On Organizations
7 pages
Time Series and Spectral Analysis Part V. Spectral Analysis: Sonia - Gouveia@ua - PT
No ratings yet
Time Series and Spectral Analysis Part V. Spectral Analysis: Sonia - Gouveia@ua - PT
20 pages
Unraveling The Impact of Data Breaches Navigating The Shadow
No ratings yet
Unraveling The Impact of Data Breaches Navigating The Shadow
2 pages
Unraveling The Impact of Data Breaches Navigating The Shadow
No ratings yet
Unraveling The Impact of Data Breaches Navigating The Shadow
2 pages
Lessons Learned From Lessons Learned From Analyzing 100 Data Analyzing 100 Data Breaches Breaches
No ratings yet
Lessons Learned From Lessons Learned From Analyzing 100 Data Analyzing 100 Data Breaches Breaches
11 pages
Unraveling The Impact of Data Breaches Navigating The Shadow
No ratings yet
Unraveling The Impact of Data Breaches Navigating The Shadow
2 pages
Salas 2001
No ratings yet
Salas 2001
6 pages
Modeling and Predicting Cyber Hacking Breaches
No ratings yet
Modeling and Predicting Cyber Hacking Breaches
6 pages
Econ 512 Box Jenkins Slides
No ratings yet
Econ 512 Box Jenkins Slides
31 pages
Running Head: Data Breach in USA Federal Government
No ratings yet
Running Head: Data Breach in USA Federal Government
12 pages
Modeling and Predicting Cyber Hacking Breaches
No ratings yet
Modeling and Predicting Cyber Hacking Breaches
6 pages
Chapter 1
No ratings yet
Chapter 1
5 pages
Adaptive IIR Filter: Terry Lee EE 491D May 13, 2005
No ratings yet
Adaptive IIR Filter: Terry Lee EE 491D May 13, 2005
21 pages
Cyber Claims Analysis Report
No ratings yet
Cyber Claims Analysis Report
19 pages
2017 Year End Data Breach QuickView Report
No ratings yet
2017 Year End Data Breach QuickView Report
19 pages
Competency/Trend - Nxt/Pages/Bootstrap-L1.Aspx: Sensitivity: Internal & Restricted
No ratings yet
Competency/Trend - Nxt/Pages/Bootstrap-L1.Aspx: Sensitivity: Internal & Restricted
3 pages
Dealing With Data Breaches Amidst Changes in Technology
No ratings yet
Dealing With Data Breaches Amidst Changes in Technology
7 pages
Central Luzon State University College of Engineering Bachelor of Science in Information Technology
No ratings yet
Central Luzon State University College of Engineering Bachelor of Science in Information Technology
9 pages
Ict652 Term Paper Data Breach
No ratings yet
Ict652 Term Paper Data Breach
11 pages
An Empirical Analysis of California Data Breaches
No ratings yet
An Empirical Analysis of California Data Breaches
8 pages
Name: Mimansa Bhargava: Data Breach
No ratings yet
Name: Mimansa Bhargava: Data Breach
3 pages
Data Breach Report
No ratings yet
Data Breach Report
29 pages
D B: W U W "C " R: ATA Reaches Hat The Nderground Orld of Arding Eveals
No ratings yet
D B: W U W "C " R: ATA Reaches Hat The Nderground Orld of Arding Eveals
33 pages
Data Breach 1
No ratings yet
Data Breach 1
6 pages
A Prediction Approach For Stock Market Volatility Based On Time Series Data
No ratings yet
A Prediction Approach For Stock Market Volatility Based On Time Series Data
13 pages
Pierian Data - Python For Finance & Algorithmic Trading Course Notes
No ratings yet
Pierian Data - Python For Finance & Algorithmic Trading Course Notes
11 pages
The Role of Human Resources in Data Breach and Cyber Security
No ratings yet
The Role of Human Resources in Data Breach and Cyber Security
10 pages
Modeling and Predicting Cyber Hacking Breaches
No ratings yet
Modeling and Predicting Cyber Hacking Breaches
8 pages
Data Science Interview Questions (#Day13)
No ratings yet
Data Science Interview Questions (#Day13)
10 pages
Ts Solex Sheet 5
No ratings yet
Ts Solex Sheet 5
2 pages
Subject CT6 Statistical Methods Core Technical Syllabus: For The 2010 Examinations
No ratings yet
Subject CT6 Statistical Methods Core Technical Syllabus: For The 2010 Examinations
7 pages
Nation-State Cyber Offensive Capabilities: an in-depth look into a multipolar dimension
From Everand
Nation-State Cyber Offensive Capabilities: an in-depth look into a multipolar dimension
Eduardo Izycki
No ratings yet
NIST Cybersecurity Framework: A pocket guide
From Everand
NIST Cybersecurity Framework: A pocket guide
Alan Calder
5/5 (1)