Document 1
Document 1
Document 1
1 Introduction
Each year a large number of new software vulnerabilities are discovered in var-
ious applications (see Figure 1). Evaluation of network security has focused on
known vulnerabilities and their effects on the hosts and networks. However, the
potential for unknown vulnerabilities (a.k.a. zero-day vulnerabilities) cannot be
ignored because more and more cyber attacks utilize these unknown security
holes. A zero-day vulnerability could last a long period of time (e.g. in 2010
Microsoft confirmed a vulnerability in Internet Explorer, which affected some
versions that were released in 2001). Therefore, in order to have more accurate
2 Predicting zero-day software vulnerabilities through data-mining
results on network security evaluation, one must consider the effect from zero-
day vulnerabilities. The National Vulnerability Database (NVD) is a well-known
data source for vulnerability information, which could be useful to estimate the
likelihood that a specific application contains zero-day vulnerabilities based on
historical information. We have adopted a data-mining approach in an attempt to
build a prediction model for the attribute “time to next vulnerability” (TTNV),
i.e. the time that it will take before the next vulnerability about a particular
application will be found. The predicted TTNV metrics could be translated into
the likelihood that a zero-day vulnerability exists in the software.
Past research has addressed the problem of predicting software vulnerabilities
from different angles. Kyle et al. [10] pointed out the importance of estimating
the risk-level of zero-day vulnerabilities. Mcqueen et al. [15] did experiments on
estimating the number of zero-day vulnerabilities on each given day. Alhazmi
and Malaiya [3] introdced the definition of TTNV. Ozment [19] did a number of
studies on analyzing NVD, and pointed out several limitations of this database.
why it is unlikely to construct a reliable prediction model for TTNV given the
information available in NVD.
Each data entry in NVD consists of a large number of fields. We represent them
as <D, CPE, CVSS>. D is a set of data including published time, summary
of the vulnerability and external links about each vulnerability. CPE [6] and
CVSS [21] will be described below.
cpe:/a:acme:product:1.0:update2:pro:en-us
Professional edition of the "Acme Product 1.0 Update 2 English".
The CVSS Score is calculated based on the metric vector, with the objective
of indicating the severity of a vulnerability.
4 Predicting zero-day software vulnerabilities through data-mining
3 Our Approach
Removing obvious errors: Some NVD entries are obviously erroneous (e.g.
in one entry for Linux the kernel version was given as 390). To prevent these
entries from polluting the learning process, we removed them from the database.
Time: We investigated two schemes for constructing time features. One is epoch
time, the other is using month and day separately without year. Like explained
before, the epoch time is unlikely to provide useful prediction capability, as it
increases monotonically. Intuitively, the second scheme shall be better, as the
month and day on which a vulnerability is published may show some repeating
pattern, even in future years.
Version: We calculate the difference between the versions of two adjacent in-
stances and use the versiondiff as a predictive feature. An instance here refers to
an entry where a specific version of an application contains a specific vulnera-
bility. The rationale for using versiondiff as a predictive feature is that we want
Predicting zero-day software vulnerabilities through data-mining 5
to use the trend of the versions with time to estimate future situations. Two
versiondiff schemas are introduced in our approach. The first one is calculating
the versiondiff based on version counters (rank), while the second is calculating
the versiondiff by radix.
Counter versiondiff: For this versiondiff schema, differences between minor ver-
sions and differences between major versions are treated similarly. For example,
if one software has three versions: 1.1, 1.2, 2.0, then the versions will be assigned
counters 1, 2, 3 based on the rank of their values. Therefore, the versiondiff
between 1.1 and 1.2 is the same as the one between 1.2 and 2.0.
When analyzing the data, we found out that versiondiff did not work very
well for our problem because, in most cases, the new vulnerabilities affect all
previous versions as well. Therefore, most values of versiondiff are zero, as the
new vulnerability instance must affect an older version that also exists in the
previous instance, thus, resulting in a versiondiff of zero. In order to mitigate
this limitation, we created another predictive feature for our later experiments.
The additional feature that we constructed is the number of occurrences of a
certain version of each software. More details will be provided in Section 4.
4 Experimental Results
We conducted the experiments on our department’s computer cluster - Beocat.
We used a single node and 4G RAM for each experiment. As mentioned above,
WEKA [5], a data-mining suite, was used in all experiments.
Root Mean Squared Error: The mean squared error (MSE) of a predictive
regression model is another way to quantify the difference between a set of
predicted values, xp , and the set of actual (target) values, xt , of the attributed
being predicted. The root mean squared error (RMSE) can be defined as:
v
uP n
u (xp,i − xt,i )2
q q t
i=1
RMSE(xp , xt )= M SE(xp , xt ) = E[(xp − xt )2 ] =
n
Root Relative Squared Error: According to [1], the root relative squared
error (RRSE) is relative to what the error would have been if a simple predictor
had been used. The simple predictor is considered to be the mean/majority of
the actual values. Thus, the relative squared error takes the total squared error
and normalizes it by dividing by the total squared error of the simple predictor.
By taking the root of the relative squared error one reduces the error to the
same dimensions as the quantity being predicted.
v
u n (x − x )2
uP
u p,i t,i
RRSE(xp , xt ) = u i=1
u
n
t P 2
(xt,i − x̄)
i=1
Predicting zero-day software vulnerabilities through data-mining 7
4.2 Experiments
4.3 Results
not show a significant difference between the two time schemes that we used,
although we expected the month and day feature to provide better results than
the absolute epoch time, as explained in Section 3.2. Thus, neither scheme has
acceptable correlation capability on the test data. We adapted the month and
day time schema for all of the following experiments.
Table 2 shows a comparison of the results of the two different versiondiff
schemes. As can be seen, both perform poorly as well. Given the unsatisfac-
tory results, we believed that the large number of Linux sub-versions could be
potentially a problem. Thus, we also investigated constructing the versiondiff
feature by binning versions of the Linux kernel (to obtained a smaller set of sub-
versions). We round each sub-version to its third significant major version (e.g.
Bin(2.6.3.1) = 2.6.3). We bin based on the first three most significant versions
because more than half of the instances (31834 out of 56925) have version longer
than 3, and Only 1% (665 out of 56925) versions are longer than 4. Also, the
difference on the third subversion will be regarded as a huge dissimilarity for
Linux kernels. We should note that the sub-version problem may not exist for
other vendors, such as Microsoft, where the versions of the software are natu-
rally discrete (all Microsoft products have versions less than 20). Table 3 shows
the comparisons between regression models that use binned versions versus re-
gression models that do not use binned versions. The results are still not good
enough as many of the versiondiff values are zero, as explained in Section 3.2
(new vulnerabilities affect affect previous versions as well).
TTNV Binning: Since we found that the feature (TTNV) of Linux shows distinct
clusters, we divided the feature values into two categories, more than 10 days
and no more than 10 days, thus transforming the original regression problem
into an easier binary classification problem. The resulting models are evaluated
in terms of corrected classified rates, shown in Table 4. While the models are
better in this case, the false positive rates are still high (typically above 0.4). In
this case, as before, we used default parameters for all classification functions.
However, for the SMO function, we also used the Gaussian (RBF) kernel. The
results of the SMO (RBF kernel) classifier are better than the results of most
Predicting zero-day software vulnerabilities through data-mining 9
other classifiers, in terms of correctly classified rate. However, even this model
has a false positive rate of 0.436, which is far from acceptable.
Correctly classified
Classification Functions FPR TPR
training test
Simple logistic 97.6101% 69.6121% 0.372 0.709
Logistic regression 97.9856% 57.9542% 0.777 0.647
Multi-layer perceptron 98.13% 64.88% 0.689 0.712
RBF network 95.083% 55.18% 0.76 0.61
SMO 97.9061% 61.8259% 0.595 0.658
SMO (RBF kernel) 96.8303% 62.8392% 0.436 0.641
CVSS Metrics: In all cases, we also perform experiments by adding CVSS met-
rics as predictive features. However, we did not see much differences.
TTNV values. However, we did not find any obvious clusters for either windows
or non-windows instances. Therefore, we only used regression functions. The
results obtained using the aforementioned features for both Windows and non-
Windows instances are presented in Table 5. As can be seen, the correlation
coefficients are still less than 0.4.
Mozilla: At last, we built classification models for Firefox, with and without the
CVSS metrics. The results are shown in Table 7. As can be seen, the correctly
classified rates are relatively good (approximately 0.7) in both cases. However,
the number of instances in this dataset is rather small (less than 5000), therefore
it is unclear how stable the prediction model is.
As mentioned above, we used default parameters for all regression and classifi-
cation models that we built. To investigate if different parameter settings could
produce better results, we chose to tune parameters for the support vector ma-
chines algorithm (SVM), whose WEKA implementations for classifications and
regression are called SMO and SMO regression, respectively. There are two main
parameters that can be tuned for SVM, denoted by C and σ. The C parameter
is a cost parameter which controls the trade-off between model complexity and
training error, while σ controls the width of the Gaussian kernel [2].
To find the best combination of values for C and σ, we generated a grid
consisting of the following values for C: 0.5, 1.0, 2.0, 3.0, 5.0, 7.0, 10, 15, 20
and the following values for σ: 0, 0.05, 0.1, 0.2, 0.3, 0.5, 1.0, 2.0, 5.0, and run
the SVM algorithm for all possible combinations. We used a separate validation
set to select the combination of values that gives the best values for correlation
coefficient, and root squared mean error and root relative squared error together.
The validation and test datasets have approximately equal sizes; the test set
consists of chronologically newer data, as compared to the validation data, while
the validation data is newer than the training data.
Table 8 shows the best parameter values when tuning was performed based
on the correlation coefficient, together with results corresponding to these pa-
rameter values, in terms of correlation coefficient, RRSE and RMSE (for both
validation and test datasets). Table 9 shows similar results when parameters are
tuned on RRSE and RMSE together.
12 Predicting zero-day software vulnerabilities through data-mining
4.5 Summary
The experiments above indicate that it is hard to build good prediction models
based on the limited data available in NVD. For example, there is no version
information for most Microsoft instances (especially, Windows instances). Some
results look promising (e.g. the models we built for Firefox), but they are far
from usable in practice. Below, we discuss what we believe to be the main reasons
for the difficulty of building good prediction models for TTNV from NVD.
4.6 Discussion
We believe the main factor affecting the predictive power of our models is the
low quality of the data from the National Vulnerability Database. Following are
several limitations of the data:
Predicting zero-day software vulnerabilities through data-mining 13
5 Related Works
Alhazmi and Malaiya [3] have addressed the problem of building models for
predicting the number of vulnerabilities that will appear in the future. They
targeted operating systems instead of applications. The Alhazmi-Malaiya Logis-
tic model works well for fitting existing data, when evaluated in terms of average
error (AE) and average bias (AB) of number of vulnerabilities over time. How-
ever, fitting existing data is a prerequisite of testing models: predictive power is
the most important criteria [18] . They did test the predictive accuracy of their
models and got satisfactory results [18].
Ozment [19] examined the vulnerability discovery models (proposed by Al-
hazmi Malaiya [3]) and pointed some limitations that make these models inappli-
cable. One of them is that there is not enough information included in a govern-
ment supported vulnerability database (e.g. National Vulnerability Database).
This is confirmed by our empirical study.
McQueen et al. [15] designed algorithms for estimating the number of zero-
day vulnerabilities on each given day. This number can indicate the overall risk
level from zero-day vulnerabilities. However, for different applications the risks
could be different. Our work aimed to construct software-specific prediction mod-
els.
Massacci et al. [14, 16] compared several existing vulnerability databases
based on the type of vulnerability features available in each of them. They men-
tioned that many important features are not included in most databases. e.g.
discovery date is hard to find. Even though certain databases (such as OSVDB
that as we also studied) claim they include the features, most of the entries are
blank. For their Firefox vulnerability database, they employed textual retrieval
techniques and took keywords from CVS developer’s commit log to get several
other features by cross-referencing through CVE ids. They showed that by using
two different data sources for doing the same experiment, the results could be
quite different due to the high degree of inconsistency in the data available for
the research community at the current time. They further tried to confirm the
14 Predicting zero-day software vulnerabilities through data-mining
6 Conclusions
In this paper we present our effort in building prediction models for zero-day
vulnerabilities based on the information contained in the National Vulnerability
Database. Our research found that due to a number of limitations of this data
source, it is unlikely that one can build a practically usable prediction model at
this time. We presented our rigorous evaluation of various feature construction
schemes and parameter tuning for learning algorithms, and notice that none
of the results obtained shows acceptable performance. We discussed possible
reasons as of why the data source may not be well suited to predict the desired
features for zero-day vulnerabilities.
References
1. Root relative squared error. Website. http://www.gepsoft.com/gxpt4kb/
Chapter10/Section1/SS07.htm.
2. Support vector machines. Website. http://www.dtreg.com/svm.htm.
3. Omar H. Alhazmi and Yashwant K. Malaiya. Prediction capabilities of vulner-
ability discovery models. In Annual Reliability and Maintainability Symposium
(RAMS), 2006.
4. Paul Ammann, Duminda Wijesekera, and Saket Kaushik. Scalable, graph-based
network vulnerability analysis. In 9th ACM Conference on Computer and Com-
munications Security(CCS), 2002.
5. Remco R. Bouckaert, Eibe Frank, Mark Hall, Richard Kirkby, Peter Reutemann,
Alex Seewald, and David Scuse. WEKA Manual for Version 3.7. The University
of Waikato. The University of Waikato, 2010.
6. Andrew Buttner and Neal Ziring. Common platform enumeration (cpe) c specifica-
tion. Technical report, The MITRE Corporation AND National Security Agency,
2009.
7. Marc Dacier, Yves Deswarte, and Mohamed Kaâniche. Models and tools for quan-
titative assessment of operational security. In IFIP SEC, 1996.
8. J. Dawkins and J. Hale. A systematic approach to multi-stage network attack anal-
ysis. In Proceedings of Second IEEE International Information Assurance Work-
shop, pages 48 – 56, April 2004.
9. Rinku Dewri, Nayot Poolsappasit, Indrajit Ray, and Darrell Whitley. Optimal
security hardening using multi-objective optimization on attack tree models of
networks. In 14th ACM Conference on Computer and Communications Security
(CCS), 2007.
Predicting zero-day software vulnerabilities through data-mining 15
10. Kyle Ingols, Matthew Chu, Richard Lippmann, Seth Webster, and Stephen Boyer.
Modeling modern network attacks and countermeasures using attack graphs. In
25th Annual Computer Security Applications Conference (ACSAC), 2009.
11. Kyle Ingols, Richard Lippmann, and Keith Piwowarski. Practical attack graph
generation for network defense. In 22nd Annual Computer Security Applications
Conference (ACSAC), Miami Beach, Florida, December 2006.
12. Sushil Jajodia, Steven Noel, and Brian O’Berry. Topological analysis of network
attack vulnerability. In V. Kumar, J. Srivastava, and A. Lazarevic, editors, Manag-
ing Cyber Threats: Issues, Approaches and Challanges, chapter 5. Kluwer Academic
Publisher, 2003.
13. Richard Lippmann and Kyle W. Ingols. An annotated review of past papers on
attack graphs. Technical report, MIT Lincoln Laboratory, March 2005.
14. Fabio Massacci and Viet Hung Nguyen. Which is the right source for vulnerability
studies? an empirical analysis on mozilla firefox. In MetriSec, 2010.
15. Miles McQueen, Trever McQueen, Wayne Boyer, and May Chaffin. Empirical
estimates and observations of 0day vulnerabilities. In 42nd Hawaii International
Conference on System Sciences, 2009.
16. Viet Hung Nguyen and Le Minh Sang Tran. Predicting vulnerable software com-
ponents with dependency graphs. In MetriSec, 2010.
17. Xinming Ou, Wayne F. Boyer, and Miles A. McQueen. A scalable approach to
attack graph generation. In 13th ACM Conference on Computer and Communica-
tions Security (CCS), pages 336–345, 2006.
18. Andy Ozment. Improving vulnerability discovery models analyzer. In QoP07,
2007.
19. Andy Ozment. Vulnerability Discovery & Software Security. PhD thesis, University
of Cambridge, 2007.
20. Cynthia Phillips and Laura Painton Swiler. A graph-based system for network-
vulnerability analysis. In NSPW ’98: Proceedings of the 1998 workshop on New
security paradigms, pages 71–79. ACM Press, 1998.
21. Mike Schiffman, Gerhard Eschelbeck, David Ahmad, Andrew Wright, and Sasha
Romanosky. CVSS: A Common Vulnerability Scoring System. National Infrastruc-
ture Advisory Council (NIAC), 2004.
22. Oleg Sheyner, Joshua Haines, Somesh Jha, Richard Lippmann, and Jeannette M.
Wing. Automated generation and analysis of attack graphs. In Proceedings of the
2002 IEEE Symposium on Security and Privacy, pages 254–265, 2002.