0% found this document useful (0 votes)
95 views

Enhancing Vulnerability Prioritization

This paper presents efforts to build a Special Interest Group (SIG) to develop an exploit prediction scoring system called EPSS. The SIG consists of over 170 experts providing crowd-sourced expertise to improve EPSS. The new EPSS machine learning model provides an 82% performance improvement in distinguishing exploited vulnerabilities, which are prioritized for remediation. The SIG aims to address challenges of providing a scoring system that practitioners rely on, with significant performance gains over existing systems, and low adoption barriers.

Uploaded by

Ang Andrew
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

Enhancing Vulnerability Prioritization

This paper presents efforts to build a Special Interest Group (SIG) to develop an exploit prediction scoring system called EPSS. The SIG consists of over 170 experts providing crowd-sourced expertise to improve EPSS. The new EPSS machine learning model provides an 82% performance improvement in distinguishing exploited vulnerabilities, which are prioritized for remediation. The SIG aims to address challenges of providing a scoring system that practitioners rely on, with significant performance gains over existing systems, and low adoption barriers.

Uploaded by

Ang Andrew
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Enhancing Vulnerability Prioritization: Data-Driven Exploit

Predictions with Community-Driven Insights


Jay Jacobs Sasha Romanosky Octavian Suciu
Cyentia Institute RAND Corporation University of Maryland
[email protected] [email protected] [email protected]

Ben Edwards Armin Sarabi


Cyentia Institute University of Michigan
[email protected] [email protected]
arXiv:2302.14172v1 [cs.CR] 27 Feb 2023

ABSTRACT vulnerabilities per month [20]. As a consequence of the increasing


The number of disclosed vulnerabilities has been steadily increasing awareness of software flaws and the limited capacity to remediate
over the years. At the same time, organizations face significant them, vulnerability prioritization has become both a chronic and
challenges patching their systems, leading to a need to prioritize an acute concern for every organization attempting to reduce their
vulnerability remediation in order to reduce the risk of attacks. attack surface.
Unfortunately, existing vulnerability scoring systems are either The prioritization process involves scoring and ranking vulner-
vendor-specific, proprietary, or are only commercially available. abilities according to assessments, often based on the industry
Moreover, these and other prioritization strategies based on vulner- standard Common Vulnerability Scoring System (CVSS) [17]. How-
ability severity are poor predictors of actual vulnerability exploita- ever, only the Base metric group of CVSS is being assigned and
tion because they do not incorporate new information that might distributed at scale by NIST, and this group of metrics is unable
impact the likelihood of exploitation. to adapt to post-disclosure information, such as the publication of
In this paper we present the efforts behind building a Special exploits or technical artifacts, which can affect the odds of attacks
Interest Group (SIG) that seeks to develop a completely data-driven against a vulnerability being observed in the wild. As a result, while
exploit scoring system that produces scores for all known vulnera- only 5% of known vulnerabilities are exploited in the wild [21], nu-
bilities, that is freely available, and which adapts to new information. merous prior studies have shown that CVSS does not perform well
The Exploit Prediction Scoring System (EPSS) SIG consists of more when used to prioritize exploited vulnerabilities over those with-
than 170 experts from around the world and across all industries, out evidence of exploitation [1, 3, 15]. While several other efforts
providing crowd-sourced expertise and feedback. have been made to capture exploitation likelihood in vulnerability
Based on these collective insights, we describe the design de- assessments, these approaches are either vendor-specific [24, 28]
cisions and trade-offs that lead to the development of the next or proprietary and not available publicly [26, 27, 36].
version of EPSS. This new machine learning model provides an In order to improve remediation practices, network defenders need
82% performance improvement over past models in distinguish- a scoring systems that can accurately quantify likelihood of exploits
ing vulnerabilities that are exploited in the wild and thus may be in the wild, and is able to adapt to new information published after
prioritized for remediation. the initial disclosure of a vulnerability.
Any effort to developing a new capability to understand, antici-
KEYWORDS pate, and respond to new cyber threats must overcome three main
software vulnerabilities, exploit prediction, machine learning, EPSS, challenges: i) it must address the requirements of practitioners who
CVE rely on it; ii) it must provide significant performance improvements
over existing scoring systems; and iii) it must have a low barrier to
1 INTRODUCTION entry for adoption and use.
To address these challenges, a Special Interest Group (SIG) was
Vulnerability management, the practice of identifying, prioritizing,
formed in early 2020 at the Forum of Incident Response and Security
and patching known software vulnerabilities, has been a continuous
Teams (FIRST). From its inception until the time of this writing,
challenge for defenders for decades. This issue is exacerbated by the
the Exploit Prediction Scoring System (EPSS) SIG has gathered
increasing number of new vulnerabilities that are being disclosed
170 members from across the world, representing practitioners,
annually. For example, MITRE published1 25,068 new vulnerabilities
researchers, government agencies, and software developers.2 The
during the 2022 calendar year, a 24.3% increase over 2021.
SIG was created with the publication of the first EPSS model for
Adding to the increasing rate of published vulnerabilities are
predicting the likelihood of exploits in the wild [22] and is organized
challenges incurred by practitioners when trying to remediate them.
around a mailing list, a discussion forum, and bi-weekly meetings.
Recent research conducted by Kenna Security and Cyentia tracked
This unique environment represented an opportunity to understand
exposed vulnerabilities at hundreds of companies and found that
the challenges faced by practitioners when performing vulnerability
the monthly median rate of remediation was only 15.5%, while
prioritization, and therefore address the first challenge raised above
a quarter of companies remediated less than 6.6% of their open
1 Not marked as REJECT or RESERVED. 2 See https://www.first.org/epss.
by designing a scoring system that takes into account practitioner implementation of EPSS to perform their own data collection, and
requirements. further allowed more complex features and models. The model used
To address the second challenge and achieve significant perfor- in v2 is XGBoost [12], and the feature set was greatly expanded
mance improvements, the SIG provided subject matter expertise, from 16 to 1,164. These efforts led to a significant improvement
which guided feature engineering with high utility at predicting in predictive performance over the previous version by capturing
exploits in the wild. Finally, to address the challenges of designing higher order interactions in the extended feature set. Another major
a public and readily-available scoring system, the SIG attracted a component of a centralized architecture was being able to adapt
set of industry partners willing to share proprietary data for the to new vulnerability artifacts (e.g., the publication of exploits) and
development of the model, the output of which can then be made produce new predictions, daily. Moreover, the SIG also commented
public. This allowed EPSS scores to be publicly available at scale, that producing scores based on the likelihood of exploitation within
lowering the barrier to entry for those wanting to integrate EPSS the first year of a vulnerability’s lifecycle was not very practical,
into their prioritization pipeline. since most prioritization decisions are made with respect to an
This paper presents the latest (third) iteration of the EPSS model, upcoming patching cycle. As a result, v2 switched to predicting
as well as lessons learned in its design, and their impact on designing exploitation activity within the following 30-day window as of the
a scoring system. The use of a novel and diverse feature set and state- time of scoring, which aligns with the typical remediation window
of-the-art machine learning techniques allows EPSS to improve of practitioners in the SIG.
prediction performance by 82% over its predecessor (as measured by For the third version of EPSS, the SIG highlighted a requirement
the precision/recall Area Under the Curve improved to 0.779 from for improved precision at identifying vulnerabilities likely to be
0.429). EPSS is able to score all vulnerabilities published on MITRE’s exploited in the wild. This drove an effort to expand the sources of
CVE List (and the National Vulnerability Database), and can reduce exploit data by partnering with multiple organizations willing to
the amount of effort required to patch critical vulnerabilities to share data for model development, and engineer more complex and
one-eighth of a comparable strategy based on CVSS. This paper informative features. These label and feature improvements, along
makes the following contributions: with a methodical hyper-parameter tuning approach, enabled im-
proved training of an XGBoost classifier. This allowed the proposed
(1) Present lessons learned from developing an exploit predic-
v3 model to achieve an overall 82% improvement in classifier per-
tion model that integrates the functional requirements of a
formance over v2, with the Area Under the Precision/Recall Curve
community of nearly 200 practitioners and researchers.
increasing from 0.429 to 0.779. This boost in prediction performance
(2) Engineers novel features for exploit prediction and use them
allows organizations to substantially improve their prioritization
to train the EPSS classifier for predicting the likelihood of
practices and design data-driven patching strategies.
exploits in the wild.
(3) Analyzes the practical utility of EPSS by showing that it can
significantly improve remediation strategies compared to 3 DATA
static baselines. The data used in this research is based on 192,035 published vulnera-
bilities (not marked as “REJECT” or “RESERVED”) listed in MITRE’s
2 EVOLUTION OF EPSS Common Vulnerabilities and Exposures (CVE) list through Decem-
ber 31, 2022. The CVE identifier has been used to combine records
EPSS was initially inspired by the Common Vulnerability Scor-
across our disparate data sources. Table 1 lists the categories of
ing System (CVSS). The first EPSS model [22] was designed to be
data, number of features in each category, and the source(s) or other
lightweight, portable (i.e. implemented in a spreadsheet), and par-
notes. In total, EPSS collects 1,477 unique independent variables
simonious in terms of the data required to score vulnerabilities.
for every vulnerability.
Because of these design goals, the first model used a logistic re-
gression which produced interpretable and intuitive scores, and
predicted the probability of exploitation activity being observed in 3.1 Ground truth: exploitation in the wild
the first year following the publication of a vulnerability. In order to EPSS collects and aggregates evidence of exploits from multiple
be parsimonious, the logistic regression model was trained on only sources: Fortiguard, Alienvault OTX, the Shadow Server Foundation
16 independent variables (features) extracted at the time of vulner- and GreyNoise (though not all sources cover the full time period).
ability disclosure. While outperforming CVSS, the SIG highlighted Each of these data sources employ network- or host-layer intrusion
some key limitations which hindered its practical adoption. detection/prevention systems (IDS/IPS), or honeypots, in order to
Informed by this feedback, the second version of EPSS aimed to identify attempted exploitation. These systems are also predomi-
address the major limitations of the first version. The first design nantly signature-based (as opposed to anomaly-based) detection
decision was to switch to a centralized architecture. By centralizing systems. Moreover, all of these organizations have large enterprise
and automating the data collection and scoring, a more complex infrastructures of sensor and collection networks. Fortiguard, for
model could be developed to improve performance. This decision example, manages tens of thousands of IDS/IPS devices that identify
came with a trade-off, namely a loss of the model’s portability and report exploitation activity from across the globe. Alienvault
and thus, the ability to score vulnerabilities which are not publicly OTX, GreyNoise and the Shadow Server Foundation also maintain
disclosed (e.g., zero day vulnerabilities, or flaws that may never worldwide networks of sensors for detecting exploitation activity.
be assigned a CVE ID). Nevertheless, focusing on public vulner- These data sources include the list of CVEs observed to be ex-
abilities under the centralized model removed the need for each ploited on a daily basis. The data are then cleaned, and exploitation
2
Table 1: Description of data sources used in EPSS.

Description # of variables Sources


Exploitation activity in the wild (ground truth) 1 (with dates) Fortinet, AlienVault, ShadowServer, GreyNoise
Publicly available exploit code 3 Exploit-DB, GitHub, MetaSploit
CVE is listed/discussed on a list or website (“site”) 3 CISA KEV, Google Project Zero, Trend Micro’s Zero Day Initiative
(ZDI)
Social media 3 Mentions/discussion on Twitter
Offensive security tools and scanners 4 Intrigue, sn1per, jaeles, nuclei
References with labels 17 MITRE CVE List, NVD
Keyword description of the vulnerability 147 Text description in MITRE CVE List
CVSS metrics 15 National Vulnerability Database (NVD)
CWE 188 National Vulnerability Database (NVD)
Vendor labels 1,096 National Vulnerability Database (NVD)
Age of the vulnerability 1 Days since CVE published in MITRE CVE list

activity is consolidated into a single boolean value (0 or 1), iden- listing4 of “publicly known cases of detected zero-day exploits”.5
tifying days on which exploitation activity was reported for any This may help us forecast exploitation activity as the vulnerability
given CVE across any of the available data sources. Structuring the slides into N-day status. We include 162 unique CVEs listed by
training data according to this boolean time-series enables us to Google Project Zero.
estimate the probability of exploitation activity in any upcoming Trend Micro’s Zero Day Initiative (ZDI), the “world’s largest
window of time, though the consensus in the EPSS Special Interest vendor-agnostic bug bounty program”,6 works with researchers
Group was to standardize on a 30-day window to align with most and vendors to responsibly disclose zero-day vulnerabilities and
enterprise patch cycles. issue public advisories about vulnerabilities at the conclusion of
The exploit data used in this research paper covers activity from July their process. We include 7,356 CVEs that have public advisories
1, 2016 to December 31st, 2022 (2,374 days / 78 months / 6.5 years), issued by ZDI.
over which we collected 6.4 million exploitation observations (date The Known Exploited Vulnerabilities (KEV) catalog from the US
and CVE combinations), targeting 12,243 unique vulnerabilities. Department of Homeland Security’s Cybersecurity and Infrastruc-
Based on this data, we find that 6.4% (12,243 of 192,035) of all ture Security Agency (CISA) is an “authoritative source of vulnera-
published vulnerabilities were observed to be exploited during this bilities that have been exploited in the wild”.7 We include 866 CVEs
period, which is consistent with previous findings [21, 22]. from CISA’s KEV list.
These sources lack transparency about when exploitation activity
was observed, and for how long this activity was ongoing. However,
3.2 Explanatory variables/features because past exploitation attempts might influence the likelihood
In total, EPSS leverages 1,477 features for predicting exploitation of future attacks, we include these indicators as binary features for
activity. Next, we describe the data sources used to construct these our model.
features.
Social media. Exploitation may also be correlated with social
Published exploit code. We first consider the correlation between media discussions, and therefore we collect Twitter mentions of
exploitation in the wild and the existence of publicly available CVEs, counting these mentions within three different historical
exploit code, which is collected from three sources (courtesy of time windows (7, 30, and 90 days). We only count primary and
Cyentia3 .): Exploit-DB, Github, and Metasploit. In total we identi- original tweets and exclude retweets and quoted retweets. The
fied 24,133 CVEs with published exploit code, consisting of 20,604 median number of daily unique tweets mentioning CVEs is 1,308
CVEs from Exploit-DB, 4,049 published on GitHub, and 1,905 pub- with the 25th and 75th percentile of daily tweets being 607 and
lished on Metasploit modules. Even though Exploit-DB contains 1,400 respectively. We currently make no attempt to validate the
the majority of published exploits, GitHub has become a valuable content or filter out automated posts (from bots).
source in recent years. For example, in 2022, 1,591 exploits were
Offensive security tools. We also collect evidence of vulnerabil-
published on GitHub, while Exploit-DB and Metasploit added 196
ities being used in offensive security tools that are designed, in
and 94 entries, respectively.
part, to identify vulnerabilities during penetration tests. We are
Public vulnerability lists. Next, we consider that exploitation currently gathering information from four different offensive secu-
activity may be forecasted by the presence of vulnerabilities on rity tools with varying numbers of CVEs identified in each: Nuclei
popular lists and/or websites that maintain and share information with 1,548 CVEs, Jaeles with 206 CVEs, Intrigue with 169 CVEs and
about selective vulnerabilities. Google Project Zero maintains a 4 https://docs.google.com/spreadsheets/d/1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/
view#gid=1190662839.
5 https://googleprojectzero.blogspot.com/p/0day.html.
6 https://www.zerodayinitiative.com/about.
3 https://www.cyentia.com/services/exploit-intelligence-service 7 https://www.cisa.gov/known-exploited-vulnerabilities

3
Sn1per with 63 CVEs. These are encoded as binary features which Common Weakness Enumeration (CWE), which is a “community-
indicate whether each particular source is capable of scanning for developed list of software and hardware weakness types”.8 We
and reporting on the presence of each vulnerability. collect the CWE assignments from NVD, noting that 21,570 CVEs
do not have a CWE assigned. We derived binary features for CWEs
References. In order to capture metrics around the activity and found across at least 10 vulnerabilities, resulting in 186 CWE iden-
analysis related to vulnerabilities, for each CVE, we count the num- tifiers being included. In addition, we maintain two features for
ber of references listed in MITRE’s CVE list, as well as the number of vulnerabilities where CWE information is not available, or the
references with each of the 16 reference tags assigned by NVD. The assigned CWEs are not among the common ones. The top CWE
labels and and their associated prevalence across CVEs are: Vendor identifiers and their vulnerability counts are CWE 79 (20,797), CWE
Advisory (102,965), Third Party Advisory (84,224), Patch (59,660), 119 (11,727), CWE 20 (9,590), CWE 89 (8,790), CWE 787 (7,624), CWE
Exploit (54,633), VDB Entry (31,880), Issue Tracking (16,848), Mail- 200 (7,270), CWE 264 (5,485), CWE 22 (4,918), CWE 125 (4,743), and
ing List (15,228), US Government Resource (11,164), Release Notes CWE 352 (4,081).
(9,308), Permissions Required (3,980), Broken Link (3,934), Prod- Vulnerable vendors. We suspect exploitation activity may be cor-
uct (3,532), Mitigation (2,983), Technical Description (1,686), Not related to the market share and/or install base companies achieve.
Applicable (961), and Press/Media Coverage (124). Therefore, we parse through the Common Platform Enumeration
(CPE) data provided by NVD in order to identify platform records
Keyword description of the vulnerability. To capture attributes of marked as “vulnerable”, and extract only the vendor portion of the
vulnerabilities themselves, we use the same process as described record. We did not make any attempt to fill in missing information
in previous research [21, 22]. This process detects and extracts or correct any typos or misspellings that may occasionally appear
hundreds of common multiword expressions used to describe and in the records. We ranked vendors according to the number of
discuss vulnerabilities. These expressions are then grouped and vulnerabilities, creating one binary feature for each vendor, and
normalized into common vulnerability concepts. The top tags we evaluated the effect of including less frequent vendors as features.
included and associated CVEs are as follows: “remote attacker” We observed no performance improvements by including vendors
(80,942), “web” (31,866), “code execution” (31,330), “denial of service” with fewer than 10 CVEs in our dataset. As a result, we extracted
(28,478), and ‘authenticated” (21,492). In total, we include 147 binary 1,040 unique vendor features in the final model. The most preva-
features for identifying such tags. lent vendors and their vulnerability counts are Microsoft (10,127),
We followed the same process as EPSS v1 for extracting mulit- Google (9,100), Oracle (8,970), Debian (7,627), Apple (6,499), IBM
word expressions from the text from references using Rapid Auto- (6,409), Cisco (5,766), RedHat (4,789), Adobe (4,627), Fedora Project
matic Keyword Extraction [31]. (4,166).
Age of the vulnerability. Finally, the age of a vulnerability might
CVSS metrics. To capture other attributes of vulnerabilities, we contribute or detract from the likelihood of exploitation. Intuitively,
collect CVSS base metrics. These consist of exploitability measure- we expect old vulnerabilities to be less attractive to attackers due to
ments (attack vector, attack complexity, privilege required, user a smaller vulnerable population. To capture this, we create a feature
interaction, scope) and the three impact measurements (confiden- which records the number of days elapsed from CVE publication
tiality, integrity and availability). These categorical variables are to the time of feature extraction in our model.
encoded using one-hot encoding. We collected CVSS version 3 in-
formation from NVD for 118,087 vulnerabilities. However, 73,327 4 MODELING APPROACH
vulnerabilities published before CVSSv3 was created and are only
scored in NVD using CVSSv2. To address this, we developed a sepa- 4.1 Preparing ground truth and features
rate and dedicated machine learning model to estimate the CVSSv3 Exploitation activity is considered as any recorded attempt to ex-
measurement values for each of these vulnerabilities. ploit a vulnerability, regardless of the success of the attempt, and
We use a process similar to prior work [25], where for each CVE, regardless of whether the targeted vulnerability is present. All ob-
we use the CVSSv2 sub-components for CVEs which have both served exploitation activity is recorded with the date the activity
CVSSv2 and CVSSv3 scores. We then train a feedforward neural occurred and aggregated across all data sources by the date and
network to predict CVSSv3 vectors. The model was validated using CVE identifier. The resulting ground truth is a binary value for each
8-fold, yearly stratified, cross-validation, achieving 74.9% accuracy vulnerability of whether exploitation activity was observed or not,
when predicting the exact CVSSv3 vector. For 99.9% of vectors, we for each day.
predict the majority (5 or more) of the individual metrics correctly. Since many of the features may change day by day, we construct
For each individual portion of the CVSSv3 vector we were able to features for the training data on a daily basis. In order to reduce
achieve a minimum of 93.4% accuracy (on the Privileges Required the size of our data (and thus the time and memory needed to
metric). We note that this exceeds the accuracy achieved by [25], train models) we aggregate consecutive daily observations where
and likely warrants further research into the robustness of CVSSv3 features do not change. The size of the exposure and the number of
prediction and its possible application to future versions of CVSS. days with exploitation activity are included in the model training.
When constructing the test data, a single date is selected (typi-
CWE. We also capture the observation that different types of cally "today", see next section) and all of the features are generated
vulnerabilities may be more or less attractive to attackers, using the 8 https://cwe.mitre.org

4
based on the state of vulnerabilities for that date. Since the final Table 2: Non-default hyperparameter values for XGBoost al-
model is intended to estimate the probability of exploitation in the gorithm and data selection
next 30 days, we construct the ground truth for the test data by
looking for exploitation activity over the following 30 days from Parameter Value
the test date selected. Time Horizon 1 year
Learning rate 0.11
Max depth tree depth 20
Subsample ratio of the training instances 0.75
4.2 Model selection Minimum loss reduction for leaf node partition 10
The first EPSS model [22] sought not only to accurately predict Maximum delta step 0.9
exploitation but do so in a parsimonious, easy to implement way. The number of boosting rounds 65
As a result, regularized logistic regression (Elasticnet) was chosen
to produce a generalized linear model with only a handful of vari-
ables. The current model relaxes this requirement in the hopes of
improving performance and providing more accurate exploitation 4.4 Tuning and optimizing model performance
predictions. In particular, capturing non-linear relationships be- Despite being a well studied approach, the use of gradient boosted
tween inputs and exploitation activity will better predict the finer trees and XGBoost for prediction problems still requires some ef-
grain exploitation activity. fort to identify useful features and model tuning to achieve good
Removing the requirement of a simple model with the need to model performance. This requires a-priori decisions about which
model complex relationships expands the universe of potential mod- features to include and the hyperparameter values for the XGBoost
els. Indeed many machine learning algorithms have been developed algorithm.
for this exact purpose. However, testing all models is impractical The features outlined in subsection 3.2 includes 28,724 variables.
because each model requires significant engineering and calibration Many of these variables are binary features indicating whether a
to achieve an optimal outcome. We therefore focus on a single type vulnerability affects a particular vendor or can be described by a
of model that has proven to be particularly performant on these specific CWE. While the XGBoost algorithm is efficient, including
data. Recent research has illustrated that panel (tabular) data, such all of variables in our inference is technically infeasible. To reduce
as ours, can be most successfully modeled using tree based methods the scope of features we take a naive, yet demonstrably effective ap-
(in particular gradient boosted trees for regression) [18], arriving at proach at removing variables below a specific occurrence rate [39].
similar or better predictive performance with less computation and This reduced the input feature set to 1,477 variables.
tuning in comparison to other methods such as neural networks. One additional challenge with our data is the temporal nature
Given the results in [18] we focus our efforts on tuning a common of our predictions. In particular, exactly how much historical data
implementation of gradient boosted trees, XGBoost [12]. should be included in the data set. In addition to the XGBoost
XGBoost is a popular, well documented, and performant implemen- hyperparameters and the sparsity threshold, we also constructed
tation of the gradient boosted tree algorithm in which successive four different sets of training data for 6 months and then 1, 2 and
decision trees are trained to iteratively reduce prediction error. 3 years, to determine what time horizons would provide the best
predictions.
To identify the time horizon and sparsity threshold described
above as well as the other hyperparameters needed by our imple-
4.3 Train/test split and measuring performance mentation of gradient boosted trees we take a standard approach
In order to reduce over-fitting, We implement two restrictions. First, described in [38]. We first define reasonable ranges for the hyper-
we implement a time-based test/train split, constructing our train- parameters, use Latin Hypercube sampling over the set of possible
ing data sets on data up to and including October 31, 2021. We then combinations, compute model performance for that set of hyper-
construct the test data set based on the state of vulnerabilities on parameters, then finally build an additional model (also a gradient
December 1st, 2021, providing one month between the end of the boosted tree) to predict performance given a set of hyperparameters,
training data and the test data. As mentioned above, the ground using the model to maximize performance.
truth in the test data is any exploitation activity from December The results of the above process results in the parameters selected
1st to December 30th, 2021. Second, we use 5-fold cross validation, in Table 2. Note that of the tested time horizons, none dramatically
with the folds based on each unique CVE identifier. This selectively outperformed others, with 1 year only slightly outperforming other
removes vulnerabilities from the training data and tests the perfor- tested possibilities.
mance on the hold out set, thus further reducing the likelihood of
over-fitting. 5 EVALUATION
Finally, we measure performance by calculating the area under
the curve (AUC) based on precision and recall across the full range 5.1 Precision (efficiency) and recall (coverage)
of predictions. We selected precision-recall since we have severe Precision and recall are commonly used machine learning perfor-
class imbalance in exploited vulnerabilities, and using accuracy or mance metrics, but are not intuitive for security practitioners, and
traditional Receiver Operator Characteristic (ROC) curves may be therefore can be difficult to contextualize what these performance
misleading due to that imbalance. metrics represent in practice.
5
Precision (efficiency) measures how well resources are being 1.0 0.9 Labeled points show thresholds,
0.9 0.8
allocated, (where low efficiency represents wasted effort), and is 0.7 CVEs scoring at or above
0.9 0.6 threshold are prioritized
calculated as the true positives divided by the sum of the true and 0.9 0.8 0.5
0.8 0.4
false positives. 0.8 0.7 0.7
0.3

Precision (Efficiency)
0.6 0.6
In the vulnerability management context, efficiency addresses the 0.5
0.7 0.5
question, “out of all the vulnerabilities remediated, how many were 0.4 0.2
0.4
actually exploited?” If a remediation strategy suggests patching 100 0.6 0.3
vulnerabilities, 60 of which were exploited, the efficiency would be 0.3
0.5 0.2 0.1
60%. 0.2
Recall (coverage), on the other hand, considers how well a remedia- 0.4 0.1 0.05
tion strategy actually addresses those vulnerabilities that should be

EP
0.3

SS
patched (e.g., that have observed exploitation activity), and is calcu- 0.05

v3
0.1
lated as the true positives divided by the sum of the true positives 0.2
and false negatives. 0.05 EP
10 SS
In the vulnerability management context, coverage addresses the 0.1 CVSS v3.x Base Sc e
or 9 EPSS 2
v
8 v1 7
6 5 43210.5
question, “out of all the vulnerabilities that are being exploited, 0.0
how many were actually remediated?” If 100 vulnerabilities are 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
exploited, 40 of which are patched, the coverage would be 40%. Recall (Coverage)
Therefore, for the purpose of this article, we use the terms effi-
ciency and coverage interchangeably with precision and recall, Figure 1: Performance of EPSS v3 compared to previous ver-
respectively, in the discussions below. sions and CVSS Base Score

5.2 Model performance


After several rounds of experiments to find the optimal set of fea- with a CVSS base score of 9.7 or higher. At the F1 threshold, CVSS
tures, amount of historical data, and model parameters as discussed v3.x achieves an efficiency rating of 6.5% and coverage of 32.3% and
in the previous section, we generated one final model using all vul- prioritizes 13.7% of the vulnerabilities in our study.
nerabilities from November 1st, 2021 to October 31st, 2022. We then
predicted the probability of exploitation activity in the next 30 days 5.3 Probability calibrations
based on the state of vulnerabilities on December 1st, 2022. Using A significant benefit of this model over alternative exploit scoring
evidence of exploitation activity for the following 30 days (through systems (described above) is that the output scores are true proba-
Dec 30th, 2022), we measured overall performance as shown in bilities (i.e., probability of any exploitation activity being observed
Figure 1. For comparison, we also show performance metrics for in the next 30 days) and can therefore be scaled to produce a threat
the EPSS versions 1 and 2, as well as CVSS v3 base scores for the score based on one or more vulnerabilities, such as would be found
same date and exploitation activity (Dec 1st, 2022). Figure 1 in- in a single network device (laptop, server), network segment, or an
cludes points along the precision-recall curves that represent the entire enterprise. For example, standard mathematical techniques
thresholds with each prioritization strategy. can be used to answer questions like “what is the probability that
Figure 1 clearly illustrates the significant improvement of the at least one of this asset’s vulnerabilities will be exploited in the
EPSS v3 model over previous versions, as well as the CVSS version next 30 days?” Such estimates, however, are only useful if they
3 base score. are calibrated and therefore reflect the true likelihood of the event
EPSS v3 produces an area under the curve (AUC) of 0.7795, and an F1 occurring.
score of 0.728. A remediation strategy based on this F1 score would In order to address this, we measure calibration in a two ways.
prioritize remediation for vulnerabilities with EPSS probabilities First we calculate a Brier Score [8] which produces a score between
of 0.36 and above, and would achieve an efficiency of 78.5% and 0 and 1, with 0 being perfectly calibrated and 1 being perfectly
coverage of 67.8%. uncalibrated (the original 1950 paper doubles the range from 0 to
In addition, this strategy would prioritize remediation of 3.5% of all 2). Our final estimate revealed a Brier score of 0.0162, which is
published vulnerabilities (representing the level of effort). objectively very low (good). We also plot the predicted (binned) val-
EPSS v2 has an AUC of 0.4288 and a calculated F1 score at 0.451, ues against the observed (binned) exploitation activity (commonly
which prioritizes vulnerabilities with a probability of 0.16 and above. referred to as a “calibration plot”) as shown in Figure 2. The closer
At the F1 threshold, EPSS v2 achieves an efficiency rating of 45.5% the plotted line is to a 45 degree line (i.e. a line with a slope of 1,
and coverage of 44.8% and prioritizes 4% of the vulnerabilities in represented by the dashed line), the greater the calibration. Again,
our study. EPSS v1 has an AUC of 0.2998 and a calculated F1 score at by visual inspection, our plotted line very closely matches the 45
0.361, which prioritizes vulnerabilities with a probability of 0.2 and degree line.
above. At the F1 threshold, EPSS v1 achieves an efficiency rating of
43% and coverage of 31.1% and prioritizes 2.9% of the vulnerabilities 5.4 Simple Remediation Strategies
in our study. Finally, CVSS v3.x base score has an AUC of 0.051 Research conducted by Kenna Security and Cyentia tracked vul-
and a calculated F1 score at 0.108, which prioritizes vulnerabilities nerabilities at hundreds of companies and found that on average,
6
100% CVSS:3.1/PR:N Tag:Code Execution
Observed with Exploitation Activity
in 30 days following Dec 1, 2022

Effort: 70.4% of CVEs Effort: 17.2% of CVEs


Coverage: 88.1% Coverage: 48.0%
10%
Efficiency: 5.1% Efficiency: 11.4%

1% Exploit:Exploit DB CWE-119: Buffer Overflow

Effort: 10.9% of CVEs Effort: 6.2% of CVEs


0.1% Coverage: 34.7% Coverage: 16.9%
Efficiency: 13.0% Efficiency: 11.1%
0.1% 1% 10% 100%
Predicted Probability of
Exploitation on Dec 1, 2022

Exploit:metasploit Site:KEV
Figure 2: Calibration Plot comparing predicted probabilities
to observed exploitation period in the following 30 days

Effort: 1.0% of CVEs Effort: 0.5% of CVEs


companies were only able to remediate about 15.5% of their open Coverage: 14.9% Coverage: 5.9%
Efficiency: 60.5% Efficiency: 53.2%
vulnerabilities in a month[20]. This research also found that re-
source capacity for remediating vulnerabilities varies considerably
across companies, which suggests that any vulnerability remedi-
ation strategy should accommodate varying levels of corporate
resources and budgets. Indeed, organizations with fewer resources
(presumably smaller organizations) may prefer to emphasize ef- All CVEs CVEs Prioritized Exploited
ficiency over coverage, to optimize their spending, while larger
organizations may accept less efficient strategies in exchange for
the greater coverage (i.e. more vulnerabilities patched). Figure 3: Alternative strategies based on simple heuristics
Therefore, we compare the amount of effort required (as mea-
sured by the number of vulnerabilities needing to be remediated)
for differing remediation strategies. Figure 3 highlights the per- The middle row of Figure 3 shows remediation strategies for
formance of 6 simple (but practical) vulnerability prioritization vulnerabilities published in Exploit DB (left), and Buffer Overflows
strategies based on our test data (December 1st, 2022).9 (CWE-119; right3), respectively.
The first diagram in the upper row considers a strategy based on The bottom row of Figure 3 is especially revealing. The bottom
the CVSS v3.x vector of “Privilege Required: None”. Being able to right diagram shows performance metrics for a remediation strat-
exploit a vulnerability that doesn’t require any established account egy based on patching vulnerabilities from the Known Exploited
credentials is an attractive vulnerability to exploit, as an attacker. Vulnerabilities (KEV) list (as of Dec 1, 2022) from DHS/CISA. The
While this strategy would yield 88.1% coverage, it would achieve KEV list is meant to prioritize vulnerability remediation for US Fed-
only 5.1% efficiency. That is, from a defender perspective, this class eral agencies as per Binding Operational Directive 22-0110 . Strictly
of vulnerabilities represents over 130,000 (70%) of all published following the KEV would remediate half of one percent (0.5%) of all
CVEs, and would easily surpass the resources capacity of most published CVEs, and produce a relatively high efficiency of 53.2%.
organizations. However, with almost 8,000 unique CVEs with exploitation activity
“Code Execution” is another attractive vulnerability attribute for in December, the coverage obtained from this strategy is only 5.9%.
attackers since these vulnerabilities could allow the attacker to Alternatively, the strategy identified in the bottom left diagram
achieve full control of a target asset. However, remediating all the shows a remediation strategy based on whether a vulnerability
code execution vulnerabilities (17% or about 32,000 of all CVEs) appears in a Metasploit module. In this case, a network defender
would achieve 48% coverage and 11.4% efficiency. would need to remediate almost twice as many vulnerabilities on

9 Performance is then measured based on exploitation activity in the following 30 days. 10 "See https://www.cisa.gov/binding-operational-directive-22-01"
7
CVSS v3.x EPSS v1 CVSS v3.x EPSS v1

Threshold: 9.1+ Threshold: 0.062+ Threshold: 7+ Threshold: 0.015+


Effort: 15.1% of CVEs Effort: 15.1% of CVEs Effort: 58.1% of CVEs Effort: 44.3% of CVEs
Coverage: 33.5% Coverage: 57.0% Coverage: 82.1% Coverage: 82.2%
Efficiency: 6.1% Efficiency: 15.4% Efficiency: 3.9% Efficiency: 7.6%

EPSS v2 EPSS v3 EPSS v2 EPSS v3

Threshold: 0.037+ Threshold: 0.022+ Threshold: 0.012+ Threshold: 0.088+


Effort: 15.4% of CVEs Effort: 15.3% of CVEs Effort: 39.0% of CVEs Effort: 7.3% of CVEs
Coverage: 69.9% Coverage: 90.4% Coverage: 84.7% Coverage: 82.0%
Efficiency: 18.5% Efficiency: 24.1% Efficiency: 8.9% Efficiency: 45.5%

All CVEs CVEs Above Threshold Exploited All CVEs CVEs Above Threshold Exploited

Figure 4: Strategy comparisons holding the level of effort Figure 5: Strategy comparisons holding the coverage con-
constant stant

the KEV list, but would enjoy 13% greater efficiency (60.5% vs 53.2%)
and almost three times more coverage (14.9% vs 5.9%).
circle (exploitation activity) covered by the blue circle (number of
Therefore, based on this simple heuristic (KEV vs Metasploit), the
vulnerabilities needing to be remediated). The baseline for coverage
Metasploit strategy outperforms the KEV strategy.
is set by a CVSS strategy of remediating vulnerabilities with a base
score of 7 and above (CVEs with a "High" or "Critical" CVSS score).
5.5 Advanced remediation strategies Such a strategy yields a respectable coverage at 82.1% but at the cost
Next we explore the real-world performance of our model, using of a higher level of effort, needing to remediate 58.1% or 110,000
two separate approaches. We first compare coverage among four of all published CVEs. Practitioners can achieve a similar level of
remediation strategies while holding the level of effort constant (i.e. coverage (82%) using EPSS v3 and prioritizing vulnerabilities scored
the number of vulnerabilities needing to be remediated), we then at 0.088 and above but with a much lower level of effort, needing
compare levels of effort while holding coverage constant. to only remediate 7.3% or just under 14,000 vulnerabilities.
Figure 4 compares the four strategies while maintaining approxi- Remediating CVEs rated as High or Critical with CVSS v3 gives
mately the same level of effort. That is, the blue circle in the middle a respectable level of coverage at 82.1%, but requires remediating
of each figure – representing the number of vulnerabilities that 58.1% of published CVEs. On the other hand, EPSS v3 can achieve
would need to be remediated – is fixed to the same size for each the same level of coverage but reduces the amount of effort from
strategy, at approximately 15% or about 28,000 vulnerabilities. The 58.1% to 7.3% of all CVEs, or fewer than 14000 vulnerabilities.
CVSS strategy, for example, would remediate vulnerabilities with
a base score of 9.1 or greater, and would achieve coverage and
efficiency of 33.5% and 6.1%, respectively. 6 DISCUSSION AND FUTURE WORK
A remediation strategy based on EPSS v2, on the other hand, Currently, the EPSS model ingests data concerning which vulnera-
would remediate vulnerabilities with an EPSS v2 score of 0.037 and bilities were exploited on which days. However, exploitation has
greater, yielding 69.9% coverage and 18.5% efficiency. Already, this many other characteristics, which may be useful to capture and ex-
strategy doubles the coverage and triples the efficiency, relative to amine. For example, we may be interested in studying the number
the CVSS strategy. of exploits per vulnerability (volume), fragmentation of exploitation
Even better results are achieved with a remediation strategy over time (that is, the pattern of periods of exploitation), or preva-
based on EPSS v3 which enjoys 90.4% coverage and 24.1% efficiency. lence, which would measure the spread of exploitation, typically
Figure 5 compares the four strategies while maintaining approxi- by counting the number of devices detecting exploitation. We leave
mately the same level of coverage. That is, the proportion of the red these topics for future work.
8
6.1 Limitations and adversarial consideration Exploit Code
This research is conducted with a number of limitations. First,
CVE (age+refs)
insights are limited to data collected from our data partners and the
geographic and organizational coverage of their network collection CVSS Vectors
devices. While these data providers collectively manage hundreds
Sites
of thousands of sensors across the globe, and across organizations of

Density
all sizes and industries, they do not observe every attempted exploit Scanners

event in every network. Nevertheless, it is plausible to think that the


Twitter
data used, and therefore any inferences provided, are representative
of all mass exploitation activity. Tag

In regard to the nature of how vulnerabilities are detected, any


CWE
signature-based detection device is only able to alert on events
that it was programmed to observe. Therefore, we are not able to Vendor

observe vulnerabilities that were exploited but undetected by the 0 0.01 0.1 0.5 1 2 3 5
sensor because a signature was not written. Shapley Value
Moreover, the nature of the detection devices generating the
events will be biased toward detecting network-based attacks, as Figure 6: Density plots of the absolute SHAP values for each
opposed to attacks from other attack vectors such as host-based family of features
attacks or methods requiring physical proximity.11 Similarly, these
detection systems will be typically installed on public-facing perime-
vulnerabilities with lower EPSS scores, it may be conceivable that
ter internet devices, and therefore less suited to detecting computer
attackers begin to strategically incorporate these lower scoring
attacks against internet of things (IoT) devices, automotive net-
vulnerabilities into their tactics and malware. While possible, we
works, ICS, SCADA, operational technology (OT), medical devices,
are not aware of any actual or suggestive evidence to this effect.
etc.
Finally, while evolving the model from a logistic regression
Given the exploit data from the data partners, we are not able
to a more sophisticated machine learning approach greatly im-
to distinguish between exploit activity generated by researchers or
proved performance of EPSS, an important consequence is that
commercial entities, versus actual malicious exploit activity. While
interpretability of variable contributions is more difficult to quan-
it is likely that some proportion of exploitation does originate from
tify as we discuss in the next section.
non-malicious sources, at this point we have no reliable way of
estimating the true proportion. However, based on the collective
6.2 Variable importance and contribution
authors’ experience, and discussions with our data providers, we
do not believe that this represents a significant percentage of ex- While an XGBoost model is not nearly as intuitive or interpretable
ploitation activity. as linear regression, we can use SHAP values [23] to reduce the
While these points may limit the scope of our inferences, to the opacity of a trained model by quantifying feature contributions,
Í
extent that our data collection is representative of an ecosystem of breaking down the score assigned to a CVE as 𝜙 0 + 𝑖 𝜙𝑖 , where
public-facing, network-based attacks, we believe that many of the 𝜙𝑖 is the contribution from feature 𝑖, and 𝜙 0 is a bias term. We use
insights presented here are generalizable beyond this dataset. SHAP values due to their good properties such as local accuracy
In addition to these limitations, there are other adversarial con- (attributions sum up to the output of the model), missingness (miss-
siderations that fall outside the scope of this paper. For example, one ing features are given no importance), and consistency (modifying
potential concern is the opportunity for adversarial manipulation a model so that a feature is given more weight never decreases its
either of the EPSS model, or using the EPSS scores. For example, attribution).
it may be possible for malicious actors to poison or otherwise ma- The contributions from different classes of variables in the ker-
nipulate the input data to the EPSS model (e.g. Github, Twitter). nel density plot are shown in Figure 6. First, note that the figure
These issues have been studied extensively in the context of ma- displays the absolute value of the SHAP values, in order to infer
chine learning for exploit prediction [32] and other tasks [10, 33], the contribution of the variable away from zero. Second, note the
and their potential impact is well understood. Given that we have horizontal axis is presented on log scale to highlight that the ma-
no evidence of such attacks in practice, and our reliance on data jority of features do not contribute much weight to the final output.
from many distinct sources which would reduce the leverage of In addition, the thin line extending out to the right in Figure 6
adversaries, we leave an in-depth investigation of the matter for illustrates how there are instances of features within each class
future work. Additionally, it is possible that malicious actors may that contribute a significant amount. Finally, note that Figure 6
change their strategies based on EPSS scores. For example, if net- is sorted in decreasing mean absolute SHAP value for each class
work defenders increasingly adopt EPSS as the primary method of features, highlighting the observation that published exploit
for prioritizing vulnerability remediation, thereby deprioritizing code is the strongest contributor to the estimated probability of
exploitation activity.
11 For
Figure 7 identifies the 30 most significant features with their
example, it is unlikeley to find evidence of exploitation for CVE-2022-37418 in
our data set, a vulnerability in the remote keyless entry systems on specific makes and calculated mean absolute SHAP value. Again, note that higher
models of automobiles. values infer a greater influence (either positive or negative) on
9
CVE: Count of References Vulnerability Database (NVD) includes a CVSS base score with
Tag: Remote nearly all vulnerabilities it has published. Because of the wide-
Tag: Code Execution spread use of CVSS, specifically the base score, as a prioritization
Exploit: Exploit DB
CVE: Age of CVE strategy we will compare our performance against CVSS as well as
Vendor: Microsoft our previous models.
CVSS: 3.1/AV:N
CVSS: 3.1/PR:N Exploit likelihood is also modeled through various vendor-specific
CVSS: 3.1/A:H metrics. In 2008, Microsoft introduced the Exploitability Index for
CVSS: 3.1/C:H vulnerabilities in their products [24]. It provides 4 measures for
Site: ZDI
Exploit: metasploit the likelihood that a vulnerability will be exploited: whether an
NVD: Exploit Ref exploitation has already been detected, and whether exploitation is
NVD: VDB Ref
NVD: US Gov Ref more or less likely, or unlikely. The metric has been investigated
Tag: SQLi before [15, 30, 40] and was shown to have limited performance at
Scanner: Nuclei predicting exploitation in the wild [13, 30] or the development of
Vendor: Adobe
CVSS: 3.1/UI:N functional exploits [34].
NVD: Vendor Advisory Ref Redhat provides a 4-level severity rating: low, moderate, im-
Tag: Local portant, and critical [28]. In addition to capturing a measure of
NVD: 3party Advisory Ref
NVD: Patch Ref the impact to a vulnerable system, this index also captures some
CVSS: 3.1/I:H notion of exploitability. For example, the “low” severity rating rep-
Tag: XSS
Tag: Denial of Service resents vulnerabilities that are unlikely to be exploited, whereas
Site: KEV the “critical” severity rating reflects vulnerabilities that could be
CVSS: 3.1/Scored easily exploited by an unauthenticated remote attacker. Like the
Exploit: Github
Tag: Buffer Overflow Exploitability Index, Redhat’s metric is vendor-specific and has
limitations reflecting exploitation likelihood [34].
0.0 0.1 0.2 0.3 0.4
A series of commercial solutions also aim to capture the likeli-
Mean Absolute Shapley Value
hood of exploits. Tenable, a leading vendor of intrusion detection
systems, created the Vulnerability Priority Rating (VPR), which,
Figure 7: Mean absolute SHAP value for individual features like CVSS, combines information about both impact to a vulnerable
system, and the exploitability (threat) of a vulnerability in order to
help network defenders better prioritize remediation efforts [36].
the final predicted value. Note that Figure 6 is showing the mean For example, the threat component of VPR “reflects both recent
absolute SHAP value from an entire class of features. So even though and potential future threat activity” by examining whether exploit
Exploit Code as a class of features has a higher mean absolut SHAP code is publicly available, whether there are mentions of active
value, the largest individual feature is coming from the count of exploitation on social media or in the dark web, etc. Rapid 7’s Real
references in the published CVE (which is in the "CVE" class). Risk Score product uses its own collection of data feeds to produce
Note how the most influential feature is the count of the number a score between 1-1000. This score is a combination of the CVSS
of references in MITRE’s CVE List, followed by “remote attack- base score, “malware exposure, exploit exposure and ease of use,
ers,” “code execution,” and published exploit code in Exploit-DB, and vulnerability age” and seeks to produce a better measure of
respectively. both exploitability and “risk” [26]. Recorded Future’s Vulnerability
Intelligence product integrates multiple data sources, including
threat information, and localized asset criticality [27]. The predic-
7 LITERATURE REVIEW AND RELATED tions, performance evaluations and implementation details of these
SCORING SYSTEMS solutions are not publicly available.
This research is informed by multiple bodies of literature. First, These industry efforts are either vendor-specific, score only sub-
there are a number of industry efforts that seek to provide some sets of vulnerabilities, based on expert opinion and assessments and
measure of exploitability for individual vulnerabilities, though there therefore not entirely data-driven, or proprietary and not publicly
is wide variation in their scope and availability. First, the base metric available.
group of CVSS, the leading standard for measuring the severity Our work is also related to a growing academic research field of
of a vulnerability, is composed of two parts, measuring impact predicting and detecting vulnerability exploitation. A large body
and exploitability [17]. The score is built on expert judgements, of work focuses on predicting the emergence of proof-of-concept
capturing, for example the observation that a broader ability to or functional exploits [5–7, 9, 14, 29, 34], not necessarily whether
exploit a vulnerability (i.e., remotely across the Internet, as opposed these exploits will be used in the wild, as is done with EPSS. Papers
to requiring local access to the device); a more complex exploit predicting exploitation in the wild have used alternative sources of
required, or more user interaction required, all serve to increase the exploitation, most notably data from Symantec’s IDS, to build pre-
apparent likelihood that a vulnerability could be exploited, all else diction models [4, 11, 16, 19, 32, 35, 37]). Most of these papers build
being equal. CVSS has been repeatedly shown by prior work [2, 3], vulnerability feature sets from commonly used data sources such
as well as our own evidence, to be insufficient for capturing all as NVD or OSVDB, although some of them use novel identifiers
the factors that drive exploitation in the wild. The U.S. National for exploitation: [32] infers exploitation using Twitter data, [37]
10
uses patching patterns and blacklist information to predict whether [6] Navneet Bhatt, Adarsh Anand, and Venkata SS Yadavalli. 2021. Exploitabil-
organizations are facing new exploits, while [35] uses natural lan- ity prediction of software vulnerabilities. Quality and Reliability Engineering
International 37, 2 (2021), 648–663.
guage processing methods to infer context of darkweb/deepweb [7] Mehran Bozorgi, Lawrence K Saul, Stefan Savage, and Geoffrey M Voelker. 2010.
discussions. Beyond Heuristics: Learning to Classify Vulnerabilities and Predict Exploits.
In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge
Compared to other scoring systems and research described above, Discovery and Data Mining. 105–114.
EPSS is a rigorous and ongoing research effort is; an international, [8] Glenn W Brier et al. 1950. Verification of forecasts expressed in terms of proba-
community-driven effort; designed to predict vulnerability exploita- bility. Monthly weather review 78, 1 (1950), 1–3.
[9] Benjamin L Bullough, Anna K Yanchenko, Christopher L Smith, and Joseph R
tion in the wild; available for all known and published vulnerabili- Zipkin. 2017. Predicting Exploitation of Disclosed Software Vulnerabilities Using
ties; updated daily to reflect new vulnerabilities and new exploit- Open-source Data. In Proceedings of the 3rd ACM on International Workshop on
related information; made available freely to the public. Security and Privacy Analytics. 45–53.
[10] Anirban Chakraborty, Manaar Alam, Vishal Dey, Anupam Chattopadhyay, and
Debdeep Mukhopadhyay. 2018. Adversarial attacks and defences: A survey. arXiv
preprint arXiv:1810.00069 (2018).
[11] Haipeng Chen, Rui Liu, Noseong Park, and VS Subrahmanian. 2019. Using
8 CONCLUSION twitter to predict when vulnerabilities will be exploited. In Proceedings of the 25th
In this paper, we presented results from an international, community- ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
3143–3152.
driven effort to collect and analyze software vulnerability exploit [12] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting
data, and to build a machine learning model capable of estimating system. In ACM SIGKDD International Conference on Knowledge Discovery and
the probability that a vulnerability would be exploited within 30 Data Mining. ACM, 785–794.
[13] DarkReading 2008. Black Hat: The Microsoft Exploitability Index: More Vulnera-
days following the prediction. In particular, we described the pro- bility Madness. DarkReading. https://www.darkreading.com/risk/black-hat-the-
cess of collecting each of the additional variables, and described microsoft-exploitability-index-more-vulnerability-madness.
[14] Michel Edkrantz and Alan Said. 2015. Predicting Cyber Vulnerability Exploits
the approaches used to create the machine learning model based with Machine Learning.. In SCAI. 48–57.
on 6.4 million observed exploit attempts. Through the expanded [15] C Eiram. 2013. Exploitability/Priority Index Rating Systems (Approaches, Value,
data sources we achieved an unprecedented 82% improvement in and Limitations).
[16] Yong Fang, Yongcheng Liu, Cheng Huang, and Liang Liu. 2020. FastEmbed:
classifier performance over the previous iterations of EPSS. Predicting vulnerability exploitation possibility based on ensemble machine
We illustrated practical use of EPSS by way of comparison with a learning algorithm. PloS one 15, 2 (2020), e0228439.
set of alternative vulnerability remediation strategies. In particular, [17] FIRST 2019. A complete guide to the common vulnerability scoring system.
https://www.first.org/cvss/v3.0/specification-document.
we showed the sizeable and meaningful improvement in coverage, [18] Leo Grinsztajn, Edouard Oyallon, and Gael Varoquaux. 2022. Why do tree-based
efficiency and level of effort (as measured by the number of vul- models still outperform deep learning on typical tabular data?. In Thirty-sixth
Conference on Neural Information Processing Systems Datasets and Benchmarks
nerabilities that would need to be remediated) by using EPSS v3 Track.
over any and all current remediation approaches, including CVSS, [19] Mohammad Shamsul Hoque, Norziana Jamil, Nowshad Amin, and Kwok-Yan
CISA’s KEV list, and Metasploit. Lam. 2021. An Improved Vulnerability Exploitation Prediction Model with Novel
Cost Function and Custom Trained Word Vector Embedding. Sensors 21, 12
As the EPSS effort continues to grow, acquire and ingest new (2021), 4220.
data, and improve modeling techniques with each new version, we [20] Cyentia Institute and Kenna Security. 2022. Prioritization to Prediction Vol 8.
believe it will continue to improve in performance, and provide (2022). https://www.kennasecurity.com/resources/prioritization-to-prediction-
reports/
new and fundamental insights into vulnerability exploitation for [21] Jay Jacobs, Sasha Romanosky, Idris Adjerid, and Wade Baker. 2020. Improving vul-
many years to come. nerability remediation through better exploit prediction. Journal of Cybersecurity
6, 1 (2020), tyaa015.
[22] Jay Jacobs, Sasha Romanosky, Benjamin Edwards, Idris Adjerid, and Michael
Roytman. 2021. Exploit Prediction Scoring System (EPSS). Digital Threats:
9 ACKNOWLEDGEMENTS Research and Practice 2, no. 3 (2021): 1-17. 2, 3 (2021), 1–17.
We would like to acknowledge the participants of the EPSS Special [23] Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model
predictions. In Advances in neural information processing systems. 4765–4774.
Interest Group (SIG), as well as the organizations that have con- [24] Microsoft 2020. Microsoft Exploitability Index. Microsoft. https://www.microsoft.
tributed to the EPSS data model to include: Forinet, Shadow Server com/en-us/msrc/exploitability-index.
Foundation, Greynoise, Alien Vault, Cyentia, and FIRST. [25] Maciej Nowak, Michał Walkowski, and Sławomir Sujecki. 2021. Conversion of
CVSS Base Score from 2.0 to 3.1. In 2021 International Conference on Software,
Telecommunications and Computer Networks (SoftCOM). IEEE, 1–3.
[26] Rapid7 2023. Prioritize Vulnerabilities Like an Attacker. Rapid7. https://www.
REFERENCES rapid7.com/products/insightvm/features/real-risk-prioritization/.
[1] Luca Allodi and Fabio Massacci. 2012. A Preliminary Analysis of Vulnerability [27] Recorded Future 2023. Prioritize patching based on risk. Recorded Future.
Scores for Attacks in Wild. In CCS BADGERS Workshop. Raleigh, NC. https://www.recordedfuture.com/platform/vulnerability-intelligence.
[2] Luca Allodi and Fabio Massacci. 2012. A preliminary analysis of vulnerability [28] RedHat 2023. Severity ratings. RedHat. https://access.redhat.com/security/
scores for attacks in wild: The EKITS and SYN datasets. In Proceedings of the 2012 updates/classification/.
ACM Workshop on Building Analysis Datasets and Gathering Experience Returns [29] Alexander Reinthal, Eleftherios Lef Filippakis, and Magnus Almgren. 2018. Data
for Security. 17–24. modelling for predicting exploits. In Nordic Conference on Secure IT Systems.
[3] Luca Allodi and Fabio Massacci. 2014. Comparing vulnerability severity and Springer, 336–351.
exploits using case-control studies. ACM Transactions on Information and System [30] Reuters. [n. d.]. Microsoft correctly predicts reliable exploits just 27% of the time.
Security (TISSEC) 17, 1 (2014), 1–20. https://www.reuters.com/article/urnidgns852573c400693880002576630073ead6/
[4] Mohammed Almukaynizi, Eric Nunes, Krishna Dharaiya, Manoj Senguttuvan, microsoft-correctly-predicts-reliable-exploits-just-27-of-the-time-
Jana Shakarian, and Paulo Shakarian. 2017. Proactive Identification of Exploits in idUS186777206820091104.
the Wild Through Vulnerability Mentions Online. In 2017 International Conference [31] Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Automatic
on Cyber Conflict (CyCon US). IEEE, 82–88. keyword extraction from individual documents. Text mining: Applications and
[5] Kenneth Alperin, Allan Wollaber, Dennis Ross, Pierre Trepagnier, and Leslie theory (2010), 1–20.
Leonard. 2019. Risk prioritization by leveraging latent vulnerability features in a [32] Carl Sabottke, Octavian Suciu, and Tudor Dumitras, . 2015. Vulnerability Disclo-
contested environment. In Proceedings of the 12th ACM Workshop on Artificial sure in the Age of Social Media: Exploiting Twitter for Predicting {Real-World }
Intelligence and Security. 49–57. Exploits. In 24th USENIX Security Symposium (USENIX Security 15). 1041–1056.
11
[33] Octavian Suciu, Radu Marginean, Yigitcan Kaya, Hal Daume III, and Tudor Dumi- [37] Chaowei Xiao, Armin Sarabi, Yang Liu, Bo Li, Mingyan Liu, and Tudor Dumitras.
tras. 2018. When does machine learning {FAIL }? generalized transferability for 2018. From patching delays to infection symptoms: Using risk profiles for an
evasion and poisoning attacks. In 27th {USENIX } Security Symposium ( {USENIX } early discovery of vulnerabilities exploited in the wild. In 27th USENIX Security
Security 18). 1299–1316. Symposium (USENIX Security 18). 903–918.
[34] Octavian Suciu, Connor Nelson, Zhuoer Lyu, Tiffany Bao, and Tudor Dumitras, . [38] Li Yang and Abdallah Shami. 2020. On hyperparameter optimization of machine
2022. Expected exploitability: Predicting the development of functional vul- learning algorithms: Theory and practice. Neurocomputing 415 (2020), 295–316.
nerability exploits. In 31st USENIX Security Symposium (USENIX Security 22). [39] Yiming Yang and Jan O Pedersen. 1997. A comparative study on feature selection
377–394. in text categorization. In Icml, Vol. 97. Citeseer, 35.
[35] Nazgol Tavabi, Palash Goyal, Mohammed Almukaynizi, Paulo Shakarian, and [40] Awad A Younis and Yashwant K Malaiya. 2015. Comparing and evaluating CVSS
Kristina Lerman. 2018. Darkembed: Exploit prediction with neural language base metrics and microsoft rating system. In 2015 IEEE International Conference
models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. on Software Quality, Reliability and Security. IEEE, 252–261.
[36] Tenable 2020. What Is VPR and How Is It Different from CVSS? Tenable. https:
//www.tenable.com/blog/what-is-vpr-and-how-is-it-different-from-cvss.

12

You might also like