0% found this document useful (0 votes)
21 views

Enhancing Vulnerability Prioritization - Data-Driven Exploit Predictions

Uploaded by

felipeel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Enhancing Vulnerability Prioritization - Data-Driven Exploit Predictions

Uploaded by

felipeel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Enhancing Vulnerability Prioritization: Data-Driven Exploit Predictions with

Community-Driven Insights

Jay Jacobs Sasha Romanosky Octavian Suciu Ben Edwards Armin Sarabi
Cyentia Institute RAND Corporation University of Maryland Cyentia Institute University of Michigan
[email protected] [email protected] [email protected] [email protected] [email protected]

Abstract—The number of disclosed vulnerabilities has been and the limited capacity to remediate them, vulnerability
steadily increasing over the years. At the same time, orga- prioritization has become both a chronic and an acute
nizations face significant challenges patching their systems, concern for every organization attempting to reduce their
leading to a need to prioritize vulnerability remediation in attack surface.
arXiv:2302.14172v2 [cs.CR] 15 Jun 2023

order to reduce the risk of attacks. The prioritization process involves scoring and rank-
Unfortunately, existing vulnerability scoring systems are ing vulnerabilities according to assessments, often based
either vendor-specific, proprietary, or are only commercially on the industry standard Common Vulnerability Scoring
available. Moreover, these and other prioritization strategies System (CVSS) [FIRST(2019)]. However, only the Base
based on vulnerability severity are poor predictors of actual metric group of CVSS is being assigned and distributed at
vulnerability exploitation because they do not incorporate scale by NIST, and this group of metrics is unable to adapt
new information that might impact the likelihood of exploita- to post-disclosure information, such as the publication
tion. of exploits or technical artifacts, which can affect the
In this paper we present the efforts behind building odds of attacks against a vulnerability being observed in
a Special Interest Group (SIG) that seeks to develop a the wild. As a result, while only 5% of known vulner-
completely data-driven exploit scoring system that produces
abilities are exploited in the wild [Jacobs et al.(2020)],
scores for all known vulnerabilities, that is freely available,
numerous prior studies have shown that CVSS does
not perform well when used to prioritize exploited
and which adapts to new information. The Exploit Prediction
vulnerabilities over those without evidence of ex-
Scoring System (EPSS) SIG consists of more than 170 experts
ploitation [Allodi and Massacci(2012a)], [Eiram(2013)],
from around the world and across all industries, providing
[Allodi and Massacci(2014)]. While several other ef-
crowd-sourced expertise and feedback.
forts have been made to capture exploitation likelihood
Based on these collective insights, we describe the design
in vulnerability assessments, these approaches are ei-
decisions and trade-offs that lead to the development of the ther vendor-specific [Microsoft(2020)], [RedHat(2023)] or
next version of EPSS. This new machine learning model pro- proprietary and not available publicly [Tenable(2020)],
vides an 82% performance improvement over past models in [Rapid7(2023)], [Recorded Future(2023)].
distinguishing vulnerabilities that are exploited in the wild
In order to improve remediation practices, network de-
and thus may be prioritized for remediation. fenders need a scoring systems that can accurately quan-
tify likelihood of exploits in the wild, and is able to adapt
1. Introduction to new information published after the initial disclosure
of a vulnerability.
Vulnerability management, the practice of identifying, Any effort to developing a new capability to under-
prioritizing, and patching known software vulnerabilities, stand, anticipate, and respond to new cyber threats must
has been a continuous challenge for defenders for decades. overcome three main challenges: i) it must address the
This issue is exacerbated by the increasing number of requirements of practitioners who rely on it; ii) it must
new vulnerabilities that are being disclosed annually. For provide significant performance improvements over exist-
example, MITRE published1 25,068 new vulnerabilities ing scoring systems; and iii) it must have a low barrier
during the 2022 calendar year, a 24.3% increase over for adoption and use.
2021. To address these challenges, a Special Interest Group
Adding to the increasing rate of published vulnerabil- (SIG) was formed in early 2020 at the Forum of Incident
ities are challenges incurred by practitioners when trying Response and Security Teams (FIRST). From its incep-
to remediate them. Recent research conducted by Kenna tion, the Exploit Prediction Scoring System (EPSS) SIG
Security and Cyentia tracked exposed vulnerabilities at has gathered 170 members from across the world, rep-
hundreds of companies and found that the monthly median resenting practitioners, researchers, government agencies,
rate of remediation was only 15.5%, while a quarter of and software developers.2 The SIG was created with the
companies remediated less than 6.6% of their open vul- publication of the first EPSS model for predicting the
nerabilities per month [Institute and Security(2022)]. As a likelihood of exploits in the wild [Jacobs et al.(2021)] and
consequence of the increasing awareness of software flaws is organized around a mailing list, a discussion forum, and
1. Not marked as REJECT or RESERVED. 2. See https://www.first.org/epss.
bi-weekly meetings. This unique environment represented developed to improve performance. This decision came
an opportunity to understand the challenges faced by with a trade-off, namely a loss of the model’s portability
practitioners when performing vulnerability prioritization, and thus, the ability to score vulnerabilities which are not
and therefore address the first challenge raised above publicly disclosed (e.g., zero day vulnerabilities, or flaws
by designing a scoring system that takes into account that may never be assigned a CVE ID). Nevertheless,
practitioner requirements. focusing on public vulnerabilities under the centralized
To address the second challenge and achieve signifi- model removed the need for each implementation of
cant performance improvements, the SIG provided subject EPSS to perform their own data collection, and further
matter expertise, which guided feature engineering with allowed more complex features and models. The model
high utility at predicting exploits in the wild. Finally, to used in v2 is XGBoost [Chen and Guestrin(2016)], and
address the challenges of designing a public and readily- the feature set was greatly expanded from 16 to 1,164.
available scoring system, the SIG attracted a set of in- These efforts led to a significant improvement in predictive
dustry partners willing to share proprietary data for the performance over the previous version by capturing higher
development of the model, the output of which can then order interactions in the extended feature set. Another
be made public. This allowed EPSS scores to be publicly major component of a centralized architecture was being
available at scale, lowering the barrier to entry for those able to adapt to new vulnerability artifacts (e.g., the pub-
wanting to integrate EPSS into their prioritization pipeline. lication of exploits) and produce new predictions, daily.
This paper presents the latest (third) iteration of the Moreover, the SIG also commented that producing scores
EPSS model, as well as lessons learned in its design, and based on the likelihood of exploitation within the first
their impact on designing a scoring system. The use of year of a vulnerability’s lifecycle was not very practical,
a novel and diverse feature set and optimized machine since most prioritization decisions are made with respect
learning techniques allows EPSS to improve prediction to an upcoming patching cycle. As a result, v2 switched
performance by 82% over its predecessor (as measured to predicting exploitation activity within the following 30-
by the precision/recall Area Under the Curve improved to day window as of the time of scoring, which aligns with
0.779 from 0.429). EPSS is able to score all vulnerabilities the typical remediation window of practitioners in the SIG.
published on MITRE’s CVE List (and the National Vul- For the third version of EPSS, the SIG highlighted a
nerability Database), and can reduce the amount of effort requirement for improved precision at identifying vulner-
required to patch critical vulnerabilities to one-eighth of abilities likely to be exploited in the wild. This drove an
a comparable strategy based on CVSS. This paper makes effort to expand the sources of exploit data by partnering
the following contributions: with multiple organizations willing to share data for model
1) Present lessons learned from developing an ex- development, and engineer more complex and informative
ploit prediction model that integrates the func- features. These label and feature improvements, along
tional requirements of a community of nearly 200 with a methodical hyper-parameter tuning approach, en-
practitioners and researchers. abled improved training of an XGBoost classifier. This
2) Engineers novel features for exploit prediction allowed the proposed v3 model to achieve an overall 82%
and use them to train the EPSS classifier for improvement in classifier performance over v2, with the
predicting the likelihood of exploits in the wild. Area Under the Precision/Recall Curve increasing from
3) Analyzes the practical utility of EPSS by show- 0.429 to 0.779. This boost in prediction performance
ing that it can significantly improve remediation allows organizations to substantially improve their prioriti-
strategies compared to static baselines. zation practices and design data-driven patching strategies.

2. Evolution of EPSS 3. Data


EPSS was initially inspired by the Common
The data used in this research is based on 192,035
Vulnerability Scoring System (CVSS). The first
published vulnerabilities (not marked as “REJECT” or
EPSS model [Jacobs et al.(2021)] was designed to be
“RESERVED”) listed in MITRE’s Common Vulnerabili-
lightweight, portable (i.e. implemented in a spreadsheet),
ties and Exposures (CVE) list through December 31, 2022.
and parsimonious in terms of the data required to
The CVE identifier has been used to combine records
score vulnerabilities. Because of these design goals, the
across our disparate data sources. Table 1 lists the cat-
first model used a logistic regression which produced
egories of data, number of features in each category, and
interpretable and intuitive scores, and predicted the
the source(s) or other notes. In total, EPSS collects 1,477
probability of exploitation activity being observed in the
unique independent variables for every vulnerability.
first year following the publication of a vulnerability. In
order to be parsimonious, the logistic regression model
was trained on only 16 independent variables (features) 3.1. Labeling data: exploitation in the wild
extracted at the time of vulnerability disclosure. While
outperforming CVSS, the SIG highlighted some key EPSS collects and aggregates evidence of exploits
limitations which hindered its practical adoption. from multiple sources: Fortiguard, Alienvault OTX, the
Informed by this feedback, the second version of EPSS Shadowserver Foundation and GreyNoise (though not all
aimed to address the major limitations of the first version. sources cover the full time period). Each of these data
The first design decision was to switch to a centralized sources employ network- or host-layer intrusion detec-
architecture. By centralizing and automating the data col- tion/prevention systems (IDS/IPS), or honeypots, in order
lection and scoring, a more complex model could be to identify attempted exploitation. These systems are also

2
TABLE 1. D ESCRIPTION OF DATA SOURCES USED IN EPSS.

Description # of variables Type Sources


Exploitation activity in the wild (labels) 1 (with dates) Binary Fortinet, AlienVault, Shadowserver, GreyNoise
Publicly available exploit code 3 Binary Exploit-DB, GitHub, MetaSploit
CVE mentioned on list or website 3 Binary CISA KEV, Google Project Zero, Trend Micro ZDI
Social media 3 Numeric Mentions/discussion on Twitter
Offensive security tools and scanners 4 Binary Intrigue, sn1per, jaeles, nuclei
References with labels 17 Numeric MITRE CVE List, NVD
Keyword description of vulnerability 147 Binary Text description in MITRE CVE List
CVSS metrics 15 One-Hot National Vulnerability Database (NVD)
CWE 188 Binary National Vulnerability Database (NVD)
Vendor labels 1,096 Binary National Vulnerability Database (NVD)
Age of the vulnerability 1 Numeric Days since CVE published in MITRE CVE list

predominantly signature-based (as opposed to anomaly- Exploit-DB, 4,049 published on GitHub, and 1,905 pub-
based) detection systems. Moreover, all of these orga- lished on Metasploit modules. Even though Exploit-DB
nizations have large enterprise infrastructures of sensor contains the majority of published exploits, GitHub has
and collection networks. Fortiguard, for example, manages become a valuable source in recent years. For example,
tens of thousands of IDS/IPS devices that identify and in 2022, 1,591 exploits were published on GitHub, while
report exploitation activity from across the globe. Alien- Exploit-DB and Metasploit added 196 and 94 entries,
vault OTX, GreyNoise and the Shadowserver Foundation respectively. We derive three binary features from this
also maintain worldwide networks of sensors for detecting category.
exploitation activity. Aggregating exploit evidence from Public vulnerability lists. Next, we consider that
multiple sources does not guarantee uniform coverage of exploitation activity may be forecasted by the presence
labels across all types of vulnerabilities, and this could of vulnerabilities on popular lists and/or websites that
lead to class- and feature-dependent noise when used to maintain and share information about selective vulnera-
train machine learning models [Suciu et al.(2022)]. We bilities. Google Project Zero maintains a listing4 of “pub-
discuss these limitations in Section 6. licly known cases of detected zero-day exploits.”5 This
These data sources include the list of CVEs observed may help forecast exploitation activity as the vulnerability
to be exploited on a daily basis. The data are then slides into N-day status. We include 162 unique CVEs
cleaned, and exploitation activity is consolidated into a listed by Google Project Zero.
single boolean value (0 or 1), identifying days on which Trend Micro’s Zero Day Initiative (ZDI), the “world’s
exploitation activity was reported for any given CVE largest vendor-agnostic bug bounty program”,6 works with
across any of the available data sources. Structuring the researchers and vendors to responsibly disclose zero-day
training data according to this boolean time-series enables vulnerabilities and issue public advisories about vulner-
us to estimate the probability of exploitation activity in abilities at the conclusion of their process. We include
any upcoming window of time, though the consensus in 7,356 CVEs that have public advisories issued by ZDI.
the EPSS Special Interest Group was to standardize on a The Known Exploited Vulnerabilities (KEV) catalog
30-day window to align with most enterprise patch cycles. from the US Department of Homeland Security’s Cy-
The exploit data used in this research paper cover activity bersecurity and Infrastructure Security Agency (CISA)
from July 1, 2016 to December 31st, 2022 (2,374 days / 78 is an “authoritative source of vulnerabilities that have
months / 6.5 years), over which we collected 6.4 million been exploited in the wild”.7 We include 866 CVEs from
exploitation observations (date and CVE combinations), CISA’s KEV list.
targeting 12,243 unique vulnerabilities. Based on these These sources lack transparency about when exploita-
data, we find that 6.4% (12,243 of 192,035) of all pub- tion activity was observed, and for how long this activity
lished vulnerabilities were observed to be exploited during was ongoing. However, because past exploitation attempts
this period, which is consistent with previous findings might influence the likelihood of future attacks, we in-
[Jacobs et al.(2020)], [Jacobs et al.(2021)]. clude these indicators as binary features for our model.
Social media. Exploitation may also be correlated
3.2. Explanatory variables/features with social media discussions, and therefore we collect
Twitter mentions of CVEs, creating three features count-
In total, EPSS leverages 1,477 features for predicting ing these mentions within three different historical time
exploitation activity. Next, we describe the data sources windows (7, 30, and 90 days). We only count primary and
used to construct these features as well as the engineering original tweets and exclude retweets and quoted retweets.
behind them. The median number of daily unique tweets mentioning
Published exploit code. We first consider the cor- CVEs is 1,308 with the 25th and 75th percentile of daily
relation between exploitation in the wild and the existence tweets being 607 and 1,400 respectively. We currently
of publicly available exploit code, which is collected from
three sources (courtesy of Cyentia3 ): Exploit-DB, Github, 4. https://docs.google.com/spreadsheets/d/
1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/view#gid=
and Metasploit. In total we identified 24,133 CVEs with 1190662839.
published exploit code, consisting of 20,604 CVEs from 5. https://googleprojectzero.blogspot.com/p/0day.html.
6. https://www.zerodayinitiative.com/about.
3. https://www.cyentia.com/services/exploit-intelligence-service 7. https://www.cisa.gov/known-exploited-vulnerabilities

3
make no attempt to validate the content or filter out stratified, cross-validation, achieving 74.9% accuracy
automated posts (from bots). when predicting the exact CVSSv3 vector. For 99.9%
Offensive security tools. We also collect evidence of vectors, we predict the majority (5 or more) of
of vulnerabilities being used in offensive security tools the individual metrics correctly. For each individual
that are designed, in part, to identify vulnerabilities during portion of the CVSSv3 vector we were able to achieve a
penetration tests. We are currently gathering information minimum of 93.4% accuracy (on the Privileges Required
from four different offensive security tools with varying metric). We note that this exceeds the accuracy achieved
numbers of CVEs identified in each: Nuclei with 1,548 by [Nowak et al.(2021)], and likely warrants further
CVEs, Jaeles with 206 CVEs, Intrigue with 169 CVEs research into the robustness of CVSSv3 prediction and
and Sn1per with 63 CVEs. These are encoded as binary its possible application to future versions of CVSS.
features which indicate whether each particular source is CWE. We also capture the observation that differ-
capable of scanning for and reporting on the presence of ent types of vulnerabilities may be more or less attractive
each vulnerability. to attackers, using the Common Weakness Enumeration
References. In order to capture metrics around (CWE), which is a “community-developed list of software
the activity and analysis related to vulnerabilities, for and hardware weakness types.”8 We collect the CWE
each CVE, we count the number of references listed assignments from NVD, noting that 21,570 CVEs do not
in MITRE’s CVE list, as well as the number of refer- have a CWE assigned. We derived binary features for
ences with each of the 16 reference tags assigned by CWEs found across at least 10 vulnerabilities, resulting
NVD. The labels and their associated prevalence across in 186 CWE identifiers being included. In addition, we
CVEs are: Vendor Advisory (102,965), Third Party Ad- maintain two features for vulnerabilities where CWE in-
visory (84,224), Patch (59,660), Exploit (54,633), VDB formation is not available, or the assigned CWEs are not
Entry (31,880), Issue Tracking (16,848), Mailing List among the common ones. The top CWE identifiers and
(15,228), US Government Resource (11,164), Release their vulnerability counts are CWE 79 (20,797), CWE 119
Notes (9,308), Permissions Required (3,980), Broken (11,727), CWE 20 (9,590), CWE 89 (8,790), CWE 787
Link (3,934), Product (3,532), Mitigation (2,983), Tech- (7,624), CWE 200 (7,270), CWE 264 (5,485), CWE 22
nical Description (1,686), Not Applicable (961), and (4,918), CWE 125 (4,743), and CWE 352 (4,081).
Press/Media Coverage (124). Vulnerable vendors. We suspect exploitation ac-
Keyword description of the vulnerability. To tivity may be correlated to the market share and/or install
capture attributes of vulnerabilities themselves, we base companies achieve. Therefore, we parse through the
use the same process as described in previous re- Common Platform Enumeration (CPE) data provided by
search [Jacobs et al.(2020)], [Jacobs et al.(2021)]. This NVD in order to identify platform records marked as “vul-
process detects and extracts hundreds of common mul- nerable”, and extract only the vendor portion of the record.
tiword expressions used to describe and discuss vulner- We did not make any attempt to fill in missing information
abilities. These expressions are then grouped and nor- or correct any typos or misspellings that may occasionally
malized into common vulnerability concepts. The top appear in the records. We ranked vendors according to
tags we included and associated CVEs are as follows: the number of vulnerabilities, creating one binary feature
“remote attacker” (80,942), “web” (31,866), “code execu- for each vendor, and evaluated the effect of including
tion” (31,330), “denial of service” (28,478), and ‘authen- less frequent vendors as features. We observed no per-
ticated” (21,492). In total, we include 147 binary features formance improvements by including vendors with fewer
for identifying such tags. than 10 CVEs in our dataset. As a result, we extracted
We followed the same process as EPSS v1 for 1,040 unique vendor features in the final model. The
extracting multi-word expressions from the text from most prevalent vendors and their vulnerability counts are
references using Rapid Automatic Keyword Extrac- Microsoft (10,127), Google (9,100), Oracle (8,970), De-
tion [Rose et al.(2010)]. bian (7,627), Apple (6,499), IBM (6,409), Cisco (5,766),
CVSS metrics. To capture other attributes of vul- RedHat (4,789), Adobe (4,627), Fedora Project (4,166).
nerabilities, we collect CVSS base metrics. These con- Age of the vulnerability. Finally, the age of a vul-
sist of exploitability measurements (attack vector, attack nerability might contribute or detract from the likelihood
complexity, privilege required, user interaction, scope) of exploitation. Intuitively, we expect old vulnerabilities to
and the three impact measurements (confidentiality, in- be less attractive to attackers due to a smaller vulnerable
tegrity and availability). These categorical variables are population. To capture this, we create a feature which
encoded using one-hot encoding. We collected CVSS records the number of days elapsed from CVE publication
version 3 information from NVD for 118,087 vulnera- to the time of feature extraction in our model.
bilities. However, 73,327 vulnerabilities published before
CVSSv3 were created and are only scored in NVD using
CVSSv2. To address this, we developed a separate and 4. Modeling Approach
dedicated machine learning model to estimate the CVSSv3
measurement values for each of these vulnerabilities. 4.1. Preparing labels and features
We use a process similar to prior
work [Nowak et al.(2021)], where for each CVE, Exploitation activity is considered as any recorded
we use the CVSSv2 sub-components for CVEs which attempt to exploit a vulnerability, regardless of the success
have both CVSSv2 and CVSSv3 scores. We then of the attempt, and regardless of whether the targeted
train a feedforward neural network to predict CVSSv3
vectors. The model was validated using 8-fold, yearly 8. https://cwe.mitre.org

4
vulnerability is present. All observed exploitation activ- split, constructing our training data sets on data up to
ity is recorded with the date the activity occurred and and including October 31, 2021. We then construct the
aggregated across all data sources by the date and CVE test data set based on the state of vulnerabilities on
identifier. The resulting labeling data is a binary value December 1st, 2021, providing one month between the
for each vulnerability of whether exploitation activity was end of the training data and the test data. As mentioned
observed or not, for each day. above, the ground truth in the test data is any exploitation
Since many of the features may change day by day, we activity from December 1st to December 30th, 2021.
construct features for the training data on a daily basis. In Second, we use 5-fold cross validation, with the folds
order to reduce the size of our data (and thus the time and based on each unique CVE identifier. This selectively
memory needed to train models) we aggregate consecutive removes vulnerabilities from the training data and tests
daily observations where features do not change. The size the performance on the hold out set, thus further reduc-
of the exposure and the number of days with exploitation ing the likelihood of overfitting. We chose k = 5 for
activity are included in the model training. our procedure as it corresponds to an 80%/20% split in
When constructing the test data, a single date is se- training and test data. This larger validation size (20%) is
lected (typically ”today”, see next section) and all of the less likely to induce overfitting, and therefor poor hyper
features are generated based on the state of vulnerabilities parameter selection, than a k = 10 (90%/10% train/test
for that date. Since the final model is intended to estimate split) as described in [Cawley and Talbot(2010)]. Addi-
the probability of exploitation in the next 30 days, we tionally, we stratify the folds to ensure the same proportion
construct labels for the test data by looking for exploita- of exploitation activity for each fold as recommended
tion activity over the following 30 days from the test date in [Kohavi et al.(1995)]. Other values of k may provide
selected. better performance, but due to computational restraints
we rely on the literature as a guide for this particular
4.2. Model selection parameter rather than adding an additional dimension to
our model search space.
The first EPSS model [Jacobs et al.(2021)] sought not Finally, we measure performance by calculating the
only to accurately predict exploitation but do so in a parsi- area under the curve (AUC) based on precision and recall
monious, easy to implement way. As a result, regularized across the full range of predictions. We selected precision-
logistic regression (Elasticnet) was chosen to produce a recall since we have severe class imbalance in exploited
generalized linear model with only a handful of variables. vulnerabilities, and using accuracy or traditional Receiver
The current model relaxes this requirement in the hopes Operator Characteristic (ROC) curves may be misleading
of improving performance and providing more accurate due to that imbalance.
exploitation predictions. In particular, capturing non-linear
relationships between inputs and exploitation activity will 4.4. Tuning and optimizing model performance
better predict the finer exploitation activity.
Removing the requirement of a simple model with Despite being a well studied approach, the use of gra-
the need to model complex relationships expands the dient boosted trees and XGBoost for prediction problems
universe of potential models. Indeed many machine learn- still requires some effort to identify useful features and
ing algorithms have been developed for this exact pur- model tuning to achieve good model performance. This
pose. However, testing all models is impractical be- requires a-priori decisions about which features to include
cause each model requires significant engineering and and the hyperparameter values for the XGBoost algorithm.
calibration to achieve an optimal outcome. We there- The features outlined in subsection 3.2 includes 28,724
fore focus on a single type of model that has proven variables. Many of these variables are binary features
to be particularly performant on these data. Recent re- indicating whether a vulnerability affects a particular ven-
search has illustrated that panel (tabular) data, such as dor or can be described by a specific CWE. While the
ours, can be most successfully modeled using tree based XGBoost algorithm is efficient, including all variables in
methods (in particular gradient boosted trees for regres- our inference is technically infeasible. To reduce the scope
sion) [Grinsztajn et al.(2022)], arriving at similar or better of features we take a naive, yet demonstrably effective ap-
predictive performance with less computation and tuning proach at removing variables below a specific occurrence
in comparison to other methods such as neural networks. rate [Yang and Pedersen(1997)]. This reduced the input
Given the results in [Grinsztajn et al.(2022)] we focus our feature set to 1,477 variables.
efforts on tuning a common implementation of gradient One additional challenge with our data is the temporal
boosted trees [Chen and Guestrin(2016)]. We also provide nature of our predictions. In particular, exactly how much
a comparison to a transformer-based neural network in historical data should be included in the data set. In
subsection 6.1. addition to the XGBoost hyperparameters and the spar-
XGBoost is a popular, well documented, and performant sity threshold, we also constructed four different sets of
implementation of the gradient boosted tree algorithm in training data for 6 months and then 1, 2 and 3 years,
which successive decision trees are trained to iteratively to determine what time horizons would provide the best
reduce prediction error. predictions.
To identify the time horizon and sparsity threshold de-
4.3. Train/test split and measuring performance scribed above as well as the other hyperparameters needed
by our implementation of gradient boosted trees we take a
In order to reduce overfitting, we implement two standard approach described in [Yang and Shami(2020)].
restrictions. First, we implement a time-based test/train We first define reasonable ranges for the hyperparameters,

5
TABLE 2. N ON - DEFAULT HYPERPARAMETER VALUES FOR 1.0 0.9 Labeled points show thresholds,
XGB OOST ALGORITHM AND DATA SELECTION 0.9 0.8
0.7 CVEs scoring at or above
0.9 0.6 threshold are prioritized
0.9 0.8 0.5
Parameter Value
0.8 0.4
Time Horizon 1 year 0.8 0.7 0.7
0.3

Precision (Efficiency)
Learning rate 0.11 0.6 0.6
0.5
Max depth tree depth 20 0.7 0.5 0.4 0.2
Subsample ratio of the training instances 0.75 0.4
0.6 0.3
Minimum loss reduction for leaf node partition 10
0.3
Maximum delta step 0.9 0.5 0.2 0.1
The number of boosting rounds 65 0.2
0.4 0.1 0.05

EP
0.3

SS
use Latin Hypercube sampling over the set of possible 0.05

v3
0.1
combinations, compute model performance for that set of 0.2
0.05 EP
hyperparameters, then finally build an additional model 10 SS
0.1 CVSS v3.x Base Sc e
or 9 EPSS 2
v
(also a gradient boosted tree) to predict performance given 8 v1 7
6 5 43210.5
a set of hyperparameters, using the model to maximize 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
performance.
The results of the above process results in the pa- Recall (Coverage)
rameters selected in Table 2. Note that of the tested time
horizons, none dramatically outperformed others, with 1 Figure 1. Performance of EPSS v3 compared to previous versions and
CVSS Base Score
year only slightly outperforming other tested possibilities.

5. Evaluation
the probability of exploitation activity in the next 30
5.1. Precision (efficiency) and recall (coverage) days based on the state of vulnerabilities on December
1st, 2022. Using evidence of exploitation activity for the
Precision and recall are commonly used machine following 30 days (through Dec 30th, 2022), we measured
learning performance metrics, but are not intuitive for overall performance as shown in Figure 1. For compar-
security practitioners, and therefore can be difficult to ison, we also show performance metrics for the EPSS
contextualize what these performance metrics represent in versions 1 and 2, as well as CVSS v3 base scores for the
practice. same date and exploitation activity (Dec 1st, 2022). Figure
Precision (efficiency) measures how well resources are 1 includes points along the precision-recall curves that
being allocated, (where low efficiency represents wasted represent the thresholds with each prioritization strategy.
effort), and is calculated as the true positives divided by
the sum of the true and false positives. Figure 1 clearly illustrates the significant improvement
of the EPSS v3 model over previous versions, as well as
In the vulnerability management context, efficiency ad-
the CVSS version 3 base score.
dresses the question, “out of all the vulnerabilities remedi-
ated, how many were actually exploited?” If a remediation EPSS v3 produces an area under the curve (AUC) of
strategy suggests patching 100 vulnerabilities, 60 of which 0.7795, and an F1 score of 0.728. A remediation strategy
were exploited, the efficiency would be 60%. based on this F1 score would prioritize remediation for
Recall (coverage), on the other hand, considers how vulnerabilities with EPSS probabilities of 0.36 and above,
well a remediation strategy actually addresses those and would achieve an efficiency of 78.5% and coverage
vulnerabilities that should be patched (e.g., that have of 67.8%.
observed exploitation activity), and is calculated as the In addition, this strategy would prioritize remediation of
true positives divided by the sum of the true positives 3.5% of all published vulnerabilities (representing the
and false negatives. level of effort).
In the vulnerability management context, coverage ad- EPSS v2 has an AUC of 0.4288 and a calculated F1
dresses the question, “out of all the vulnerabilities that are score at 0.451, which prioritizes vulnerabilities with a
being exploited, how many were actually remediated?” If probability of 0.16 and above. At the F1 threshold, EPSS
100 vulnerabilities are exploited, 40 of which are patched, v2 achieves an efficiency rating of 45.5% and coverage
the coverage would be 40%. of 44.8% and prioritizes 4% of the vulnerabilities in our
Therefore, for the purpose of this article, we use the terms study. EPSS v1 has an AUC of 0.2998 and a calculated
efficiency and coverage interchangeably with precision F1 score at 0.361, which prioritizes vulnerabilities with a
and recall, respectively, in the discussions below. probability of 0.2 and above. At the F1 threshold, EPSS
v1 achieves an efficiency rating of 43% and coverage
5.2. Model performance of 31.1% and prioritizes 2.9% of the vulnerabilities in
our study. Finally, CVSS v3.x base score has an AUC of
After several rounds of experiments to find the optimal 0.051 and a calculated F1 score at 0.108, which prioritizes
set of features, amount of historical data, and model vulnerabilities with a CVSS base score of 9.7 or higher. At
parameters as discussed in the previous section, we gener- the F1 threshold, CVSS v3.x achieves an efficiency rating
ated one final model using all vulnerabilities from Novem- of 6.5% and coverage of 32.3% and prioritizes 13.7% of
ber 1st, 2021 to October 31st, 2022. We then predicted the vulnerabilities in our study.

6
100% CVSS:3.1/PR:N Tag:Code Execution
Observed with Exploitation Activity
in 30 days following Dec 1, 2022

Effort: 70.4% of CVEs Effort: 17.2% of CVEs


Coverage: 88.1% Coverage: 48.0%
10%
Efficiency: 5.1% Efficiency: 11.4%

1% Exploit:Exploit DB CWE-119: Buffer Overflow

Effort: 10.9% of CVEs Effort: 6.2% of CVEs


0.1% Coverage: 34.7% Coverage: 16.9%
Efficiency: 13.0% Efficiency: 11.1%
0.1% 1% 10% 100%
Predicted Probability of
Exploitation on Dec 1, 2022

Figure 2. Calibration Plot comparing predicted probabilities to observed Exploit:metasploit Site:KEV


exploitation period in the following 30 days

5.3. Probability calibrations Effort: 1.0% of CVEs Effort: 0.5% of CVEs


Coverage: 14.9% Coverage: 5.9%
Efficiency: 60.5% Efficiency: 53.2%
A significant benefit of this model over alternative
exploit scoring systems (described above) is that the out-
put scores are true probabilities (i.e., probability of any
exploitation activity being observed in the next 30 days)
and can therefore be scaled to produce a threat score
based on one or more vulnerabilities, such as would be All CVEs CVEs Prioritized Exploited
found in a single network device (laptop, server), network
segment, or an entire enterprise. For example, standard Figure 3. Alternative strategies based on simple heuristics
mathematical techniques can be used to answer questions
like “what is the probability that at least one of this
asset’s vulnerabilities will be exploited in the next 30 that any vulnerability remediation strategy should accom-
days?” Such estimates, however, are only useful if they modate varying levels of corporate resources and budgets.
are calibrated and therefore reflect the true likelihood of Indeed, organizations with fewer resources (presumably
the event occurring. smaller organizations) may prefer to emphasize efficiency
In order to address this, we measure calibra- over coverage, to optimize their spending, while larger
tion in a two ways. First we calculate a Brier organizations may accept less efficient strategies in ex-
Score [Brier et al.(1950)] which produces a score between change for the greater coverage (i.e. more vulnerabilities
0 and 1, with 0 being perfectly calibrated and 1 being patched).
perfectly uncalibrated (the original 1950 paper doubles Therefore, we compare the amount of effort required
the range from 0 to 2). Our final estimate revealed a Brier (as measured by the number of vulnerabilities needing to
score of 0.0162, which is objectively very low (good). be remediated) for differing remediation strategies. Figure
We also plot the predicted (binned) values against the ob- 3 highlights the performance of 6 simple (but practical)
served (binned) exploitation activity (commonly referred vulnerability prioritization strategies based on our test data
to as a “calibration plot”) as shown in Figure 2. The closer (December 1st, 2022).9
the plotted line is to a 45 degree line (i.e. a line with a The first diagram in the upper row considers a
slope of 1, represented by the dashed line), the greater the strategy based on the CVSS v3.x vector of “Privilege
calibration. Again, by visual inspection, our plotted line Required: None”. Being able to exploit a vulnerability
very closely matches the 45 degree line. that doesn’t require any established account credentials
is an attractive vulnerability to exploit, as an attacker.
5.4. Simple Remediation Strategies While this strategy would yield 88.1% coverage, it would
achieve only 5.1% efficiency. That is, from a defender
Research conducted by Kenna Security and Cyen- perspective, this class of vulnerabilities represents over
tia tracked vulnerabilities at hundreds of companies and 130,000 (70%) of all published CVEs, and would easily
found that on average, companies were only able to surpass the resources capacity of most organizations.
remediate about 15.5% of their open vulnerabilities in “Code Execution” is another attractive vulnerability
a month[Institute and Security(2022)]. This research also
found that resource capacity for remediating vulnerabili- 9. Performance is then measured based on exploitation activity in the
ties varies considerably across companies, which suggests following 30 days.

7
attribute for attackers since these vulnerabilities could CVSS v3.x EPSS v1
allow the attacker to achieve full control of a target
asset. However, remediating all the code execution
vulnerabilities (17% or about 32,000 of all CVEs) would
Threshold: 9.1+ Threshold: 0.062+
achieve 48% coverage and 11.4% efficiency. Effort: 15.1% of CVEs Effort: 15.1% of CVEs
The middle row of Figure 3 shows remediation strate- Coverage: 33.5% Coverage: 57.0%
gies for vulnerabilities published in Exploit DB (left), and Efficiency: 6.1% Efficiency: 15.4%
Buffer Overflows (CWE-119; right3), respectively.
The bottom row of Figure 3 is especially revealing.
The bottom right diagram shows performance metrics for
a remediation strategy based on patching vulnerabilities EPSS v2 EPSS v3
from the Known Exploited Vulnerabilities (KEV) list (as
of Dec 1, 2022) from DHS/CISA. The KEV list is meant
to prioritize vulnerability remediation for US Federal Threshold: 0.037+ Threshold: 0.022+
agencies as per Binding Operational Directive 22-0110 . Effort: 15.4% of CVEs Effort: 15.3% of CVEs
Strictly following the KEV would remediate half of one Coverage: 69.9% Coverage: 90.4%
percent (0.5%) of all published CVEs, and produce a Efficiency: 18.5% Efficiency: 24.1%
relatively high efficiency of 53.2%. However, with almost
8,000 unique CVEs with exploitation activity in Decem-
ber, the coverage obtained from this strategy is only 5.9%.
Alternatively, the strategy identified in the bottom left
diagram shows a remediation strategy based on whether a All CVEs CVEs Above Threshold Exploited
vulnerability appears in a Metasploit module. In this case,
a network defender would need to remediate almost twice Figure 4. Strategy comparisons holding the level of effort constant
as many vulnerabilities on the KEV list, but would enjoy
13% greater efficiency (60.5% vs 53.2%) and almost three
times more coverage (14.9% vs 5.9%). remediated). The baseline for coverage is set by a CVSS
Therefore, based on this simple heuristic (KEV vs Metas- strategy of remediating vulnerabilities with a base score
ploit), the Metasploit strategy outperforms the KEV strat- of 7 and above (CVEs with a ”High” or ”Critical” CVSS
egy. score). Such a strategy yields a respectable coverage at
82.1% but at the cost of a higher level of effort, needing
5.5. Advanced remediation strategies to remediate 58.1% or 110,000 of all published CVEs.
Practitioners can achieve a similar level of coverage (82%)
Next we explore the real-world performance of our using EPSS v3 and prioritizing vulnerabilities scored at
model, using two separate approaches. We first compare 0.088 and above but with a much lower level of effort,
coverage among four remediation strategies while holding needing to only remediate 7.3% or just under 14,000
the level of effort constant (i.e. the number of vulnerabil- vulnerabilities.
Remediating CVEs rated as High or Critical with CVSS
ities needing to be remediated), we then compare levels
v3 gives a respectable level of coverage at 82.1%, but
of effort while holding coverage constant.
requires remediating 58.1% of published CVEs. On the
Figure 4 compares the four strategies while maintain-
other hand, EPSS v3 can achieve the same level of cover-
ing approximately the same level of effort. That is, the
age but reduces the amount of effort from 58.1% to 7.3%
blue circle in the middle of each figure – representing
of all CVEs, or fewer than 14000 vulnerabilities.
the number of vulnerabilities that would need to be re-
mediated – is fixed to the same size for each strategy, at
approximately 15% or about 28,000 vulnerabilities. The 6. Discussion and Future Work
CVSS strategy, for example, would remediate vulnerabili-
ties with a base score of 9.1 or greater, and would achieve Currently, the EPSS model ingests data concerning
coverage and efficiency of 33.5% and 6.1%, respectively. which vulnerabilities were exploited on which days. How-
A remediation strategy based on EPSS v2, on the ever, exploitation has many other characteristics, which
other hand, would remediate vulnerabilities with an EPSS may be useful to capture and examine. For example, we
v2 score of 0.037 and greater, yielding 69.9% coverage may be interested in studying the number of exploits per
and 18.5% efficiency. Already, this strategy doubles the vulnerability (volume), fragmentation of exploitation over
coverage and triples the efficiency, relative to the CVSS time (that is, the pattern of periods of exploitation), or
strategy. prevalence, which would measure the spread of exploita-
Even better results are achieved with a remediation tion, typically by counting the number of devices detecting
strategy based on EPSS v3 which enjoys 90.4% coverage exploitation. We leave these topics for future work.
and 24.1% efficiency.
Figure 5 compares the four strategies while maintain- 6.1. Comparison to neural networks
ing approximately the same level of coverage. That is, the
proportion of the red circle (exploitation activity) covered In addition to the XGBoost model presented in sec-
by the blue circle (number of vulnerabilities needing to be tion 5, we also train a transformer-based classifier on
our data set. Transformers [Vaswani et al.(2017)] have
10. ”See https://www.cisa.gov/binding-operational-directive-22-01” achieved state-of-the-art performance in a wide range of

8
CVSS v3.x EPSS v1 to think that the data used, and therefore any inferences
provided, are representative of all mass exploitation activ-
ity.
In regard to the nature of how vulnerabilities are
Threshold: 7+ Threshold: 0.015+
Effort: 58.1% of CVEs Effort: 44.3% of CVEs detected, any signature-based detection device is only able
Coverage: 82.1% Coverage: 82.2% to alert on events that it was programmed to observe.
Efficiency: 3.9% Efficiency: 7.6% Therefore, we are not able to observe vulnerabilities that
were exploited but undetected by the sensor because a
signature was not written.
Moreover, the nature of the detection devices gener-
EPSS v2 EPSS v3 ating the events will be biased toward detecting network-
based attacks, as opposed to attacks from other attack
vectors such as host-based attacks or methods requiring
Threshold: 0.012+ Threshold: 0.088+ physical proximity.12 Similarly, these detection systems
Effort: 39.0% of CVEs Effort: 7.3% of CVEs will be typically installed on public-facing perimeter inter-
Coverage: 84.7% Coverage: 82.0% net devices, and therefore less suited to detecting computer
Efficiency: 8.9% Efficiency: 45.5% attacks against internet of things (IoT) devices, automotive
networks, ICS, SCADA, operational technology (OT),
medical devices, etc.
Given the exploit data from the data partners, we
are not able to distinguish between exploit activity gen-
All CVEs CVEs Above Threshold Exploited erated by researchers or commercial entities, versus ac-
tual malicious exploit activity. While it is likely that
Figure 5. Strategy comparisons holding the coverage constant some proportion of exploitation does originate from non-
malicious sources, at this point we have no reliable way
of estimating the true proportion. However, based on the
sequence modeling tasks, especially for natural language collective authors’ experience, and discussions with our
processing. Note that our feature set can be thought of as a data providers, we do not believe that this represents a
sequence of tag/value pairs (ti , vi ) that have been assigned significant percentage of exploitation activity.
to a CVE, where ti contains the integer index assigned to While these points may limit the scope of our infer-
a tag, and vi represents the associated value (e.g., a count, ences, to the extent that our data collection is represen-
or simply one for binary features).11 To feed this sequence tative of an ecosystem of public-facing, network-based
to a transformer model, we convert each item/tag to an n- attacks, we believe that many of the insights presented
dimensional embedding using fθ (ti , vi ) := eθ (ti )+gθ (vi ), here are generalizable beyond this dataset.
where eθ (·) is an embedding lookup table, and gθ (·) maps In addition to these limitations, there are other ad-
a value to an n-dimensional embedding. We use n = 256, versarial considerations that fall outside the scope of
4 layers, 4 attention heads, and an intermediate layer this paper. For example, one potential concern is the
size of 1024. For gθ (·), we use a fully connected neural opportunity for adversarial manipulation either of the
network with two layers, a hidden layer size of 256, and EPSS model, or using the EPSS scores. For exam-
the tanh activation function. ple, it may be possible for malicious actors to poison
We train the above classifier for 100,000 iterations or otherwise manipulate the input data to the EPSS
with a batch size of 128 and a learning rate of 0.0001, model (e.g. Github, Twitter). These issues have been
achieving a precision-recall AUC of 0.7374 (as opposed studied extensively in the context of machine learning
to 0.7795 for the XGBoost model presented in section 5). for exploit prediction [Sabottke et al.(2015)] and other
We believe the slightly lower performance to be due to tasks [Suciu et al.(2018)], [Chakraborty et al.(2018)], and
the aptness of XGBoost for modeling tabular data, and their potential impact is well understood. Given that we
lower susceptibility to overfitting. This further justifies our have no evidence of such attacks in practice, and our
original model choice for predicting exploitation in-the- reliance on data from many distinct sources which would
wild. reduce the leverage of adversaries, we leave an in-depth
investigation of the matter for future work. Additionally,
6.2. Limitations and adversarial consideration it is possible that malicious actors may change their
strategies based on EPSS scores. For example, if net-
This research is conducted with a number of limita- work defenders increasingly adopt EPSS as the primary
tions. First, insights are limited to data collected from method for prioritizing vulnerability remediation, thereby
our data partners and the geographic and organizational deprioritizing vulnerabilities with lower EPSS scores, it
coverage of their network collection devices. While these may be conceivable that attackers begin to strategically
data providers collectively manage hundreds of thousands incorporate these lower scoring vulnerabilities into their
of sensors across the globe, and across organizations of all tactics and malware. While possible, we are not aware of
sizes and industries, they do not observe every attempted any actual or suggestive evidence to this effect.
exploit event in every network. Nevertheless, it is plausible
12. For example, it is unlikely to find evidence of exploitation for
11. We normalize values associated with each feature/tag to have a CVE-2022-37418 in our data set, a vulnerability in the remote keyless
maximum of one. entry systems on specific makes and models of automobiles.

9
Exploit Code CVE: Count of References
Tag: Remote
CVE (age+refs) Tag: Code Execution
Exploit: Exploit DB
CVSS Vectors CVE: Age of CVE
Vendor: Microsoft
Sites CVSS: 3.1/AV:N
CVSS: 3.1/PR:N

Density
Scanners CVSS: 3.1/A:H
CVSS: 3.1/C:H
Twitter Site: ZDI
Exploit: metasploit
Tag
NVD: Exploit Ref
NVD: VDB Ref
CWE
NVD: US Gov Ref
Tag: SQLi
Scanner: Nuclei
Vendor
Vendor: Adobe
0 0.01 0.1 0.5 1 2 3 5 CVSS: 3.1/UI:N
Shapley Value NVD: Vendor Advisory Ref
Tag: Local
NVD: 3party Advisory Ref
Figure 6. Density plots of the absolute SHAP values for each family of NVD: Patch Ref
features CVSS: 3.1/I:H
Tag: XSS
Tag: Denial of Service
Site: KEV
Finally, while evolving the model from a logistic CVSS: 3.1/Scored
regression to a more sophisticated machine learning ap- Exploit: Github
Tag: Buffer Overflow
proach greatly improved performance of EPSS, an im-
portant consequence is that interpretability of variable 0.0 0.1 0.2 0.3 0.4
contributions is more difficult to quantify as we discuss Mean Absolute Shapley Value
in the next section.
Figure 7. Mean absolute SHAP value for individual features
6.3. Variable importance and contribution

While an XGBoost model is not nearly as intuitive references in the published CVE (which is in the ”CVE”
or interpretable as linear regression, we can use SHAP class).
values [Lundberg and Lee(2017)] to reduce the opacity Note how the most influential feature is the count of the
of a trained model by quantifying feature contributions, number of references in MITRE’s CVE List, followed
by “remote attackers,” “code execution,” and published
P
breaking down the score assigned to a CVE as ϕ0 + i ϕi ,
where ϕi is the contribution from feature i, and ϕ0 is a bias exploit code in Exploit-DB, respectively.
term. We use SHAP values due to their good properties
such as local accuracy (attributions sum up to the output 7. Literature Review and Related Scoring
of the model), missingness (missing features are given
no importance), and consistency (modifying a model so Systems
that a feature is given more weight never decreases its
attribution). This research is informed by multiple bodies of litera-
The contributions from different classes of variables ture. First, there are a number of industry efforts that seek
in the kernel density plot are shown in Figure 6. First, to provide some measure of exploitability for individual
note that the figure displays the absolute value of the vulnerabilities, though there is wide variation in their
SHAP values, in order to infer the contribution of the scope and availability. First, the base metric group of
variable away from zero. Second, note the horizontal axis CVSS, the leading standard for measuring the severity
is presented on log scale to highlight that the majority of a vulnerability, is composed of two parts, measuring
of features do not contribute much weight to the final impact and exploitability [FIRST(2019)]. The score is
output. In addition, the thin line extending out to the built on expert judgements, capturing, for example the
right in Figure 6 illustrates how there are instances of observation that a broader ability to exploit a vulnerability
features within each class that contribute a significant (i.e., remotely across the Internet, as opposed to requir-
amount. Finally, note that Figure 6 is sorted in decreasing ing local access to the device); a more complex exploit
mean absolute SHAP value for each class of features, required, or more user interaction required, all serve to
highlighting the observation that published exploit code increase the apparent likelihood that a vulnerability could
is the strongest contributor to the estimated probability of be exploited, all else being equal. CVSS has been repeat-
exploitation activity. edly shown by prior work [Allodi and Massacci(2012b)],
Figure 7 identifies the 30 most significant features with [Allodi and Massacci(2014)], as well as our own evi-
their calculated mean absolute SHAP value. Again, note dence, to be insufficient for capturing all the factors that
that higher values infer a greater influence (either positive drive exploitation in the wild. The U.S. National Vulner-
or negative) on the final predicted value. Note that Figure ability Database (NVD) includes a CVSS base score with
6 is showing the mean absolute SHAP value from an nearly all vulnerabilities it has published. Because of the
entire class of features. So even though Exploit Code as a wide-spread use of CVSS, specifically the base score, as
class of features has a higher mean absolut SHAP value, a prioritization strategy we will compare our performance
the largest individual feature is coming from the count of against CVSS as well as our previous models.

10
Exploit likelihood is also modeled through various [Almukaynizi et al.(2017)], [Chen et al.(2019)],
vendor-specific metrics. In 2008, Microsoft introduced [Xiao et al.(2018)], [Tavabi et al.(2018)],
the Exploitability Index for vulnerabilities in their prod- [Fang et al.(2020)], [Hoque et al.(2021)]). Most of these
ucts [Microsoft(2020)]. It provides 4 measures for the papers build vulnerability feature sets from commonly
likelihood that a vulnerability will be exploited: whether used data sources such as NVD or OSVDB, although
an exploitation has already been detected, and whether ex- some of them use novel identifiers for exploitation:
ploitation is more or less likely, or unlikely. The metric has [Sabottke et al.(2015)] infers exploitation using Twitter
been investigated before [Reuters([n. d.])], [Eiram(2013)], data, [Xiao et al.(2018)] uses patching patterns and
[Younis and Malaiya(2015)] and was shown to have blacklist information to predict whether organizations
limited performance at predicting exploitation in the are facing new exploits, while [Tavabi et al.(2018)] uses
wild [DarkReading(2008)], [Reuters([n. d.])] or the devel- natural language processing methods to infer context of
opment of functional exploits [Suciu et al.(2022)]. darkweb/deepweb discussions.
Redhat provides a 4-level severity rating: low, moder- Compared to other scoring systems and research described
ate, important, and critical [RedHat(2023)]. In addition to above, EPSS is a rigorous and ongoing research effort
capturing a measure of the impact to a vulnerable system, is; an international, community-driven effort; designed to
this index also captures some notion of exploitability. predict vulnerability exploitation in the wild; available for
For example, the “low” severity rating represents vul- all known and published vulnerabilities; updated daily to
nerabilities that are unlikely to be exploited, whereas reflect new vulnerabilities and new exploit-related infor-
the “critical” severity rating reflects vulnerabilities that mation; made available freely to the public.
could be easily exploited by an unauthenticated remote
attacker. Like the Exploitability Index, Redhat’s metric is
vendor-specific and has limitations reflecting exploitation 8. Conclusion
likelihood [Suciu et al.(2022)].
A series of commercial solutions also aim to capture In this paper, we presented results from an interna-
the likelihood of exploits. Tenable, a leading vendor of tional, community-driven effort to collect and analyze
intrusion detection systems, created the Vulnerability Pri- software vulnerability exploit data, and to build a machine
ority Rating (VPR), which, like CVSS, combines infor- learning model capable of estimating the probability that a
mation about both impact to a vulnerable system, and vulnerability would be exploited within 30 days following
the exploitability (threat) of a vulnerability in order to the prediction. In particular, we described the process of
help network defenders better prioritize remediation ef- collecting each of the additional variables, and described
forts [Tenable(2020)]. For example, the threat component the approaches used to create the machine learning model
of VPR “reflects both recent and potential future threat based on 6.4 million observed exploit attempts. Through
activity” by examining whether exploit code is publicly the expanded data sources we achieved an unprecedented
available, whether there are mentions of active exploita- 82% improvement in classifier performance over the pre-
tion on social media or in the dark web, etc. Rapid 7’s vious iterations of EPSS.
Real Risk Score product uses its own collection of data We illustrated practical use of EPSS by way of com-
feeds to produce a score between 1-1000. This score is parison with a set of alternative vulnerability remediation
a combination of the CVSS base score, “malware expo- strategies. In particular, we showed the sizeable and mean-
sure, exploit exposure and ease of use, and vulnerability ingful improvement in coverage, efficiency and level of
age” and seeks to produce a better measure of both ex- effort (as measured by the number of vulnerabilities that
ploitability and “risk” [Rapid7(2023)]. Recorded Future’s would need to be remediated) by using EPSS v3 over any
Vulnerability Intelligence product integrates multiple data and all current remediation approaches, including CVSS,
sources, including threat information, and localized asset CISA’s KEV list, and Metasploit.
criticality [Recorded Future(2023)]. The predictions, per- As the EPSS effort continues to grow, acquire and
formance evaluations and implementation details of these ingest new data, and improve modeling techniques with
solutions are not publicly available. each new version, we believe it will continue to improve
These industry efforts are either vendor-specific, score in performance, and provide new and fundamental insights
only subsets of vulnerabilities, based on expert opinion into vulnerability exploitation for many years to come.
and assessments and therefore not entirely data-driven, or
proprietary and not publicly available.
Our work is also related to a growing academic Acknowledgements
research field of predicting and detecting vulnerability
exploitation. A large body of work focuses on We would like to acknowledge the participants of the
predicting the emergence of proof-of-concept EPSS Special Interest Group (SIG), as well as the orga-
or functional exploits [Bozorgi et al.(2010)], nizations that have contributed to the EPSS data model to
[Edkrantz and Said(2015)], [Bullough et al.(2017)], include: Fortinet, Shadow Server Foundation, Greynoise,
[Reinthal et al.(2018)], [Alperin et al.(2019)], Alien Vault, Cyentia, and FIRST.
[Bhatt et al.(2021)], [Suciu et al.(2022)], not necessarily
whether these exploits will be used in the wild, as References
is done with EPSS. Papers predicting exploitation
in the wild have used alternative sources of [Allodi and Massacci(2012a)] Luca Allodi and Fabio Massacci. 2012a.
exploitation, most notably data from Symantec’s IDS, A Preliminary Analysis of Vulnerability Scores for Attacks in Wild.
to build prediction models [Sabottke et al.(2015)], In CCS BADGERS Workshop. Raleigh, NC.

11
[Allodi and Massacci(2012b)] Luca Allodi and Fabio Massacci. 2012b. [Grinsztajn et al.(2022)] Leo Grinsztajn, Edouard Oyallon, and Gael
A preliminary analysis of vulnerability scores for attacks in wild: Varoquaux. 2022. Why do tree-based models still outperform deep
The EKITS and SYN datasets. In Proceedings of the 2012 ACM learning on typical tabular data?. In Thirty-sixth Conference on
Workshop on Building Analysis Datasets and Gathering Experience Neural Information Processing Systems Datasets and Benchmarks
Returns for Security. 17–24. Track.
[Allodi and Massacci(2014)] Luca Allodi and Fabio Massacci. 2014. [Hoque et al.(2021)] Mohammad Shamsul Hoque, Norziana Jamil,
Comparing vulnerability severity and exploits using case-control Nowshad Amin, and Kwok-Yan Lam. 2021. An Improved Vulnera-
studies. ACM Transactions on Information and System Security bility Exploitation Prediction Model with Novel Cost Function and
(TISSEC) 17, 1 (2014), 1–20. Custom Trained Word Vector Embedding. Sensors 21, 12 (2021),
4220.
[Almukaynizi et al.(2017)] Mohammed Almukaynizi, Eric Nunes, Kr-
ishna Dharaiya, Manoj Senguttuvan, Jana Shakarian, and Paulo [Institute and Security(2022)] Cyentia Institute and Kenna Security.
Shakarian. 2017. Proactive Identification of Exploits in the Wild 2022. Prioritization to Prediction Vol 8. (2022). https://www.
Through Vulnerability Mentions Online. In 2017 International kennasecurity.com/resources/prioritization-to-prediction-reports/
Conference on Cyber Conflict (CyCon US). IEEE, 82–88.
[Jacobs et al.(2020)] Jay Jacobs, Sasha Romanosky, Idris Adjerid, and
[Alperin et al.(2019)] Kenneth Alperin, Allan Wollaber, Dennis Ross, Wade Baker. 2020. Improving vulnerability remediation through
Pierre Trepagnier, and Leslie Leonard. 2019. Risk prioritization by better exploit prediction. Journal of Cybersecurity 6, 1 (2020),
leveraging latent vulnerability features in a contested environment. tyaa015.
In Proceedings of the 12th ACM Workshop on Artificial Intelligence [Jacobs et al.(2021)] Jay Jacobs, Sasha Romanosky, Benjamin Edwards,
and Security. 49–57. Idris Adjerid, and Michael Roytman. 2021. Exploit Prediction
[Bhatt et al.(2021)] Navneet Bhatt, Adarsh Anand, and Venkata SS Ya- Scoring System (EPSS). Digital Threats: Research and Practice
davalli. 2021. Exploitability prediction of software vulnerabilities. 2, no. 3 (2021): 1-17. 2, 3 (2021), 1–17.
Quality and Reliability Engineering International 37, 2 (2021), [Kohavi et al.(1995)] Ron Kohavi et al. 1995. A study of cross-
648–663. validation and bootstrap for accuracy estimation and model se-
[Bozorgi et al.(2010)] Mehran Bozorgi, Lawrence K Saul, Stefan Sav- lection. In Ijcai, Vol. 14. Montreal, Canada, 1137–1145.
age, and Geoffrey M Voelker. 2010. Beyond Heuristics: Learning [Lundberg and Lee(2017)] Scott M Lundberg and Su-In Lee. 2017. A
to Classify Vulnerabilities and Predict Exploits. In Proceedings of unified approach to interpreting model predictions. In Advances in
the 16th ACM SIGKDD International Conference on Knowledge neural information processing systems. 4765–4774.
Discovery and Data Mining. 105–114.
[Microsoft(2020)] Microsoft 2020. Microsoft Exploitability In-
[Brier et al.(1950)] Glenn W Brier et al. 1950. Verification of forecasts dex. Microsoft. https://www.microsoft.com/en-us/msrc/
expressed in terms of probability. Monthly weather review 78, 1 exploitability-index.
(1950), 1–3.
[Nowak et al.(2021)] Maciej Nowak, Michał Walkowski, and Sławomir
[Bullough et al.(2017)] Benjamin L Bullough, Anna K Yanchenko, Sujecki. 2021. Conversion of CVSS Base Score from 2.0 to 3.1.
Christopher L Smith, and Joseph R Zipkin. 2017. Predicting In 2021 International Conference on Software, Telecommunications
Exploitation of Disclosed Software Vulnerabilities Using Open- and Computer Networks (SoftCOM). IEEE, 1–3.
source Data. In Proceedings of the 3rd ACM on International
Workshop on Security and Privacy Analytics. 45–53. [Rapid7(2023)] Rapid7 2023. Prioritize Vulnerabilities Like an At-
tacker. Rapid7. https://www.rapid7.com/products/insightvm/
[Cawley and Talbot(2010)] Gavin C Cawley and Nicola LC Talbot. features/real-risk-prioritization/.
2010. On over-fitting in model selection and subsequent selection
bias in performance evaluation. The Journal of Machine Learning [Recorded Future(2023)] Recorded Future 2023. Prioritize patching
Research 11 (2010), 2079–2107. based on risk. Recorded Future. https://www.recordedfuture.com/
platform/vulnerability-intelligence.
[Chakraborty et al.(2018)] Anirban Chakraborty, Manaar Alam, Vishal
Dey, Anupam Chattopadhyay, and Debdeep Mukhopadhyay. 2018. [RedHat(2023)] RedHat 2023. Severity ratings. RedHat. https://access.
Adversarial attacks and defences: A survey. arXiv preprint redhat.com/security/updates/classification/.
arXiv:1810.00069 (2018). [Reinthal et al.(2018)] Alexander Reinthal, Eleftherios Lef Filippakis,
[Chen et al.(2019)] Haipeng Chen, Rui Liu, Noseong Park, and VS and Magnus Almgren. 2018. Data modelling for predicting ex-
Subrahmanian. 2019. Using twitter to predict when vulnerabilities ploits. In Nordic Conference on Secure IT Systems. Springer, 336–
will be exploited. In Proceedings of the 25th ACM SIGKDD 351.
International Conference on Knowledge Discovery & Data Mining. [Reuters([n. d.])] Reuters. [n. d.]. Microsoft correctly predicts
3143–3152. reliable exploits just 27% of the time. https://www.reuters.
[Chen and Guestrin(2016)] Tianqi Chen and Carlos Guestrin. 2016. com/article/urnidgns852573c400693880002576630073ead6/
XGBoost: A scalable tree boosting system. In ACM SIGKDD In- microsoft-correctly-predicts-reliable-exploits-just-27-of-the-time-idUS1867772068
ternational Conference on Knowledge Discovery and Data Mining.
ACM, 785–794. [Rose et al.(2010)] Stuart Rose, Dave Engel, Nick Cramer, and Wendy
[DarkReading(2008)] DarkReading 2008. Black Hat: The Cowley. 2010. Automatic keyword extraction from individual
Microsoft Exploitability Index: More Vulnerability documents. Text mining: Applications and theory (2010), 1–20.
Madness. DarkReading. https://www.darkreading.com/risk/ [Sabottke et al.(2015)] Carl Sabottke, Octavian Suciu, and Tudor
black-hat-the-microsoft-exploitability-index-more-vulnerability-madness. Dumitras, . 2015. Vulnerability Disclosure in the Age of Social
[Edkrantz and Said(2015)] Michel Edkrantz and Alan Said. 2015. Pre- Media: Exploiting Twitter for Predicting {Real-World} Exploits.
dicting Cyber Vulnerability Exploits with Machine Learning.. In In 24th USENIX Security Symposium (USENIX Security 15). 1041–
SCAI. 48–57. 1056.

[Eiram(2013)] C Eiram. 2013. Exploitability/Priority Index Rating [Suciu et al.(2018)] Octavian Suciu, Radu Marginean, Yigitcan Kaya,
Systems (Approaches, Value, and Limitations). Hal Daume III, and Tudor Dumitras. 2018. When does ma-
chine learning {FAIL}? generalized transferability for evasion
[Fang et al.(2020)] Yong Fang, Yongcheng Liu, Cheng Huang, and and poisoning attacks. In 27th {USENIX} Security Symposium
Liang Liu. 2020. FastEmbed: Predicting vulnerability exploitation ({USENIX} Security 18). 1299–1316.
possibility based on ensemble machine learning algorithm. PloS
one 15, 2 (2020), e0228439. [Suciu et al.(2022)] Octavian Suciu, Connor Nelson, Zhuoer Lyu,
Tiffany Bao, and Tudor Dumitras, . 2022. Expected exploitability:
[FIRST(2019)] FIRST 2019. A complete guide to the common Predicting the development of functional vulnerability exploits. In
vulnerability scoring system. https://www.first.org/cvss/v3.0/ 31st USENIX Security Symposium (USENIX Security 22). 377–
specification-document. 394.

12
[Tavabi et al.(2018)] Nazgol Tavabi, Palash Goyal, Mohammed Al-
mukaynizi, Paulo Shakarian, and Kristina Lerman. 2018. Darkem-
bed: Exploit prediction with neural language models. In Proceed-
ings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[Tenable(2020)] Tenable 2020. What Is VPR and How Is It Dif-
ferent from CVSS? Tenable. https://www.tenable.com/blog/
what-is-vpr-and-how-is-it-different-from-cvss.
[Vaswani et al.(2017)] Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,
and Illia Polosukhin. 2017. Attention is all you need. Advances in
neural information processing systems 30 (2017).
[Xiao et al.(2018)] Chaowei Xiao, Armin Sarabi, Yang Liu, Bo Li,
Mingyan Liu, and Tudor Dumitras. 2018. From patching delays
to infection symptoms: Using risk profiles for an early discovery
of vulnerabilities exploited in the wild. In 27th USENIX Security
Symposium (USENIX Security 18). 903–918.
[Yang and Shami(2020)] Li Yang and Abdallah Shami. 2020. On hy-
perparameter optimization of machine learning algorithms: Theory
and practice. Neurocomputing 415 (2020), 295–316.
[Yang and Pedersen(1997)] Yiming Yang and Jan O Pedersen. 1997. A
comparative study on feature selection in text categorization. In
Icml, Vol. 97. Citeseer, 35.
[Younis and Malaiya(2015)] Awad A Younis and Yashwant K Malaiya.
2015. Comparing and evaluating CVSS base metrics and microsoft
rating system. In 2015 IEEE International Conference on Software
Quality, Reliability and Security. IEEE, 252–261.

13

You might also like