Enhancing Vulnerability Prioritization
Enhancing Vulnerability Prioritization
activity is consolidated into a single boolean value (0 or 1), iden- listing4 of “publicly known cases of detected zero-day exploits”.5
tifying days on which exploitation activity was reported for any This may help us forecast exploitation activity as the vulnerability
given CVE across any of the available data sources. Structuring the slides into N-day status. We include 162 unique CVEs listed by
training data according to this boolean time-series enables us to Google Project Zero.
estimate the probability of exploitation activity in any upcoming Trend Micro’s Zero Day Initiative (ZDI), the “world’s largest
window of time, though the consensus in the EPSS Special Interest vendor-agnostic bug bounty program”,6 works with researchers
Group was to standardize on a 30-day window to align with most and vendors to responsibly disclose zero-day vulnerabilities and
enterprise patch cycles. issue public advisories about vulnerabilities at the conclusion of
The exploit data used in this research paper covers activity from July their process. We include 7,356 CVEs that have public advisories
1, 2016 to December 31st, 2022 (2,374 days / 78 months / 6.5 years), issued by ZDI.
over which we collected 6.4 million exploitation observations (date The Known Exploited Vulnerabilities (KEV) catalog from the US
and CVE combinations), targeting 12,243 unique vulnerabilities. Department of Homeland Security’s Cybersecurity and Infrastruc-
Based on this data, we find that 6.4% (12,243 of 192,035) of all ture Security Agency (CISA) is an “authoritative source of vulnera-
published vulnerabilities were observed to be exploited during this bilities that have been exploited in the wild”.7 We include 866 CVEs
period, which is consistent with previous findings [21, 22]. from CISA’s KEV list.
These sources lack transparency about when exploitation activity
was observed, and for how long this activity was ongoing. However,
3.2 Explanatory variables/features because past exploitation attempts might influence the likelihood
In total, EPSS leverages 1,477 features for predicting exploitation of future attacks, we include these indicators as binary features for
activity. Next, we describe the data sources used to construct these our model.
features.
Social media. Exploitation may also be correlated with social
Published exploit code. We first consider the correlation between media discussions, and therefore we collect Twitter mentions of
exploitation in the wild and the existence of publicly available CVEs, counting these mentions within three different historical
exploit code, which is collected from three sources (courtesy of time windows (7, 30, and 90 days). We only count primary and
Cyentia3 .): Exploit-DB, Github, and Metasploit. In total we identi- original tweets and exclude retweets and quoted retweets. The
fied 24,133 CVEs with published exploit code, consisting of 20,604 median number of daily unique tweets mentioning CVEs is 1,308
CVEs from Exploit-DB, 4,049 published on GitHub, and 1,905 pub- with the 25th and 75th percentile of daily tweets being 607 and
lished on Metasploit modules. Even though Exploit-DB contains 1,400 respectively. We currently make no attempt to validate the
the majority of published exploits, GitHub has become a valuable content or filter out automated posts (from bots).
source in recent years. For example, in 2022, 1,591 exploits were
Offensive security tools. We also collect evidence of vulnerabil-
published on GitHub, while Exploit-DB and Metasploit added 196
ities being used in offensive security tools that are designed, in
and 94 entries, respectively.
part, to identify vulnerabilities during penetration tests. We are
Public vulnerability lists. Next, we consider that exploitation currently gathering information from four different offensive secu-
activity may be forecasted by the presence of vulnerabilities on rity tools with varying numbers of CVEs identified in each: Nuclei
popular lists and/or websites that maintain and share information with 1,548 CVEs, Jaeles with 206 CVEs, Intrigue with 169 CVEs and
about selective vulnerabilities. Google Project Zero maintains a 4 https://docs.google.com/spreadsheets/d/1lkNJ0uQwbeC1ZTRrxdtuPLCIl7mlUreoKfSIgajnSyY/
view#gid=1190662839.
5 https://googleprojectzero.blogspot.com/p/0day.html.
6 https://www.zerodayinitiative.com/about.
3 https://www.cyentia.com/services/exploit-intelligence-service 7 https://www.cisa.gov/known-exploited-vulnerabilities
3
Sn1per with 63 CVEs. These are encoded as binary features which Common Weakness Enumeration (CWE), which is a “community-
indicate whether each particular source is capable of scanning for developed list of software and hardware weakness types”.8 We
and reporting on the presence of each vulnerability. collect the CWE assignments from NVD, noting that 21,570 CVEs
do not have a CWE assigned. We derived binary features for CWEs
References. In order to capture metrics around the activity and found across at least 10 vulnerabilities, resulting in 186 CWE iden-
analysis related to vulnerabilities, for each CVE, we count the num- tifiers being included. In addition, we maintain two features for
ber of references listed in MITRE’s CVE list, as well as the number of vulnerabilities where CWE information is not available, or the
references with each of the 16 reference tags assigned by NVD. The assigned CWEs are not among the common ones. The top CWE
labels and and their associated prevalence across CVEs are: Vendor identifiers and their vulnerability counts are CWE 79 (20,797), CWE
Advisory (102,965), Third Party Advisory (84,224), Patch (59,660), 119 (11,727), CWE 20 (9,590), CWE 89 (8,790), CWE 787 (7,624), CWE
Exploit (54,633), VDB Entry (31,880), Issue Tracking (16,848), Mail- 200 (7,270), CWE 264 (5,485), CWE 22 (4,918), CWE 125 (4,743), and
ing List (15,228), US Government Resource (11,164), Release Notes CWE 352 (4,081).
(9,308), Permissions Required (3,980), Broken Link (3,934), Prod- Vulnerable vendors. We suspect exploitation activity may be cor-
uct (3,532), Mitigation (2,983), Technical Description (1,686), Not related to the market share and/or install base companies achieve.
Applicable (961), and Press/Media Coverage (124). Therefore, we parse through the Common Platform Enumeration
(CPE) data provided by NVD in order to identify platform records
Keyword description of the vulnerability. To capture attributes of marked as “vulnerable”, and extract only the vendor portion of the
vulnerabilities themselves, we use the same process as described record. We did not make any attempt to fill in missing information
in previous research [21, 22]. This process detects and extracts or correct any typos or misspellings that may occasionally appear
hundreds of common multiword expressions used to describe and in the records. We ranked vendors according to the number of
discuss vulnerabilities. These expressions are then grouped and vulnerabilities, creating one binary feature for each vendor, and
normalized into common vulnerability concepts. The top tags we evaluated the effect of including less frequent vendors as features.
included and associated CVEs are as follows: “remote attacker” We observed no performance improvements by including vendors
(80,942), “web” (31,866), “code execution” (31,330), “denial of service” with fewer than 10 CVEs in our dataset. As a result, we extracted
(28,478), and ‘authenticated” (21,492). In total, we include 147 binary 1,040 unique vendor features in the final model. The most preva-
features for identifying such tags. lent vendors and their vulnerability counts are Microsoft (10,127),
We followed the same process as EPSS v1 for extracting mulit- Google (9,100), Oracle (8,970), Debian (7,627), Apple (6,499), IBM
word expressions from the text from references using Rapid Auto- (6,409), Cisco (5,766), RedHat (4,789), Adobe (4,627), Fedora Project
matic Keyword Extraction [31]. (4,166).
Age of the vulnerability. Finally, the age of a vulnerability might
CVSS metrics. To capture other attributes of vulnerabilities, we contribute or detract from the likelihood of exploitation. Intuitively,
collect CVSS base metrics. These consist of exploitability measure- we expect old vulnerabilities to be less attractive to attackers due to
ments (attack vector, attack complexity, privilege required, user a smaller vulnerable population. To capture this, we create a feature
interaction, scope) and the three impact measurements (confiden- which records the number of days elapsed from CVE publication
tiality, integrity and availability). These categorical variables are to the time of feature extraction in our model.
encoded using one-hot encoding. We collected CVSS version 3 in-
formation from NVD for 118,087 vulnerabilities. However, 73,327 4 MODELING APPROACH
vulnerabilities published before CVSSv3 was created and are only
scored in NVD using CVSSv2. To address this, we developed a sepa- 4.1 Preparing ground truth and features
rate and dedicated machine learning model to estimate the CVSSv3 Exploitation activity is considered as any recorded attempt to ex-
measurement values for each of these vulnerabilities. ploit a vulnerability, regardless of the success of the attempt, and
We use a process similar to prior work [25], where for each CVE, regardless of whether the targeted vulnerability is present. All ob-
we use the CVSSv2 sub-components for CVEs which have both served exploitation activity is recorded with the date the activity
CVSSv2 and CVSSv3 scores. We then train a feedforward neural occurred and aggregated across all data sources by the date and
network to predict CVSSv3 vectors. The model was validated using CVE identifier. The resulting ground truth is a binary value for each
8-fold, yearly stratified, cross-validation, achieving 74.9% accuracy vulnerability of whether exploitation activity was observed or not,
when predicting the exact CVSSv3 vector. For 99.9% of vectors, we for each day.
predict the majority (5 or more) of the individual metrics correctly. Since many of the features may change day by day, we construct
For each individual portion of the CVSSv3 vector we were able to features for the training data on a daily basis. In order to reduce
achieve a minimum of 93.4% accuracy (on the Privileges Required the size of our data (and thus the time and memory needed to
metric). We note that this exceeds the accuracy achieved by [25], train models) we aggregate consecutive daily observations where
and likely warrants further research into the robustness of CVSSv3 features do not change. The size of the exposure and the number of
prediction and its possible application to future versions of CVSS. days with exploitation activity are included in the model training.
When constructing the test data, a single date is selected (typi-
CWE. We also capture the observation that different types of cally "today", see next section) and all of the features are generated
vulnerabilities may be more or less attractive to attackers, using the 8 https://cwe.mitre.org
4
based on the state of vulnerabilities for that date. Since the final Table 2: Non-default hyperparameter values for XGBoost al-
model is intended to estimate the probability of exploitation in the gorithm and data selection
next 30 days, we construct the ground truth for the test data by
looking for exploitation activity over the following 30 days from Parameter Value
the test date selected. Time Horizon 1 year
Learning rate 0.11
Max depth tree depth 20
Subsample ratio of the training instances 0.75
4.2 Model selection Minimum loss reduction for leaf node partition 10
The first EPSS model [22] sought not only to accurately predict Maximum delta step 0.9
exploitation but do so in a parsimonious, easy to implement way. The number of boosting rounds 65
As a result, regularized logistic regression (Elasticnet) was chosen
to produce a generalized linear model with only a handful of vari-
ables. The current model relaxes this requirement in the hopes of
improving performance and providing more accurate exploitation 4.4 Tuning and optimizing model performance
predictions. In particular, capturing non-linear relationships be- Despite being a well studied approach, the use of gradient boosted
tween inputs and exploitation activity will better predict the finer trees and XGBoost for prediction problems still requires some ef-
grain exploitation activity. fort to identify useful features and model tuning to achieve good
Removing the requirement of a simple model with the need to model performance. This requires a-priori decisions about which
model complex relationships expands the universe of potential mod- features to include and the hyperparameter values for the XGBoost
els. Indeed many machine learning algorithms have been developed algorithm.
for this exact purpose. However, testing all models is impractical The features outlined in subsection 3.2 includes 28,724 variables.
because each model requires significant engineering and calibration Many of these variables are binary features indicating whether a
to achieve an optimal outcome. We therefore focus on a single type vulnerability affects a particular vendor or can be described by a
of model that has proven to be particularly performant on these specific CWE. While the XGBoost algorithm is efficient, including
data. Recent research has illustrated that panel (tabular) data, such all of variables in our inference is technically infeasible. To reduce
as ours, can be most successfully modeled using tree based methods the scope of features we take a naive, yet demonstrably effective ap-
(in particular gradient boosted trees for regression) [18], arriving at proach at removing variables below a specific occurrence rate [39].
similar or better predictive performance with less computation and This reduced the input feature set to 1,477 variables.
tuning in comparison to other methods such as neural networks. One additional challenge with our data is the temporal nature
Given the results in [18] we focus our efforts on tuning a common of our predictions. In particular, exactly how much historical data
implementation of gradient boosted trees, XGBoost [12]. should be included in the data set. In addition to the XGBoost
XGBoost is a popular, well documented, and performant implemen- hyperparameters and the sparsity threshold, we also constructed
tation of the gradient boosted tree algorithm in which successive four different sets of training data for 6 months and then 1, 2 and
decision trees are trained to iteratively reduce prediction error. 3 years, to determine what time horizons would provide the best
predictions.
To identify the time horizon and sparsity threshold described
above as well as the other hyperparameters needed by our imple-
4.3 Train/test split and measuring performance mentation of gradient boosted trees we take a standard approach
In order to reduce over-fitting, We implement two restrictions. First, described in [38]. We first define reasonable ranges for the hyper-
we implement a time-based test/train split, constructing our train- parameters, use Latin Hypercube sampling over the set of possible
ing data sets on data up to and including October 31, 2021. We then combinations, compute model performance for that set of hyper-
construct the test data set based on the state of vulnerabilities on parameters, then finally build an additional model (also a gradient
December 1st, 2021, providing one month between the end of the boosted tree) to predict performance given a set of hyperparameters,
training data and the test data. As mentioned above, the ground using the model to maximize performance.
truth in the test data is any exploitation activity from December The results of the above process results in the parameters selected
1st to December 30th, 2021. Second, we use 5-fold cross validation, in Table 2. Note that of the tested time horizons, none dramatically
with the folds based on each unique CVE identifier. This selectively outperformed others, with 1 year only slightly outperforming other
removes vulnerabilities from the training data and tests the perfor- tested possibilities.
mance on the hold out set, thus further reducing the likelihood of
over-fitting. 5 EVALUATION
Finally, we measure performance by calculating the area under
the curve (AUC) based on precision and recall across the full range 5.1 Precision (efficiency) and recall (coverage)
of predictions. We selected precision-recall since we have severe Precision and recall are commonly used machine learning perfor-
class imbalance in exploited vulnerabilities, and using accuracy or mance metrics, but are not intuitive for security practitioners, and
traditional Receiver Operator Characteristic (ROC) curves may be therefore can be difficult to contextualize what these performance
misleading due to that imbalance. metrics represent in practice.
5
Precision (efficiency) measures how well resources are being 1.0 0.9 Labeled points show thresholds,
0.9 0.8
allocated, (where low efficiency represents wasted effort), and is 0.7 CVEs scoring at or above
0.9 0.6 threshold are prioritized
calculated as the true positives divided by the sum of the true and 0.9 0.8 0.5
0.8 0.4
false positives. 0.8 0.7 0.7
0.3
Precision (Efficiency)
0.6 0.6
In the vulnerability management context, efficiency addresses the 0.5
0.7 0.5
question, “out of all the vulnerabilities remediated, how many were 0.4 0.2
0.4
actually exploited?” If a remediation strategy suggests patching 100 0.6 0.3
vulnerabilities, 60 of which were exploited, the efficiency would be 0.3
0.5 0.2 0.1
60%. 0.2
Recall (coverage), on the other hand, considers how well a remedia- 0.4 0.1 0.05
tion strategy actually addresses those vulnerabilities that should be
EP
0.3
SS
patched (e.g., that have observed exploitation activity), and is calcu- 0.05
v3
0.1
lated as the true positives divided by the sum of the true positives 0.2
and false negatives. 0.05 EP
10 SS
In the vulnerability management context, coverage addresses the 0.1 CVSS v3.x Base Sc e
or 9 EPSS 2
v
8 v1 7
6 5 43210.5
question, “out of all the vulnerabilities that are being exploited, 0.0
how many were actually remediated?” If 100 vulnerabilities are 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
exploited, 40 of which are patched, the coverage would be 40%. Recall (Coverage)
Therefore, for the purpose of this article, we use the terms effi-
ciency and coverage interchangeably with precision and recall, Figure 1: Performance of EPSS v3 compared to previous ver-
respectively, in the discussions below. sions and CVSS Base Score
Exploit:metasploit Site:KEV
Figure 2: Calibration Plot comparing predicted probabilities
to observed exploitation period in the following 30 days
9 Performance is then measured based on exploitation activity in the following 30 days. 10 "See https://www.cisa.gov/binding-operational-directive-22-01"
7
CVSS v3.x EPSS v1 CVSS v3.x EPSS v1
All CVEs CVEs Above Threshold Exploited All CVEs CVEs Above Threshold Exploited
Figure 4: Strategy comparisons holding the level of effort Figure 5: Strategy comparisons holding the coverage con-
constant stant
the KEV list, but would enjoy 13% greater efficiency (60.5% vs 53.2%)
and almost three times more coverage (14.9% vs 5.9%).
circle (exploitation activity) covered by the blue circle (number of
Therefore, based on this simple heuristic (KEV vs Metasploit), the
vulnerabilities needing to be remediated). The baseline for coverage
Metasploit strategy outperforms the KEV strategy.
is set by a CVSS strategy of remediating vulnerabilities with a base
score of 7 and above (CVEs with a "High" or "Critical" CVSS score).
5.5 Advanced remediation strategies Such a strategy yields a respectable coverage at 82.1% but at the cost
Next we explore the real-world performance of our model, using of a higher level of effort, needing to remediate 58.1% or 110,000
two separate approaches. We first compare coverage among four of all published CVEs. Practitioners can achieve a similar level of
remediation strategies while holding the level of effort constant (i.e. coverage (82%) using EPSS v3 and prioritizing vulnerabilities scored
the number of vulnerabilities needing to be remediated), we then at 0.088 and above but with a much lower level of effort, needing
compare levels of effort while holding coverage constant. to only remediate 7.3% or just under 14,000 vulnerabilities.
Figure 4 compares the four strategies while maintaining approxi- Remediating CVEs rated as High or Critical with CVSS v3 gives
mately the same level of effort. That is, the blue circle in the middle a respectable level of coverage at 82.1%, but requires remediating
of each figure – representing the number of vulnerabilities that 58.1% of published CVEs. On the other hand, EPSS v3 can achieve
would need to be remediated – is fixed to the same size for each the same level of coverage but reduces the amount of effort from
strategy, at approximately 15% or about 28,000 vulnerabilities. The 58.1% to 7.3% of all CVEs, or fewer than 14000 vulnerabilities.
CVSS strategy, for example, would remediate vulnerabilities with
a base score of 9.1 or greater, and would achieve coverage and
efficiency of 33.5% and 6.1%, respectively. 6 DISCUSSION AND FUTURE WORK
A remediation strategy based on EPSS v2, on the other hand, Currently, the EPSS model ingests data concerning which vulnera-
would remediate vulnerabilities with an EPSS v2 score of 0.037 and bilities were exploited on which days. However, exploitation has
greater, yielding 69.9% coverage and 18.5% efficiency. Already, this many other characteristics, which may be useful to capture and ex-
strategy doubles the coverage and triples the efficiency, relative to amine. For example, we may be interested in studying the number
the CVSS strategy. of exploits per vulnerability (volume), fragmentation of exploitation
Even better results are achieved with a remediation strategy over time (that is, the pattern of periods of exploitation), or preva-
based on EPSS v3 which enjoys 90.4% coverage and 24.1% efficiency. lence, which would measure the spread of exploitation, typically
Figure 5 compares the four strategies while maintaining approxi- by counting the number of devices detecting exploitation. We leave
mately the same level of coverage. That is, the proportion of the red these topics for future work.
8
6.1 Limitations and adversarial consideration Exploit Code
This research is conducted with a number of limitations. First,
CVE (age+refs)
insights are limited to data collected from our data partners and the
geographic and organizational coverage of their network collection CVSS Vectors
devices. While these data providers collectively manage hundreds
Sites
of thousands of sensors across the globe, and across organizations of
Density
all sizes and industries, they do not observe every attempted exploit Scanners
observe vulnerabilities that were exploited but undetected by the 0 0.01 0.1 0.5 1 2 3 5
sensor because a signature was not written. Shapley Value
Moreover, the nature of the detection devices generating the
events will be biased toward detecting network-based attacks, as Figure 6: Density plots of the absolute SHAP values for each
opposed to attacks from other attack vectors such as host-based family of features
attacks or methods requiring physical proximity.11 Similarly, these
detection systems will be typically installed on public-facing perime-
vulnerabilities with lower EPSS scores, it may be conceivable that
ter internet devices, and therefore less suited to detecting computer
attackers begin to strategically incorporate these lower scoring
attacks against internet of things (IoT) devices, automotive net-
vulnerabilities into their tactics and malware. While possible, we
works, ICS, SCADA, operational technology (OT), medical devices,
are not aware of any actual or suggestive evidence to this effect.
etc.
Finally, while evolving the model from a logistic regression
Given the exploit data from the data partners, we are not able
to a more sophisticated machine learning approach greatly im-
to distinguish between exploit activity generated by researchers or
proved performance of EPSS, an important consequence is that
commercial entities, versus actual malicious exploit activity. While
interpretability of variable contributions is more difficult to quan-
it is likely that some proportion of exploitation does originate from
tify as we discuss in the next section.
non-malicious sources, at this point we have no reliable way of
estimating the true proportion. However, based on the collective
6.2 Variable importance and contribution
authors’ experience, and discussions with our data providers, we
do not believe that this represents a significant percentage of ex- While an XGBoost model is not nearly as intuitive or interpretable
ploitation activity. as linear regression, we can use SHAP values [23] to reduce the
While these points may limit the scope of our inferences, to the opacity of a trained model by quantifying feature contributions,
Í
extent that our data collection is representative of an ecosystem of breaking down the score assigned to a CVE as 𝜙 0 + 𝑖 𝜙𝑖 , where
public-facing, network-based attacks, we believe that many of the 𝜙𝑖 is the contribution from feature 𝑖, and 𝜙 0 is a bias term. We use
insights presented here are generalizable beyond this dataset. SHAP values due to their good properties such as local accuracy
In addition to these limitations, there are other adversarial con- (attributions sum up to the output of the model), missingness (miss-
siderations that fall outside the scope of this paper. For example, one ing features are given no importance), and consistency (modifying
potential concern is the opportunity for adversarial manipulation a model so that a feature is given more weight never decreases its
either of the EPSS model, or using the EPSS scores. For example, attribution).
it may be possible for malicious actors to poison or otherwise ma- The contributions from different classes of variables in the ker-
nipulate the input data to the EPSS model (e.g. Github, Twitter). nel density plot are shown in Figure 6. First, note that the figure
These issues have been studied extensively in the context of ma- displays the absolute value of the SHAP values, in order to infer
chine learning for exploit prediction [32] and other tasks [10, 33], the contribution of the variable away from zero. Second, note the
and their potential impact is well understood. Given that we have horizontal axis is presented on log scale to highlight that the ma-
no evidence of such attacks in practice, and our reliance on data jority of features do not contribute much weight to the final output.
from many distinct sources which would reduce the leverage of In addition, the thin line extending out to the right in Figure 6
adversaries, we leave an in-depth investigation of the matter for illustrates how there are instances of features within each class
future work. Additionally, it is possible that malicious actors may that contribute a significant amount. Finally, note that Figure 6
change their strategies based on EPSS scores. For example, if net- is sorted in decreasing mean absolute SHAP value for each class
work defenders increasingly adopt EPSS as the primary method of features, highlighting the observation that published exploit
for prioritizing vulnerability remediation, thereby deprioritizing code is the strongest contributor to the estimated probability of
exploitation activity.
11 For
Figure 7 identifies the 30 most significant features with their
example, it is unlikeley to find evidence of exploitation for CVE-2022-37418 in
our data set, a vulnerability in the remote keyless entry systems on specific makes and calculated mean absolute SHAP value. Again, note that higher
models of automobiles. values infer a greater influence (either positive or negative) on
9
CVE: Count of References Vulnerability Database (NVD) includes a CVSS base score with
Tag: Remote nearly all vulnerabilities it has published. Because of the wide-
Tag: Code Execution spread use of CVSS, specifically the base score, as a prioritization
Exploit: Exploit DB
CVE: Age of CVE strategy we will compare our performance against CVSS as well as
Vendor: Microsoft our previous models.
CVSS: 3.1/AV:N
CVSS: 3.1/PR:N Exploit likelihood is also modeled through various vendor-specific
CVSS: 3.1/A:H metrics. In 2008, Microsoft introduced the Exploitability Index for
CVSS: 3.1/C:H vulnerabilities in their products [24]. It provides 4 measures for
Site: ZDI
Exploit: metasploit the likelihood that a vulnerability will be exploited: whether an
NVD: Exploit Ref exploitation has already been detected, and whether exploitation is
NVD: VDB Ref
NVD: US Gov Ref more or less likely, or unlikely. The metric has been investigated
Tag: SQLi before [15, 30, 40] and was shown to have limited performance at
Scanner: Nuclei predicting exploitation in the wild [13, 30] or the development of
Vendor: Adobe
CVSS: 3.1/UI:N functional exploits [34].
NVD: Vendor Advisory Ref Redhat provides a 4-level severity rating: low, moderate, im-
Tag: Local portant, and critical [28]. In addition to capturing a measure of
NVD: 3party Advisory Ref
NVD: Patch Ref the impact to a vulnerable system, this index also captures some
CVSS: 3.1/I:H notion of exploitability. For example, the “low” severity rating rep-
Tag: XSS
Tag: Denial of Service resents vulnerabilities that are unlikely to be exploited, whereas
Site: KEV the “critical” severity rating reflects vulnerabilities that could be
CVSS: 3.1/Scored easily exploited by an unauthenticated remote attacker. Like the
Exploit: Github
Tag: Buffer Overflow Exploitability Index, Redhat’s metric is vendor-specific and has
limitations reflecting exploitation likelihood [34].
0.0 0.1 0.2 0.3 0.4
A series of commercial solutions also aim to capture the likeli-
Mean Absolute Shapley Value
hood of exploits. Tenable, a leading vendor of intrusion detection
systems, created the Vulnerability Priority Rating (VPR), which,
Figure 7: Mean absolute SHAP value for individual features like CVSS, combines information about both impact to a vulnerable
system, and the exploitability (threat) of a vulnerability in order to
help network defenders better prioritize remediation efforts [36].
the final predicted value. Note that Figure 6 is showing the mean For example, the threat component of VPR “reflects both recent
absolute SHAP value from an entire class of features. So even though and potential future threat activity” by examining whether exploit
Exploit Code as a class of features has a higher mean absolut SHAP code is publicly available, whether there are mentions of active
value, the largest individual feature is coming from the count of exploitation on social media or in the dark web, etc. Rapid 7’s Real
references in the published CVE (which is in the "CVE" class). Risk Score product uses its own collection of data feeds to produce
Note how the most influential feature is the count of the number a score between 1-1000. This score is a combination of the CVSS
of references in MITRE’s CVE List, followed by “remote attack- base score, “malware exposure, exploit exposure and ease of use,
ers,” “code execution,” and published exploit code in Exploit-DB, and vulnerability age” and seeks to produce a better measure of
respectively. both exploitability and “risk” [26]. Recorded Future’s Vulnerability
Intelligence product integrates multiple data sources, including
threat information, and localized asset criticality [27]. The predic-
7 LITERATURE REVIEW AND RELATED tions, performance evaluations and implementation details of these
SCORING SYSTEMS solutions are not publicly available.
This research is informed by multiple bodies of literature. First, These industry efforts are either vendor-specific, score only sub-
there are a number of industry efforts that seek to provide some sets of vulnerabilities, based on expert opinion and assessments and
measure of exploitability for individual vulnerabilities, though there therefore not entirely data-driven, or proprietary and not publicly
is wide variation in their scope and availability. First, the base metric available.
group of CVSS, the leading standard for measuring the severity Our work is also related to a growing academic research field of
of a vulnerability, is composed of two parts, measuring impact predicting and detecting vulnerability exploitation. A large body
and exploitability [17]. The score is built on expert judgements, of work focuses on predicting the emergence of proof-of-concept
capturing, for example the observation that a broader ability to or functional exploits [5–7, 9, 14, 29, 34], not necessarily whether
exploit a vulnerability (i.e., remotely across the Internet, as opposed these exploits will be used in the wild, as is done with EPSS. Papers
to requiring local access to the device); a more complex exploit predicting exploitation in the wild have used alternative sources of
required, or more user interaction required, all serve to increase the exploitation, most notably data from Symantec’s IDS, to build pre-
apparent likelihood that a vulnerability could be exploited, all else diction models [4, 11, 16, 19, 32, 35, 37]). Most of these papers build
being equal. CVSS has been repeatedly shown by prior work [2, 3], vulnerability feature sets from commonly used data sources such
as well as our own evidence, to be insufficient for capturing all as NVD or OSVDB, although some of them use novel identifiers
the factors that drive exploitation in the wild. The U.S. National for exploitation: [32] infers exploitation using Twitter data, [37]
10
uses patching patterns and blacklist information to predict whether [6] Navneet Bhatt, Adarsh Anand, and Venkata SS Yadavalli. 2021. Exploitabil-
organizations are facing new exploits, while [35] uses natural lan- ity prediction of software vulnerabilities. Quality and Reliability Engineering
International 37, 2 (2021), 648–663.
guage processing methods to infer context of darkweb/deepweb [7] Mehran Bozorgi, Lawrence K Saul, Stefan Savage, and Geoffrey M Voelker. 2010.
discussions. Beyond Heuristics: Learning to Classify Vulnerabilities and Predict Exploits.
In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge
Compared to other scoring systems and research described above, Discovery and Data Mining. 105–114.
EPSS is a rigorous and ongoing research effort is; an international, [8] Glenn W Brier et al. 1950. Verification of forecasts expressed in terms of proba-
community-driven effort; designed to predict vulnerability exploita- bility. Monthly weather review 78, 1 (1950), 1–3.
[9] Benjamin L Bullough, Anna K Yanchenko, Christopher L Smith, and Joseph R
tion in the wild; available for all known and published vulnerabili- Zipkin. 2017. Predicting Exploitation of Disclosed Software Vulnerabilities Using
ties; updated daily to reflect new vulnerabilities and new exploit- Open-source Data. In Proceedings of the 3rd ACM on International Workshop on
related information; made available freely to the public. Security and Privacy Analytics. 45–53.
[10] Anirban Chakraborty, Manaar Alam, Vishal Dey, Anupam Chattopadhyay, and
Debdeep Mukhopadhyay. 2018. Adversarial attacks and defences: A survey. arXiv
preprint arXiv:1810.00069 (2018).
[11] Haipeng Chen, Rui Liu, Noseong Park, and VS Subrahmanian. 2019. Using
8 CONCLUSION twitter to predict when vulnerabilities will be exploited. In Proceedings of the 25th
In this paper, we presented results from an international, community- ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
3143–3152.
driven effort to collect and analyze software vulnerability exploit [12] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting
data, and to build a machine learning model capable of estimating system. In ACM SIGKDD International Conference on Knowledge Discovery and
the probability that a vulnerability would be exploited within 30 Data Mining. ACM, 785–794.
[13] DarkReading 2008. Black Hat: The Microsoft Exploitability Index: More Vulnera-
days following the prediction. In particular, we described the pro- bility Madness. DarkReading. https://www.darkreading.com/risk/black-hat-the-
cess of collecting each of the additional variables, and described microsoft-exploitability-index-more-vulnerability-madness.
[14] Michel Edkrantz and Alan Said. 2015. Predicting Cyber Vulnerability Exploits
the approaches used to create the machine learning model based with Machine Learning.. In SCAI. 48–57.
on 6.4 million observed exploit attempts. Through the expanded [15] C Eiram. 2013. Exploitability/Priority Index Rating Systems (Approaches, Value,
data sources we achieved an unprecedented 82% improvement in and Limitations).
[16] Yong Fang, Yongcheng Liu, Cheng Huang, and Liang Liu. 2020. FastEmbed:
classifier performance over the previous iterations of EPSS. Predicting vulnerability exploitation possibility based on ensemble machine
We illustrated practical use of EPSS by way of comparison with a learning algorithm. PloS one 15, 2 (2020), e0228439.
set of alternative vulnerability remediation strategies. In particular, [17] FIRST 2019. A complete guide to the common vulnerability scoring system.
https://www.first.org/cvss/v3.0/specification-document.
we showed the sizeable and meaningful improvement in coverage, [18] Leo Grinsztajn, Edouard Oyallon, and Gael Varoquaux. 2022. Why do tree-based
efficiency and level of effort (as measured by the number of vul- models still outperform deep learning on typical tabular data?. In Thirty-sixth
Conference on Neural Information Processing Systems Datasets and Benchmarks
nerabilities that would need to be remediated) by using EPSS v3 Track.
over any and all current remediation approaches, including CVSS, [19] Mohammad Shamsul Hoque, Norziana Jamil, Nowshad Amin, and Kwok-Yan
CISA’s KEV list, and Metasploit. Lam. 2021. An Improved Vulnerability Exploitation Prediction Model with Novel
Cost Function and Custom Trained Word Vector Embedding. Sensors 21, 12
As the EPSS effort continues to grow, acquire and ingest new (2021), 4220.
data, and improve modeling techniques with each new version, we [20] Cyentia Institute and Kenna Security. 2022. Prioritization to Prediction Vol 8.
believe it will continue to improve in performance, and provide (2022). https://www.kennasecurity.com/resources/prioritization-to-prediction-
reports/
new and fundamental insights into vulnerability exploitation for [21] Jay Jacobs, Sasha Romanosky, Idris Adjerid, and Wade Baker. 2020. Improving vul-
many years to come. nerability remediation through better exploit prediction. Journal of Cybersecurity
6, 1 (2020), tyaa015.
[22] Jay Jacobs, Sasha Romanosky, Benjamin Edwards, Idris Adjerid, and Michael
Roytman. 2021. Exploit Prediction Scoring System (EPSS). Digital Threats:
9 ACKNOWLEDGEMENTS Research and Practice 2, no. 3 (2021): 1-17. 2, 3 (2021), 1–17.
We would like to acknowledge the participants of the EPSS Special [23] Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model
predictions. In Advances in neural information processing systems. 4765–4774.
Interest Group (SIG), as well as the organizations that have con- [24] Microsoft 2020. Microsoft Exploitability Index. Microsoft. https://www.microsoft.
tributed to the EPSS data model to include: Forinet, Shadow Server com/en-us/msrc/exploitability-index.
Foundation, Greynoise, Alien Vault, Cyentia, and FIRST. [25] Maciej Nowak, Michał Walkowski, and Sławomir Sujecki. 2021. Conversion of
CVSS Base Score from 2.0 to 3.1. In 2021 International Conference on Software,
Telecommunications and Computer Networks (SoftCOM). IEEE, 1–3.
[26] Rapid7 2023. Prioritize Vulnerabilities Like an Attacker. Rapid7. https://www.
REFERENCES rapid7.com/products/insightvm/features/real-risk-prioritization/.
[1] Luca Allodi and Fabio Massacci. 2012. A Preliminary Analysis of Vulnerability [27] Recorded Future 2023. Prioritize patching based on risk. Recorded Future.
Scores for Attacks in Wild. In CCS BADGERS Workshop. Raleigh, NC. https://www.recordedfuture.com/platform/vulnerability-intelligence.
[2] Luca Allodi and Fabio Massacci. 2012. A preliminary analysis of vulnerability [28] RedHat 2023. Severity ratings. RedHat. https://access.redhat.com/security/
scores for attacks in wild: The EKITS and SYN datasets. In Proceedings of the 2012 updates/classification/.
ACM Workshop on Building Analysis Datasets and Gathering Experience Returns [29] Alexander Reinthal, Eleftherios Lef Filippakis, and Magnus Almgren. 2018. Data
for Security. 17–24. modelling for predicting exploits. In Nordic Conference on Secure IT Systems.
[3] Luca Allodi and Fabio Massacci. 2014. Comparing vulnerability severity and Springer, 336–351.
exploits using case-control studies. ACM Transactions on Information and System [30] Reuters. [n. d.]. Microsoft correctly predicts reliable exploits just 27% of the time.
Security (TISSEC) 17, 1 (2014), 1–20. https://www.reuters.com/article/urnidgns852573c400693880002576630073ead6/
[4] Mohammed Almukaynizi, Eric Nunes, Krishna Dharaiya, Manoj Senguttuvan, microsoft-correctly-predicts-reliable-exploits-just-27-of-the-time-
Jana Shakarian, and Paulo Shakarian. 2017. Proactive Identification of Exploits in idUS186777206820091104.
the Wild Through Vulnerability Mentions Online. In 2017 International Conference [31] Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. 2010. Automatic
on Cyber Conflict (CyCon US). IEEE, 82–88. keyword extraction from individual documents. Text mining: Applications and
[5] Kenneth Alperin, Allan Wollaber, Dennis Ross, Pierre Trepagnier, and Leslie theory (2010), 1–20.
Leonard. 2019. Risk prioritization by leveraging latent vulnerability features in a [32] Carl Sabottke, Octavian Suciu, and Tudor Dumitras, . 2015. Vulnerability Disclo-
contested environment. In Proceedings of the 12th ACM Workshop on Artificial sure in the Age of Social Media: Exploiting Twitter for Predicting {Real-World }
Intelligence and Security. 49–57. Exploits. In 24th USENIX Security Symposium (USENIX Security 15). 1041–1056.
11
[33] Octavian Suciu, Radu Marginean, Yigitcan Kaya, Hal Daume III, and Tudor Dumi- [37] Chaowei Xiao, Armin Sarabi, Yang Liu, Bo Li, Mingyan Liu, and Tudor Dumitras.
tras. 2018. When does machine learning {FAIL }? generalized transferability for 2018. From patching delays to infection symptoms: Using risk profiles for an
evasion and poisoning attacks. In 27th {USENIX } Security Symposium ( {USENIX } early discovery of vulnerabilities exploited in the wild. In 27th USENIX Security
Security 18). 1299–1316. Symposium (USENIX Security 18). 903–918.
[34] Octavian Suciu, Connor Nelson, Zhuoer Lyu, Tiffany Bao, and Tudor Dumitras, . [38] Li Yang and Abdallah Shami. 2020. On hyperparameter optimization of machine
2022. Expected exploitability: Predicting the development of functional vul- learning algorithms: Theory and practice. Neurocomputing 415 (2020), 295–316.
nerability exploits. In 31st USENIX Security Symposium (USENIX Security 22). [39] Yiming Yang and Jan O Pedersen. 1997. A comparative study on feature selection
377–394. in text categorization. In Icml, Vol. 97. Citeseer, 35.
[35] Nazgol Tavabi, Palash Goyal, Mohammed Almukaynizi, Paulo Shakarian, and [40] Awad A Younis and Yashwant K Malaiya. 2015. Comparing and evaluating CVSS
Kristina Lerman. 2018. Darkembed: Exploit prediction with neural language base metrics and microsoft rating system. In 2015 IEEE International Conference
models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. on Software Quality, Reliability and Security. IEEE, 252–261.
[36] Tenable 2020. What Is VPR and How Is It Different from CVSS? Tenable. https:
//www.tenable.com/blog/what-is-vpr-and-how-is-it-different-from-cvss.
12