Predicting CVSS Metric Via Description Interpretat
Predicting CVSS Metric Via Description Interpretat
Predicting CVSS Metric Via Description Interpretat
June 8, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3179692
ABSTRACT Cybercrime affects companies worldwide, costing millions of dollars annually. The constant
increase of threats and vulnerabilities raises the need to handle vulnerabilities in a prioritized manner. This
prioritization can be achieved through Common Vulnerability Scoring System (CVSS), typically used to
assign a score to a vulnerability. However, there is a temporal mismatch between the vulnerability finding
and score assignment, which motivates the development of approaches to aid in this aspect. We explore
the use of Natural Language Processing (NLP) models in CVSS score prediction given vulnerability
descriptions. We start by creating a vulnerability dataset from the National Vulnerability Database (NVD).
Then, we combine text pre-processing and vocabulary addition to improve the model accuracy and interpret
its prediction reasoning by assessing word importance, via Shapley values. Experiments show that the
combination of Lemmatization and 5,000-word addition is optimal for DistilBERT, the outperforming model
in our experiments of the NLP methods, achieving state-of-the-art results. Furthermore, specific events (such
as an attack on a known software) tend to influence model prediction, which may hinder CVSS prediction.
Combining Lemmatization with vocabulary addition mitigates this effect, contributing to increased accuracy.
Finally, binary classes benefit the most from pre-processing techniques, particularly when one class is much
more prominent than the other. Our work demonstrates that DistilBERT is a state-of-the-art model for CVSS
prediction, demonstrating the applicability of deep learning approaches to aid in vulnerability handling. The
code and data are available at https://github.com/Joana-Cabral/CVSS_Prediction.
INDEX TERMS Common vulnerability scoring system, deep learning, interpretability, natural language
processing, security.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 59125
J. C. Costa et al.: Predicting CVSS Metric via Description Interpretation
(CVE) [6], with a unique identifier, description, and CVSS the results obtained. Finally, the main conclusions and future
Base score metrics, the latter specified by the National work are presented in Section VI.
Vulnerability Database (NVD).
The score metric assignment is performed manually from II. RELATED WORK
vulnerability description analysis, for which vendors do not
A. CVSS APPLICABILITY
always provide enough detail [7] for experts to accurately
CVSS has been extensively analyzed and applied to multiple
create these scores. Furthermore, some CVSS metrics are
domains to prioritize or estimate security risks. Younis and
subjective [8], heavily relying on the previous experience
Malaiya [11] compared the CVSS base metrics and the
at assigning CVSS metrics. The inherent problems of this
Microsoft rating system, declaring that both measures have
process are exacerbated by the temporal mismatch of CVSS
a very high false-positive rate, with CVSS significantly
metric assignment and vulnerability finding: 19 days to
affected by the software type. Joh [12] concluded that most
populate a vulnerability with the respective CVSS and six
vulnerabilities are compromised due to no authentication
days to find a new one [9]. Therefore, to reduce the time/cost
required systems, by analyzing the CVSS base scores for
spent while also mitigating the subjective aspect of score
vulnerabilities of currently supported Windows operating
assignment, we explore the use of a deep learning approach
systems, suggesting the addition of an authentication process
to predict the CVSS metrics based on the vulnerability
in every system. CVSS base metrics have been used to assess
description.
cybersecurity risks in IT systems [4], using the risk formula,
We start by obtaining the vulnerabilities descriptions
and calculating risk probability and impact. The same study
and respective CVSS metrics using the NVD Application
reported that an identification of security properties in the
Programming Interface (API). The collected data is processed
early stages of development positively impacts the security
for the most recent version of CVSS (version 3) and
of the systems. In the same context, Wirtz and Heisel [13]
serves as input for the deep learning approach. We select
proposed a semi-automatic method to estimate security risks
the DistilBERT for sequence classification given its out-
in the early stages of software development, using CVSS
performance, in the created dataset, over other state-of-
formulas to assess the threat severity. Since CVSS has already
the-art Natural Language Processing (NLP) models. Since
demonstrated its validity in typical IT systems, it was also
the vulnerability descriptions contain technical expressions
adapted to calculate vulnerabilities regarding hybrid IT and
and have reduced length size, we assess the effect of
IoT systems [14], [15] accurately. Following this idea, Mishra
text pre-processing techniques and vocabulary addition. Our
and Singh [16] proposed a taxonomy for Cloud-specific
results show that text pre-processing improves the baseline
vulnerabilities, using the CVSS score to represent each major
model accuracy, exhibiting incremental performance with
Cloud vulnerability severity. Finally, a guide for applying
vocabulary addition.
CVSS to medical devices was also proposed [17], consisting
One drawback of using a deep learning approach is that
of questions that identify a value for a specific CVSS metric.
the reasoning behind their outputs is not easily disclosed.
To overcome this limitation, we use the Shapley value [10],
a game-theoretic approach to explain machine learning B. CVSS AND ARTIFICIAL INTELLIGENCE
outputs, to perceive the correlation between description The combination of Artificial Intelligence techniques and
words and the predicted CVSS metric. This process allows CVSS scores of individual vulnerabilities has also been
us to understand the importance of each word towards the reported. Sheehan et al. [18] proposed using Bayesian Net-
CVSS metric prediction, assessing their importance variance works to identify connected and autonomous vehicle
with text pre-processing and vocabulary addition. cyber risks, using CVSS scores to predict knowledge
The main contributions of our work are summarized as gaps or potential new cyber vulnerabilities. Furthermore,
follows: Frigault et al. [19] employed Bayesian Networks and Attack
• We present a vulnerability dataset, derived from NVD Graphs to measure network security, using the CVSS scores
data, with vulnerability descriptions and CVSS (version as probabilities and considering metric values of each
3) metrics; vulnerability to be independent. However, applying Bayesian
• We demonstrate the applicability of deep learning Networks to assess CVSS scores has limitations [20],
approaches to predict CVSS metrics, in combina- leading to the proposal of an approach that considers the
tion with text pre-processing and vocabulary addition, dependency relationships between the CVSS base metrics,
achieving state-of-the-art results; combining scores into three aspects: probability, effort, and
• We confer interpretability to model prediction by skill. Allouzi and Khan [21] proposed using the Markov
analyzing the importance of word descriptions, via Chain to compute the probability distribution of Internet
Shapley value. of Medical Things security threats, using CVSS scores to
The remainder of this paper is organized as follows: assign severity to the acknowledged vulnerabilities. One first
Section II summarizes the most relevant CVSS-based works; attempt to predict CVSS final scores was made through the
Section III describes the methodologies used; Section IV employment of fuzzy systems [22], outperforming Support
describes the vulnerability dataset; and Section V discusses Vector Machine (SVM) and Random-Forest. In this context,
FIGURE 1. Overview of the methodology used to assess DistilBERT performance in vulnerability detection, using CVSS data descriptions and categories.
We evaluate the model performance by varying two key aspects: 1) text pre-processing approaches; and 2) vocabulary addition. Furthermore, we evaluate
the correlation of tokens and category, via Shapley value, to assess the tokens more influential towards each category prediction.
fuzzy CVSS [23] was used to calculate the final severity relevant to increase vulnerability scoring accuracy. Another
score for vulnerabilities, employing fuzzy theory to reduce work used the Local Interpretable Model-Agnostic (LIME)
the error rate. To predict CVSS values for base metrics, framework to explain the vulnerability descriptions [30], pro-
Elbaz et al. [24] propose a linear regression model, using a viding relevant words for a small number of vulnerabilities.
bag of words approach, with the removal of irrelevant words. To the best of our knowledge, the work presented herein
is the first to combine Deep Learning and NLP approaches
C. CVSS AND DEEP LEARNING to extract information from vulnerability descriptions and
Deep learning is also known for its effectiveness in solving output CVSS metrics, while using interpretability to assess
complex problems, with the drawback of time-costly training. model predictions.
Therefore, to resemble security experts decision-making [25],
the usage of Neural Networks was proposed, automatically III. METHODOLOGY
providing a vulnerability report through CVSS metrics. The methodology used in our experiments is displayed in
Deep reinforcement learning was also used to assess the Fig. 1. We start by creating a CVSS dataset, using information
cyber-physical security of electric power systems [26], which from the NVD. Then, we vary two major performance-
adapted CVSS to estimate the complexity of attack path. As a related aspects: 1) text pre-processing; and 2) vocabulary
result, CVSS base metrics have been adopted as the guide for addition. Finally, we evaluate model accuracy and assess
identifying and prioritizing threats among multiple systems. token correlation with category prediction, using Shapley
This indicates that correctly and swiftly predicting the metrics value.
for CVSS is a valuable effort.
Sahin and Tosun [27] concluded that Long Short Term A. MODEL DETAILS AND EVALUATION METRICS
Memory (LSTM) was the most accurate model to predict We used the following models in our experiments:
CVSS final scores, when compared with Convolutional BERT [31], DistilBERT [32], RoBERTa [33], ALBERT [34],
Neural Networks (CNN) and XGBoost. The two previously and DeBERTa [35]. Our reasoning for model choice is linked
presented approaches gathered data from Open Source to the importance of BERT for the NLP area. It is one of
Vulnerability Database (OSVDB) and NVD, respectively, the most used models in NLP, in a variety of tasks, with
to train their models. Alternatively, Twitter discussions [28], proven quality. Then, we opted to choose other variations of
with NVD as ground truth for CVSS scores, were fed to BERT to assess what is the better model for CVSS metric
a Graph Convolutional Network with Attention-based input prediction. Specifically, we choose ALBERT and DistilBERT
Embedding to predict the CVE severity scores. However, for having fewer parameters than BERT and RoBERTa and
predicting CVSS final scores does not provide any insight to DeBERTa for having more parameters than BERT. The
the experts about the values for the CVSS metrics. chosen models belong to the BERT family while having
specific characteristics, such as the number of parameters.
D. VULNERABILITY INTERPRETABILITY As such, our work focused on finding the best performing
The analysis and interpretation of vulnerability descriptions state-of-the-art NLP models for CVSS metrics prediction.
is also reported in the literature. An empirical study based We finetune each model following the authors’ method-
on the NVD vulnerability descriptions [29] concluded that ology: regarding the learning rate, RoBERTa was set to
information about the asset, attack, and vulnerability type is 1.5 × 10−5 , DistilBERT was set to 5 × 10−5 , while BERT,
FIGURE 2. Bar plots of the eight categories analyzed of CVSS version 3, linked to vulnerability assessment, of the vulnerability dataset used. Each
category displays the associated classes and respective class prior.
ALBERT, and DeBERTa have all been set to 3 × 10−5 ; vocabulary of the tokenizer. To select the added words,
for the number of training epochs, RoBERTa was trained we order them by frequency of appearance in the descriptions,
for 2 epochs, DeBERTa for 10, and BERT, ALBERT, and choosing the top n words. To avoid redundancy, we only
DistilBERT for 3; regarding batch size, we used 8 for consider words that appear exclusively in the description and
ALBERT and DistilBERT, and 4 for BERT, RoBERTa, and not in the default vocabulary.
DeBERTa; finally, RoBERTa has a weight decay of 0.01, Given the existence of software versions and code snippets
while the remaining models have the default value (0). We use in some data descriptions, we use regular expressions to
the default losses and architectures of each model, from filter digits and special characters. This approach reduces
Hugging Face [36]. To obtain category classification, we use the ‘‘noise’’ of vocabulary addition, since this filtered data
a PyTorch Softmax layer [37] on the model output. is not relevant to category classification and could potentially
To compare the performance of each model, we use the dissipate the importance of relevant added words.
accuracy, F1 score, and balanced accuracy from the scikit-
learn library [38]. To compare our results with state-of-the-
art for CVSS metric inference, we use the accuracy metric. C. SHAPLEY VALUE
Deep learning models have shown high performance in
multiple tasks while providing little to no explanation for
B. TEXT PROCESSING AND VOCABULARY SELECTION the reasoning for model prediction. To tackle this issue,
To assess the contribution of each word to the classification we use Shapley value, an interpretability technique that
of the considered categories (discussed in section IV), allows us to interpret the reasoning of the model when
we start by processing vulnerability descriptions. We use providing predictions. The Shapley value, coined by Shapley
two pre-processing methods, namely, Lemmatization and in 1953 [40], is a cooperative game theory-based method
Stemming. Finally, we tokenize the text to input to the used for assigning payouts to players, depending on their
model, evaluating its accuracy based on the pre-processing contribution towards the total payout. In the machine-learning
approach. Both text pre-processing approaches use Natural context, the Shapley value is used to evaluate how each
Language Toolkit (NLTK) methods [39], while tokenization feature (player) of a given instance contributed (assigning
is achieved using Transformers library, from Hugging payout) towards the model prediction of the instance (total
Face [36]. We choose Lemmatization and Stemming, given payout).
their wide use as text pre-processing approaches in the NLP The use of Shapley value in our experiments is linked to our
area. By using Lemmatization and Stemming, we intend to interest in analyzing how each word contributed to category
process text to maintain as much relevant data as possible classification. For categories with more than n classes, and
while ignoring noisy data. This is achieved by ignoring n higher than 2, we perform n Shapley value analysis,
variants of words that have the same ‘‘base’’. In the case of each considering a class versus the remaining classes of the
Stemming is the same stem, while in Lemmatization is the category. The considered class is given the value 1, with
same lemma. the remaining receiving the value 0. If a word contributes
In our experiments, we also evaluate the effect of positively, it means that it influences the considered class. The
vocabulary addition. Moreover, we also assess this effect higher the absolute Shapley value is, the higher the feature
in conjunction with the best performing text pre-processing influence. We use the SHapley Additive exPlanations (SHAP)
approach. We evaluate the accuracy of the used model framework [41] and the Explainer model, from a publicly
when adding 5,000, 10,000, and 25,000 words to the default available implementation in [42].
TABLE 1. Percentage of class prior of the eight vulnerability related follows a similar proportion of classes, exhibited in Fig 2,
categories for all dataset, train, and test set. Class prior displays the
likelihood of an outcome in the dataset and each subset. whose analytical values are shown in Table 1. The dataset is
publicly available for repeatability purposes, and it serves as
a basis for other models to evaluate their performance and
compare with the proposed methodology.
V. EXPERIMENTS
A. MODEL COMPARISON
We start by comparing the performance of five different NLP
methods in the proposed dataset. The accuracy, F1 score,
and balanced accuracy for each of the eight categories are
presented in Table 2. The results suggest that DistilBERT
is the outperforming model for all the categories, in all the
considered metrics. The method with the worst performance
is ALBERT, which has the least number of parameters (11M),
while DeBERTa, BERT, and RoBERTa, with over 100M
parameters, also have worse performance than DistilBERT
(65M). Since we intend to assess the class inference, given
a vulnerability description, the number of parameters may
be linked to the performance variance. In this case, too few
parameters (ALBERT) are insufficient for the model to learn,
and too many leads to poorer fine-tuning. The similarity
of various accuracy values between BERT, ALBERT, and
DeBERTa, for different categories, can be explained by
dataset imbalance. In these cases, the values displayed
represent a scenario where the models opted to achieve higher
accuracy by outputting the same value in every instance.
Thus, in cases of dataset imbalance, the use of accuracy can be
deceptive, justifying the use of other metrics such as balanced
accuracy.
In this experiment, we use the default pretraining weights
IV. VULNERABILITY DATASET
(provided by HuggingFace [36]) and training parameters of
The vulnerability dataset is based on NVD information,
every model. The models used are typically applied/evaluated
a United States government repository of standards-based
in tasks where the association of two sentences is analyzed
vulnerability management data. We obtain the information
(e.g., GLUE [43]) or the aim is finding answers in a
through their API, starting from index 0 to 152,000, repre-
text, given a question (e.g., SQuAD [44]). These types of
senting data collected until April 2021. Finally, we process
tasks differ from predicting a category given a vulnerability
the collected data to retrieve vulnerability descriptions and
description (the aim of this work), which may justify
the classes for each of the eight categories analyzed: Attack
the underperformance of state-of-the-art methods in our
Vector, Attack Complexity, Privileges Required, User Interac-
experiments. Based on the obtained results, we selected
tion, Scope, Confidentiality, Integrity, and Availability. Based
DistilBERT for continuing the experiments involving the
on the CVSS documentation, these classes are grouped into
usage of Deep Learning.
Exploitability metrics (Attack Vector, Attack Complexity,
Privileges Required, and User Interaction), Scope, and Impact
metrics (Confidentiality, Integrity, and Availability). Tables B. TEXT PRE-PROCESSING
and Figures throughout this paper consider this grouping. We assess the performance of DistilBERT, for all eight
A visual representation of class proportions, for each considered categories, regarding different text pre-processing
category, of our dataset is displayed in Fig. 2. approaches. We present our results, using balanced accuracy,
Though the collected data corresponds to 152,000 vul- in Table 3, with Baseline referring to the condition where no
nerability descriptions and categories, we only consider pre-processing approach is used.
descriptions related to version 3 of CVSS in this work. For When comparing category-related performance variance,
this reason, the total number of instances in our dataset is we observe that all categories benefit from pre-processing.
79,810. We divide them into train and test sets, composed of Regarding processing-related performance variance, Lemma-
63,848 and 15,962 instances, respectively, corresponding to tization promotes better results than Stemming, for all
a 0.2 test ratio. The average description length is 43.85 and categories. Stemming truncates words by chopping off letters
44.55 words for train and test split, respectively. Each set from the end until the stem is reached. This is a more
TABLE 2. Model accuracy (Acc), F1 score, and Balanced Accuracy (BA) for each of the eight categories analyzed. The outperforming model for each metric
and category is shown in bold.
TABLE 3. Category balanced accuracy of DistilBERT for baseline Regarding the combination of vocabulary addition with
conditions (Tokenization), and using text pre-processing approaches
(Lemmatization and Stemming). The outperforming approach for each Lemmatization, we observe that this approach generally
category is shown in bold. improves the balanced accuracy, relative to vocabulary
addition alone, for most vocabulary variations. This suggests
that word importance may vary with processing approaches,
which corroborates the importance of text pre-processing,
even in the context of vocabulary addition. The results
suggest that 5,000-word addition with Lemmatization is the
best approach for overall category prediction, exhibiting the
importance of text processing and pertinent word addition in
description-based classification.
D. STATE-OF-THE-ART COMPARISON
We compare DistilBERT, and its combination with
pre-processing and vocabulary addition, with the state-of-
crude approach than Lemmatization, which justifies the
the-art. To the best of our knowledge, only Ebalz et al. [24]
underperformance using this approach. Given the superiority
evaluates class prediction accuracy in version 3 of CVSS.
displayed by Lemmatization over Stemming, this is the chosen
To compare our results with them, we also display
pre-processing approach to use in the remaining experiments.
the accuracy of Baseline and 5,000-word addition with
Lemmatization, whose balanced accuracy is presented in
C. VOCABULARY ADDITION Table 4. Since the authors presented their results in a bar plot,
We also evaluate the effect of vocabulary addition on pre- not displaying the analytical values, we register the rounded
diction accuracy. Furthermore, we compare the vocabulary values observed in said plot. We display the state-of-the-art
addition with its combination with a pre-processing approach. comparison in Table 5.
We display our results in Table 4. Ebalz et al. use a bag of words approach, with the
Relative to the baseline, most variations of vocabulary removal of irrelevant words, to input a regression model.
addition translate into performance increase, for all cat- Using DistilBERT, a deep learning approach, in conjunction
egories. Regarding the vocabulary variations, 5,000-word with text pre-processing and vocabulary addition, we obtain
addition was the condition with better results overall. This substantial accuracy improvements in the majority of cate-
suggests that adding more words is beneficial to model gories. The categories where Ebalz’s approach was closer to
accuracy improvement. However, subsequent vocabulary ours were Attack Complexity, User Interaction, and Scope,
addition (10,000 and 25,000-word addition) does not promote which could be linked to these categories being two-classed.
incremental performance increase. Given that vocabulary In these cases, the regression model used by Ebalz et al. can
addition is linked to word frequency in the description, compete with deep learning approaches. However, for the
adding more words may disperse the model attention remaining categories, with over two classes, the performance
towards less relevant words, hindering its performance. disparity is substantially larger, with up to a 28% accuracy
This aspect is more noticeable when 25,000-word addition increase. Furthermore, using the text pre-processing approach
has worse performance than baseline (e.g., Attack Vector, and adding vocabulary promotes an accuracy increase of
Attack Complexity). For all categories, 25,000-word addition DistilBERT, further enhancing its performance. The results
does not generally translate into performance improvement suggest that DistilBERT is a state-of-the-art approach for
relative to 5,000-word addition, suggesting the existence of vulnerability category prediction, particularly for multi-class
word importance redundancy with this approach. categories.
TABLE 4. Category balanced accuracy of DistilBERT for baseline conditions (Tokenization), and with different vocabulary addition, assessing the effect of
Lemmatization. Base, in each vocabulary column, refers to the vocabulary addition with Tokenization, without text pre-processing. The expression w/
Lemm refers to Lemmatization combination with vocabulary addition. The outperforming approach for each category is shown in bold.
FIGURE 4. Boxplots summarizing the effect of applying text pre-processing techniques (Lemmatization) and vocabulary addition (5,000-word addition)
for (a) binary class (Attack Complexity, User Interaction, and Scope) and (b) multiple class problems.
reported out-of-bound reading/writing errors, while man- Shapley value) and against (negative Shapley value) the
in-the-middle is a type of network attack. The importance highest class. However, combining vocabulary addition
of these tokens is linked to specific network-related events with Lemmatization increases token importance variance,
(attacks, errors), which was also observed in the baseline. particularly for negative values. This translates into increased
Protocols also increases in importance towards Network importance of tokens to categorize the least represented class.
classification, which could be linked to their association If almost all descriptions relate to a specific class, it may
with the added vocabulary. This shows that vocabulary be more beneficial/discriminative to focus on tokens linked
addition shifts the focus of token importance heavily towards to the underrepresented class, which is the approach of the
specific events, for Network classification. Network-adjacent model in this case.
(added by vocabulary addition) also gains importance in Analyzing multi-class boxplots shows that the variance of
classifying other classes, given its relevance to dissociate Net- negative Shapley value remains nearly constant throughout
work from Adjacent. Complementing vocabulary addition the various text pre-processing methods. Comparatively to
with Lemmatization (5k Vocabulary & Lemmatization) the binary classes, negative Shapley value refers to various
diminishes the importance of tokens closely linked to classes and not simply one, which justifies the (low)
Network (positive Shapley value), resurging the tendency variance observed for these cases. Relative to positive
observed with Lemmatization alone. The reduced importance Shapley value, using vocabulary addition and its combination
of specific network-related events also greatly decreased with Lemmatization tends to reduce the variance of token
token importance associated with it (protocols). Furthermore, importance, achieving a similar variance to negative Shapley
the influence of added vocabulary was enhanced in logon value tokens. In multi-class prediction, even when one class
(closely related to classes other than Network) and network- is more prevalent than others, the existence of tokens closely
adjacent, while keeping high importance of tokens associated linked to specific categories is not as likely as in binary class
with other classes definition (physical and local). This result prediction. For this reason, reducing the overall importance
suggests that Lemmatization is necessary to obtain more towards specific token importance classification translates
coherent/explainable token importance, which ultimately into better results.
translates into better model performance (as shown in
Table 4). VI. CONCLUSION
The second considered scenario relates token importance The increasing number of threats and vulnerabilities in IT
when considering binary (Attack Complexity, User Interac- systems surpass the capability of professionals to handle
tion, and Scope) and multi-class categories, for the same them, potentially leading to company prejudice. This raises
processing approaches of the first scenario. For all categories, the need to prioritize vulnerabilities, typically achieved
the highest proportion class per category was associated with through CVSS metrics, via manual vulnerability description
the value 1, with the remaining being associated with 0. Fig. 4 analysis. In this paper, we present a vulnerability dataset,
displays the boxplots for the two cases considered, showing from NVD data, and analyze the applicability of deep
the data distribution (ignoring wildcard cases). learning approaches, namely NLP methods, to aid in CVSS
The analysis of binary boxplots indicates that using metric prediction via description interpretation. In our exper-
Lemmatization and vocabulary addition promotes a decrease iments, we also assess the importance of text processing and
in token importance variance in both towards (positive vocabulary addition in metric prediction while interpreting
it via Shapley value. Our results show that DistilBERT is [18] B. Sheehan, F. Murphy, M. Mullins, and C. Ryan, ‘‘Connected and
a state-of-the-art model for CVSS metric prediction, with autonomous vehicles: A cyber-risk classification framework,’’ Transp. Res.
A, Policy Pract., vol. 124, pp. 523–536, Jun. 2019.
increased performance when combined with Lemmatization [19] M. Frigault, L. Wang, S. Jajodia, and A. Singhal, ‘‘Measuring the overall
(text pre-processing) and 5,000 word-addition. Furthermore, network security by combining CVSS scores based on attack graphs and
this combination mitigates the effect of specific events in Bayesian networks,’’ in Network Security Metrics. Cham, Switzerland:
Springer, 2017, pp. 1–23.
category prediction and leads to weighted word impor- [20] P. Cheng, L. Wang, S. Jajodia, and A. Singhal, ‘‘Refining CVSS-based
tance, particularly for binary categories, contributing to network security metrics by examining the base scores,’’ in Network
increased model accuracy. The presented dataset and model Security Metrics. Cham, Switzerland: Springer, 2017, pp. 25–52.
[21] M. Ali Allouzi and J. I. Khan, ‘‘Identifying and modeling security threats
experiments serve as a comparable basis for future works for IoMT edge network using Markov chain and common vulnerability
in CVSS metric prediction, applicable for vulnerability scoring system (CVSS),’’ 2021, arXiv:2104.11580.
handling/prioritization, which leads to increased usefulness [22] A. Khazaei, M. Ghasemzadeh, and V. Derhami, ‘‘An automatic method for
CVSS score prediction using vulnerabilities description,’’ J. Intell. Fuzzy
and accuracy of the metric, benefiting system security and Syst., vol. 30, no. 1, pp. 89–96, Aug. 2015.
operational effectiveness. [23] K. Gencer and F. Başçiftçi, ‘‘The fuzzy common vulnerability scoring
system (F-CVSS) based on a least squares approach with fuzzy logistic
regression,’’ Egyptian Informat. J., vol. 22, no. 2, pp. 145–153, Jul. 2021.
REFERENCES [24] C. Elbaz, L. Rilling, and C. Morin, ‘‘Fighting N-day vulnerabilities with
[1] P. Boden. (2016). The Emerging Era of Cyber Defense and Cybercrime. automated CVSS vector prediction at disclosure,’’ in Proc. 15th Int. Conf.
Accessed: Jul. 29, 2021. [Online]. Available: https://www.microsoft. Availability, Rel. Secur., Aug. 2020, pp. 1–10.
com/security/blog/2016/01/27/the-emerging-era-of-cyber-defense-and- [25] A. Beck and S. Rass, ‘‘Using neural networks to aid CVSS risk
cybercrime/ aggregation—An empirically validated approach,’’ J. Innov. Digit.
[2] S. Morgan. (2020). Cybercrime to Cost the World $10.5 Trillion Ecosyst., vol. 3, no. 2, pp. 148–154, 2016.
Annually by 2025. Accessed: Jul. 29, 2021. [Online]. Available: [26] X. Liu, J. Ospina, and C. Konstantinou, ‘‘Deep reinforcement learning
https://cybersecurityventures.com/hackerpocalypse-cybercrime-report- for cybersecurity assessment of wind integrated power systems,’’ IEEE
2016/ Access, vol. 8, pp. 208378–208394, 2020.
[3] vuldb.Com. (2021). Vulnerability Database. Accessed: Jul. 28, 2021. [27] S. E. Sahin and A. Tosun, ‘‘A conceptual replication on predicting the
[Online]. Available: https://vuldb.com/? severity of software vulnerabilities,’’ in Proc. Eval. Assessment Softw. Eng.,
[4] M. U. Aksu, M. H. Dilek, E. I. Tatli, K. Bicakci, H. I. Dirik, Apr. 2019, pp. 244–250.
M. U. Demirezen, and T. Aykir, ‘‘A quantitative CVSS-based cyber [28] H. Chen, J. Liu, R. Liu, N. Park, and V. S. Subrahmanian, ‘‘VASE: A
security risk assessment methodology for IT systems,’’ in Proc. Int. Twitter-based vulnerability analysis and score engine,’’ in Proc. IEEE Int.
Carnahan Conf. Secur. Technol. (ICCST), Oct. 2017, pp. 1–8. Conf. Data Mining (ICDM), Nov. 2019, pp. 976–981.
[5] P. Mell, K. Scarfone, and S. Romanosky, ‘‘Common vulnerability scoring [29] L. Allodi, S. Banescu, H. Femmer, and K. Beckers, ‘‘Identifying relevant
system,’’ IEEE Secur. Privacy, vol. 4, no. 6, pp. 85–89, Nov./Dec. 2006. information cues for vulnerability assessment using CVSS,’’ in Proc. 8th
[6] D. E. Mann and S. M. Christey, ‘‘Towards a common enumeration ACM Conf. Data Appl. Secur. Privacy, Mar. 2018, pp. 119–126.
of vulnerabilities,’’ in Proc. 2nd Workshop Res. Secur. Vulnerability [30] K. B. Alperin, A. B. Wollaber, and S. R. Gomez, ‘‘Improving inter-
Databases, West Lafayette, IN, USA: Purdue Univ., 1999, pp. 1–13. pretability for cyber vulnerability assessment using focus and context
[7] N. I. of Standards and Technology. (2021). NVD—Vulnerability Metrics. visualizations,’’ in Proc. IEEE Symp. Vis. Cyber Secur. (VizSec), Oct. 2020,
Accessed: Jul. 28, 2021. [Online]. Available: https://nvd.nist.gov/vuln- pp. 30–39.
metrics/cvss [31] J. Devlin, M. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training
[8] P. Johnson, R. Lagerstrom, M. Ekstedt, and U. Franke, ‘‘Can the common of deep bidirectional transformers for language understanding,’’ in Proc.
vulnerability scoring system be trusted? A Bayesian analysis,’’ IEEE Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang.
Trans. Dependable Secure Comput., vol. 15, no. 6, pp. 1002–1015, Technol. (NAACL-HLT), Minneapolis, MN, USA, vol. 1, Jun. 2019,
Dec. 2018. pp. 4171–4186.
[9] A. Feutrill, D. Ranathunga, Y. Yarom, and M. Roughan, ‘‘The effect [32] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, ‘‘DistilBERT, a
of common vulnerability scoring system metrics on vulnerability exploit distilled version of BERT: Smaller, faster, cheaper and lighter,’’ 2019,
delay,’’ in Proc. 6th Int. Symp. Comput. Netw. (CANDAR), Nov. 2018, arXiv:1910.01108.
pp. 1–10. [33] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
[10] L. S. Shapley, A Value for N-Person Games. Princeton, NJ, USA: Princeton L. Zettlemoyer, and V. Stoyanov, ‘‘RoBERTa: A robustly optimized BERT
Univ. Press, 2016, ch. 17. pretraining approach,’’ 2019, arXiv:1907.11692.
[11] A. A. Younis and Y. K. Malaiya, ‘‘Comparing and evaluating CVSS base [34] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,
metrics and Microsoft rating system,’’ in Proc. IEEE Int. Conf. Softw. ‘‘ALBERT: A lite BERT for self-supervised learning of language
Qual., Rel. Secur., Aug. 2015, pp. 252–261. representations,’’ in 8th Int. Conf. Learn. Represent. (ICLR), Addis Ababa,
[12] H. Joh, ‘‘Software risk assessment for windows operating systems with Ethiopia, Apr. 2020.
respect to CVSS,’’ Eur. J. Eng. Technol. Res., vol. 4, no. 11, pp. 41–45, [35] P. He, X. Liu, J. Gao, and W. Chen, ‘‘DeBERTa: Decoding-enhanced
Nov. 2019. bert with disentangled attention,’’ in Proc. 9th Int. Conf. Learn. Repre-
[13] R. Wirtz and M. Heisel, ‘‘CVSS-based estimation and prioritization for sent. (ICLR), Vienna, Austria, May 2021.
security risks,’’ in Proc. 14th Int. Conf. Eval. Novel Approaches Softw. [36] T. Wolf, L. Debut, V. Sanh, and J. Chaumond, ‘‘Transformers: State-of-the-
Eng., 2019, pp. 297–306. art natural language processing,’’ in Proc. Conf. Empirical Methods Natu-
[14] A. Ur-Rehman, I. Gondal, J. Kamruzzuman, and A. Jolfaei, ‘‘Vulnerability ral Lang. Process., Syst. Demonstrations, Oct. 2020, pp. 38–45. [Online].
modelling for hybrid IT systems,’’ in Proc. IEEE Int. Conf. Ind. Technol. Available: https://www.aclweb.org/anthology/2020.emnlp-demos.6
(ICIT), Feb. 2019, pp. 1186–1191. [37] A. Paszke, S. Gross, F. Massa, and A. Lerer, ‘‘Pytorch: An imperative style,
[15] A. Ur-Rehman, I. Gondal, J. Kamruzzaman, and A. Jolfaei, ‘‘Vulnerability high-performance deep learning library,’’ in Proc. Adv. Neural Inf. Process.
modelling for hybrid industrial control system networks,’’ J. Grid Comput., Syst., vol. 32, 2019, pp. 8026–8037.
vol. 18, no. 4, pp. 863–878, Dec. 2020. [38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
[16] N. Mishra and R. Singh, ‘‘Taxonomy & analysis of cloud computing M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,
vulnerabilities through attack vector, CVSS and complexity parameter,’’ in A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay,
Proc. Int. Conf. Issues Challenges Intell. Comput. Techn. (ICICT), vol. 1, ‘‘Scikit-learn: Machine learning in Python,’’ J. Mach. Learn. Res., vol. 12
Sep. 2019, pp. 1–8. no. 10, pp. 2825–2830, 2012.
[17] M. P. Chase and S. M. C. Coley, ‘‘Rubric for applying CVSS to medical [39] S. Bird, E. Klein, and E. Loper, Natural Language Processing With Python:
devices,’’ MITRE Corp., McLean, VA, USA, Tech. Rep. HHSM-500- Analyzing Text With the Natural Language Toolkit. Sebastopol, CA, USA:
2012-00008I, Oct. 2020. O’Reilly Media, 2009.
[40] L. S. Shapley, ‘‘A value for n-person games,’’ Contrib. Theory Games, HUGO PROENÇA (Senior Member, IEEE)
vol. 2, no. 28, pp. 307–317, 1953. received the B.Sc., M.Sc., and Ph.D. degrees,
[41] S. M. Lundberg and S.-I. Lee, ‘‘A unified approach to interpreting model in 2001, 2004, and 2007, respectively. He is cur-
predictions,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, rently an Associate Professor with the Department
pp. 4765–4774. of Computer Science, University of Beira Interior,
[42] S. E. A. Lundberg. (2021). Shap (Shapley Additive Explanations). and has been researching mainly about biometrics
Accessed: Jul. 2, 2021. [Online]. Available: https://github.com/slundberg/ and visual-surveillance. He is a member of
shap
the Editorial Boards of the Image and Vision
[43] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, ‘‘GLUE:
Computing, IEEE ACCESS, and International
A multi-task benchmark and analysis platform for natural language
understanding,’’ in Proc. EMNLP Workshop BlackboxNLP: Analyzing Journal of Biometrics. He served as a Guest Editor
Interpreting Neural Netw. (NLP), 2018, pp. 353–355. for special issues of the Pattern Recognition Letters, Image and Vision
[44] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, ‘‘SQuAD: 100,000+ Computing, and Signal, Image and Video Processing journals. He was
questions for machine comprehension of text,’’ in Proc. Conf. Empirical the Co-ordinating Editor of the IEEE Biometrics Council Newsletter and
Methods Natural Lang. Process. (EMNLP), 2016, pp. 2383—2392. the Area Editor (Ocular Biometrics) of the IEEE BIOMETRICS COMPENDIUM
journal.