Predicting CVSS Metric Via Description Interpretat

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Received May 13, 2022, accepted May 27, 2022, date of publication June 2, 2022, date of current version

June 8, 2022.
Digital Object Identifier 10.1109/ACCESS.2022.3179692

Predicting CVSS Metric via


Description Interpretation
JOANA CABRAL COSTA , TIAGO ROXO , (Member, IEEE), JOÃO B. F. SEQUEIROS ,
HUGO PROENÇA , (Senior Member, IEEE), AND
PEDRO R. M. INÁCIO , (Senior Member, IEEE)
Department of Computer Science, Instituto de Telecomunicações, University of Beira Interior, 6201-001 Covilhã, Portugal
Corresponding author: Joana Cabral Costa ([email protected])
This work was supported in part by the Fundação para a Ciência e a Tecnologia (FCT)/Programa Operacional Temático Competitividade e
Internacionalização (COMPETE)/Fundo Europeu de Desenvolvimento Regional (FEDER) under the scope of Project SECURIoTESIGN
under Grant POCI-01-0145-FEDER-030657; in part by the Portuguese FCT/Ministério da Ciência, Tecnologia e Ensino Superior
(MCTES) through National Funds and, when applicable, co-funded by EU funds under Project UIDB/50008/2020; in part by the FCT
Doctoral Grant SFRH/BD/133838/2017, Grant 2020.09847.BD, and Grant 2021.04905.BD; in part by the C4—Competence Center in
Cloud Computing co-financed by the European Regional Development Fund (ERDF) through the Programa Operacional Regional do
Centro (Centro 2020), in the scope of the Sistema de Apoio à Investigação Científica e Tecnológica, Programas Integrados de Investigação
Científica e esenvolvimento Tecnológico (IC&DT) under Project CENTRO-01-0145-FEDER-000019.

ABSTRACT Cybercrime affects companies worldwide, costing millions of dollars annually. The constant
increase of threats and vulnerabilities raises the need to handle vulnerabilities in a prioritized manner. This
prioritization can be achieved through Common Vulnerability Scoring System (CVSS), typically used to
assign a score to a vulnerability. However, there is a temporal mismatch between the vulnerability finding
and score assignment, which motivates the development of approaches to aid in this aspect. We explore
the use of Natural Language Processing (NLP) models in CVSS score prediction given vulnerability
descriptions. We start by creating a vulnerability dataset from the National Vulnerability Database (NVD).
Then, we combine text pre-processing and vocabulary addition to improve the model accuracy and interpret
its prediction reasoning by assessing word importance, via Shapley values. Experiments show that the
combination of Lemmatization and 5,000-word addition is optimal for DistilBERT, the outperforming model
in our experiments of the NLP methods, achieving state-of-the-art results. Furthermore, specific events (such
as an attack on a known software) tend to influence model prediction, which may hinder CVSS prediction.
Combining Lemmatization with vocabulary addition mitigates this effect, contributing to increased accuracy.
Finally, binary classes benefit the most from pre-processing techniques, particularly when one class is much
more prominent than the other. Our work demonstrates that DistilBERT is a state-of-the-art model for CVSS
prediction, demonstrating the applicability of deep learning approaches to aid in vulnerability handling. The
code and data are available at https://github.com/Joana-Cabral/CVSS_Prediction.

INDEX TERMS Common vulnerability scoring system, deep learning, interpretability, natural language
processing, security.

I. INTRODUCTION in 2016 [3]. This tendency provides a clear picture of the


Cyber threats force companies to increase their investments increased risk of threats and cybercrime, raising concern
in security, which resulted in a $170 billion security aspects among Information Technology (IT) administrators, which
related market in 2015 [1]. These threats impact 556 million often lack the resources to handle all incoming threats [4].
people annually, costing $3 trillion worldwide, with an Given this context, there is an inherent need to define which
expected increase to $10.5 trillion by 2025 [2]. Additionally, vulnerabilities should be tackled first.
there was an increase of vulnerability entries in VulDB, with To aid in the prioritization of vulnerability handling,
61 new daily entries, in 2021, relative to the 41 reported experts typically use the Common Vulnerability Scoring
System (CVSS) [5], a de facto standard, to accurately
The associate editor coordinating the review of this manuscript and assign a score to a vulnerability. New vulnerability entries
approving it for publication was Derek Abbott . are enumerated via Common Vulnerability Enumeration

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 59125
J. C. Costa et al.: Predicting CVSS Metric via Description Interpretation

(CVE) [6], with a unique identifier, description, and CVSS the results obtained. Finally, the main conclusions and future
Base score metrics, the latter specified by the National work are presented in Section VI.
Vulnerability Database (NVD).
The score metric assignment is performed manually from II. RELATED WORK
vulnerability description analysis, for which vendors do not
A. CVSS APPLICABILITY
always provide enough detail [7] for experts to accurately
CVSS has been extensively analyzed and applied to multiple
create these scores. Furthermore, some CVSS metrics are
domains to prioritize or estimate security risks. Younis and
subjective [8], heavily relying on the previous experience
Malaiya [11] compared the CVSS base metrics and the
at assigning CVSS metrics. The inherent problems of this
Microsoft rating system, declaring that both measures have
process are exacerbated by the temporal mismatch of CVSS
a very high false-positive rate, with CVSS significantly
metric assignment and vulnerability finding: 19 days to
affected by the software type. Joh [12] concluded that most
populate a vulnerability with the respective CVSS and six
vulnerabilities are compromised due to no authentication
days to find a new one [9]. Therefore, to reduce the time/cost
required systems, by analyzing the CVSS base scores for
spent while also mitigating the subjective aspect of score
vulnerabilities of currently supported Windows operating
assignment, we explore the use of a deep learning approach
systems, suggesting the addition of an authentication process
to predict the CVSS metrics based on the vulnerability
in every system. CVSS base metrics have been used to assess
description.
cybersecurity risks in IT systems [4], using the risk formula,
We start by obtaining the vulnerabilities descriptions
and calculating risk probability and impact. The same study
and respective CVSS metrics using the NVD Application
reported that an identification of security properties in the
Programming Interface (API). The collected data is processed
early stages of development positively impacts the security
for the most recent version of CVSS (version 3) and
of the systems. In the same context, Wirtz and Heisel [13]
serves as input for the deep learning approach. We select
proposed a semi-automatic method to estimate security risks
the DistilBERT for sequence classification given its out-
in the early stages of software development, using CVSS
performance, in the created dataset, over other state-of-
formulas to assess the threat severity. Since CVSS has already
the-art Natural Language Processing (NLP) models. Since
demonstrated its validity in typical IT systems, it was also
the vulnerability descriptions contain technical expressions
adapted to calculate vulnerabilities regarding hybrid IT and
and have reduced length size, we assess the effect of
IoT systems [14], [15] accurately. Following this idea, Mishra
text pre-processing techniques and vocabulary addition. Our
and Singh [16] proposed a taxonomy for Cloud-specific
results show that text pre-processing improves the baseline
vulnerabilities, using the CVSS score to represent each major
model accuracy, exhibiting incremental performance with
Cloud vulnerability severity. Finally, a guide for applying
vocabulary addition.
CVSS to medical devices was also proposed [17], consisting
One drawback of using a deep learning approach is that
of questions that identify a value for a specific CVSS metric.
the reasoning behind their outputs is not easily disclosed.
To overcome this limitation, we use the Shapley value [10],
a game-theoretic approach to explain machine learning B. CVSS AND ARTIFICIAL INTELLIGENCE
outputs, to perceive the correlation between description The combination of Artificial Intelligence techniques and
words and the predicted CVSS metric. This process allows CVSS scores of individual vulnerabilities has also been
us to understand the importance of each word towards the reported. Sheehan et al. [18] proposed using Bayesian Net-
CVSS metric prediction, assessing their importance variance works to identify connected and autonomous vehicle
with text pre-processing and vocabulary addition. cyber risks, using CVSS scores to predict knowledge
The main contributions of our work are summarized as gaps or potential new cyber vulnerabilities. Furthermore,
follows: Frigault et al. [19] employed Bayesian Networks and Attack
• We present a vulnerability dataset, derived from NVD Graphs to measure network security, using the CVSS scores
data, with vulnerability descriptions and CVSS (version as probabilities and considering metric values of each
3) metrics; vulnerability to be independent. However, applying Bayesian
• We demonstrate the applicability of deep learning Networks to assess CVSS scores has limitations [20],
approaches to predict CVSS metrics, in combina- leading to the proposal of an approach that considers the
tion with text pre-processing and vocabulary addition, dependency relationships between the CVSS base metrics,
achieving state-of-the-art results; combining scores into three aspects: probability, effort, and
• We confer interpretability to model prediction by skill. Allouzi and Khan [21] proposed using the Markov
analyzing the importance of word descriptions, via Chain to compute the probability distribution of Internet
Shapley value. of Medical Things security threats, using CVSS scores to
The remainder of this paper is organized as follows: assign severity to the acknowledged vulnerabilities. One first
Section II summarizes the most relevant CVSS-based works; attempt to predict CVSS final scores was made through the
Section III describes the methodologies used; Section IV employment of fuzzy systems [22], outperforming Support
describes the vulnerability dataset; and Section V discusses Vector Machine (SVM) and Random-Forest. In this context,

59126 VOLUME 10, 2022


J. C. Costa et al.: Predicting CVSS Metric via Description Interpretation

FIGURE 1. Overview of the methodology used to assess DistilBERT performance in vulnerability detection, using CVSS data descriptions and categories.
We evaluate the model performance by varying two key aspects: 1) text pre-processing approaches; and 2) vocabulary addition. Furthermore, we evaluate
the correlation of tokens and category, via Shapley value, to assess the tokens more influential towards each category prediction.

fuzzy CVSS [23] was used to calculate the final severity relevant to increase vulnerability scoring accuracy. Another
score for vulnerabilities, employing fuzzy theory to reduce work used the Local Interpretable Model-Agnostic (LIME)
the error rate. To predict CVSS values for base metrics, framework to explain the vulnerability descriptions [30], pro-
Elbaz et al. [24] propose a linear regression model, using a viding relevant words for a small number of vulnerabilities.
bag of words approach, with the removal of irrelevant words. To the best of our knowledge, the work presented herein
is the first to combine Deep Learning and NLP approaches
C. CVSS AND DEEP LEARNING to extract information from vulnerability descriptions and
Deep learning is also known for its effectiveness in solving output CVSS metrics, while using interpretability to assess
complex problems, with the drawback of time-costly training. model predictions.
Therefore, to resemble security experts decision-making [25],
the usage of Neural Networks was proposed, automatically III. METHODOLOGY
providing a vulnerability report through CVSS metrics. The methodology used in our experiments is displayed in
Deep reinforcement learning was also used to assess the Fig. 1. We start by creating a CVSS dataset, using information
cyber-physical security of electric power systems [26], which from the NVD. Then, we vary two major performance-
adapted CVSS to estimate the complexity of attack path. As a related aspects: 1) text pre-processing; and 2) vocabulary
result, CVSS base metrics have been adopted as the guide for addition. Finally, we evaluate model accuracy and assess
identifying and prioritizing threats among multiple systems. token correlation with category prediction, using Shapley
This indicates that correctly and swiftly predicting the metrics value.
for CVSS is a valuable effort.
Sahin and Tosun [27] concluded that Long Short Term A. MODEL DETAILS AND EVALUATION METRICS
Memory (LSTM) was the most accurate model to predict We used the following models in our experiments:
CVSS final scores, when compared with Convolutional BERT [31], DistilBERT [32], RoBERTa [33], ALBERT [34],
Neural Networks (CNN) and XGBoost. The two previously and DeBERTa [35]. Our reasoning for model choice is linked
presented approaches gathered data from Open Source to the importance of BERT for the NLP area. It is one of
Vulnerability Database (OSVDB) and NVD, respectively, the most used models in NLP, in a variety of tasks, with
to train their models. Alternatively, Twitter discussions [28], proven quality. Then, we opted to choose other variations of
with NVD as ground truth for CVSS scores, were fed to BERT to assess what is the better model for CVSS metric
a Graph Convolutional Network with Attention-based input prediction. Specifically, we choose ALBERT and DistilBERT
Embedding to predict the CVE severity scores. However, for having fewer parameters than BERT and RoBERTa and
predicting CVSS final scores does not provide any insight to DeBERTa for having more parameters than BERT. The
the experts about the values for the CVSS metrics. chosen models belong to the BERT family while having
specific characteristics, such as the number of parameters.
D. VULNERABILITY INTERPRETABILITY As such, our work focused on finding the best performing
The analysis and interpretation of vulnerability descriptions state-of-the-art NLP models for CVSS metrics prediction.
is also reported in the literature. An empirical study based We finetune each model following the authors’ method-
on the NVD vulnerability descriptions [29] concluded that ology: regarding the learning rate, RoBERTa was set to
information about the asset, attack, and vulnerability type is 1.5 × 10−5 , DistilBERT was set to 5 × 10−5 , while BERT,

VOLUME 10, 2022 59127


J. C. Costa et al.: Predicting CVSS Metric via Description Interpretation

FIGURE 2. Bar plots of the eight categories analyzed of CVSS version 3, linked to vulnerability assessment, of the vulnerability dataset used. Each
category displays the associated classes and respective class prior.

ALBERT, and DeBERTa have all been set to 3 × 10−5 ; vocabulary of the tokenizer. To select the added words,
for the number of training epochs, RoBERTa was trained we order them by frequency of appearance in the descriptions,
for 2 epochs, DeBERTa for 10, and BERT, ALBERT, and choosing the top n words. To avoid redundancy, we only
DistilBERT for 3; regarding batch size, we used 8 for consider words that appear exclusively in the description and
ALBERT and DistilBERT, and 4 for BERT, RoBERTa, and not in the default vocabulary.
DeBERTa; finally, RoBERTa has a weight decay of 0.01, Given the existence of software versions and code snippets
while the remaining models have the default value (0). We use in some data descriptions, we use regular expressions to
the default losses and architectures of each model, from filter digits and special characters. This approach reduces
Hugging Face [36]. To obtain category classification, we use the ‘‘noise’’ of vocabulary addition, since this filtered data
a PyTorch Softmax layer [37] on the model output. is not relevant to category classification and could potentially
To compare the performance of each model, we use the dissipate the importance of relevant added words.
accuracy, F1 score, and balanced accuracy from the scikit-
learn library [38]. To compare our results with state-of-the-
art for CVSS metric inference, we use the accuracy metric. C. SHAPLEY VALUE
Deep learning models have shown high performance in
multiple tasks while providing little to no explanation for
B. TEXT PROCESSING AND VOCABULARY SELECTION the reasoning for model prediction. To tackle this issue,
To assess the contribution of each word to the classification we use Shapley value, an interpretability technique that
of the considered categories (discussed in section IV), allows us to interpret the reasoning of the model when
we start by processing vulnerability descriptions. We use providing predictions. The Shapley value, coined by Shapley
two pre-processing methods, namely, Lemmatization and in 1953 [40], is a cooperative game theory-based method
Stemming. Finally, we tokenize the text to input to the used for assigning payouts to players, depending on their
model, evaluating its accuracy based on the pre-processing contribution towards the total payout. In the machine-learning
approach. Both text pre-processing approaches use Natural context, the Shapley value is used to evaluate how each
Language Toolkit (NLTK) methods [39], while tokenization feature (player) of a given instance contributed (assigning
is achieved using Transformers library, from Hugging payout) towards the model prediction of the instance (total
Face [36]. We choose Lemmatization and Stemming, given payout).
their wide use as text pre-processing approaches in the NLP The use of Shapley value in our experiments is linked to our
area. By using Lemmatization and Stemming, we intend to interest in analyzing how each word contributed to category
process text to maintain as much relevant data as possible classification. For categories with more than n classes, and
while ignoring noisy data. This is achieved by ignoring n higher than 2, we perform n Shapley value analysis,
variants of words that have the same ‘‘base’’. In the case of each considering a class versus the remaining classes of the
Stemming is the same stem, while in Lemmatization is the category. The considered class is given the value 1, with
same lemma. the remaining receiving the value 0. If a word contributes
In our experiments, we also evaluate the effect of positively, it means that it influences the considered class. The
vocabulary addition. Moreover, we also assess this effect higher the absolute Shapley value is, the higher the feature
in conjunction with the best performing text pre-processing influence. We use the SHapley Additive exPlanations (SHAP)
approach. We evaluate the accuracy of the used model framework [41] and the Explainer model, from a publicly
when adding 5,000, 10,000, and 25,000 words to the default available implementation in [42].

59128 VOLUME 10, 2022


J. C. Costa et al.: Predicting CVSS Metric via Description Interpretation

TABLE 1. Percentage of class prior of the eight vulnerability related follows a similar proportion of classes, exhibited in Fig 2,
categories for all dataset, train, and test set. Class prior displays the
likelihood of an outcome in the dataset and each subset. whose analytical values are shown in Table 1. The dataset is
publicly available for repeatability purposes, and it serves as
a basis for other models to evaluate their performance and
compare with the proposed methodology.

V. EXPERIMENTS
A. MODEL COMPARISON
We start by comparing the performance of five different NLP
methods in the proposed dataset. The accuracy, F1 score,
and balanced accuracy for each of the eight categories are
presented in Table 2. The results suggest that DistilBERT
is the outperforming model for all the categories, in all the
considered metrics. The method with the worst performance
is ALBERT, which has the least number of parameters (11M),
while DeBERTa, BERT, and RoBERTa, with over 100M
parameters, also have worse performance than DistilBERT
(65M). Since we intend to assess the class inference, given
a vulnerability description, the number of parameters may
be linked to the performance variance. In this case, too few
parameters (ALBERT) are insufficient for the model to learn,
and too many leads to poorer fine-tuning. The similarity
of various accuracy values between BERT, ALBERT, and
DeBERTa, for different categories, can be explained by
dataset imbalance. In these cases, the values displayed
represent a scenario where the models opted to achieve higher
accuracy by outputting the same value in every instance.
Thus, in cases of dataset imbalance, the use of accuracy can be
deceptive, justifying the use of other metrics such as balanced
accuracy.
In this experiment, we use the default pretraining weights
IV. VULNERABILITY DATASET
(provided by HuggingFace [36]) and training parameters of
The vulnerability dataset is based on NVD information,
every model. The models used are typically applied/evaluated
a United States government repository of standards-based
in tasks where the association of two sentences is analyzed
vulnerability management data. We obtain the information
(e.g., GLUE [43]) or the aim is finding answers in a
through their API, starting from index 0 to 152,000, repre-
text, given a question (e.g., SQuAD [44]). These types of
senting data collected until April 2021. Finally, we process
tasks differ from predicting a category given a vulnerability
the collected data to retrieve vulnerability descriptions and
description (the aim of this work), which may justify
the classes for each of the eight categories analyzed: Attack
the underperformance of state-of-the-art methods in our
Vector, Attack Complexity, Privileges Required, User Interac-
experiments. Based on the obtained results, we selected
tion, Scope, Confidentiality, Integrity, and Availability. Based
DistilBERT for continuing the experiments involving the
on the CVSS documentation, these classes are grouped into
usage of Deep Learning.
Exploitability metrics (Attack Vector, Attack Complexity,
Privileges Required, and User Interaction), Scope, and Impact
metrics (Confidentiality, Integrity, and Availability). Tables B. TEXT PRE-PROCESSING
and Figures throughout this paper consider this grouping. We assess the performance of DistilBERT, for all eight
A visual representation of class proportions, for each considered categories, regarding different text pre-processing
category, of our dataset is displayed in Fig. 2. approaches. We present our results, using balanced accuracy,
Though the collected data corresponds to 152,000 vul- in Table 3, with Baseline referring to the condition where no
nerability descriptions and categories, we only consider pre-processing approach is used.
descriptions related to version 3 of CVSS in this work. For When comparing category-related performance variance,
this reason, the total number of instances in our dataset is we observe that all categories benefit from pre-processing.
79,810. We divide them into train and test sets, composed of Regarding processing-related performance variance, Lemma-
63,848 and 15,962 instances, respectively, corresponding to tization promotes better results than Stemming, for all
a 0.2 test ratio. The average description length is 43.85 and categories. Stemming truncates words by chopping off letters
44.55 words for train and test split, respectively. Each set from the end until the stem is reached. This is a more

VOLUME 10, 2022 59129


J. C. Costa et al.: Predicting CVSS Metric via Description Interpretation

TABLE 2. Model accuracy (Acc), F1 score, and Balanced Accuracy (BA) for each of the eight categories analyzed. The outperforming model for each metric
and category is shown in bold.

TABLE 3. Category balanced accuracy of DistilBERT for baseline Regarding the combination of vocabulary addition with
conditions (Tokenization), and using text pre-processing approaches
(Lemmatization and Stemming). The outperforming approach for each Lemmatization, we observe that this approach generally
category is shown in bold. improves the balanced accuracy, relative to vocabulary
addition alone, for most vocabulary variations. This suggests
that word importance may vary with processing approaches,
which corroborates the importance of text pre-processing,
even in the context of vocabulary addition. The results
suggest that 5,000-word addition with Lemmatization is the
best approach for overall category prediction, exhibiting the
importance of text processing and pertinent word addition in
description-based classification.

D. STATE-OF-THE-ART COMPARISON
We compare DistilBERT, and its combination with
pre-processing and vocabulary addition, with the state-of-
crude approach than Lemmatization, which justifies the
the-art. To the best of our knowledge, only Ebalz et al. [24]
underperformance using this approach. Given the superiority
evaluates class prediction accuracy in version 3 of CVSS.
displayed by Lemmatization over Stemming, this is the chosen
To compare our results with them, we also display
pre-processing approach to use in the remaining experiments.
the accuracy of Baseline and 5,000-word addition with
Lemmatization, whose balanced accuracy is presented in
C. VOCABULARY ADDITION Table 4. Since the authors presented their results in a bar plot,
We also evaluate the effect of vocabulary addition on pre- not displaying the analytical values, we register the rounded
diction accuracy. Furthermore, we compare the vocabulary values observed in said plot. We display the state-of-the-art
addition with its combination with a pre-processing approach. comparison in Table 5.
We display our results in Table 4. Ebalz et al. use a bag of words approach, with the
Relative to the baseline, most variations of vocabulary removal of irrelevant words, to input a regression model.
addition translate into performance increase, for all cat- Using DistilBERT, a deep learning approach, in conjunction
egories. Regarding the vocabulary variations, 5,000-word with text pre-processing and vocabulary addition, we obtain
addition was the condition with better results overall. This substantial accuracy improvements in the majority of cate-
suggests that adding more words is beneficial to model gories. The categories where Ebalz’s approach was closer to
accuracy improvement. However, subsequent vocabulary ours were Attack Complexity, User Interaction, and Scope,
addition (10,000 and 25,000-word addition) does not promote which could be linked to these categories being two-classed.
incremental performance increase. Given that vocabulary In these cases, the regression model used by Ebalz et al. can
addition is linked to word frequency in the description, compete with deep learning approaches. However, for the
adding more words may disperse the model attention remaining categories, with over two classes, the performance
towards less relevant words, hindering its performance. disparity is substantially larger, with up to a 28% accuracy
This aspect is more noticeable when 25,000-word addition increase. Furthermore, using the text pre-processing approach
has worse performance than baseline (e.g., Attack Vector, and adding vocabulary promotes an accuracy increase of
Attack Complexity). For all categories, 25,000-word addition DistilBERT, further enhancing its performance. The results
does not generally translate into performance improvement suggest that DistilBERT is a state-of-the-art approach for
relative to 5,000-word addition, suggesting the existence of vulnerability category prediction, particularly for multi-class
word importance redundancy with this approach. categories.

59130 VOLUME 10, 2022


J. C. Costa et al.: Predicting CVSS Metric via Description Interpretation

TABLE 4. Category balanced accuracy of DistilBERT for baseline conditions (Tokenization), and with different vocabulary addition, assessing the effect of
Lemmatization. Base, in each vocabulary column, refers to the vocabulary addition with Tokenization, without text pre-processing. The expression w/
Lemm refers to Lemmatization combination with vocabulary addition. The outperforming approach for each category is shown in bold.

TABLE 5. Category accuracy of DistilBERT, DistilBERT-Enhanced


(DistilBERT-E) and Ebalz’s work [24]. DistilBERT-Enhanced refers to
DistilBERT using Lemmatization and 5,000-word addition. The
outperforming approach for each category is shown in bold.

FIGURE 3. Bar plots of the SHAP value for Baseline, Lemmatization,


E. INTERPRETING CATEGORY CLASSIFICATION 5,000-word addition (5k Vocabulary), and 5,000-word addition with
Lemmatization (5k Vocabulary & Lemmatization) for the category Attack
We assess word importance in two distinct scenarios: 1) com- Vector, regarding Network class.
paring the most relevant words, using different processing
techniques, for a given category; and 2) assessing the variance
of word importance towards/against binary and multi-class There is some logic behind said importance, given that
category prediction, given different processing techniques. remote and protocols are linked to network-related activities.
In the first scenario, we compare word importance vari- The influence of Matter is linked to Mattermost, an open-
ance with text pre-processing and vocabulary addition in source chat service, which was the target of multiple attacks.
DistilBERT. Given the overall superiority of Lemmatization This shows that token importance might be influenced by
and 5,000-word addition (Table 4), these are the chosen specific network-related events. When we analyze tokens
approaches. We consider the four stages for comparison: more associated with other classes (negative Shapley value),
1) Baseline; 2) Lemmatization; 3) 5,000-word addition; we observe that these are closely related to class defini-
and 4) 5,000-word addition with Lemmatization. For the tion (Local, Physical, and Adjacent) or associated with it
remaining experiments, we will refer to each word of a (infrastructure). Adding Lemmatization, we observe the
description as a token to accurately represent the word same tendency for tokens influential towards other classes
translated into the tokenizer vocabulary. We evaluate token but with increased importance. Furthermore, tokens linked
importance for the category Attack Vector, regarding the to Network classification lose importance, aside from the
Network class. In this case, Network has a value of 1, specific network-related event of baseline (Matter). This
and the remaining three classes have the value 0. Tokens suggests that token descriptions are more interpretably
with positive Shapley value influence Network classification, linked in not classifying Network than towards it, which
while negative ones are more relevant to the other three could be due to class imbalance. Network is over 70% of
classes. Attack Vector classes, making it harder to distinguish tokens
The results show a variance in token importance with text clearly associated with it, thus justifying the Lemmatization
pre-processing and vocabulary addition. Starting in the Base- results. The addition of vocabulary (5k Vocabulary) heavily
line, with no processing or vocabulary addition, protocols, influences category classification, with new tokens being
Matter, and remote are tokens that, when in a description, associated with Network classification: libxaac, Mattermost,
influence the classification of the category towards Network. and man-in-the-middle. Libxaac is an Android library with

VOLUME 10, 2022 59131


J. C. Costa et al.: Predicting CVSS Metric via Description Interpretation

FIGURE 4. Boxplots summarizing the effect of applying text pre-processing techniques (Lemmatization) and vocabulary addition (5,000-word addition)
for (a) binary class (Attack Complexity, User Interaction, and Scope) and (b) multiple class problems.

reported out-of-bound reading/writing errors, while man- Shapley value) and against (negative Shapley value) the
in-the-middle is a type of network attack. The importance highest class. However, combining vocabulary addition
of these tokens is linked to specific network-related events with Lemmatization increases token importance variance,
(attacks, errors), which was also observed in the baseline. particularly for negative values. This translates into increased
Protocols also increases in importance towards Network importance of tokens to categorize the least represented class.
classification, which could be linked to their association If almost all descriptions relate to a specific class, it may
with the added vocabulary. This shows that vocabulary be more beneficial/discriminative to focus on tokens linked
addition shifts the focus of token importance heavily towards to the underrepresented class, which is the approach of the
specific events, for Network classification. Network-adjacent model in this case.
(added by vocabulary addition) also gains importance in Analyzing multi-class boxplots shows that the variance of
classifying other classes, given its relevance to dissociate Net- negative Shapley value remains nearly constant throughout
work from Adjacent. Complementing vocabulary addition the various text pre-processing methods. Comparatively to
with Lemmatization (5k Vocabulary & Lemmatization) the binary classes, negative Shapley value refers to various
diminishes the importance of tokens closely linked to classes and not simply one, which justifies the (low)
Network (positive Shapley value), resurging the tendency variance observed for these cases. Relative to positive
observed with Lemmatization alone. The reduced importance Shapley value, using vocabulary addition and its combination
of specific network-related events also greatly decreased with Lemmatization tends to reduce the variance of token
token importance associated with it (protocols). Furthermore, importance, achieving a similar variance to negative Shapley
the influence of added vocabulary was enhanced in logon value tokens. In multi-class prediction, even when one class
(closely related to classes other than Network) and network- is more prevalent than others, the existence of tokens closely
adjacent, while keeping high importance of tokens associated linked to specific categories is not as likely as in binary class
with other classes definition (physical and local). This result prediction. For this reason, reducing the overall importance
suggests that Lemmatization is necessary to obtain more towards specific token importance classification translates
coherent/explainable token importance, which ultimately into better results.
translates into better model performance (as shown in
Table 4). VI. CONCLUSION
The second considered scenario relates token importance The increasing number of threats and vulnerabilities in IT
when considering binary (Attack Complexity, User Interac- systems surpass the capability of professionals to handle
tion, and Scope) and multi-class categories, for the same them, potentially leading to company prejudice. This raises
processing approaches of the first scenario. For all categories, the need to prioritize vulnerabilities, typically achieved
the highest proportion class per category was associated with through CVSS metrics, via manual vulnerability description
the value 1, with the remaining being associated with 0. Fig. 4 analysis. In this paper, we present a vulnerability dataset,
displays the boxplots for the two cases considered, showing from NVD data, and analyze the applicability of deep
the data distribution (ignoring wildcard cases). learning approaches, namely NLP methods, to aid in CVSS
The analysis of binary boxplots indicates that using metric prediction via description interpretation. In our exper-
Lemmatization and vocabulary addition promotes a decrease iments, we also assess the importance of text processing and
in token importance variance in both towards (positive vocabulary addition in metric prediction while interpreting

59132 VOLUME 10, 2022


J. C. Costa et al.: Predicting CVSS Metric via Description Interpretation

it via Shapley value. Our results show that DistilBERT is [18] B. Sheehan, F. Murphy, M. Mullins, and C. Ryan, ‘‘Connected and
a state-of-the-art model for CVSS metric prediction, with autonomous vehicles: A cyber-risk classification framework,’’ Transp. Res.
A, Policy Pract., vol. 124, pp. 523–536, Jun. 2019.
increased performance when combined with Lemmatization [19] M. Frigault, L. Wang, S. Jajodia, and A. Singhal, ‘‘Measuring the overall
(text pre-processing) and 5,000 word-addition. Furthermore, network security by combining CVSS scores based on attack graphs and
this combination mitigates the effect of specific events in Bayesian networks,’’ in Network Security Metrics. Cham, Switzerland:
Springer, 2017, pp. 1–23.
category prediction and leads to weighted word impor- [20] P. Cheng, L. Wang, S. Jajodia, and A. Singhal, ‘‘Refining CVSS-based
tance, particularly for binary categories, contributing to network security metrics by examining the base scores,’’ in Network
increased model accuracy. The presented dataset and model Security Metrics. Cham, Switzerland: Springer, 2017, pp. 25–52.
[21] M. Ali Allouzi and J. I. Khan, ‘‘Identifying and modeling security threats
experiments serve as a comparable basis for future works for IoMT edge network using Markov chain and common vulnerability
in CVSS metric prediction, applicable for vulnerability scoring system (CVSS),’’ 2021, arXiv:2104.11580.
handling/prioritization, which leads to increased usefulness [22] A. Khazaei, M. Ghasemzadeh, and V. Derhami, ‘‘An automatic method for
CVSS score prediction using vulnerabilities description,’’ J. Intell. Fuzzy
and accuracy of the metric, benefiting system security and Syst., vol. 30, no. 1, pp. 89–96, Aug. 2015.
operational effectiveness. [23] K. Gencer and F. Başçiftçi, ‘‘The fuzzy common vulnerability scoring
system (F-CVSS) based on a least squares approach with fuzzy logistic
regression,’’ Egyptian Informat. J., vol. 22, no. 2, pp. 145–153, Jul. 2021.
REFERENCES [24] C. Elbaz, L. Rilling, and C. Morin, ‘‘Fighting N-day vulnerabilities with
[1] P. Boden. (2016). The Emerging Era of Cyber Defense and Cybercrime. automated CVSS vector prediction at disclosure,’’ in Proc. 15th Int. Conf.
Accessed: Jul. 29, 2021. [Online]. Available: https://www.microsoft. Availability, Rel. Secur., Aug. 2020, pp. 1–10.
com/security/blog/2016/01/27/the-emerging-era-of-cyber-defense-and- [25] A. Beck and S. Rass, ‘‘Using neural networks to aid CVSS risk
cybercrime/ aggregation—An empirically validated approach,’’ J. Innov. Digit.
[2] S. Morgan. (2020). Cybercrime to Cost the World $10.5 Trillion Ecosyst., vol. 3, no. 2, pp. 148–154, 2016.
Annually by 2025. Accessed: Jul. 29, 2021. [Online]. Available: [26] X. Liu, J. Ospina, and C. Konstantinou, ‘‘Deep reinforcement learning
https://cybersecurityventures.com/hackerpocalypse-cybercrime-report- for cybersecurity assessment of wind integrated power systems,’’ IEEE
2016/ Access, vol. 8, pp. 208378–208394, 2020.
[3] vuldb.Com. (2021). Vulnerability Database. Accessed: Jul. 28, 2021. [27] S. E. Sahin and A. Tosun, ‘‘A conceptual replication on predicting the
[Online]. Available: https://vuldb.com/? severity of software vulnerabilities,’’ in Proc. Eval. Assessment Softw. Eng.,
[4] M. U. Aksu, M. H. Dilek, E. I. Tatli, K. Bicakci, H. I. Dirik, Apr. 2019, pp. 244–250.
M. U. Demirezen, and T. Aykir, ‘‘A quantitative CVSS-based cyber [28] H. Chen, J. Liu, R. Liu, N. Park, and V. S. Subrahmanian, ‘‘VASE: A
security risk assessment methodology for IT systems,’’ in Proc. Int. Twitter-based vulnerability analysis and score engine,’’ in Proc. IEEE Int.
Carnahan Conf. Secur. Technol. (ICCST), Oct. 2017, pp. 1–8. Conf. Data Mining (ICDM), Nov. 2019, pp. 976–981.
[5] P. Mell, K. Scarfone, and S. Romanosky, ‘‘Common vulnerability scoring [29] L. Allodi, S. Banescu, H. Femmer, and K. Beckers, ‘‘Identifying relevant
system,’’ IEEE Secur. Privacy, vol. 4, no. 6, pp. 85–89, Nov./Dec. 2006. information cues for vulnerability assessment using CVSS,’’ in Proc. 8th
[6] D. E. Mann and S. M. Christey, ‘‘Towards a common enumeration ACM Conf. Data Appl. Secur. Privacy, Mar. 2018, pp. 119–126.
of vulnerabilities,’’ in Proc. 2nd Workshop Res. Secur. Vulnerability [30] K. B. Alperin, A. B. Wollaber, and S. R. Gomez, ‘‘Improving inter-
Databases, West Lafayette, IN, USA: Purdue Univ., 1999, pp. 1–13. pretability for cyber vulnerability assessment using focus and context
[7] N. I. of Standards and Technology. (2021). NVD—Vulnerability Metrics. visualizations,’’ in Proc. IEEE Symp. Vis. Cyber Secur. (VizSec), Oct. 2020,
Accessed: Jul. 28, 2021. [Online]. Available: https://nvd.nist.gov/vuln- pp. 30–39.
metrics/cvss [31] J. Devlin, M. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training
[8] P. Johnson, R. Lagerstrom, M. Ekstedt, and U. Franke, ‘‘Can the common of deep bidirectional transformers for language understanding,’’ in Proc.
vulnerability scoring system be trusted? A Bayesian analysis,’’ IEEE Conf. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang.
Trans. Dependable Secure Comput., vol. 15, no. 6, pp. 1002–1015, Technol. (NAACL-HLT), Minneapolis, MN, USA, vol. 1, Jun. 2019,
Dec. 2018. pp. 4171–4186.
[9] A. Feutrill, D. Ranathunga, Y. Yarom, and M. Roughan, ‘‘The effect [32] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, ‘‘DistilBERT, a
of common vulnerability scoring system metrics on vulnerability exploit distilled version of BERT: Smaller, faster, cheaper and lighter,’’ 2019,
delay,’’ in Proc. 6th Int. Symp. Comput. Netw. (CANDAR), Nov. 2018, arXiv:1910.01108.
pp. 1–10. [33] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
[10] L. S. Shapley, A Value for N-Person Games. Princeton, NJ, USA: Princeton L. Zettlemoyer, and V. Stoyanov, ‘‘RoBERTa: A robustly optimized BERT
Univ. Press, 2016, ch. 17. pretraining approach,’’ 2019, arXiv:1907.11692.
[11] A. A. Younis and Y. K. Malaiya, ‘‘Comparing and evaluating CVSS base [34] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,
metrics and Microsoft rating system,’’ in Proc. IEEE Int. Conf. Softw. ‘‘ALBERT: A lite BERT for self-supervised learning of language
Qual., Rel. Secur., Aug. 2015, pp. 252–261. representations,’’ in 8th Int. Conf. Learn. Represent. (ICLR), Addis Ababa,
[12] H. Joh, ‘‘Software risk assessment for windows operating systems with Ethiopia, Apr. 2020.
respect to CVSS,’’ Eur. J. Eng. Technol. Res., vol. 4, no. 11, pp. 41–45, [35] P. He, X. Liu, J. Gao, and W. Chen, ‘‘DeBERTa: Decoding-enhanced
Nov. 2019. bert with disentangled attention,’’ in Proc. 9th Int. Conf. Learn. Repre-
[13] R. Wirtz and M. Heisel, ‘‘CVSS-based estimation and prioritization for sent. (ICLR), Vienna, Austria, May 2021.
security risks,’’ in Proc. 14th Int. Conf. Eval. Novel Approaches Softw. [36] T. Wolf, L. Debut, V. Sanh, and J. Chaumond, ‘‘Transformers: State-of-the-
Eng., 2019, pp. 297–306. art natural language processing,’’ in Proc. Conf. Empirical Methods Natu-
[14] A. Ur-Rehman, I. Gondal, J. Kamruzzuman, and A. Jolfaei, ‘‘Vulnerability ral Lang. Process., Syst. Demonstrations, Oct. 2020, pp. 38–45. [Online].
modelling for hybrid IT systems,’’ in Proc. IEEE Int. Conf. Ind. Technol. Available: https://www.aclweb.org/anthology/2020.emnlp-demos.6
(ICIT), Feb. 2019, pp. 1186–1191. [37] A. Paszke, S. Gross, F. Massa, and A. Lerer, ‘‘Pytorch: An imperative style,
[15] A. Ur-Rehman, I. Gondal, J. Kamruzzaman, and A. Jolfaei, ‘‘Vulnerability high-performance deep learning library,’’ in Proc. Adv. Neural Inf. Process.
modelling for hybrid industrial control system networks,’’ J. Grid Comput., Syst., vol. 32, 2019, pp. 8026–8037.
vol. 18, no. 4, pp. 863–878, Dec. 2020. [38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
[16] N. Mishra and R. Singh, ‘‘Taxonomy & analysis of cloud computing M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas,
vulnerabilities through attack vector, CVSS and complexity parameter,’’ in A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay,
Proc. Int. Conf. Issues Challenges Intell. Comput. Techn. (ICICT), vol. 1, ‘‘Scikit-learn: Machine learning in Python,’’ J. Mach. Learn. Res., vol. 12
Sep. 2019, pp. 1–8. no. 10, pp. 2825–2830, 2012.
[17] M. P. Chase and S. M. C. Coley, ‘‘Rubric for applying CVSS to medical [39] S. Bird, E. Klein, and E. Loper, Natural Language Processing With Python:
devices,’’ MITRE Corp., McLean, VA, USA, Tech. Rep. HHSM-500- Analyzing Text With the Natural Language Toolkit. Sebastopol, CA, USA:
2012-00008I, Oct. 2020. O’Reilly Media, 2009.

VOLUME 10, 2022 59133


J. C. Costa et al.: Predicting CVSS Metric via Description Interpretation

[40] L. S. Shapley, ‘‘A value for n-person games,’’ Contrib. Theory Games, HUGO PROENÇA (Senior Member, IEEE)
vol. 2, no. 28, pp. 307–317, 1953. received the B.Sc., M.Sc., and Ph.D. degrees,
[41] S. M. Lundberg and S.-I. Lee, ‘‘A unified approach to interpreting model in 2001, 2004, and 2007, respectively. He is cur-
predictions,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, rently an Associate Professor with the Department
pp. 4765–4774. of Computer Science, University of Beira Interior,
[42] S. E. A. Lundberg. (2021). Shap (Shapley Additive Explanations). and has been researching mainly about biometrics
Accessed: Jul. 2, 2021. [Online]. Available: https://github.com/slundberg/ and visual-surveillance. He is a member of
shap
the Editorial Boards of the Image and Vision
[43] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, ‘‘GLUE:
Computing, IEEE ACCESS, and International
A multi-task benchmark and analysis platform for natural language
understanding,’’ in Proc. EMNLP Workshop BlackboxNLP: Analyzing Journal of Biometrics. He served as a Guest Editor
Interpreting Neural Netw. (NLP), 2018, pp. 353–355. for special issues of the Pattern Recognition Letters, Image and Vision
[44] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, ‘‘SQuAD: 100,000+ Computing, and Signal, Image and Video Processing journals. He was
questions for machine comprehension of text,’’ in Proc. Conf. Empirical the Co-ordinating Editor of the IEEE Biometrics Council Newsletter and
Methods Natural Lang. Process. (EMNLP), 2016, pp. 2383—2392. the Area Editor (Ocular Biometrics) of the IEEE BIOMETRICS COMPENDIUM
journal.

JOANA CABRAL COSTA received the bachelor’s


and master’s degrees in computer science and
engineering from the Universidade da Beira Inte-
rior (UBI), in 2019 and 2021, respectively, where
she is currently pursuing the Ph.D. degree with
a Fundação para a Ciência e a Tecnologia (FCT) PEDRO R. M. INÁCIO (Senior Member,
Scholarship, in the field of computer vision and IEEE) was born in Covilhã, Portugal, in 1982.
adversarial attacks. He received the B.Sc. degree in mathemat-
ics/computer science and the Ph.D. degree in
computer science and engineering from the
TIAGO ROXO (Member, IEEE) received the Universidade da Beira Interior (UBI), Portugal, in
bachelor’s degree in computer science and engi- 2005 and 2009, respectively. The Ph.D. work was
neering from the Universidade da Beira Interior performed in the enterprise environment of Nokia
(UBI), in 2019, where he is currently pursuing the Siemens Networks Portugal S.A., through a Ph.D.
Ph.D. degree with a Fundação para a Ciência e Grant from the Portuguese Foundation for Science
a Tecnologia (FCT) Scholarship, in the field of and Technology.
computer vision and artificial intelligence. He has been a Professor of computer science at the UBI, since 2010,
where he lectures subjects related with information assurance and security,
programming of mobile devices and computer based simulation, to graduate
and undergraduate courses, namely to the B.Sc., M.Sc. and Ph.D. programs
in computer science and engineering. He is currently the Head of the
JOÃO B. F. SEQUEIROS received the bache- Department of Computer Science, UBI. He is an Instructor of the UBI
lor’s and master’s degrees in computer science Cisco Academy. He is a Researcher of the Instituto de Telecomunicações
and engineering from the Universidade da Beira (IT). He has about 40 publications in the form of book chapters and
Interior, in 2014 and 2016, respectively, where he papers in international peer-reviewed books, conferences, and journals.
is currently pursuing the Ph.D. degree under the He frequently reviews articles for IEEE, Springer, Wiley, and Elsevier
title ‘‘Towards a Framework for System and Attack journals. His main research interests include information assurance and
Modeling and Mapping of Security Requirements security, computer-based simulation, and network traffic monitoring,
for the Internet of Things.’’ His dissertation analysis, and classification.
focused on the development of a box for auto- Dr. Inácio has been a member of the Technical Program Committee
mated network-based security assessments. He has of International Conferences, such as the ACM Symposium on Applied
authored or coauthored several journals and conference papers mainly in the Computing—Track on Networking. He was one of the chairs of
security field. His main research interests include network and application WISARC 2016.
security, cryptography, cybersecurity, and the IoT security.

59134 VOLUME 10, 2022

You might also like