Automated Software Vulnerability Assessment With Concept Drift
Automated Software Vulnerability Assessment With Concept Drift
Abstract— Software Engineering researchers are 2008, the year when Google released Android. It is being
increasingly using Natural Language Processing (NLP) recognized that such new SVs terms cause problems for
techniques to automate Software Vulnerabilities (SVs) building the vulnerability assessment models.
assessment using the descriptions in public repositories. Some previous studies [6, 14, 15] have suffered from
However, the existing NLP-based approaches suffer from concept drift by mixing the new and old SVs in the model
concept drift. This problem is caused by a lack of proper validation step, which can lead to biased results as such
treatment of new (out-of-vocabulary) terms for the evaluation
approach accidentally merges the new information with the
of unseen SVs over time. To perform automated SVs
existing one. Moreover, the previous work of SVs analysis
assessment with concept drift using SVs’ descriptions, we
propose a systematic approach that combines both character
[6, 7, 14-18] used predictive models with only word features
and word features. The proposed approach is used to predict without reporting how to handle the new or extended
seven Vulnerability Characteristics (VCs). The optimal model concepts (e.g., new versions of the same software) in the new
of each VC is selected using our customized time-based cross- SVs’ descriptions. The research on machine translation [19-
validation method from a list of eight NLP representations and 22] has shown that the unseen (Out-of-Vocabulary (OoV))
six well-known Machine Learning models. We have used the terms can make existing word-only models less robust to
proposed approach to conduct large-scale experiments on more future prediction due to their missing information. For
than 100,000 SVs in the National Vulnerability Database vulnerability prediction, Zhoubing [8] did use random
(NVD). The results show that our approach can effectively embedding vectors to represent the OoV words, which still
tackle the concept drift issue of the SVs’ descriptions reported discards the relationship between the new and old concepts.
from 2000 to 2018 in NVD even without retraining the model. Such observations motivated us to tackle the research
In addition, our approach performs competitively compared to problem “How to handle the concept drift issue of the
the existing word-only method. We also investigate how to vulnerability descriptions in public repositories to
build compact concept-drift-aware models with much fewer improve the robustness of automated SVs assessment?” It
features and give some recommendations on the choice of appears to us that it is important to address the issue of
classifiers and NLP representations for SVs assessment. concept drift to enable practical applicability of automated
vulnerability assessment tools. To the best of our knowledge,
Keywords—software vulnerability, machine learning, multi-
there has been no existing work to systematically address the
class classification, natural language processing, mining
software repositories concept drift issue in SVs assessment.
To perform SVs assessment with concept drift using the
vulnerability descriptions in public repositories, we present a
I. INTRODUCTION Machine Learning (ML) model that utilizes both character-
Software Vulnerability (SV) is usually defined as a flaw level and word-level features. We also propose a customized
or weakness in software code, that can potentially result in a time-based version of cross-validation method for model
cybersecurity attack [1]. Cybersecurity attacks reportedly led selection and validation. Our cross-validation method splits
to a loss of more than 50 billion dollars to the U.S. economy the data by year to embrace the temporal relationship of SVs.
in 2016 [2]. Different types of SVs have different levels of We evaluate the proposed model on the prediction of seven
security threats to software-intensive systems [3]. It is Vulnerability Characteristics (VCs), i.e., Confidentiality,
important to assess SVs for prioritizing actions so that more Integrity, Availability, Access Vector, Access Complexity,
severe SVs are patched before exploitations [4, 5]. Authentication, and Severity. Our key contributions are:
Automation of SVs analysis and assessment has become an
1. We demonstrate the concept drift issue of SVs using
important area of research efforts. Identification of SVs’
concrete examples from NVD.
characteristics is a critical task for automation. It is asserted
2. We investigate a customized time-based cross-
that SVs’ public repositories, such as the National
validation method to select the optimal ML models
Vulnerability Database (NVD), can help identify SVs
for SVs assessment. Our method can help prevent
characteristics by analyzing their descriptions using Natural
future vulnerability information from being leaked
Language Processing (NLP) [6-8].
into the past in model selection and validation steps.
However, the vulnerability data have the temporal
3. We propose and extensively evaluate an effective
property since many new terms appear in the descriptions of
Character-Word Model (CWM) to assess SVs using
SVs. Such terms are a result of the release of new
the descriptions with concept drift. We also
technologies/products or discovery of a zero-day attack or
investigate the performance of low-dimensional
SV; for example, the NVD received more than 13,000 new
CWM models. We provide a GitHub repository 1
SVs in 2017 [9]. The appearance of new concepts makes the
containing our models and associated source code.
vulnerability data and patterns change over time [10-12],
which is known as concept drift [13]. For example, the
keyword Android has only started appearing in NVD since 1
https://github.com/lhmtriet/MSR2019
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 03,2023 at 06:28:46 UTC from IEEE Xplore. Restrictions apply.
Paper structure. Section II introduces the vulnerability validation. Text preprocessing step (cf. section III.B) is
description and VCs. Section III describes our proposed necessary to reduce the noise in the text to build a better
approach. Section IV presents the experimental design of this assessment model. Next, the preprocessed text enters the
work. Section V analyzes the experimental results and time-based k-fold cross-validation step to select the optimal
discusses the findings. Section VI identifies the threats to classifier and NLP representations for each VC. It should be
validity. Section VII covers the related works. Section VIII noted that this step only tunes the word-level models instead
concludes and suggests some future directions. of the combined models of both word and character features.
One reason is that the search space of the combined model
II. BACKGROUND is much larger than that of the word-only model since we at
National Vulnerability Database [9] (NVD) is a well- least have to consider different NLP representations for
known public repository that contains a huge amount of SVs character-level features. The computational resource to
information that is considered trustworthy as NVD is extract character-level n-grams is also more than that of
maintained by governmental bodies (National Cyber word-level counterparts. Section III.C provides more details
Security and Division of the United States Department of about the time-based k-fold cross-validation method.
Homeland Security). NVD inherits the unique vulnerability Next comes the model building process with four main
identifiers and descriptions from Common Vulnerabilities steps: (i) word n-grams generation, (ii) character n-grams
and Exposures (CVE) [1]. NVD also adds an evaluation to generation, (iii) feature aggregation and (iv) character-word
each SV using the Common Vulnerability Scoring System model building. Steps (i) and (ii) use the preprocessed text
(CVSS) [23, 24]. Currently, there are three versions of in the previous process to generate word and character n-
CVSS, in which the latest version (i.e., the third version) grams based on the identified optimal NLP representations
was introduced in 2015. The second CVSS version (CVSS of each VC. The word n-grams generation step (i) here is the
2) is also maintained. same as the one in the time-based k-fold cross-validation of
In CVSS 2, an SV is evaluated based on three main the previous process. An example of the word and character
criteria: Impact, Exploitability and Severity. Impact and n-grams in our approach is given in Table 1. Such character
Exploitability refer to the threats and exploitation n-grams increase the probability of capturing parts of OoV
procedures of each SV. Severity determines the level of terms due to concept drift in vulnerability descriptions.
severity of an SV based on Impact and Exploitability. The Subsequently, both levels of the n-grams and the optimal
first two criteria can be further decomposed into NLP representations are input into the feature aggregation
(Confidentiality, Integrity, Availability) and (Access Vector, step (iii) to extract the features from the preprocessed text
Access Complexity, Authentication), respectively (cf. Fig. using our proposed algorithm in section III.D. This step also
1). There are three separate values for each of the seven combines the aggregated character and word vocabularies
VCs. From the perspective of ML, this is a multi-class with the optimal NLP representations of each VC to create
classification problem, which can be solved readily using the feature models. We save such models to transform the
ML algorithms. It is noted that Access Vector, Access data of future prediction. In the last step (iv), the extracted
Complexity and Authentication characteristics suffer the features are trained with the optimal classifiers found in the
most from the issue of imbalanced data, in which the model selection process to build the complete character-
number of elements in the minority class is much smaller word models for each VC to perform automated
compared to those of the other classes. vulnerability assessment with concept drift.
In the prediction process, new vulnerability description
is first preprocessed using the same text preprocessing step.
Then, the preprocessed text is transformed to create the
features by the feature models saved in the model building
process. Finally, the trained character-word models use such
features to determine each VC.
372
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 03,2023 at 06:28:46 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Main workflow of our proposed model for vulnerability assessment with concept drift.
Subsequently, the stemming step is done using the Porter transformation step uses the word n-grams as the vocabulary
Stemmer algorithm [28] in nltk library. Stemming is needed to transform the preprocessed text of both training and
to avoid two or more words with the same meaning but in validation sets into the features for building a model. We
different forms (e.g., “allow” vs. “allows”). The main goal create the NLP configurations from various values of n-
of stemming is to retrieve consistent features (words), thus grams combined with either term frequency or tf-idf
any algorithm that can return each word’s root should work. measure. Uni-gram with term frequency is also called Bag-
Researchers may use lemmatization, which is relatively of-Words (BoW). These NLP representations have been
slower as it also considers the surrounding context. selected since they are popular and have performed well for
SVs analysis [6, 14, 17].
C. Model selection with time-based k-fold cross-validation
We propose a time-based cross-validation method (cf. TABLE 2. EIGHT CONFIGURATIONS OF NLP REPRESENTATIONS USED FOR
MODEL SELECTION. NOTE: ‘9’ IS SELECTED, ‘-’ IS NON-SELECTED.
Fig. 3) to select the best classifiers and NLP representations
for each VC. The idea has been inspired by the time-series Configuration Word n-grams tf-idf
domain [29]. As shown in Fig. 2, our method has four steps: 1 1 -
(i) data splitting, (ii) word n-grams generation, (iii) feature 2 1 9
3 1-2 -
transformation, and (iv) model training and evaluation. Data 4 1-3 -
splitting explicitly considers the time order of SVs to ensure 5 1-4 -
that in each pass/fold, the new information of the validation 6 1-2 9
set does not exist in the training set, which maintains the 7 1-3 9
temporal property of SVs. The new terms can appear in 8 1-4 9
different time during a year; thus, the preprocessed text in For each NLP configuration, the model training and
each fold is split by the year explicitly, not by equal sample evaluation step trains six classifiers (cf. section IV.C) on the
size, e.g., SVs from 1999 to 2010 are for training and those training set and then evaluates the models on the validation
of 2011 are for validation in a pass/fold. set using different evaluation metrics (cf. section IV.D). The
model with the highest average cross-validated performance
is selected for the current VC. The process is repeated for
every VC, then the optimal classifiers and NLP
representations are returned for all seven VCs.
D. Feature aggregation algorithm
We propose Algorithm 1 to combine word and character
n-grams in the model building process to create the features
for our character-word model. Six inputs of the algorithm
are (i) input descriptions, (ii) word n-grams, (iii) character
n-grams, (iv) the minimum, (v) the maximum number of
Fig. 3. Our proposed time-based cross-validation method. character n-grams, and (vi) the optimal NLP configuration
Note: x is the final year in the original training set, of the current VC. The main output is a feature matrix
k is the number of cross-validation folds.
containing the term weights of the documents transformed
After data splitting in each fold, we use the training set by the aggregated character and word vocabularies to build
to generate the word n-grams. Subsequently, with each of the character-word models. We also output the character and
the eight NLP configurations in Table 2, the feature word feature models for future prediction.
373
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 03,2023 at 06:28:46 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1. Feature aggregation algorithm to transform the documents IV. EXPERIMENT DESIGN
with the aggregated word and character-level features.
All the classifiers and NLP representations (n-grams,
Input: List of vulnerability descriptions: Din term frequency, tf-idf) in this work have been implemented
Set of word-level n-grams: Fw { f1w , f 2 w ,..., f nw } in scikit-learn [26] and nltk [27] libraries in Python. Our
Set of character-level n-grams: Fc { f1c , f 2 c ,..., f mc } code ran on a fourth-generation Intel Core i7-4200HQ CPU
The minimum and maximum character n-grams: minngram and maxngram (four cores) running at 2.6 GHz with 16 GB of RAM.
The optimal NLP configuration of the current VC: config A. Research Questions
Output: The aggregated data matrix: Xagg
The word and character feature models: modelw, modelc Our research aims at addressing the concept drift issue in
1 slt_chars m empty set SVs’ descriptions to improve the robustness of both model
2 foreach fi Fc do selection and prediction steps. We have evaluated our two-
3 tokens m f i trimmed and split by space phase character-word model. The first phase selects the
4 if (size of tokens = 1) and ((length of the first element in
optimal models for each VC. The second phase incorporates
tokens) > 1) then the character features to build character-word models. We
5 slt_chars m slt_chars + ^tokens` raise and answer four Research Questions (RQs):
6 end if x RQ1: Is our time-based cross-validation more effective
7 end foreach than a non-temporal method to handle concept drift in
8 diff_words m Fw slt _ chars the model selection step for vulnerability assessment?
9 modelw m Feature_transformation (diff_words, config) To answer RQ1, we first identify the new terms in the
10 modelc m Feature_transformation (slt_chars, minngram 1 , vulnerability descriptions. We associate such terms
maxngram , config) with their release or discovery years. We then use
qualitative examples to demonstrate the information
11 Xword m Din transformed with modelw
leakage in the non-temporal model selection step. We
12 Xchar m Din transformed with modelc also quantitatively compare the effectiveness of the
13 Xagg m horizontal_append(Xword , Xchar) proposed time-based cross-validation method with a
14 return Xagg , modelw, modelc traditional non-temporal one to address the temporal
relationship in the context of vulnerability assessment.
Steps 2-7 of the algorithm filter the character features. x RQ2: Which are the optimal models for multi-
More specifically, step 3 removes (trims) the spaces from classification of each vulnerability characteristic? To
both ends of each feature. Then, we split such feature by answer RQ2, we present the optimal models (i.e.,
space(s) to determine how many words to which the classifiers and NLP representations) using word
character(s) belongs. Subsequently, steps 4-6 retain only the features for each VC selected by a five-fold time-based
character features that are parts of single words (size of cross-validation method (cf. section III.C). We also
tokens = 1), except the single characters such as x, y, z compare the performance of different classes of
((length of the first element in tokens) > 1). The n-gram models (single vs. ensemble) and NLP representations
characters with space(s) in between represent more than one to give recommendations for future use.
word, which can make the classifier more prone to x RQ3: How effective is our character-word model to
overfitting. Similarly, single characters are too short and perform automated vulnerability assessment with
they can belong to too many words, which is likely to make concept drift? For RQ3, we first demonstrate how the
the model hardly generalizable. In fact, ‘a’ is a meaningful OoV phrases identified in RQ1 can affect the
single character, but it has been already removed as a stop performance of the existing word-only models. We
word. The characters can even represent a whole word (e.g., then highlight the ability of the character features to
“attack” token with maxn gram t 6 ). In such cases, step 8 handle the concept drift issue of SVs. We also compare
removes the duplicated word-level features the performance of our character-word model with
( Fw slt _ chars ). Based on the assumption that unseen or those of the word-only (without handling concept drift)
misspelled terms can share common characters with existing and character-only models.
words, such choice can enhance the probability of the model x RQ4: To what extent can low-dimensional model
capturing the OoV words in the new descriptions. Retaining retain the original performance? The features of our
only the character features also helps reduce the number of proposed model in RQ3 are high-dimensional and
features and the model overfitting. After that, steps 9-10 sparse. Hence, we evaluate the dimensionality
define the feature models modelw and modelc using the word reduction technique (i.e., Latent Semantic Analysis
(diff_words) and character (slt_chars) vocabularies, [30]) and the sub-word embeddings (i.e., fastText [31,
respectively, along with the NLP configurations to 32]) to show how much information of the original
transform the input documents into feature matrices for model is approximated in lower dimensions. The work
building the model. Steps 11-12 then use the two defined done for answering RQ4 facilitates the building of
word and character models to actually transform the input more efficient concept-drift-aware predictive models.
documents Din into feature matrices Xword and Xchar, B. Dataset
respectively. Step 13 concatenates two feature matrices by We retrieved 113,292 SVs from NVD in JSON format.
columns. Finally, step 14 returns the final aggregated feature The dataset contains the SVs from 1988 to 2018. We
matrix Xagg along with both word and character feature discarded 5926 SVs that contain “** REJECT **” in their
models namely modelw and modelc. descriptions since they have been confirmed duplicated or
374
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 03,2023 at 06:28:46 UTC from IEEE Xplore. Restrictions apply.
incorrect by experts. Seven VCs of CVSS 2 (cf. section II) D. Evaluation Metrics
were used as our SVs assessment metrics. It turned out that Our multi-class classification problem can be
there are 2242 SVs without any value of CVSS 2. Therefore, decomposed into multiple binary classification problems. To
we also removed such SVs from our dataset. Finally, we define the standard evaluation metrics for a binary problem
obtained a dataset containing 105,124 SVs along with their [6-8], we first describe four possibilities as follows.
descriptions and the values of seven VCs indicated x True positive (TP): The classifier correctly predicts
previously. For evaluation purposes, we followed the work that an SV has a particular characteristic.
in [6] to use the year of 2016 to divide our dataset into x False positive (FP): The classifier incorrectly predicts
training and testing sets with the sizes of 76,241 and 28,883, that an SV has a particular characteristic.
respectively. The primary reason for splitting the dataset x True negative (TN): The classifier correctly predicts
based on the time order is to consider the temporal that an SV does not have a particular characteristic.
relationship of SVs.
x False negative (FN): The classifier incorrectly predicts
C. Vulnerability classification machine learning models that an SV does not have a particular characteristic.
To solve our multi-class classification problem, we used Based on TP, FP, TN, FN, Accuracy, Precision, Recall and
six well-known ML models. These classifiers have achieved F-Score can be defined accordingly in (1), (2), (3), (4).
great results in recent data science competitions such as TP TN
Accuracy (1)
Kaggle [33]. We provide brief descriptions and the TP FP TN FN
hyperparameters of each classifier below. Precision
TP
(2)
x Naïve Bayes (NB) [34] is a simple probabilistic model TP FP
that is based on Bayes’ theorem. This model assumes TP
Recall (3)
that all the features are conditionally independent with TP FN
respect to each other. In this study, NB has no tuning 2 u Precision u Recall
hyperparameter during the validation step. F Score (4)
Precision Recall
x Logistic Regression (LR) [35] is a linear classifier in Whilst Accuracy measures the global performance of all
which the logistic function is used to convert the linear classes, F-Score (a harmonic mean of Precision and Recall)
output into probability. The one-vs-rest scheme is evaluates each class separately. Such local estimate like F-
applied to split the multi-class problem into multiple Score is more favorable to the imbalanced VCs such as
binary classification problems. In this work, we select Access Vector, Access Complexity, Authentication,
the optimal value of the regularization parameter for Severity (cf. Fig. 1). In fact, there are several variants of F-
LR from the list of values: 0.01, 0.1, 1, 10, 100. Score for multi-class classification problem namely Micro,
x Support Vector Machine (SVM) [36] is a Macro and Weighted F-Scores. In the case of multi-class
classification model in which a maximum margin is classification, Micro F-Score is actually the same as
determined to separate the classes. For NLP, the linear Accuracy. For Macro and Weighted F-Scores, the former
kernel is preferred because of its more scalable does not consider class distribution (the number of elements
computation and sparsity handling [37]. The tuning in each class) for computing F-Score of each class; whereas,
regularization values of SVM are the same as LR. the latter does. To account for the balanced and imbalanced
x Random Forest (RF) [38] is a bagging model in VCs globally and locally, we use Accuracy, Macro, and
which multiple decision trees are combined to reduce Weighted F-Scores to evaluate our models. For model
the variance and sensitivity to noise. The complexity of selection, if there is a performance tie among models
RF is mainly controlled by (i) the number of trees, (ii) regarding Accuracy and/or Macro F-Score, Weighted F-
maximum depth, and (iii) maximum number of leaves. Score is chosen as the discriminant criterion. The reason is
(i) tuning values are: 100, 300, 500. We set ( ii) to that Weighted F-Score can be considered a compromise
unlimited, which makes the model the highest degree between Macro F-Score and Accuracy. If the tie still exists,
of flexibility and easier to adapt to new data. For (iii), the less complex model with the smaller number of
the tuning values are 100, 200, 300 and unlimited. hyperparameters is selected as per the Occam’s razor
x XGBoost - Extreme Gradient Boosting (XGB) [39] principle [41]. In the last tie scenario, the model with shorter
is a variant Gradient Boosting Tree Model (GBTM) in training time is chosen.
which multiple weak tree-based classifiers are
combined and regularized to enhance the robustness of V. EXPERIMENTAL RESULTS AND DISCUSSION
the overall model. Three hyperparameters of XGB that
require tuning are the same as RF. It should be noted A. RQ1: Is our time-based cross-validation more effective
that the unlimited value of the maximum number of than a non-temporal method to handle concept drift in
leaves is not applicable to XGB. the model selection step for vulnerability assessment?
x Light Gradient Boosting Machine (LGBM) [40] is a We performed both qualitative and quantitative analyses
light-weight version of GBTM. Its main advantage is to demonstrate the relationship between concept drift and
the scalability since the sub-trees are grown in a leaf- the model selection step of vulnerability assessment. Firstly,
wise manner rather than depth-wise of other GBT it is intuitive that the data of SVs intrinsically changes over
algorithms. Three hyperparameters of LGBM that time because of new products, software and attack vectors.
require tuning are the same as XGB. The number of new terms appearing in the NVD description
In this work, we consider NB, LR and SVM as single each year during the period from 2000 to 2017 is given in
models, while RF, XGB and LGBM as ensemble models. Fig. 4.
375
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 03,2023 at 06:28:46 UTC from IEEE Xplore. Restrictions apply.
Fig. 4. The number of new terms from 2000 to 2017 of vulnerability Fig. 6. Difference between the validated and testing Weighted F-Scores of
description in NVD. our time-based and a normal cross-validation methods.
On average each year, there are 7345 new terms added to Fig. 6 shows that traditional non-temporal cross-validation
the vocabulary. Moreover, from 2015 to 2017, the number was overfitted in four out of seven cases (i.e., Availability,
of new terms has been consistently increasing and achieved Access Vector, Access Complexity, and Authentication).
an all-time high value of 14684 in 2017. We also highlight Especially, the degrees of overfitting of non-temporal
some concrete examples about the terms appearing in the validation method were 1.8, 4.7 and 1.8 times higher than
database after a particular technology, product or attack was those of the time-based version for Availability, Access
released in Fig. 5. There seems to be a strong correlation Vector, and Access Complexity, respectively. For the other
between the time of appearance of some new terms in the three VCs, both methods were similar, in which the
descriptions and their years of release or discovery. differences were within 0.02. Moreover, on average, the
Weighted F-Scores on the testing set of the non-temporal
cross-validation method were only 0.002 higher than our
approach. This value is negligible compared to the
difference of 0.02 (ten times more) in the validation step. It
is noted that similar comparison also holds for non-stratified
non-temporal cross-validation. Overall, both qualitative and
quantitative findings suggest that the time-based cross-
validation method should be preferred to lower the
performance overestimation and mis-selection of predictive
models due to the effect of concept drift in the model
selection step of SVs.
The summary answer to RQ1: The qualitative results
show that many new terms are regularly added to NVD,
Fig. 5. Examples of new terms in NVD corresponding to new products, after the release or discovery of the corresponding
software, cyber-attacks from 2000 to 2018. software products or cyber-attacks. Normal methods
Note: The year of release/discovery is put in parentheses. mixing these new terms can inflate the cross-validated
Such unseen terms contain many concepts about new model performance. Quantitatively, the optimal models
products (e.g., Firefox, Skype, and iPhone), operating found by our time-based cross-validation are also less
systems (e.g., Android, Windows Vista/7/8/10), technologies overfitted, especially two to five times for Availability,
(e.g., Ajax, jQuery, and Node.js), attacks (e.g., Code Red, Access Vector and Access Complexity. It is
Slammer, and Stuxnet worms). There are also the extended recommended that the time-based cross-validation
forms of existing ones such as the updated versions of Java should be adopted in the model selection step for
Standard Edition (Java SE) each year. These qualitative vulnerability assessment.
results depict that if the time property of SVs is not
considered in the model selection step, then the future terms B. RQ2: Which are the optimal models for multi-
can be mixed with past ones. Such information leakage can classification of each vulnerability characteristic?
result in the discrepancy in the real model performance. The answer to RQ1 has shown that the temporal cross-
In fact, the main goal of the validation step is to select validation should be used for selecting the optimal models
the optimal models that can exhibit the similar behavior on in the context of vulnerability assessment. The work to
unseen data. Next, our approach quantitatively compared the answer RQ2 presents the detailed results of the first phase of
degree of model overfitting using our time-based cross- our model. To be more specific, we have used our five-fold
validation method with a stratified non-temporal one used in time-based cross-validation to select the optimal word-only
[6, 7]. For each method, we computed the Weighted F- model for each of the seven VCs from six classifiers (cf.
Scores difference between the cross-validated and testing section IV.C) and eight NLP representations (cf. Table 2).
results of the optimal models found in the validation step (cf. More specifically, we have followed the guidelines of the
Fig. 6). The model selection and selection criteria previous work [6] to extract only the words appearing in
procedures of the normal cross-validation method are the more than 0.1% of all descriptions as features for RQ2.
same as our temporal one.
376
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 03,2023 at 06:28:46 UTC from IEEE Xplore. Restrictions apply.
Firstly, each classifier was tuned using random VCs to TABLE 5. AVERAGE CROSS-VALIDATED WEIGHTED F-SCORES OF
TERM FREQUENCY VS. TF-IDF GROUPED BY SIX CLASSIFIERS.
select its optimal set of hyperparameters. Such selected
hyperparameters are reported in Table 3. Classifier
NB LR SVM RF XGB LGBM
TABLE 3. OPTIMAL HYPERPARAMETERS FOUND FOR EACH CLASSIFIER. Term
0.781 0.833 0.835 0.843 0.846 0.846
frequency
Classifier Hyperparameters tf-idf 0.786 0.832 0.831 0.836 0.843 0.844
NB None
LR Regularization value: Since the NLP representations mostly affect the classifiers,
+ 0.1 for term frequency their validated results are grouped by six classifiers in Table
+ 10 for tf-idf
SVM Kernel: linear
5 and Table 6. The result shows that tf-idf did not
Regularization value: 0.1 outperform term frequency for five out of six classifiers.
RF Number of trees: 100 This result agrees with the existing work [6, 7]. It seemed
Max. depth: unlimited that n-grams with n > 1 improved the result. We used a
Max. number of leaf nodes: unlimited
right-tailed unpaired two-sample t-test to check the
XGB Number of trees: 100
Max. depth: unlimited significance of such improvement of n-grams (n > 1). The
Max. number of leaf nodes: 100 P-value was 0.169; that was larger than the confidence level
LGBM Number of trees: 100 of 0.05. Thus, we were unable to accept the improvement of
Max. depth: unlimited n-grams over uni-gram. Furthermore, there was no
Max. number of leaf nodes: 100
performance improvement after increasing the number of n-
It is worth noting that we have utilized local optimization as grams. The above-reported three observations implied that
a filter to reduce the search space. We found that 0.1 was a the more complex NLP representations did not provide a
consistently good value of regularization coefficient for statistically significant improvement over the simplest BoW
SVM. Unlike SVM, for LR, 0.1 was suitable for term (configuration one in Table 2). This argument helped
frequency representation; whereas, 10 performed better for explain why three out of seven optimal models in Table 4
the case of tf-idf. One possible explanation is that LR were BoW.
provides a decision boundary that is more sensitive to TABLE 6. AVERAGE CROSS-VALIDATED WEIGHTED F-SCORES OF
hyperparameter. Additionally, although tf-idf with l2- UNI-GRAM VS. N-GRAMS (2 ≤ N ≤ 4) GROUPED BY SIX CLASSIFIERS.
normalization helps model converge faster, it usually Classifier 1-gram 2-grams 3-grams 4-grams
requires more regularization to avoid overfitting [42]. For NB 0.756 0.778 0.784 0.785
ensemble models, more hyperparameters need tuning as LR 0.821 0.835 0.836 0.836
mentioned in section IV.C. Regarding the maximum number SVM 0.823 0.835 0.836 0.837
of leaves, the optimal value for RF was unlimited, which is RF 0.838 0.840 0.838 0.838
XGB 0.844 0.845 0.846 0.846
expected since it would give more flexibility to the model. LGBM 0.845 0.845 0.845 0.845
However, for XGB and LGBM, the unlimited value was
not available. In fact, the higher value did not improve the Along with the NLP representations, we also
performance, but significantly increased the computational investigated the performance difference between single
time. As a result, we chose 100 to be the number of leaves (NB, LR, and SVM) and ensemble (RF, XGB, and LGBM)
for XGB and LGBM. Similarly, we obtained 100 as a good models. The average Weighted F-Scores grouped by VCs
value for the number of trees of each ensemble model. We for single and ensemble models are illustrated in Fig. 7.
noticed that the maximum depth of ensemble methods was Ensemble models seemed to consistently demonstrate the
the hyperparameter that affected the validation result the superior performance compared to single counterparts. It
most; the others did not change the performance was also observed that the ensemble methods produced
dramatically. Finally, we got a search space of size of 336 in mostly consistent results (i.e., small variance) for Access
the cross-validation step ((six classifiers) x (eight NLP Vector and Authentication properties.
configurations) x (seven characteristics)). After using our
five-fold time-based cross-validation method in section
III.C, the optimal validation results are given in Table 4.
Besides each output, we also examined the validated We performed the right-tailed unpaired two-sample t-tests to
results across different types of classifiers (single vs. check the significance of the better performance of
ensemble models) and NLP representations (n-grams and tf- ensemble over single models. Table 7 reports the P-values
idf vs. term frequency). of the results from the hypothesis testing.
377
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 03,2023 at 06:28:46 UTC from IEEE Xplore. Restrictions apply.
TABLE 7. P-VALUES OF H0: ENSEMBLE MODELS ≤ SINGLE MODELS FOR Another issue was due to the specialized or abbreviated
EACH VC.
terms such as /redirect?url= XSS, SEGV, CSRF without
Vulnerability characteristic P-value proper explanation. The above issues suggest that the
Confidentiality 3.261 x 10-5 vulnerability descriptions should be written with sufficient
Integrity 9.719 x 10-5 information to enhance the comprehensibility of SVs.
Availability 3.855 x 10-5
Access Vector 2.320 x 10-3
For RQ3, the solution to the issue of the word-only
Access Complexity 1.430 x 10-5 model using character level features is evaluated. We
Authentication 1.670 x 10-3 considered the non-stop-words with high frequency (i.e.,
Severity 1.060 x 10-7 appearing in more than 10% of all descriptions) to generate
the character features. Using the same 0.1% value as RQ2
The hypothesis testing confirmed that the superiority of the
increased the dimensions more than 30 times, but the
ensemble methods was significant since all P-values are
performance only changed within 0.02. According to
smaller than the confidence level of 0.05. The validated
Algorithm 1, the output minimum number of character n-
results in Table 4 also affirmed that six out of seven optimal
grams was chosen to be two. We first tested the robustness
classifiers were ensemble (i.e., LGBM and XGB). It is noted
of the character-only models by setting the maximum
that the XGB model usually takes more time to train than
number of characters to only three. For each year y from
the LGBM model, especially for tf-idf representation. Our
1999 to 2017, we used such character model to generate the
findings suggest that LGBM, XGB and BoW should be
characters from the data of the considering year y backward.
considered as baseline classifiers and NLP representations
We then verified the existence of such features using the
for future vulnerability-related research.
descriptions of the other part of data (i.e., from year y + 1
The summary answer to RQ2: LGBM and BoW are towards 2018). Surprisingly, the model using only two-to-
the most frequent optimal classifiers and NLP three-character n-grams could produce at least one non-zero
representations. Overall, the more complex NLP feature for all the descriptions even using only training data
representations such as n-grams, tf-idf do not provide a in 1999 (i.e., the first year in our dataset based on the
statistically significant performance improvement than vulnerability identification). Such finding shows that our
BoW. The ensemble models perform statistically better approach is stable to vulnerability data changes (concept
than single ones. It is recommended that the ensemble drift) in testing data from 2000 to 2018 even with the
classifiers (e.g., XGB and LGBM) and BoW should be limited amount of data and without retraining.
used as baseline models for vulnerability analytics. Next, to increase the generalizability of our approach,
three to ten were considered for selecting the maximum
C. RQ3: How effective is our character-word model to number of character n-grams based on their corresponding
perform automated vulnerability assessment with vocabulary sizes (cf. Fig. 8). Using the elbow method in
concept drift? cluster analysis, six was selected since vocabulary size did
not increase dramatically after this point. The selected
The OoV terms presented in RQ1 actually directly have
minimum and maximum values of character n-grams turned
an impact on the word-only models. Such missing features
out to match the minimum and average word lengths of all
can make the model unable to produce reliable results.
NVD descriptions in our dataset, respectively.
Especially when no existing term is found (i.e., all features
are zero), the model would have the same output regardless
of the context. To answer RQ3, we first tried to identify
such all-zero cases in the vulnerability descriptions from
2000 to 2018. For each year from 2000 to 2018, we split the
dataset into (i) training set (data from the previous year
backward) for building the vocabulary, and (ii) testing set
(data from the current year to 2018) for checking the
vocabulary existence. We found 64 cases from 2000 to 2018
in the testing data, in which all the features were missing
(cf. Appendix). We used the terms appearing at least 0.1%
in all descriptions. It should be noted that the number of all- Fig. 8. The relationship between the size of vocabulary and the maximum
zero cases may be reduced using a larger vocabulary with number of character n-grams.
the trade-off for larger computational time. We also
We then used the feature aggregation algorithm (cf.
investigated the descriptions of these vulnerabilities and
section III.D) to create the aggregated features from the
found several interesting patterns. The average length of
character n-grams (2 ≤ n ≤ 6) and word n-grams to build the
these abnormal descriptions was only 7.98 words compared
final model set and compared it with two baselines: Word-
to 39.17 of all descriptions. It turned out that the information
only Model (WoM) and Character-only Model (CoM). It
about the threats and sources of such SVs was limited. Most
should be noted that WoM is the model in which concept
of them just included the assets and attack/vulnerability
drift is not handled. Unfortunately, a direct comparison with
types. For example, the vulnerabilities with ID of CVE-
the existing WoM [6] was not possible since they used older
2016-10001xx had nearly the same format “Reflected XSS
NVD dataset and they did not produce their source code for
in WordPress plugin” with the only differences were the
reproduction. However, we tried to set up the experiments
name and version of the plugin. This format made the model
based on the guidelines and results in the previous paper.
hard to evaluate the impact of each vulnerability separately.
378
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 03,2023 at 06:28:46 UTC from IEEE Xplore. Restrictions apply.
TABLE 8. PERFORMANCE (ACCURACY, MACRO F-SCORE, WEIGHTED F-SCORE) OF OUR CHARACTER-WORD, WORD-ONLY AND CHARACTER-ONLY MODELS.
Our optimal model (CWM) Word-only model (WoM) Character-only model (CoM)
Vulnerability
Accuracy Macro Weighted Accuracy Macro Weighted Accuracy Macro Weighted
chacteristic
F-Score F-Score F -Score F-Score F-Score F-Score
Confidentiality 0.727 0.717 0.728 0.722 0.708 0.723 0.694 0.683 0.698
Integrity 0.763 0.749 0.764 0.763 0.744 0.764 0.731 0.718 0.734
Availability 0.712 0.711 0.711 0.700 0.696 0.702 0.660 0.657 0.660
Access Vector 0.914 0.540 0.901 0.904 0.533 0.894 0.910 0.538 0.899
Access Complexity 0.703 0.468 0.673 0.718 0.476 0.691 0.700 0.457 0.668
Authentication 0.875 0.442 0.844 0.864 0.425 0.832 0.866 0.441 0.840
Severity 0.668 0.575 0.663 0.686 0.569 0.675 0.661 0.549 0.652
To be more specific, we used BoW predictors and TABLE 9. WEIGHTED F-SCORES OF OUR ORIGINAL CWM (GREEN-COLORED
BASELINE), 300-DIMENSION LATENT SEMANTIC ANALYSIS (LSA-300),
random forest (the best of their three models used) with the FASTTEXT TRAINED ON VULNERABILITY DESCRIPTION (FASTTEXT-300) AND
following hyperparameters: the number of trees was 100 and FASTTEXT TRAINED ON ENGLISH WIKIPEDIA PAGES (FASTTEXT-300W).
the number of features for splitting was 40. For CoM, we
Vulnerability LSA- fastText- fastText-
used the same optimal classifiers of each VC. The characteristic
Our CWM
300 300 300W
comparison results are given in Table 8. CWM performed Confidentiality 0.728 0.656 0.679 0.648
slightly better than the WoM for four out of seven VCs Integrity 0.764 0.695 0.719 0.672
regarding all evaluation metrics. Also, 4.98% features of Availability 0.711 0.656 0.687 0.669
Access Vector 0.901 0.892 0.893 0.866
CWM were non-zero, which was nearly five-time denser
Access Complexity 0.673 0.611 0.679 0.678
than 1.03% of WoM. Also, CoM was the worst model Authentication 0.844 0.842 0.815 0.765
among the three, which had been expected since it contained Severity 0.663 0.656 0.654 0.635
the least information (smallest number of features).
Although CWM does not significantly outperform WoM, its Table 9 shows that LSA-300 retained from 90% to 99%
main advantage is to effectively handle the OoV terms performance of the original model, but used only 300
(concept drift), except new terms without any matching dimensions (6-18% original model sizes). More remarkably,
parts. We hope that our solution to concept drift will be with the same 300 dimensions, the fastText model trained
integrated into the practitioner’s existing framework and on the vulnerability descriptions was on average better than
future research work to perform more robust SVs analytics. LSA-300 (97% vs. 94.5%). fastText model even slightly
outperformed our original CWM for Access Complexity.
The summary answer to RQ3: The WoM does not Moreover, for all seven cases, the fastText model using
handle the new cases well, especially those with all zero- vulnerability knowledge (fastText-300) had higher Weighted
value features. Without retraining, the tri-gram character F-Scores than that trained on English Wikipedia pages
features can still handle the OoV words effectively with (fastText-300W) [45]. The result implied that vulnerability
no all-zero features for all testing data from 2000 to descriptions contain specific terms that do not frequently
2018. Our CWM performs comparably well with the appear in the general domains. The domain relevance turns
existing WoM and provides nearly five-time richer out to be not only applicable to word embeddings [8], but
information. Hence, our CWM is better for automated also to character/sub-word embeddings for vulnerability
vulnerability assessment with concept drift. analysis and assessment. Overall, our findings show that
LSA and fastText are capable of building efficient models
D. RQ4: To what extent can low-dimensional model retain without too much performance trade-off.
the original performance?
The summary answer to RQ4: The LSA model with
The n-gram NLP models usually have an issue with the
300 dimensions (6-18% of the original size) retains from
high-dimensional and sparse feature vectors [25]. The large
90% up to 99% performance of the original model. With
feature sizes of our CWMs in Table 8 were 1649 for
the same dimensions, the model with fastText sub-word
Confidentiality, Availability and Access Complexity; 4154
embeddings provide even more promising results. The
for Integrity and Access Vector; 3062 for Authentication;
fastText model with the vulnerability knowledge
and 5104 for Severity. To address such challenge in RQ4,
outperforms that trained on the general context (e.g.,
we investigated the dimensionality reduction method (i.e.,
Wikipedia). LSA and fastText can help build efficient
Latent Semantic Analysis (LSA) [30]) and recent sub-word
models for vulnerability assessment.
embeddings (e.g., fastText [31, 32]) for vulnerability
classification. fastText is an extension of Word2Vec [43]
VI. THREATS TO VALIDITY
word embeddings, in which the character-level features are
also considered. fastText is different to traditional n-grams Internal validity. We used well-known tools such as
in the sense that it determines the meaning of a scikit-learn [26] and nltk [27] libraries for ML and NLP.
word/subword based on the surrounding context. Here, we Our optimal models may not guarantee the highest
computed the sentence representation as an average fastText performance for every SV since there are infinite values of
embedding of its constituent words and characters. We hyperparameters to tune. However, even when the optimal
implemented fastText using Gensim [44] library in Python. values change, a time-based cross-validation method should
For LSA, using the elbow method and total explained still be preferred since we have considered the general trend
variances of the principal components, we selected 300 for of all SVs. Our model may not provide the state-of-the-art
the dimensions and called it LSA-300. results, but at least it gives the baseline performance for
handling the concept drift of SVs.
379
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 03,2023 at 06:28:46 UTC from IEEE Xplore. Restrictions apply.
External validity. Our work used NVD – one of the analysis and assessment, but has not mentioned how to
most comprehensive public repositories of SVs. The size of overcome its concept drift challenge. Our work is the first of
our dataset is more than 100,000 with the latest its kind to provide a robust treatment for SVs’ concept drift.
vulnerabilities in 2018. Our character-word model has also
B. Temporal modeling of software vulnerabilities
been demonstrated to consistently handle the OoV words
well even with very limited data for all years in the dataset. Regarding the temporal relationship of SVs, Roumani
It is recognized that the model may not work for extreme [54] proposed a time-series approach using autoregressive
rare terms in which no existing parts can be found. integrated moving average and exponential smoothing
However, our model is totally re-trainable to deal with such methods to predict the number of vulnerabilities in the
cases or incorporate more sources of SVs’ descriptions. future. Another time-series work [55] was described to
Conclusion validity. We mitigated the randomness of model the trend in disclosing SVs. A group of researchers
the results by taking the average value of five-fold cross- led by Tsokos published a series of work [56-58] on
validation. The performance comparison of different types stochastic models such as Hidden Markov Models, Artificial
of classifiers and NLP representations was also confirmed Neural Network, and Support Vector Machine to estimate
using a statistical hypothesis test with P-values that were the occurrence and exploitability of vulnerabilities. The
much lower than the confidence level of 5%. focus of the above studies was on the determination of the
occurrence of SVs over time. In contrast, our work aims to
VII. RELATED WORKS handle the temporal relationship to build more robust
predictive models for SVs assessment.
A. Software Vulnerability Analytics
It is important to patch the critical-first vulnerabilities VIII. CONCLUSION AND FUTURE WORK
[46]. Thus, besides CVSS, there have been many effective We observe that the existing works suffer from concept
evaluation schemes for SVs [47-49]. Recently, there is a drift in the vulnerability descriptions that affect both the
detailed Bayesian analysis of various vulnerability scoring traditional model selection and prediction steps of SVs
systems [50], which highlights the good overall performance assessment. We assert that concept drift can degrade the
of CVSS. Therefore, we used the well-known CVSS as the robustness of existing predictive models. We show that the
ground truth for our approach. We assert that our approach time-based cross-validation should be used for vulnerability
can be generalizable to other vulnerability rating systems analysis to better capture the temporal relationship of SVs.
following the same scheme of multi-class classification. Then, our main contribution is the Character-Word Models
Regarding the predictive analytics of SVs, Bozorgi [15] (CWMs) to improve the robustness of automated SVs
pioneered the use of ML models for SVs analysis. Their assessment with concept drift. CWMs have been
paper used an SVM model and various features (e.g., NVD demonstrated to handle concept drift of SVs effectively for
description, CVSS, published and modified dates) to all the testing data from 2000 to 2018 in NVD even in the
estimate the likelihood of exploitation and time-to-exploit of case of data scarcity. Our approach has also performed
SVs. Another piece of work analyzed the VCs and trends of comparably well with the existing word-only models. Our
SVs by incorporating different vulnerability information CWMs are also much less sparse and thus less prone to
from multiple vulnerability repositories [11, 14], security overfitting. We have also found that Latent Semantic
advisories [18, 51], darkweb/deepnet [14, 52] and social Analysis and sub-word embeddings like fastText help build
network (Twitter) [53]. These efforts assumed that all VCs compact and efficient CWM models (up to 94% reduction in
have been available at the time of analysis. However, our dimension) with the ability to retain at least 90% of the
work relaxes this assumption by using only the vulnerability performance for all VCs. Besides the good performance, the
description – one of the first information about new SVs. implications on the use of different models are also given to
Our model can be used for both new and old SVs. support practitioners and researchers with vulnerability
Actually, the descriptions are also utilized for analytics. Hopefully, this work can open up various research
vulnerability analysis and prediction. Yamamoto [17] used avenues to develop more sophisticated concept-drift-aware
Linear Discriminant Analysis, Naïve Bayes and Latent models in SVs and related areas.
Semantic Indexing combined with annual effect estimation In the future, we plan to investigate the performance of
to determine the VCs of more than 60,000 SVs in NVD. The deep learning models to embed the dependency of both
annual effect focused on the recent SVs, but still could not character and word features in low-dimensional space for
explicitly handle the OoV terms in the descriptions. Spanos vulnerability prediction. Alongside concept drift, handling
[6] worked on the same task using a multi-target framework imbalanced data is also a concern for future SVs research.
with Decision Tree, Random Forest and Gradient Boosting
Tree. Our approach also contains the word-only model, but APPENDIX
we select the optimal models using our time-based cross- 64 vulnerabilities (CVD-ID) with all-zero features of word-only model (V.C) from 2000 to 2018:
CVE-2013-6647, CVE-2015-1000004, CVE-2016-1000113, CVE-2016-1000114, CVE-2016-
validation to better address the concept drift issue. The 1000117, CVE-2016-1000118, CVE-2016-1000126, CVE-2016-1000127, CVE-2016-1000128,
CVE-2016-1000129, CVE-2016-1000130, CVE-2016-1000131, CVE-2016-1000132, CVE-2016-
vulnerability descriptions were also used to evaluate the 1000133, CVE-2016-1000134, CVE-2016-1000135, CVE-2016-1000136, CVE-2016-1000137,
vulnerability severity [7], associate the frequent terms with CVE-2016-1000138, CVE-2016-1000139, CVE-2016-1000140, CVE-2016-1000141, CVE-2016-
1000142, CVE-2016-1000143, CVE-2016-1000144, CVE-2016-1000145, CVE-2016-1000146,
each VC [16], determine the type of each SV using topic CVE-2016-1000147, CVE-2016-1000148, CVE-2016-1000149, CVE-2016-1000150, CVE-2016-
1000151, CVE-2016-1000152, CVE-2016-1000153, CVE-2016-1000154, CVE-2016-1000155,
modeling [12] and show vulnerability trends [11]. Recently, CVE-2016-1000217, CVE-2017-10798, CVE-2017-10801, CVE-2017-14036, CVE-2017-14536,
Zhuobing [8] have applied deep learning to predict CVE-2017-15808, CVE-2017-16760, CVE-2017-16785, CVE-2017-17499, CVE-2017-17703,
CVE-2017-17774, CVE-2017-6102, CVE-2017-7276, CVE-2017-8783, CVE-2018-10030, CVE-
vulnerability severity. The existing literature has 2018-10031, CVE-2018-10382, CVE-2018-11120, CVE-2018-11405, CVE-2018-12501, CVE-
2018-13997, CVE-2018-14382, CVE-2018-5285, CVE-2018-5361, CVE-2018-6467, CVE-2018-
demonstrated the usefulness of description for vulnerability 6834, CVE-2018-8817, CVE-2018-9130
380
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 03,2023 at 06:28:46 UTC from IEEE Xplore. Restrictions apply.
REFERENCES [20] C.-C. Huang, H.-C. Yen, P.-C. Yang, S.-T. Huang, and J. S.
Chang, "Using sublexical translations to handle the OOV
problem in machine translation," ACM Transactions on Asian
[1] M. Corporation. Common Vulnerabilities and Exposures Language Information Processing (TALIP), vol. 10, p. 16, 2011.
[Online]. Available: https://cve.mitre.org/ [Accessed: [21] A. Liu and K. Kirchhoff, "Context Models for OOV Word
25/12/2018]. Translation in Low-Resource Languages," arXiv preprint
[2] CEA Report: The Cost of Malicious Cyber Activity to the U.S. arXiv:1801.08660, 2018.
Economy [Online]. Available: [22] M. Razmara, M. Siahbani, R. Haffari, and A. Sarkar, "Graph
https://www.whitehouse.gov/articles/cea-report-cost-malicious- propagation for paraphrasing out-of-vocabulary words in
cyber-activity-u-s-economy/ [Accessed: 25/12/2018]. statistical machine translation," in 51st Annual Meeting of the
[3] K. Nayak, D. Marino, P. Efstathopoulos, and T. Dumitraş, Association for Computational Linguistics (Volume 1: Long
"Some Vulnerabilities Are Different Than Others," Cham, 2014, Papers), 2013, pp. 1105-1115.
pp. 426-446. [23] FIRST. Common Vulnerability Scoring System v3.0:
[4] S. Khan and S. Parkinson, "Review into State of the Art of Specification Document [Online]. Available:
Vulnerability Assessment using Artificial Intelligence," in https://www.first.org/cvss/specification-document [Accessed:
Guide to Vulnerability Analysis for Computer Networks and 25/12/2018].
Systems: An Artificial Intelligence Approach, S. Parkinson, A. [24] FIRST. Common Vulnerability Scoring System [Online].
Crampton, and R. Hill, Eds., ed Cham: Springer International Available: https://www.first.org/cvss/ [Accessed: 25/12/2018].
Publishing, 2018, pp. 3-32. [25] A. Kao and S. R. Poteet, Natural language processing and text
[5] V. Smyth, "Software vulnerability management: how mining: Springer Science & Business Media, 2007.
intelligence helps reduce the risk," Network Security, vol. 2017, [26] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B.
pp. 10-12, 2017. Thirion, O. Grisel, et al., "Scikit-learn: Machine learning in
[6] G. Spanos and L. Angelis, "A multi-target approach to estimate Python," Journal of Machine Learning Research, vol. 12, pp.
software vulnerability characteristics and severity scores," 2825-2830, 2011.
Journal of Systems and Software, vol. 146, pp. 152-166, 2018. [27] S. Bird and E. Loper, "NLTK: the natural language toolkit," in
[7] G. Spanos, L. Angelis, and D. Toloudis, "Assessment of ACL 2004 on Interactive Poster and Demonstration Sessions,
Vulnerability Severity using Text Mining," in 21st Pan-Hellenic 2004, p. 31.
Conference on Informatics, Larissa, Greece, 2017, pp. 1-6. [28] M. F. Porter, "An algorithm for suffix stripping," Program, vol.
[8] Z. Han, X. Li, Z. Xing, H. Liu, and Z. Feng, "Learning to 14, pp. 130-137, 1980.
Predict Severity of Software Vulnerability Using Only [29] C. Bergmeir and J. M. Benítez, "On the use of cross-validation
Vulnerability Description," in 2017 IEEE International for time series predictor evaluation," Information Sciences, vol.
Conference on Software Maintenance and Evolution (ICSME), 191, pp. 192-213, 2012.
2017, pp. 125-136. [30] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and
[9] National Vulnerability Database [Online]. Available: R. Harshman, "Indexing by latent semantic analysis," Journal of
https://nvd.nist.gov/ [Accessed: 25/12/2018]. the American Society for Information Science, vol. 41, pp. 391-
[10] M. A. Williams, S. Dey, R. C. Barranco, S. M. Naim, M. S. 407, 1990.
Hossain, and M. Akbar, "Analyzing Evolving Trends of [31] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching
Vulnerabilities in National Vulnerability Database," in 2018 word vectors with subword information," arXiv preprint
IEEE International Conference on Big Data (Big Data), 2018, arXiv:1607.04606, 2016.
pp. 3011-3020. [32] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, "Bag of
[11] S. S. Murtaza, W. Khreich, A. Hamou-Lhadj, and A. B. Bener, tricks for efficient text classification," arXiv preprint
"Mining trends and patterns of software vulnerabilities," arXiv:1607.01759, 2016.
Journal of Systems and Software, vol. 117, pp. 218-228, 2016. [33] Kaggle [Online]. Available: https://www.kaggle.com/
[12] S. Neuhaus and T. Zimmermann, "Security Trend Analysis with [Accessed: 25/12/2018].
CVE Topic Models," in 2010 IEEE 21st International [34] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern
Symposium on Software Reliability Engineering, 2010, pp. 111- Approach: Malaysia; Pearson Education Limited, 2016.
120. [35] S. H. Walker and D. B. Duncan, "Estimation of the probability
[13] B. L. Bullough, A. K. Yanchenko, C. L. Smith, and J. R. Zipkin, of an event as a function of several independent variables,"
"Predicting Exploitation of Disclosed Software Vulnerabilities Biometrika, vol. 54, pp. 167-179, 1967.
Using Open-source Data," in 3rd ACM on International [36] C. Cortes and V. Vapnik, "Support-vector networks," Machine
Workshop on Security And Privacy Analytics, Scottsdale, Learning, vol. 20, pp. 273-297, 1995.
Arizona, USA, 2017, pp. 45-53. [37] A. Basu, C. Walters, and M. Shepherd, "Support vector
[14] M. Almukaynizi, E. Nunes, K. Dharaiya, M. Senguttuvan, J. machines for text categorization," in 2003 36th Annual Hawaii
Shakarian, and P. Shakarian, "Patch Before Exploited: An International Conference on System Sciences, 2003, p. 7 pp.
Approach to Identify Targeted Software Vulnerabilities," in AI [38] T. K. Ho, "Random decision forests," in 1995 3rd International
in Cybersecurity, L. F. Sikos, Ed., ed Cham: Springer Conference on Document Analysis and Recognition, 1995, pp.
International Publishing, 2019, pp. 81-113. 278-282.
[15] M. Bozorgi, L. K. Saul, S. Savage, and G. M. Voelker, "Beyond [39] T. Chen and C. Guestrin, "Xgboost: A scalable tree boosting
heuristics: learning to classify vulnerabilities and predict system," in 22nd ACM SIGKDD International Conference on
exploits," in 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785-794.
Knowledge Discovery and Data Mining, Washington, DC, [40] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, et al.,
USA, 2010, pp. 105-114. "Lightgbm: A highly efficient gradient boosting decision tree,"
[16] D. Toloudis, G. Spanos, and L. Angelis, "Associating the in Advances in Neural Information Processing Systems, 2017,
Severity of Vulnerabilities with their Description," Cham, 2016, pp. 3146-3154.
pp. 231-242. [41] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth,
[17] Y. Yamamoto, D. Miyamoto, and M. Nakayama, "Text-Mining "Occam's razor," Information Processing Letters, vol. 24, pp.
Approach for Estimating Vulnerability Score," in 2015 4th 377-380, 1987.
International Workshop on Building Analysis Datasets and [42] T. H. M. Le, T. T. Tran, and L. K. Huynh, "Identification of
Gathering Experience Returns for Security (BADGERS), 2015, hindered internal rotational mode for complex chemical species:
pp. 67-73. A data mining approach with multivariate logistic regression
[18] M. Edkrantz, S. Truvé, and A. Said, "Predicting Vulnerability model," Chemometrics and Intelligent Laboratory Systems, vol.
Exploits in the Wild," in 2015 IEEE 2nd International 172, pp. 10-16, 2018.
Conference on Cyber Security and Cloud Computing, 2015, pp. [43] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
513-514. "Distributed representations of words and phrases and their
[19] M.-T. Luong, I. Sutskever, Q. V. Le, O. Vinyals, and W. compositionality," in Advances in Neural Information
Zaremba, "Addressing the rare word problem in neural machine Processing Systems, 2013, pp. 3111-3119.
translation," arXiv preprint arXiv:1410.8206, 2014.
381
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 03,2023 at 06:28:46 UTC from IEEE Xplore. Restrictions apply.
[44] R. Rehurek and P. Sojka, "Software framework for topic [52] E. Nunes, A. Diab, A. Gunn, E. Marin, V. Mishra, V. Paliath, et
modelling with large corpora," in LREC 2010 Workshop on New al., "Darknet and deepnet mining for proactive cybersecurity
Challenges for NLP Frameworks, 2010. threat intelligence," in 2016 IEEE Conference on Intelligence
[45] Pre-trained word vectors of fastText [Online]. Available: and Security Informatics (ISI), 2016, pp. 7-12.
https://github.com/facebookresearch/fastText/blob/master/pretra [53] C. Sabottke, O. Suciu, and T. Dumitras, "Vulnerability
ined-vectors.md [Accessed: 25/12/2018]. Disclosure in the Age of Social Media: Exploiting Twitter for
[46] Y. Hou, X. Ren, Y. Hao, T. Mo, and W. Li, "A Security Predicting Real-World Exploits," in USENIX Security
Vulnerability Threat Classification Method," in Advances on Symposium, 2015, pp. 1041-1056.
Broad-Band Wireless Computing, Communication and [54] Y. Roumani, J. K. Nwankpa, and Y. F. Roumani, "Time series
Applications, Cham, 2018, pp. 414-426. modeling of vulnerabilities," Computers & Security, vol. 51, pp.
[47] Q. Liu and Y. Zhang, "VRSS: A new system for rating and 32-40, 2015.
scoring vulnerabilities," Computer Communications, vol. 34, pp. [55] M. Tang, M. Alazab, and Y. Luo, "Exploiting Vulnerability
264-273, 2011. Disclosures: Statistical Framework and Case Study," in 2016
[48] G. Spanos, A. Sioziou, and L. Angelis, "WIVSS: a new Cybersecurity and Cyberforensics Conference (CCC), 2016, pp.
methodology for scoring information systems vulnerabilities," 117-122.
in 17th Panhellenic Conference on Informatics, Thessaloniki, [56] S. M. Rajasooriya, C. P. Tsokos, and P. K. Kaluarachchi,
Greece, 2013, pp. 83-90. "Cyber Security: Nonlinear Stochastic Models for Predicting the
[49] R. Sharma and R. K. Singh, "An Improved Scoring System for Exploitability," Journal of Information Security, vol. 8, p. 125,
Software Vulnerability Prioritization," in Quality, IT and 2017.
Business Operations: Modeling and Optimization, P. K. Kapur, [57] P. K. Kaluarachchi, C. P. Tsokos, and S. M. Rajasooriya, "Non-
U. Kumar, and A. K. Verma, Eds., ed Singapore: Springer Homogeneous Stochastic Model for Cyber Security
Singapore, 2018, pp. 33-43. Predictions," Journal of Information Security, vol. 9, p. 12,
[50] P. Johnson, R. Lagerstrom, M. Ekstedt, and U. Franke, "Can the 2017.
Common Vulnerability Scoring System be Trusted? A Bayesian [58] N. R. Pokhrel, H. Rodrigo, and C. P. Tsokos, "Cybersecurity:
Analysis," IEEE Transactions on Dependable and Secure Time Series Predictive Modeling of Vulnerabilities of Desktop
Computing, pp. 1-1, 2018. Operating System Using Linear and Non-Linear Approach,"
[51] C.-C. Huang, F.-Y. Lin, F. Y.-S. Lin, and Y. S. Sun, "A novel Journal of Information Security, vol. 8, p. 362, 2017.
approach to evaluate software vulnerability prioritization,"
Journal of Systems and Software, vol. 86, pp. 2822-2840, 2013.
382
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 03,2023 at 06:28:46 UTC from IEEE Xplore. Restrictions apply.