Towards An Improved Understanding of Software Vulnerability Assessment Using Data-Driven Approaches-Le2022 - PHD
Towards An Improved Understanding of Software Vulnerability Assessment Using Data-Driven Approaches-Le2022 - PHD
Contents
List of Tables ix
Abstract xi
Acknowledgements xv
Dedication xvii
1 Introduction 1
1.1 Problem Statement and Research Objectives . . . . . . . . . . . . . . . . . . 2
1.2 Thesis Overview and Contributions . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
References 130
vii
List of Figures
4.1 A vulnerable function extracted from the fixing commit b38a1b3 of a soft-
ware vulnerability (CVE-2017-1000487) in the Plexus-utils project. . . . . . 61
4.2 Methodology used to answer the research questions. . . . . . . . . . . . . . 63
4.3 Class distributions of the seven CVSS metrics. . . . . . . . . . . . . . . . . . 66
4.4 Proportions of different types of lines in a function. . . . . . . . . . . . . . . 71
4.5 Differences in testing software vulnerability assessment performance (F1-
Score and MCC) between models using different types of lines/context and
those using only vulnerable statements. . . . . . . . . . . . . . . . . . . . . . 73
4.6 Average performance (MCC) of six classifiers and five features for software
vulnerability assessment in functions. . . . . . . . . . . . . . . . . . . . . . . 76
5.1 Exemplary software vulnerability fixing commit for the XML external entity
injection (XXE) (CVE-2016-3674) and its respective software vulnerability
contributing commit in the xstream project. . . . . . . . . . . . . . . . . . . 82
5.2 Workflow of DeepCVA for automated commit-level software vulnerability
assessment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3 Code changes outside of a method from the commit 4b9fb37 in the Apache
qpid-broker-j project. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 Class distributions of seven software vulnerability assessment tasks. . . . . . 89
5.5 Time-based splits for training, validating & testing. . . . . . . . . . . . . . . 90
5.6 Differences of testing MCC of the model variants compared to the proposed
DeepCVA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
viii
List of Tables
2.1 Comparison of contributions between our review and the existing related
surveys/reviews. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Inclusion and exclusion criteria for study selection. . . . . . . . . . . . . . . 10
2.3 List of the reviewed papers in the Exploit Likelihood sub-theme of the Ex-
ploitation theme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 List of the reviewed papers in the Exploit Time sub-theme of the Exploitation
theme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 List of the reviewed papers in the Exploit Characteristics sub-theme of the
Exploitation theme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 List of the reviewed papers in the Impact theme. . . . . . . . . . . . . . . . 18
2.7 List of the reviewed papers in the Severe vs. Non-Severe sub-theme of the
Severity theme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8 List of the reviewed papers in the Severity Levels sub-theme of the Severity
theme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.9 List of the reviewed papers in the Severity Score sub-theme of the Severity
theme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.10 List of the reviewed papers in the Type theme. . . . . . . . . . . . . . . . . 24
2.11 List of the reviewed papers in the Miscellaneous Tasks theme. . . . . . . . . 27
2.12 The frequent data sources, features, models, evaluation techniques and eval-
uation metrics used for the five identified SV assessment themes. . . . . . . 30
2.13 The mapping between the themes/tasks and the respective studies collected
from May 2021 to February 2022. . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Word and character n-grams extracted from the sentence “Hello World”. ‘_’
represents a space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 The eight configurations of Natural Language Processing representations
used for model selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Optimal hyperparameters found for each classifier. . . . . . . . . . . . . . . 50
3.4 Optimal models and results after the validation step. . . . . . . . . . . . . . 50
3.5 Average cross-validated Weighted F-scores of term frequency vs. tf-idf grouped
by six classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6 Average cross-validated Weighted F-scores of uni-gram vs. n-grams (2 ≤ n
≤ 4) grouped by six classifiers. . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7 P -values of H0 : Ensemble models ≤ Single models for each vulnerability
characteristic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.8 Performance (Accuracy, Macro F1-Score, Weighted F1-Score) of our character-
word vs. word-only and character-only models. . . . . . . . . . . . . . . . . 54
3.9 Weighted F1-Scores of our original Character-Word Model, 300-dimension
Latent Semantic Analysis (LSA-300), fastText trained on SV descriptions
(fastText-300) and fastText trained on English Wikipedia pages (fastText-
300W). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.1 The number of commits and projects after each filtering step. . . . . . . . . 90
5.2 Testing performance of DeepCVA and baseline models. . . . . . . . . . . . . 93
5.3 Testing performance (MCC) of optimal baselines using oversampling tech-
niques and multi-task DeepCVA. . . . . . . . . . . . . . . . . . . . . . . . . 97
6.1 Content-based thresholds (aSO/SSE & bSO/SSE ) for the two steps of the
content-based filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2 The obtained software vulnerability posts using our tag-based and content-
based filtering heuristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3 Top-5 tags of software vulnerability, security and general posts on Stack
Overflow and Security StackExchange. . . . . . . . . . . . . . . . . . . . . . 109
6.4 Software vulnerability topics on Stack Overflow and Security StackExchange
identified by Latent Dirichlet Allocation along with their proportions and
trends over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.5 General expertise in terms of average reputation of each topic on Stack
Overflow and Security StackExchange. . . . . . . . . . . . . . . . . . . . . . 115
6.6 Answer types of software vulnerability discussions identified on Q&A websites.117
6.7 Top-1 answer types of 13 software vulnerability topics on Stack Overflow &
Security StackExchange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.8 The mapping between 13 software vulnerability topics and their Common
Weakness Enumeration (CWE) values. . . . . . . . . . . . . . . . . . . . . . 119
xi
Abstract
Software Vulnerabilities (SVs) can expose software systems to cyber-attacks, potentially
causing enormous financial and reputational damage for organizations. There have been
significant research efforts to detect these SVs so that developers can promptly fix them.
However, fixing SVs is complex and time-consuming in practice, and thus developers usually
do not have sufficient time and resources to fix all SVs at once. As a result, developers
often need SV information, such as exploitability, impact, and overall severity, to prioritize
fixing more critical SVs. Such information required for fixing planning and prioritization
is typically provided in the SV assessment step of the SV lifecycle. Recently, data-driven
methods have been increasingly proposed to automate SV assessment tasks. However, there
are still numerous shortcomings with the existing studies on data-driven SV assessment
that would hinder their application in practice.
This PhD thesis aims to contribute to the growing literature in data-driven SV as-
sessment by investigating and addressing the constant changes in SV data as well as the
lacking considerations of source code and developers’ needs for SV assessment that impede
the practical applicability of the field. Particularly, we have made the following five con-
tributions in this thesis. (1) We systematize the knowledge of data-driven SV assessment
to reveal the best practices of the field and the main challenges affecting its application in
practice. Subsequently, we propose various solutions to tackle these challenges to better
support the real-world applications of data-driven SV assessment. (2) We first demon-
strate the existence of the concept drift (changing data) issue in descriptions of SV reports
that current studies have mostly used for predicting the Common Vulnerability Scoring
System (CVSS) metrics. We augment report-level SV assessment models with subwords
of terms extracted from SV descriptions to help the models more effectively capture the
semantics of ever-increasing SVs. (3) We also identify that SV reports are usually released
after SV fixing. Thus, we propose using vulnerable code to enable earlier SV assessment
without waiting for SV reports. We are the first to use Machine Learning techniques to
predict CVSS metrics on the function level leveraging vulnerable statements directly caus-
ing SVs and their context in code functions. The performance of our function-level SV
assessment models is promising, opening up research opportunities in this new direction.
(4) To facilitate continuous integration of software code nowadays, we present a novel deep
multi-task learning model, DeepCVA, to simultaneously and efficiently predict multiple
CVSS assessment metrics on the commit level, specifically using vulnerability-contributing
commits. DeepCVA is the first work that enables practitioners to perform SV assessment
as soon as vulnerable changes are added to a codebase, supporting just-in-time prioritiza-
tion of SV fixing. (5) Besides code artifacts produced from a software project of interest,
SV assessment tasks can also benefit from SV crowdsourcing information on developer
Question and Answer (Q&A) websites. We automatically retrieve large-scale security/SV-
related posts from these Q&A websites. We then apply a topic modeling technique on
these posts to distill developers’ real-world SV concerns that can be used for data-driven
SV assessment. Overall, we believe that this thesis has provided evidence-based knowledge
and useful guidelines for researchers and practitioners to automate SV assessment using
data-driven approaches.
xiii
Declaration of Authorship
I certify that this work contains no material which has been accepted for the award of
any other degree or diploma in my name, in any university or other tertiary institution
and, to the best of my knowledge and belief, contains no material previously published
or written by another person, except where due reference has been made in the text. In
addition, I certify that no part of this work will, in the future, be used in a submission in
my name, for any other degree or diploma in any university or other tertiary institution
without the prior approval of the University of Adelaide and where applicable, any partner
institution responsible for the joint-award of this degree.
I acknowledge that copyright of published works contained within this thesis resides
with the copyright holder(s) of those works.
I also give permission for the digital version of my thesis to be made available on the
web, via the University’s digital research repository, the Library Search and also through
web search engines, unless permission has been granted by the University to restrict access
for a period of time.
I acknowledge the support I have received for my research through the provision of
University of Adelaide International Wildcard Scholarship.
March 2022
xv
Acknowledgements
This thesis would not have been possible without continuous support, guidance, and
encouragement from many people and entities. I would like to hereby acknowledge them.
Firstly, I express my deepest gratitude to my principal supervisor, Professor M. Ali
Babar, for giving me a valuable opportunity to conduct PhD research under his supervision.
His constructive feedback has motivated me to continuously reflect and improve myself to
become a better researcher and a more-rounded person in life. With his kind patience
and persistent guidance, I have also managed to navigate myself through the challenging
COVID-19 pandemic and complete my PhD research to the best of my ability. Besides
research, he has also given me great opportunities to engage in numerous teaching and
supervision activities that have tremendously helped me to enhance my communication and
interpersonal skills. All in all, working under his mentorship has profoundly transformed
me and enabled me to go beyond my limits and better prepare myself for my future career.
Secondly, I sincerely thank my co-supervisor, Professor Cheng-Chew Lim for providing
insightful comments on the research carried out in this thesis.
Thirdly, I am extremely grateful to many current/former members in the Centre for
Research on Engineering Software Technologies (CREST) at the University of Adelaide.
Special thanks to Faheem Ullah, Chadni Islam, Bushra Sabir, Aufeef Chauhan, Huaming
Chen, Bakheet Aljedaani, Mansooreh Zahedi, Hao Chen, Roland Croft, David Hin, and
Mubin Ul Haque for not only academic contributions and feedback on the research papers
related to this thesis, but also for being wonderful colleagues from whom I have learned
a lot. Specifically, I cannot give enough appreciation to Roland Croft and David Hin for
their great technical insights and contributions to improve the quality of many research
endeavours I have pursued during my PhD. In addition, I am happy to be accompanied by
Faheem Ullah and Aufeef Chauhan during weekly Friday dinners, which has helped me to
relax and recharge each week. I also appreciate Nguyen Khoi Tran for introducing me to
the CREST family. Thank you all for making my PhD journey memorable.
Fourthly, I have also had the chance to collaborate with and learn from many world-
class researchers outside of CREST such as Xuanyu Duan, Mengmeng Ge, Shang Gao,
and Xuequan Lu. I am also thankful for all the constructive feedback from the paper and
thesis reviewers that helped significantly improve the research conducted in the thesis.
Fifthly, I fully acknowledge the University of Adelaide for providing me with the Uni-
versity of Adelaide International Wildcard Scholarship and world-class facilities that have
supported me to pursue my doctoral research and activities.
Sixthly, I highly appreciate the Urbanest at the University of Adelaide for providing
me with the best-conditioned accommodation that I can ever ask for so that I can enjoy
my personal life and recharge after working hours during my PhD. I am also extremely
fortunate that Urbanest has also given me sufficient facilities to work effectively from home
during the pandemic. I also want to deeply thank my roommates, especially Zach Li, for
cheering me up during my down days.
Seventhly, I am greatly appreciative of my ASUS laptop for always being my reliable
companion and working restlessly to enable me to obtain experimental results as well as
write research papers and this thesis in a timely manner.
Finally and most importantly, I am immensely and eternally indebted to my family,
especially my grandmother, mother, and father, who always stand by my side during the
ups and downs of my PhD. Without their constant support and unconditional caring, I
would not have been able to pursue my dreams and be where I am now. I love all of you
from the bottom of my heart.
xvii
Chapter 1
Introduction
Software has become an integral part of the modern world [1]. Software systems are
rapidly increasing in size and complexity. For example, the whole software ecosystems at
Google, which host many popular applications/services such as Google Search, YouTube,
and Google Maps, contain more than two million lines of code [2]. Quality assurance
of such large systems is a focal point for both researchers and practitioners to minimize
disruptions to millions of people around the world.
Software Vulnerabilities (SVs)1 have been long-standing issues that negatively affect
software quality [3]. SVs are security bugs that are detrimental to the confidentiality,
integrity and availability of software systems, potentially resulting in catastrophic cyber-
security attacks [4]. The exploitation of these SVs such as the Heartbleed [5] or Log4j [6]
attacks can damage the operations and reputation of millions of software systems and or-
ganizations globally. These cyber-attacks caused by SVs have led to huge financial losses as
well. According to the Australian Cyber Security Centre, a loss of more than 30 billion dol-
lars due to cyber-attacks have been reported worldwide from 2020 to 2021 [7]. Therefore,
it is important to remediate critical SVs as promptly as possible.
In practice, different types of SVs have varying levels of security threats to software-
intensive systems [8]. However, fixing all SVs at the same time is not always practical due
to limited resources and time [9]. A common practice in this situation is to prioritize fixing
SVs posing imminent and serious threats to a system of interest. Such fixing prioritization
usually requires inputs from SV assessment [10, 11].
The SV assessment phase is between the SV discovery/detection and SV remedia-
tion/mitigation/fixing/patching phases in the SV management lifecycle [12], as shown in
Fig. 1.1. The assessment phase first unveils the characteristics of the SVs found in the
discovery phase to locate “hot spots” that contain many highly critical/severe SVs and re-
quire higher attention in a system. Practitioners then use the assessment outputs to devise
an optimal remediation plan, i.e., the order/priority of fixing each SV, based on available
human and technological resources. For example, an identified cross-site scripting (XSS) or
SQL injection vulnerability in a web application will likely require an urgent remediation
plan. These two types of SVs are well-known and can be easily exploited by attackers to
gain unauthorized access and compromise sensitive data/information. On the other hand,
an SV that requires admin access or happens only in a local network will probably have a
lower priority since only a few people can initiate an attack. According to the plan devised
in the assessment phase, SVs would be prioritized for fixing in the remediation phase. In
practice, the tasks in the SV assessment phase for ever-increasing SVs are repetitive and
time-consuming, and thus require automation to save time and effort for practitioners.
Given the increasing size and complexity of software systems nowadays, automation
of SV assessment tasks has attracted significant attention in the Software Engineering
community. Traditionally, static analysis tools have been the de-facto approach to SV
assessment [13]. These tools rely on pre-defined rules to determine SV characteristics.
1
In this thesis, “software vulnerability” and “security vulnerability” are used interchangeably.
2 Chapter 1. Introduction
Figure 1.1: Phases in an SV lifecycle. Note: The main focus of this thesis
is SV assessment.
However, these pre-defined rules require significant expertise and effort to define and may be
easily error-prone [14]. In addition, these rules need manual modifications and extensions
to adapt to ever-changing patterns of new SVs [15]. Such manual changes do not scale well
with the fast-paced growth of SVs, potentially leading to delays in SV assessment and in
turn untimely SV mitigation. Hence, there is an apparent need for automated techniques
that can perform SV assessment without using manually-defined rules.
In the last decade, data-driven techniques have emerged as a promising alternative to
static analysis counterparts for SV assessment, as indicated in our extensive review [11]
(to be discussed in-depth in Chapter 2). The emergence of data-driven SV assessment is
mainly because of the rapid increase in size of SV data in the wild (e.g., more than 170,000
SVs were reported on National Vulnerability Database (NVD) [16] from 2002 to 2021 [17]).
These approaches are underpinned by Machine Learning (ML), Deep Learning (DL) and
Natural Language Processing (NLP) models that are capable of automatically extracting
complex patterns and rules from large-scale SV data, reducing the reliance on experts’
knowledge. Overall, these data-driven models have opened up new opportunities for the
field of automated SV assessment.
Chapter 2
Literature Review
on Data-Driven Chapter 3 Chapter 4 Chapter 5 Chapter 6
SV Assessment Automated Report- Automated Automated Collection and Analysis of
Level SV Assessment Function-Level Commit-Level Developers' SV Concerns
with Concept Drift SV Assessment SV Assessment on Q&A Websites
Figure 1.2: Overview of the thesis. Note: SV stands for Software Vul-
nerability.
The objectives of this thesis are to present and evaluate solutions to address
the lack of (1) treatment for changing SV data, (2) usage of SV-related code,
and (3) consideration of developers’ real-world SV concerns to improve the
practicality of data-driven SV assessment.
to as concept drift [19] that can degrade the performance of SV assessment models over
time. However, most of the existing report-level SV assessment models have not accounted
for this concept drift issue, potentially affecting their performance when deployed in the
wild. Using more than 100,000 SV reports from NVD, Chapter 3 performs a large-scale
investigation of the prevalence and impacts of the concept drift issue on report-level SV
assessment models. Moreover, we present a novel SV assessment model that combines
characters and words extracted from SV descriptions to better capture the semantics of
new terms, increasing the model robustness against concept drift.
The key contributions of this thesis (the green boxes in Fig. 1.2) from the six afore-
mentioned chapters are summarized as follows.
3. Early and just-in-time SV assessment using source code (Chapters 4 and 5).
(i) Practices of developing effective data-driven models using vulnerable code state-
ments and context in functions for early SV assessment without delays caused by
missing SV reports. (ii) Just-in-time and efficient SV assessment using code com-
mits (where SVs are first added) with deep multi-task learning.
1 Triet Huynh Minh Le, Huaming Chen, and Muhammad Ali Babar, “A Survey
on Data-Driven Software Vulnerability Assessment and Prioritization,” ACM Computing
Surveys (CSUR), 2021. [CORE ranking: rank A*, Impact factor (2020): 10.282, SJR
rating: Q1] (Chapter 2)
2 Triet Huynh Minh Le and Muhammad Ali Babar, “On the Use of Fine-Grained
Vulnerable Code Statements for Software Vulnerability Assessment Models”, in Proceedings
6 Chapter 1. Introduction
of the 19th International Conference on Mining Software Repositories (MSR). ACM, 2022.
[CORE ranking: rank A, Acceptance rate: 34%] (Chapter 3)
3 Triet Huynh Minh Le, Bushra Sabir, and Muhammad Ali Babar, “Automated
Software Vulnerability Assessment with Concept Drift,” in Proceedings of the 16th Inter-
national Conference on Mining Software Repositories (MSR). IEEE, 2019, pp. 371–382.
[CORE ranking: rank A, Acceptance rate: 25%] (Chapter 4)
4 Triet Huynh Minh Le, David Hin, Roland Croft, and Muhammad Ali Babar, “Deep-
CVA: Automated Commit-Level Vulnerability Assessment with Deep Multi-Task Learn-
ing,” in Proceedings of the 36th IEEE/ACM International Conference on Automated Soft-
ware Engineering (ASE). IEEE, 2021, pp. 717–729. [CORE ranking: rank A*, Acceptance
rate: 19%] (Chapter 5)
5 Triet Huynh Minh Le, David Hin, Roland Croft, and Muhammad Ali Babar,
“PUMiner: Mining Security Posts from Developer Question and Answer Websites with
PU Learning,” in Proceedings of the 17th International Conference on Mining Software
Repositories (MSR). ACM, 2020, pp. 350–361. [CORE ranking: rank A, Acceptance
rate: 25.7%] (Chapter 6)
6 Triet Huynh Minh Le, Roland Croft, David Hin, and Muhammad Ali Babar, “A
Large-Scale Study of Security Vulnerability Support on Developer Q&A Websites,” in
Proceedings of the 25th Evaluation and Assessment in Software Engineering (EASE). ACM,
2021, pp. 109–118. [CORE ranking: rank A, Acceptance rate: 27%, Nominated for
the Best Paper Award] (Chapter 6)
7 Triet Huynh Minh Le, Hao Chen, and Muhammad Ali Babar, “Deep Learning
for Source Code Modeling and Generation: Models, Applications, and Challenges,” ACM
Computing Surveys (CSUR), vol. 53, no. 3, pp. 1–38, 2020. [CORE ranking: rank A*,
Impact factor (2020): 10.282, SJR rating: Q1, High-impact research work selected by
Faculty of Engineering, Computer & Mathematical Sciences at the University of Adelaide.]
8 Xuanyu Duan, Mengmeng Ge, Triet Huynh Minh Le, Faheem Ullah, Shang Gao,
Xuequan Lu, and Muhammad Ali Babar, “Automated Security Assessment for the Inter-
net of Things,” in 2021 IEEE 26th Pacific Rim International Symposium on Dependable
Computing (PRDC). IEEE, 2021, pp. 47–56. [CORE ranking: rank B]
Chapter 2
1
In the original review paper [11], we used the term SV assessment and prioritization instead of SV
assessment. However, in Chapter 2, the term SV assessment is used to ensure consistency with the other
parts of the thesis, and it is important to note that SV assessment and SV assessment and prioritization
can be used interchangeably as most SV assessment tasks can be used for prioritizing SV fixing.
8 Chapter 2. Literature Review on Data-Driven Software Vulnerability Assessment
2.1 Introduction
As discussed in Chapter 1, Software Vulnerability (SV) assessment is required to prior-
itize the remediation of critical SVs (i.e., the ones that can lead to devastating cyber-
attacks) [10]. SV assessment includes tasks that determine various characteristics such as
the types, exploitability, impact and severity levels of SVs [31]. Such characteristics help
understand and select high-priority SVs to resolve early given the limited effort and re-
sources. For example, SVs with simple exploitation and severe impacts likely require high
fixing priority.
There has been an active research area to assess and prioritize SVs using increas-
ingly large data from multiple sources. Many studies in this area have proposed different
Natural Language Processing (NLP), Machine Learning (ML) and Deep Learning (DL)
techniques to leverage such data to automate various tasks such as predicting the Com-
mon Vulnerability Scoring System [29] (CVSS) metrics (e.g., [23, 22, 21]) or public exploits
(e.g., [32, 33, 34]). These prediction models can learn the patterns automatically from vast
SV data, which would be otherwise impossible to do manually. Such patterns are utilized
to speed up the assessment processes of ever-increasing and more complex SVs, signifi-
cantly reducing practitioners’ effort. Despite the rising research interest in data-driven SV
assessment, to the best of our knowledge, there has been no comprehensive review on the
state-of-the-art methods and existing challenges in this area. To bridge this gap, we are
the first to review in-depth the research studies that automate data-driven SV assessment
tasks leveraging SV data and NLP/ML/DL techniques.
The contributions of our review are summarized as follows:
2. We synthesize and discuss the pros and cons of data, features, models, evaluation
methods and metrics commonly used in the reviewed studies.
We believe that our findings can provide useful guidelines for researchers and practitioners
to effectively utilize data to perform SV assessment.
Related Work. There have been several existing surveys/reviews on SV analysis and
prediction, but they are fundamentally different from ours (see Table 2.1). Ghaffarian et
al. [3] conducted a seminal survey on ML-based SV analysis and discovery. Subsequently,
several studies [35, 36, 37, 38] reviewed DL techniques for detecting vulnerable code. How-
ever, these prior reviews did not describe how ML/DL techniques can be used to assess
and prioritize the detected SVs. There have been other relevant reviews on using Open
Source Intelligence (OSINT) (e.g., phishing or malicious emails/URLs/IPs) to make in-
formed security decisions [39, 40, 41]. However, these OSINT reviews did not explicitly
discuss the use of SV data and how such data can be leveraged to automate the assessment
processes. Moreover, most of the reviews on SV assessment have focused on either static
analysis tools [13] or rule-based approaches (e.g., expert systems or ontologies) [9]. These
methods rely on pre-defined patterns and struggle to work with new types and different
data sources of SVs compared to contemporary ML or DL approaches presented in this
chapter [14, 15, 42]. Recently, Dissanayake et al. [43] reviewed the socio-technical chal-
lenges and solutions for security patch management that involves SV assessment after SV
patches are identified. Unlike [43], we focus on the challenges, solutions and practices of
automating various SV assessment tasks with data-driven techniques. We also consider all
types of SV assessment regardless of the patch availability.
2.2. Overview of the Literature Review 9
Table 2.1: Comparison of contributions between our review and the ex-
isting related surveys/reviews.
Contribution
Focus on SV Analysis of SV Analysis of data-
Study assessment data sources driven approaches
2.2.2 Methodology
Study selection. Our study selection was inspired by the Systematic Literature Re-
view guidelines [44]. We first designed the search string: “‘software’ AND vulner* AND
(learn* OR data* OR predict*) AND (priority* OR assess* OR impact* OR exploit* OR
severity*) AND NOT (fuzz* OR dynamic* OR intrusion OR adversari* OR malware* OR
‘vulnerability detection’ OR ‘vulnerability discovery’ OR ‘vulnerability identification’ OR
‘vulnerability prediction’)”. This search string covered the key papers (i.e., with more than
10 Chapter 2. Literature Review on Data-Driven Software Vulnerability Assessment
Table 2.2: Inclusion and exclusion criteria for study selection. Notes:
We did not limit the selection to only peer-reviewed papers as this is an
emerging field with many (high-quality) papers on Arxiv and most of them
are under-submission. However, to ensure the quality of the papers on
Arxiv, we only selected the ones with at least one citation.
Inclusion criteria
• I1. Studies that focused on SVs rather than hardware or human vulnerabilities
• I2. Studies that focused on assessment task(s) of SVs
• I3. Studies that used data-driven approaches (e.g., ML/DL/NLP techniques) for SV
assessment
Exclusion criteria
• E1. Studies that were not written in English
• E2. Studies that we could not retrieve their full texts
• E3. Studies that were not related to Computer Science
• E4. Studies that were literature review or survey
• E5. Studies that only performed statistical analysis of SV assessment metrics
• E6. Studies that only focused on (automated) collection of SV data
50 citations) in the area and excluded many papers on general security and SV detec-
tion. We then adapted this search string2 to retrieve an initial list of 1,765 papers up to
April 20213 from various commonly used databases such as IEEE Xplore, ACM Digital
Library, Scopus, SpringerLink and Wiley. We also defined the inclusion/exclusion crite-
ria (see Table 2.2) to filter out irrelevant/low-quality studies with respect to our scope in
section 2.2.1. Based on these criteria and the titles and abstracts and keywords of 1,765
initial papers, we removed 1,550 papers. After reading the full-text and applying the cri-
teria on the remaining 215 papers, we obtained 70 papers directly related to data-driven
SV assessment. To further increase the coverage of studies, we performed backward and
forward snowballing [45] on these 70 papers (using the above sources and Google Scholar)
and identified 14 more papers that satisfied the inclusion/exclusion criteria. In total, we
included 84 studies in our review. We do not claim that we have collected all the papers
in this area, but we believe that our selection covered most of the key studies to unveil the
practices of data-driven SV assessment.
Data extraction and synthesis of the selected studies. We followed the steps of
thematic analysis [18] to identify the taxonomy of data-driven SV assessment tasks in
sections 2.3, 2.4, 2.5, 2.6 and 2.7 as well as the key practices of data-driven model building
for automating these tasks in section 2.8. We first conducted a pilot study of 20 papers
to familiarize ourselves with data to be extracted from the primary studies. After that,
we generated initial codes and then merged them iteratively in several rounds to create
themes. Two of the authors performed the analysis independently, in which each author
analyzed half of the selected papers and then reviewed the analysis output of the other
author. Any disagreements were resolved through discussions.
Exploit Likelihood
1. Exploitation Exploit Time
Exploit Characteristics
Software Vulnerability Assessment Tasks
Confidentiality
Integrity
2. Impact Availability
Scope
Custom Vulnerability Consequences
Themes Sub-themes
Figure 2.1: Taxonomy of studies on data-driven SV assessment.
Figure 2.1). Specifically, we extracted the themes by grouping related SV assessment tasks
that the reviewed studies aim to automate/predict using data-driven models. Note that a
paper is categorized into more than one theme if that paper develops models for multiple
cross-theme tasks.
We acknowledge that there can be other ways to categorize the studies. However, we
assert the reliability of our taxonomy as all of our themes (except theme 5) align with the
security standards used in practice. For example, Common Vulnerability Scoring System
(CVSS) [29] provides a framework to characterize exploitability, impact and severity of SVs
(themes 1-3), while Common Weakness Enumeration (CWE) [28] includes many vulnera-
bility types (theme 4). Hence, we believe our taxonomy can help identify and bridge the
knowledge gap between the academic literature and industrial practices, making it relevant
and potentially beneficial for both researchers and practitioners. Details of each theme in
our taxonomy are covered in subsequent sections.
Table 2.3: List of the reviewed papers in the Exploit Likelihood sub-theme
of the Exploitation theme. Note: The nature of task of this sub-theme is
binary classification of existence/possibility of proof-of-concept and/or real-
world exploits.
Table 2.4: List of the reviewed papers in the Exploit Time sub-theme of
the Exploitation theme.
these authors used an SVM model with features extracted from the dangerous system
calls [84] in entry points/functions [85] and the reachability from any of these entry points
to vulnerable functions [86]. Moving from high-level to binary code, Yan et al. [60] first
used a Decision tree to obtain prior beliefs about SV types in 100 Linux applications using
static features (e.g., hexdump) extracted from executables. Subsequently, they applied
various fuzzing tools (i.e., Basic Fuzzing Framework [87] and OFuzz [88]) to detect SVs with
the ML-predicted types. They finally updated the posterior beliefs about the exploitability
based on the outputs of the ML model and fuzzers using a Bayesian network. The proposed
method outperformed !exploitable,8 a static crash analyzer provided by Microsoft. Tripathi
et al. [61] also predicted SV exploitability from crashes (i.e., VDiscovery [62, 63] and
LAVA [64] datasets) using an SVM model and static features from core dumps and dynamic
features generated by the Last Branch Record hardware debugging utility. Zhang et al. [65]
proposed two improvements to Tripathi et al. [61]’s approach. These authors first replaced
the hardware utility in [61] that may not be available for resource-constrained devices (e.g.,
IoT) with sequence/n-grams of system calls extracted from execution traces. They also
used an online passive-aggressive classifier [89] to enable online/incremental learning of
exploitability for new crash batches on-the-fly.
capturing the content and relationships among tweets, respective tweets’ authors and SVs.
The proposed model outperformed many baselines and was integrated into the VEST sys-
tem [93] to provide timely SV assessment information for practitioners. To the best of our
knowledge, at the time of writing, Chen et al. [92, 93] have been the only ones pinpointing
the exact exploit time of SVs rather than large/uncertain time-frames (e.g., months) in
other studies, helping practitioners to devise much more fine-grained remediation plans.
Table 2.5: List of the reviewed papers in the Exploit Characteristics sub-
theme of the Exploitation theme.
specific prediction head/layer for each metric/task. This model outperformed single-task
counterparts while requiring much less time to (re-)train.
Although CVSS exploitability metrics were most commonly used, several studies used
other schemes for characterizing exploitation. Chen et al. [104] used Linear SVM and
SV descriptions to predict multiple SV characteristics, including three SV locations (i.e.,
Local, LAN and Remote) on SecurityFocus [72] and Secunia [118] databases as well as 11
SV causes 10 on SecurityFocus. Regarding the exploit types, Rouhonen et al. [105] used
LDA [30] and Random forest to classify whether an exploit would affect a web application.
This study can help find relevant exploits in components/sub-systems of a large system.
For privileges, Aksu et al. [106] extended the Privileges Required metric of CVSS by
incorporating the context (i.e., Operating system or Application) to which privileges are
applied (see Table 2.5). They found MLP [119] to be the best-performing model for
obtaining these privileges from SV descriptions. They also utilized the predicted privileges
to generate attack graphs (sequence of attacks from source to sink nodes). Liu et al. [108]
advanced this task by combining information gain for feature selection and Convolutional
Neural Network (CNN) [120] for feature extraction. Regarding attack patterns, Kanakogi
et al. [109] found Doc2vec [121] to be more effective than term-frequency inverse document
frequency (tf-idf) when combined with cosine similarity to find the most relevant Common
Attack Pattern Enumeration and Classification (CAPEC) [122] for a given SV on NVD.
Such attack patterns can manifest how identified SVs can be exploited by adversaries,
assisting the selection of suitable countermeasures.
Table 2.6: List of the reviewed papers in the Impact theme. Note: We
grouped the first four sub-themes as they were mostly predicted together.
ones using source code for exploit prediction, but their approach still requires manual
identification of dangerous function calls in C/C++. More work is required to employ
data-driven approaches to alleviate the need for manually defined rules to improve the
effectiveness and generalizability of code-based exploit prediction.
an exploited SV would affect only the system that contains the SV. For example, Scope
changes when an SV occurring in a virtual machine affects the whole host machine, in turn
increasing individual impacts.
The studies that predicted the CVSS impact metrics are mostly the same as the ones
predicting the CVSS exploitability metrics in section 2.3. Given the overlap, we hereby
only describe the main directions and techniques of the Impact-related tasks rather than
iterating the details of each study. Overall, a majority of the work has focused on clas-
sifying CVSS impact metrics (versions 2 and 3) using three main learning paradigms:
single-task [95, 96, 23, 98, 93, 100, 101], multi-target [102, 22] and multi-task [103] learn-
ing. Instead of developing a separate prediction model for each metric like the single-
task approach, multi-target and multi-task approaches only need a single model for all
tasks. Multi-target learning predicts concatenated output; whereas, multi-task learning
uses shared feature extraction for all tasks and task-specific softmax layers to determine
the output of each task. These three learning paradigms were powered by applying and/or
customizing a wide range of data-driven methods. The first method was to use single
ML classifiers like supervised Latent Dirichlet Allocation [95], Principal component analy-
sis [97], Naïve Bayes [23, 98, 102], Logistic regression [101], Kernel-based SVM [96], Linear
SVM [23], KNN [23] and Decision tree [22]. Other studies employed ensemble models
combining the strength of multiple single models such as Random forest [23, 98], boosting
model [22] and XGBoost/LGBM [23]. Recently, more studies moved towards more sophis-
ticated DL architectures such as MLP [102], attention-based (Bi-)LSTM [103] and graph
neural network [93]. Ensemble and DL models usually beat the single ones, but there is a
lack of direct comparisons between these two emerging model types.
Table 2.7: List of the reviewed papers in the Severe vs. Non-Severe sub-
theme of the Severity theme. Note: The nature of task here is binary
classification of severe SVs with High/Critical CVSS v2/v3 severity levels.
Table 2.8: List of the reviewed papers in the Severity Levels sub-theme
of the Severity theme.
Table 2.9: List of the reviewed papers in the Severity Score sub-theme of
the Severity theme. Notes: † denotes that the severity score is computed
from ML-predicted base metrics using the formula provided by an assess-
ment framework (CVSS and/or WIVSS).
1-layer CNN) also achieved promising results for predicting a different severity categoriza-
tion, namely Atlassian’s levels.12 Such findings were successfully replicated by Sahin et
al. [132]. Nakagawa et al. [133] further enhanced the DL model performance for the same
task by incorporating the character-level features into a CNN model [139]. Complementary
to performance enhancement, Gong et al. [103] proposed to predict these severity levels
concurrently with other CVSS metrics in a single model using multi-task learning [116]
powered by an attention-based Bi-LSTM shared feature extraction model. The unified
model was demonstrated to increase both the prediction effectiveness and efficiency. Be-
sides Atlassian’s categories, several studies applied ML models to predict severity levels on
other platforms such as Secunia [104] and China National Vulnerability Database13 [134].
Instead of using textual categories, Khazaei et al. [135] divided the CVSS severity score
into 10 bins with 10 increments each (e.g., values of 0 – 0.9 are in one bin) and obtained
decent results (86-88% Accuracy) using Linear SVM, Random forest and Fuzzy system.
did not directly predict severity score from SV descriptions, instead they aggregated the
predicted values of the CVSS Exploitability (see section 2.3) and Impact metrics (see
section 2.4) using the formulas of CVSS version 2 [96, 22, 100, 101], version 3 [98, 100, 101]
and WIVSS [22]. We noticed the papers predicting both versions (e.g., CVSS versions 2
vs. 3 or CVSS version 2 vs. WIVSS) usually obtained better performance for version 3
and WIVSS than version 2 [100, 101]. These findings may suggest that the improvements
made by experts in version 3 and WIVSS compared to version 2 help make the patterns
in severity score clearer and easier for ML models to capture. In addition to predicting
severity score, Toloudis et al. [97] examined the correlation between words in descriptions
of SVs and the severity values of such SVs, aiming to shed light on words that increase or
decrease the severity score of SVs.
without clear patterns/keywords in SV descriptions. Aota et al. [148] utilized the Boruta
feature selection algorithm [163] and Random forest to improve the performance of base
CWE classification. Base CWEs give more fine-grained information for SV remediation
than categorical CWEs used in [145].
There has been a recent rise in using neural network/DL based models for CWE classi-
fication. Huang et al. [147] implemented a deep neural network with tf-idf and information
gain for the task and obtained better performance than SVM, Naïve Bayes and KNN.
Aghaei et al. [149] improved upon [148] for both categorical (coarse-grained) and base
(fine-grained) CWE classification with an adaptive hierarchical neural network to deter-
mine sequences of less to more fine-grained CWEs. To capture the hierarchical structure
and rare classes of CWEs, Das et al. [150] matched SV and CWE descriptions instead
of predicting CWEs directly. They presented a deep Siamese network with a BERT-
based [80] shared feature extractor that outperformed many baselines even for rare/unseen
CWE classes. Recently, Zou et al. [151] pioneered the multi-class classification of CWE
in vulnerable functions curated from Software Assurance Reference Dataset (SARD) [164]
and NVD. They achieved high performance (∼95% F1-Score) with DL (Bi-LSTM) models.
The strength of their model came from combining global (semantically related statements)
and local (variables/statements affecting function calls) features. Note that this model
currently only works for functions in C/C++ and 40 selected classes of CWE.
Another group of studies has considered unsupervised learning methods to extract
CWE sequences, patterns and relationships. Sequences of SV types over time were iden-
tified by Murtaza et al. [152] using an n-gram model. This model sheds light on both
co-occurring and upcoming CWEs (grams), raising awareness of potential cascading at-
tacks. Lin et al. [153] applied an association rule mining algorithm, FP-growth [165], to
extract the rules/patterns of various CWEs aspects including types, programming lan-
guage, time of introduction and consequence scope. For example, buffer overflow (CWE
type) usually appears during the implementation phase (time of introduction) in C/C++
(programming language) and affects the availability (consequence scope). Lately, Han et
al. [154] developed a deep knowledge graph embedding technique to mine the relationships
among CWE types, assisting in finding relevant SV types with similar properties.
features that best described the 10 pre-defined SV types16 prevalent in the wild. These
authors then used a diffusion-based storytelling technique [169] to show the evolution of
a particular topic of SVs over time; e.g., increasing API-related SVs requires hardening
the APIs used in a product. To support user-friendly SV assessment using ever-increasing
unstructured SV data, Russo et al. [162] used Bayesian network to predict 10 pre-defined
SV types.17 Besides predicting manually defined SV types using SV natural language
descriptions, Yan et al. [60] used a decision tree to predict 22 SV types prevalent in the
executables of Linux applications. The predicted type was then combined with fuzzers’
outputs to predict SV exploitability (see section 2.3.1.1). Besides author-defined types,
custom SV types also come from specific SV platforms. Zhang et al. [134] designed an
ML-based framework to predict the SV types collected from China National Vulnerability
Database. Ensemble models (bagging and boosting models) achieved, on average, the
highest performance for this task.
Table 2.11: List of the reviewed papers in the Miscellaneous Tasks theme.
directly from CVE-2020-28022,18 but not from CVE-2021-2122.19 The latter case requires
techniques from section 2.6.1.1.
Most of the retrieval methods in this sub-theme have been formulated under the multi-
class classification setting. One of the earliest works was conducted by Weerawardhana et
al. [170]. This study extracted software names/versions, impacts and attacker’s/user’s ac-
tion from SV descriptions on NVD using Stanford Named Entity Recognition (NER) tech-
nique, a.k.a. CRF classifier [186]. Later, Dong et al. [171] proposed to use a word/character-
level Bi-LSTM to improve the performance of extracting vulnerable software names and
versions from SV descriptions available on NVD and other SV databases/advisories (e.g.,
CVE Details [187], ExploitDB [67], SecurityFocus [72], SecurityTracker [188] and Open-
wall [189]). Based on the extracted entities, these authors also highlighted the inconsisten-
cies in vulnerable software names and versions across different SV sources. Besides version
products/names of SVs, Gonzalez et al. [172] used a majority vote of different ML mod-
els (e.g., SVM and Random forest) to extract the 19 entities of Vulnerability Description
Ontology (VDO) [173] from SV descriptions to check the consistency of these descriptions
based on the guidelines of VDO. Since 2020, there has been a trend in using DL models
(e.g., Bi-LSTM, CNNs or BERT [80]/ELMo [190]) to extract different information from SV
descriptions including required elements for generating MulVal [176] attack rules [174, 175]
or SV types/root cause, attack type/vector [177, 178], Common Product Enumeration
(CPE) [191] for standardizing names of vulnerable vendors/products/versions [179], part-
of-speech [180] and relevant entities (e.g., vulnerable products, attack type, root cause)
from ExploitDB to generate SV descriptions [182]. BERT models [80], pre-trained on gen-
eral text (e.g., Wikipedia pages [82] or PTB corpus [181]) and fine-tuned on SV text, have
also been increasingly used to address the data scarcity/imbalance for the retrieval tasks.
group of authors [185] leveraged the identified factors in their prior qualitative study to
develop various regression models such as linear regression, tree-based regression and neural
network regression models, to predict time-to-fix SVs using the data collected at SAP. These
authors found that code components containing detected SVs are more important for the
prediction than SV types.
Table 2.12: The frequent data sources, features, models, evaluation tech-
niques and evaluation metrics used for the five identified SV assessment
themes. Notes: The values are organized based on their overall frequency
across the five themes. For the Prediction Model and Evaluation Metric
elements, the values are first organized by categories (ML then DL for
Prediction Model and classification then regression for Evaluation Met-
ric) and then by frequencies. k-CV stands for k-fold cross-validation.
The full list of values and their appearance frequencies for the five el-
ements in the five themes can be found at https://figshare.com/s/
da4d238ecdf9123dc0b8.
Source/Technique/Metric Strengths Weaknesses
Element: Data Source
• Report expert-verified information (with CVE-ID) • Missing/incomplete links to vulnerable code/fixes
NVD/CVE/CVE Details
• Contain CWE and CVSS entries for each SV • Inconsistencies due to human errors
(deprecated OSVDB)
• Link to external sources (e.g., official fixes) • Delayed SV reporting and assignment of CVSS metrics
ExploitDB • Report PoC exploits of SVs (with links to CVE-ID) • May not lead to real exploits in the wild
Other security advisories (e.g.,
• Report real-world exploits of SVs • Some exploits may not have links to CVE entries for
SecurityFocus, Symantec or
• Cover diverse SVs (including ones w/o CVE-ID) mapping with other assessment metrics
X-Force)
• Early reporting of SVs (maybe earlier than NVD)
Informal sources (e.g., Twitter • Contain non-verified and even misleading information
• Contain non-technical SV information (e.g., financial
and darkweb) • May cause adversarial attacks to assessment models
damage or socio-technical challenges in addressing SVs)
Element: Model Feature
• May suffer from vocabulary explosion (e.g., many new
description words for new SVs)
• Simple to implement
• No consideration of word context/order (maybe needed
BoW/tf-idf/n-grams • Strong baseline for text-based inputs (e.g., SV
for code-based SV analysis)
descriptions in security databases/advisories)
• Cannot handle Out-of-Vocabulary (OoV) words
(can resolve with subwords [23])
• Cannot handle OoV words (can resolve with
• Capture nearby context of each word
Word2vec fastText [196])
• Can reuse existing pre-trained model(s)
• No consideration of word order
DL model end-to-end trainable • May not produce high-quality representation for tasks
• Produce SV task-specific features
features with limited data (e.g., real-world exploit prediction)
• Capture contextual representation of text (i.e., the • May require GPU to speed up feature inference
Bidirectional Encoder Repre-
feature vector of a word is specific to each input) • May be too computationally expensive and require too
sentations from Transformers
• Capture word order in an input much data to train a strong model from scratch
(BERT)
• Can handle OoV words • May require fine-tuning to work well for a source task
Source/expert-defined meta- • Lightweight • Require SV expertise to define relevant features
data features • Human interpretable for a task of interest • Hard to generalize to new tasks
Element: Prediction Model
Single ML models (e.g., Linear • Simple to implement
• May be prone to overfitting
SVM, Logistic regression, • Efficient to (re-)train on large data (e.g., using the
• Usually do not perform as well as ensemble/DL models
Naïve Bayes) entire NVD database)
Ensemble ML models (e.g.,
• Strong baseline (usually stronger than single models)
Random forest, XGBoost, • Take longer to train than single models
• Less prone to overfitting
LGBM)
Latent Dirichlet Allocation • Require no labeled data for training • Require SV expertise to manually label generated topics
(LDA – topic modeling) • Can provide features for supervised learning models • May generate human non-interpretable topics
• Perform comparably yet are more costly compared to
Deep Multi-Layer Perceptron • Work readily with tabular data (e.g., manually defined ensemble ML models
(MLP) features or BoW/tf-idf/n-grams) • Less effective for unstructured data (e.g., SV descrip-
tions)
Deep Convolutional Neural • Capture local and hierarchical patterns of inputs • Cannot effectively capture sequential order of inputs
Networks (CNN) • Usually perform better than MLP for text-based data (maybe needed for code-based SV analysis)
• May suffer from the information bottleneck issue
Deep recurrent neural networks • Capture short-/long-term dependencies from inputs
(can resolve with attention mechanism [117])
(e.g., LSTM or Bi-LSTM) • Usually perform better than MLP for text-based data
• Usually take longer to train than CNNs
Deep graph neural networks
• Capture directed relationships among multiple SV • Require graph-structured inputs to work
(e.g., Graph convolutional
entities and sources • More computationally expensive than other DL models
network)
Deep transfer learning with
fine-tuning (e.g., BERT with • Can improve the performance for tasks with small
• Require target task to have similar nature as source task
task-specific classification data (e.g., real-world exploit prediction)
layer(s))
Deep constrastive learning • Can improve performance for tasks with small data • Computationally expensive (two inputs instead of one)
(e.g., Siamese neural networks) • Robust to class imbalance (e.g., CWE classes) • Do not directly produce class-wise probabilities
• Can share features for predicting multiple tasks (e.g.,
• Require predicted tasks to be related
Deep multi-task learning CVSS metrics) simultaneously
• Hard to tune the performance of individual tasks
• Reduce training/maintenance cost
Element: Evaluation Technique
• No separate test set for validating optimized models
• Easy to implement (can resolve with separate test set(s))
Single k-CV without test
• Reduce the randomness of results with multiple folds • Maybe infeasible for expensive DL models
• Use future data/SVs for training, may bias results
• May produce unstable results (the single version)
Single/multiple random train/ • Easy to implement
• Maybe infeasible for expensive DL models (the multiple
test with/without val (using • Reduce the randomness of results (the multiple
version)
val to tune hyperparameters) version)
• Use future data/SVs for training, may bias results
Single/multiple time-based
• Consider the temporal properties of SVs, simulating • Similar drawbacks for the single & multiple versions as
train/test with/without val
the realistic evaluation of ever-increasing SVs in practice the random counterparts
(using val to tune
• Reduce the randomness of results (the multiple • May result in uneven/small splits (e.g., many SVs in a
hyper-
version) year)
parameters)
Element: Evaluation Metric
F1-Score/Precision/Recall • Suitable for imbalanced data (common in SV assess- • Do not consider True Negatives in a confusion matrix
(classification) ment tasks) (can resolve with Matthews Correlation Coefficient (MCC))
Accuracy (classification) • Consider all the cells in a confusion matrix • Unsuitable for imbalanced data (can resolve with MCC)
• May not represent real-world settings (i.e., as models
in practice
Area Under the Curve (AUC)
• Independent of prediction thresholds mostly use fixed classification thresholds)
(classification)
• ROC-AUC may not be suitable for imbalanced data
(can resolve with Precision-Recall-AUC)
Mean absolute (percentage) • Maybe hard to interpret a value on its own without
error/Root mean squared error • Show absolute performance of models domain knowledge (i.e., whether an error of x is sufficiently
(regression) effective)
Correlation coefficient (r)/
• Show relative performance of models (0 – 1), where 0 • R2 always increases when adding any new feature
Coef. of determination (R2 ) is worst & 1 is best
(regression) (can resolve with adjusted R2 )
2.8. Analysis of Data-Driven Approaches for Software Vulnerability Assessment 31
SVs (e.g., [154, 184]). However, NVD/CVE still suffer from information inconsisten-
cies [171, 141, 142] and missing relevant external sources (e.g., SV fixing code) [198].
Such issues motivate future work to validate/clean NVD data and utilize more sources for
code-based SV assessment (see section 2.9).
To enrich the SV information on NVD/CVE, many other security advisories and SV
databases have been commonly leveraged by the reviewed studies, notably ExploitDB [67],
Symantec [68, 199], SecurityFocus [72], CVE Details [187] and OSVDB. Most of these
sources disclose PoC (ExploitDB and OSVDB) and/or real-world (Symantec and Security
Focus) exploits. However, real-world exploits are much rarer and different compared to PoC
ones [33, 55]. It is recommended that future work should explore more data sources (other
than the ones in Table 2.3) and better methods to retrieve real-world exploits, e.g., using
semi-supervised learning [200] to maximize the data efficiency for exploit retrieval and/or
few-shot learning for tackling the extreme exploit data imbalance issue [201]. Additionally,
CVE Details and OSVDB are SV databases like NVD yet with a few key differences. CVE
Details explicitly monitors Exploit-DB entries that may be missed on NVD and provides
a more user-friendly interface to view/search SVs. OSVDB also reports SVs that do not
appear on NVD (without CVE-ID), but this site was discontinued in 2016.
Besides official/expert-verified data sources, we have seen an increasing interest in min-
ing SV information from informal sources that also contain non-expert generated content
such as social media (e.g., Twitter) and darkweb. Especially, Twitter has been widely used
for predicting exploits as this platform has been shown to contain many SV disclosures
even before official databases like NVD [33, 140]. Recently, darkweb forums/sites/markets
have also gained traction as SV mentions on these sites have a strong correlation with their
exploits in the wild [48, 49]. However, SV-related data on these informal sources are much
noisier because they neither follow any pre-defined structure nor have any verification and
they are even prone to fake news [33]. Thus, the data integrity of these sources should be
checked, potentially by checking the reputation of posters, to avoid inputting unreliable
data to prediction models and potentially producing misleading findings.
than a vocabulary size. Moreover, these techniques capture the sequential order and con-
text (nearby words) to enable better SV-related text comprehension (e.g., SV vs. general
exploit). Importantly, these NN/DL learned features can be first trained in a non-SV do-
main with abundant data (e.g., Wikipedia pages [82]) and then transferred/fine-tuned in
the SV domain to address limited/imbalanced SV data [56]. The main concern with these
sophisticated NN/DL features is their limited interpretability, which is an exciting future
research area [204].
The metadata about SVs can also complement the missing information in descriptions
or code for SV assessment. For example, prediction of exploits and their characteristics
have been enhanced using CVSS metrics [49], CPE [106] and SV types [57] on NVD. Ad-
ditionally, Twitter-related statistics (e.g., number of followers, likes and retweets) have
been shown to increase the performance of predicting SV exploitation, impact and sever-
ity [33, 93]. Recently, alongside features extracted from vulnerable code, the information
about a software development process and involved developers have also been extracted to
predict SV fixing effort [185]. Currently, metadata-based and text-based features have been
mainly integrated by concatenating their respective feature vectors (e.g., [140, 92, 48, 49]).
An alternative yet unexplored way is to build separate models for each feature type and
then combine these models using meta-learning (e.g., model stacking [205]).
topic model to identify latent topics/types of SVs without relying on a labeled taxonomy.
The identified topics were mapped to the existing SV taxonomies such as CWE [156] and
OWASP [157, 158]. The topics generated by topic models like LDA can also be used as
features for classification/regression models [105] or building topic-wise models to capture
local SV patterns [210]. However, definite interpretations for unsupervised outputs are
challenging to obtain as they usually rely on human judgement [211].
metric to be considered as it explicitly captures all values in a confusion matrix, and thus
has less bias in results.
For regression tasks, various metrics have been used such as Mean absolute error, Mean
absolute percentage error, Root mean squared error [22] as well as Correlation coefficient
(r) and Coefficient of determination (R2 ) [185]. Note that adjusted R2 should be preferred
over R2 as R2 would always increase when adding a new (even irrelevant) feature.
A model can have a higher value of one metric yet lower values of others.23 Therefore,
we suggest using a combination of suitable metrics for a task of interest to avoid result
bias towards a specific metric. Currently, most studies have focused on evaluating model
effectiveness, i.e., how well the predicted outputs match the ground-truth values. Besides
effectiveness, other aspects (e.g., efficiency in training/deployment and robustness to in-
put changes) of models should also be evaluated to provide a complete picture of model
applicability in practice.
Table 2.13: The mapping between the themes/tasks and the respective
studies collected from May 2021 to February 2022.
Theme/Task Studies
Exploitation Prediction [217], [218], [219], [220], [221], [222], [223], [224], [225]
Impact Prediction [218], [219], [220], [222], [225]
Severity Prediction [218], [219], [220], [222], [225]
Type Prediction [226], [227], [228], [229], [230]
Miscellaneous Tasks [218], [231], [232], [233], [234]
• All the tasks tackled by the studies have aligned with the ones presented in this
literature review, reinforcing the robustness of our taxonomy presented in Fig. 2.1.
26
https://stackoverflow.com
27
https://security.stackexchange.com
28
We do not claim that this list of updated papers is complete, yet we believe that we have covered the
representative ones. It should also be noted that we did not include our own papers as these papers would
be presented in subsequent chapters of this thesis.
36 Chapter 2. Literature Review on Data-Driven Software Vulnerability Assessment
• Among the themes, Exploitation has attracted the most contributions from these
new studies, similar to that of the reviewed papers prior to May 2021. In addition,
prediction of the CVSS exploitability, impact, and severity metrics is still the most
common task. However, it is encouraging to see that more studies have worked on
CVSS version 3.1, which is closer to the current industry standard.
• Regarding data sources, NVD/CVE is still most prevalently used. Besides CAPEC [122],
MITRE ATT&CK Framework29 is a new source of attack patterns being utilized [217].
• Regarding model features, BERT [80] has been commonly used to extract contextual
feature embeddings from SV descriptions for various tasks.
• Regarding prediction models, DL techniques, which are mainly based on CNN and/or
(Bi-)LSTM with attention, have been increasingly used to improve the performance
of the tasks.
• Regarding evaluation techniques and evaluation metrics, since most of the tasks are
the same as before, evaluation practices have largely stayed the same as the ones
presented in sections 2.8.4 and 2.8.5.
Overall, the three key practical challenges presented in section 2.9 are still not addressed
by the new studies. Thus, we believe that our contributions/solutions to address these
challenges in this thesis would set the foundation for future research to further improve
the practical applicability of SV assessment using data-driven approaches.
29
https://attack.mitre.org/
37
Chapter 3
Related publication: This chapter is based on our paper titled “Automated Soft-
ware Vulnerability Assessment with Concept Drift”, published in the 16th Interna-
tional Conference on Mining Software Repositories (MSR), 2019 (CORE A) [23].
3.1 Introduction
Software Vulnerability (SV) reports in public repositories, such as National Vulnerability
Database (NVD) [16], have been widely leveraged to automatically predict SV character-
istics using data-driven approaches (see Chapter 2). However, data in these SV reports
have the temporal property since many new terms appear in the descriptions of SVs. Such
terms are a result of the release of new technologies/products or the discovery of a zero-day
attack or SV; for example, NVD received more than 13,000 new SVs in 2017 [16]. The
appearance of new concepts makes SV data and patterns change over time [160, 152, 156],
which is known as concept drift [34]. For example, the keyword Android has only started
appearing in NVD since 2008, the year when Google released Android. We assert that
such new SV terms can cause problems for building report-level SV assessment models.
Previous studies [22, 49, 32] have suffered from concept drift as they have usually
mixed new and old SVs in the model validation step. Such approach accidentally merges
new SV information with existing one, which can lead to biased results. Moreover, the
previous work of SV analysis [22, 20, 49, 32, 97, 95, 47] used predictive models with
only word features without reporting how to handle novel or extended concepts (e.g., new
versions of the same software) in new SVs’ descriptions. Research on machine transla-
tion [235, 236, 237, 238] has shown that unseen (Out-of-Vocabulary (OoV)) terms can
make existing word-only models less robust to future prediction due to their missing infor-
mation. For SV prediction, Han et al. [21] did use random embedding vectors to represent
the OoV words, which still discards the relationship between new and old concepts. Such
observations motivated us to tackle the research problem “How to handle the con-
cept drift issue of the SV descriptions in public repositories to improve the
robustness of automated report-level SV assessment?” It appears to us that it
is important to address the issue of concept drift to enable the practical applicability of
automated SV assessment tools. To the best of our knowledge, there has been no existing
work to systematically address the concept drift issue in report-level SV assessment.
To perform report-level SV assessment with concept drift using SV descriptions in
public repositories, we present a Machine Learning (ML) model that utilizes both character-
level and word-level features. We also propose a customized time-based version of cross-
validation method for model selection and validation. Our cross-validation method splits
the data by year to embrace the temporal relationship of SVs. We evaluate the proposed
model on the prediction of seven Vulnerability Characteristics (VCs), i.e., Confidentiality,
Integrity, Availability, Access Vector, Access Complexity, Authentication, and Severity.
Our key contributions are:
1. We demonstrate the concept drift issue of SVs using concrete examples from NVD.
2. We investigate a customized time-based cross-validation method to select the optimal
ML models for SV assessment. Our method can help prevent future SV information
from being leaked into the past in model selection and validation steps.
3. We propose and extensively evaluate an effective Character-Word Model (CWM) to
assess SVs using the descriptions with concept drift. We also investigate the per-
formance of low-dimensional CWM models. We provide our models and associated
source code for future research at https://github.com/lhmtriet/MSR2019.
Chapter organization. Section 3.2 introduces SV descriptions and VCs. Section 3.3
describes our proposed approach. Section 3.4 presents the experimental design of this
work. Section 3.5 analyzes the experimental results and discusses the findings. Section 3.6
identifies the threats to validity. Section 3.7 covers the related works. Section 3.8 concludes
and suggests some future directions.
3.2. Background 39
100000
90000
80000
70000
60000
Frequnecy
50000
40000
30000
20000
10000
0
None
Partial
Complete
None
Partial
Complete
None
Partial
Complete
Local
Adjacent Network
Network
Low
Medium
High
None
Single
Multiple
Low
Medium
High
Confidentiality Integrity Availability Access Access Authentication Severity
Vector Complexity
3.2 Background
Software Vulnerability (SV) assessment is an important step in the SV lifecycle, which
determines various characteristics of detected SVs [10]. Such characteristics support de-
velopers to understand the nature of SVs, which can inform prioritization and remediation
strategies. For example, if an SV can severely damage the confidentiality of a system, e.g.,
allowing attackers to access/steal sensitive information, this SV should have a high fixing
priority. A fixing protocol to ensure confidentiality can then be followed, e.g., checking/en-
forcing privileges to access the affected component/data.
National Vulnerability Database [16] (NVD) is one of the most popular and trustworthy
sources for SV assessment. NVD is maintained by governmental bodies (National Cyber
Security and Division of the United States Department of Homeland Security). This site
inherits unique SV identifiers and descriptions from Common Vulnerabilities and Exposures
(CVE) [66]. For SV assessment, NVD provides expert-verified assessment metrics, namely
Common Vulnerability Scoring System (CVSS) [29], for each reported SV.
CVSS is one of the most commonly used frameworks by both researchers and practi-
tioners to perform SV assessment. There are two main versions of CVSS, namely versions 2
and 3, in which version 3 only came into effect in 2015. CVSS version 2 is still widely used
as many SVs prior to 2015 can yet pose threats to contemporary systems. For instance, the
SV with CVE-2004-0113 first found in 2004 was exploited in 2018 [239]. Hence, we adopt
the assessment metrics of CVSS version 2 as the outputs for the SV assessment models in
this study.
CVSS version 2 provides metrics to quantify the three main aspects of SVs, namely
exploitability, impact, and severity. We focus on the base metrics because the temporal
metrics (e.g., exploit availability in the wild) and environmental metrics (e.g., potential
impact outside of a system) are unlikely obtainable from project artifacts (e.g., SV code/re-
ports) alone. Specifically, the base Exploitability metrics examine the technique (Access
Vector) and complexity to initiate an exploit (Access Complexity) as well as the authen-
tication requirement (Authentication). The base Impact metrics of CVSS focus on the
system Confidentiality, Integrity, and Availability. The Exploitation and Impact metrics
are used to compute the Severity of SVs. Severity approximates the criticality of an SV.
Chapter 3. Automated Report-Level Software Vulnerability Assessment with Concept
40
Drift
Table 3.1: Word and character n-grams extracted from the sentence “Hello
World”. ‘_’ represents a space.
character-level n-grams is also more than that of word-level counterparts. Section 3.3.3
provides more details about the time-based k-fold cross-validation method.
Next comes the model building process with four main steps: (i) word n-grams gener-
ation, (ii) character n-grams generation, (iii) feature aggregation and (iv ) character-word
model building. Steps (i) and (ii) use the preprocessed text in the previous process to
generate word and character n-grams based on the identified optimal NLP representations
of each VC. The word n-grams generation step (i) here is the same as the one in the
time-based k-fold cross-validation of the previous process. An example of the word and
character n-grams in our approach is given in Table 3.1. Such character n-grams increase
the probability of capturing parts of OoV terms due to concept drift in SV descriptions.
Subsequently, both levels of the n-grams and the optimal NLP representations are input
into the feature aggregation step (iii) to extract features from the preprocessed text using
our proposed algorithm in section 3.3.4. This step also combines the aggregated character
and word vocabularies with the optimal NLP representations of each VC to create the
feature models. We save such models to transform data of future prediction. In the last
step (iv ), the extracted features are trained with the optimal classifiers found in the model
selection process to build the complete character-word models for each VC to perform
automated report-level SV assessment with concept drift.
In the prediction process, a new SV description is first preprocessed using the same text
preprocessing step. Then, the preprocessed text is transformed to create features by the
feature models saved in the model building process. Finally, the trained character-word
models use such features to determine each VC.
x- k + 1
x- k + 2
x- k + 3
...
x
Pass 1 ...
Pass 2 ...
Training data
Pass 3 ... Validation data
. Unused data
.
. .
. .
Pass k
Year
Table 3.2: The eight configurations of NLP representations used for model
selection. Note: ‘✓’ is selected, ‘-’ is non-selected.
n-grams generation, (iii) feature transformation, and (iv ) model training and evaluation.
Data splitting explicitly considers the time order of SVs to ensure that in each pass/fold,
the new information of the validation set does not exist in the training set, which maintains
the temporal property of SVs. New terms can appear at different time during a year; thus,
the preprocessed text in each fold is split by year explicitly, not by equal sample size as
in traditional time-series splits;1 e.g., SVs from 1999 to 2010 are for training and those in
2011 are for validation in a pass/fold.
After data splitting in each fold, we use the training set to generate the word n-grams.
Subsequently, with each of the eight NLP configurations in Table 3.2, the feature transfor-
mation step uses the word n-grams as the vocabulary to transform the preprocessed text
of both training and validation sets into the features for building a model. We create the
NLP configurations from various values of n-grams combined with either term frequency or
tf-idf measure. Uni-gram with term frequency is also called Bag-of-Words (BoW). These
NLP representations have been selected since they are popular and have performed well for
SV analysis [22, 49, 95]. For each NLP configuration, the model training and evaluation
1
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.
TimeSeriesSplit.html
3.3. The Proposed Approach 43
step trains six classifiers (see section 3.4.3) on the training set and then evaluates the mod-
els on the validation set using different evaluation metrics (see section 3.4.4). The model
with the highest average cross-validated performance is selected for a VC. The process is
repeated for every VC, then the optimal classifiers and NLP representations are returned
for all seven VCs.
words, such choice can enhance the probability of a model capturing the OoV words in new
descriptions. Retaining only the character features also helps reduce the number of features
and model overfitting. After that, steps 9-10 define the feature models modelw and modelc
using the word (dif f _words) and character (slt_chars) vocabularies, respectively, along
with the NLP configurations to transform the input documents into feature matrices for
building an assessment model. Steps 11-12 then use the two defined word and character
models to actually transform the input documents into the feature matrices Xword and
Xword , respectively. Step 13 concatenates the two feature matrices by columns. Finally,
step 14 returns the final aggregated feature matrix Xagg along with the word and character
feature models modelw and modelc .
• RQ2: Which are the optimal models for multi-classification of each SV character-
istic? To answer RQ2, we present the optimal models (i.e., classifiers and NLP
representations) using word features for each VC selected by a five-fold time-based
cross-validation method (see section 3.3.3). We also compare the performance of
different classes of models (single vs. ensemble) and NLP representations to give
recommendations for future use.
• RQ4: To what extent can low-dimensional model retain the original performance?
The features of our proposed model in RQ3 are high-dimensional and sparse. Hence,
we evaluate a dimensionality reduction technique (i.e., Latent Semantic Analysis [245])
3.4. Experimental Design and Setup 45
and the sub-word embeddings (i.e., fastText [196, 246]) to show how much informa-
tion of the original model is approximated in lower dimensions. RQ4 findings can
facilitate the building of more efficient concept-drift-aware predictive models.
3.4.2 Dataset
We retrieved 113,292 SVs from NVD in JSON format. The dataset contains the SVs from
1988 to 2018. We discarded 5,926 SVs that contain “** REJECT **” in their descriptions
since they had been confirmed duplicated or incorrect by experts. Seven VCs of CVSS 2
(see section 3.2) were used as our SV assessment metrics. It turned out that there are 2,242
SVs without any value of CVSS 2. Therefore, we also removed such SVs from our dataset.
Finally, we obtained a dataset containing 105,124 SVs along with their descriptions and
the values of seven VCs indicated previously. For evaluation purposes, we followed the
work in [22] to use the year of 2016 to divide our dataset into training and testing sets
with the sizes of 76,241 and 28,883, respectively. The primary reason for splitting the
dataset based on the time order is to consider the temporal relationship of SVs.
• Naïve Bayes (NB) [248] is a simple probabilistic model that is based on Bayes’ the-
orem. This model assumes that all the features are conditionally independent with
respect to each other. In this study, NB had no tuning hyperparameter during the
validation step.
• Logistic Regression (LR) [249] is a linear classifier in which the logistic function is used
to convert a linear output into a respective probability. The one-vs-rest scheme was
applied to split the multi-class problem into multiple binary classification problems.
In this work, we selected the optimal value of the regularization parameter for LR
from the list of values: 0.01, 0.1, 1, 10, 100.
• Random Forest (RF) [251] is a bagging model in which multiple decision trees are
combined to reduce the variance and sensitivity to noise. The complexity of RF is
mainly controlled by (i) the number of trees, (ii) maximum depth, and (iii) maximum
number of leaves. (i) tuning values were: 100, 300, 500. We set (ii) to unlimited,
which makes the model the highest degree of flexibility and easier to adapt to new
data. For (iii), the tuning values were 100, 200, 300 and unlimited.
16000
14000
12000
Number of new terms
10000
8000
6000
4000
2000
0
00
01
02
03
04
05
06
07
08
09
10
12
13
14
15
16
17
11
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
20
Year
Figure 3.4: The number of new terms from 2000 to 2017 of SV descriptions
in NVD.
Bugzilla (1998) OpenWave Slammer worm Welchia worm Skype Ubuntu (2004)
Ascii (1999) (2000) (2002) (2003) (2003) Zotob worm
JSP (1999) Code Red Firefox (2002) Wordpress Witty worm (2005)
Tomcat (1999) worm (2001) asp.net (2002) (2003) (2004) Ajax (2005)
non-temporal one used in [22, 20]. For each method, we computed the Weighted F1-Scores
difference between the cross-validated and testing results of the optimal models found in
the validation step (see Fig. 3.6). The model selection and selection criteria procedures
of the normal cross-validation method were the same as our temporal one. Fig. 3.6 shows
that traditional non-temporal cross-validation was overfitted in four out of seven cases
(i.e., Availability, Access Vector, Access Complexity, and Authentication). Especially, the
degrees of overfitting of non-temporal validation method were 1.8, 4.7 and 1.8 times higher
than those of the time-based version for Availability, Access Vector, and Access Complexity,
respectively. For the other three VCs, both methods were similar, in which the differences
were within 0.02. Moreover, on average, the Weighted F1-Scores on the testing set of the
non-temporal cross-validation method were only 0.002 higher than our approach. This
value is negligible compared to the difference of 0.02 (ten times more) in the validation
step. It is worth noting that a similar comparison also held for non-stratified non-temporal
cross-validation. Overall, both qualitative and quantitative findings suggest that the time-
based cross-validation method should be preferred to lower the performance overestimation
and mis-selection of report-level SV assessment models due to the effect of concept drift in
the model selection step.
The summary answer to RQ1: The qualitative results show that many new terms
are regularly added to NVD, after the release or discovery of the corresponding soft-
ware products or cyber-attacks. Normal random-based evaluation methods mixing
these new terms can inflate the cross-validated model performance. Quantitatively,
the optimal models found by our time-based cross-validation are also less overfitted,
especially two to five times for Availability, Access Vector and Access Complexity.
It is recommended that the time-based cross-validation should be adopted in the
model selection step for report-level SV assessment.
3.5. Experimental Results and Discussion 49
0.16
0.12
0.08
0.04
0.00
ity
y
ity
y
n
y
or
rit
lit
io
lit
gr
ct
ex
ve
at
tia
bi
Ve
te
pl
ic
ila
Se
en
In
nt
m
s
a
id
es
he
Co
Av
nf
cc
ut
Co
s
A
A
es
cc
A
Normal Time-based
Classifier Hyperparameters
NB None
Regularization value:
LR + 0.1 for term frequency
+ 10 for tf-idf
Kernel: linear
SVM
Regularization value: 0.1
Number of trees: 100
RF Max. depth: unlimited
Max. number of leaf nodes: unlimited
Number of trees: 100
XGB Max. depth: unlimited
Max. number of leaf nodes: 100
Number of trees: 100
LGBM Max. depth: unlimited
Max. number of leaf nodes: 100
Table 3.4: Optimal models and results after the validation step. Note:
The NLP configuration number is put in parentheses.
that the maximum depth of ensemble methods was the hyperparameter that affected the
validation result the most; the others did not change the performance dramatically. Finally,
we got a search space of size of 336 in the cross-validation step ((six classifiers) × (eight
NLP configurations) × (seven characteristics)). The optimal validation results after using
our five-fold time-based cross-validation method in section 3.3.3 are given in Table 3.4.
Besides each output, we also examined the validated results across different types of
classifiers (single vs. ensemble models) and NLP representations (n-grams and tf-idf vs.
term frequency). Since the NLP representations mostly affect the classifiers, their validated
results are grouped by six classifiers in Tables 3.5 and 3.6. The result shows that tf-idf did
not outperform term frequency for five out of six classifiers. This result agrees with the
existing work [22, 20]. It seemed that n-grams with n > 1 improved the result. We used a
one-sided non-parametric Wilcoxon signed rank test [254] to check the significance of such
improvement of n-grams (n > 1). The p-value was 0.169, which was larger than 0.01 (the
significance level). Thus, we were unable to accept the improvement of n-grams over uni-
gram. Furthermore, there was no performance improvement after increasing the number
of n-grams. The above-reported three observations implied that the more complex NLP
representations did not provide a statistically significant improvement over the simplest
BoW (configuration 1 in Table 3.2). This argument helped explain why three out of seven
optimal models in Table 3.4 were BoW.
3.5. Experimental Results and Discussion 51
Classifier
NB LR SVM RF XGB LGBM
Term
0.781 0.833 0.835 0.843 0.846 0.846
frequency
tf-idf 0.786 0.832 0.831 0.836 0.843 0.844
Along with the NLP representations, we also investigated the performance difference
between single (NB, LR, and SVM) and ensemble (RF, XGB, and LGBM) models. The
average Weighted F1-Scores grouped by VCs for single and ensemble models are illustrated
in Fig. 3.7. The ensemble models seemed to consistently demonstrate the superior perfor-
mance compared to the single counterparts. We also observed that the ensemble methods
produced mostly consistent results (i.e., small variances) for Access Vector and Authenti-
cation characteristics. We performed the one-sided non-parametric Wilcoxon signed rank
tests [254] to check the significance of the better performance of the ensemble over the
single models. Table 3.7 reports the p-values of the results from the hypothesis testing.
The tests confirmed that the superiority of the ensemble methods was significant since
all p-values were smaller than the significance level of 0.01. The validated results in Ta-
ble 3.4 also affirmed that six out of seven optimal classifiers were ensemble (i.e., LGBM and
XGB). It is noted that the XGB model usually took more time to train than the LGBM
model, especially for tf-idf representation. Our findings suggest that LGBM, XGB and
BoW should be considered as baseline classifiers and NLP representations for report-level
SV assessment.
The summary answer to RQ2: LGBM and BoW are the most frequent optimal
classifiers and NLP representations. Overall, the more complex NLP representa-
tions such as n-grams, tf-idf do not provide a statistically significant performance
improvement than BoW. The ensemble models perform statistically better than sin-
gle ones. It is recommended that the ensemble classifiers (e.g., XGB and LGBM)
and BoW should be used as baseline models for report-level SV assessment.
Single
0.95
Ensemble
0.90
0.85
0.80
0.75
0.70
r
ialit
y
g rity bility ecto e xit
y
a tio
n
e rity
t e l c v
fid
en Int aila s sV mp nt i Se
n A v
Ac
ce s Co u the
Co ces A
Ac
SV characteristic p-value
Confidentiality 3.261 × 10−5
Integrity 9.719 × 10−5
Availability 3.855 × 10−5
Access Vector 2.320 × 10−3
Access Complexity 1.430 × 10−5
Authentication 1.670 × 10−3
Severity 1.060 × 10−7
cases in the SV descriptions from 2000 to 2018. For each year from 2000 to 2018, we
split the dataset into (i) training set (data from the previous year backward) for building
the vocabulary, and (ii) testing set (data from the current year to 2018) for checking
the vocabulary existence. We found 64 cases from 2000 to 2018 in the testing data, in
which all the features were missing (see Appendix 3.9). We used the terms appearing at
least 0.1% in all descriptions. It should be noted that the number of all-zero cases may
be reduced using a larger vocabulary with the trade-off for larger computational time.
We also investigated the descriptions of these vulnerabilities and found several interesting
patterns. The average length of these abnormal descriptions was only 7.98 words compared
to 39.17 of all descriptions. It turned out that the information about the threats and
sources of such SVs was limited. Most of them just included the assets and attack/SV
types. For example, the vulnerabilities with ID of CVE-2016-10001xx had nearly the
same format “Reflected XSS in WordPress plugin” with the only differences were the name
and version of the plugin. This format made it hard for a model to evaluate the impact
of each SV separately. Another issue was due to specialized or abbreviated terms such
3.5. Experimental Results and Discussion 53
500
450
Number of terms 400
350
300
250
200
2 3 4 5 6 7 8 9 10 11
Maximum number of character n-grams
Figure 3.8: The relationship between the size of vocabulary and the max-
imum number of character n-grams.
as /redirect?url = XSS, SEGV, CSRF without proper explanation. The above issues
suggest that SV descriptions should be written with sufficient information to enhance the
comprehensibility of SVs.
For RQ3, the solution to the issue of the word-only models using character-level features
is evaluated. We considered the non-stop-words with high frequency (i.e., appearing in
more than 10% of all descriptions) to generate the character features. Using the same
0.1% value as RQ2 increased the dimensions more than 30 times, but the performance
only changed within 0.02. According to Algorithm 1, the output minimum number of
character n-grams was chosen to be two. We first tested the robustness of the character-
only models by setting the maximum number of characters to only three. For each year
y from 1999 to 2017, we used such character model to generate the characters from the
data of the considering year y backward. We then verified the existence of such features
using the descriptions of the other part of the data (i.e., from year y + 1 towards 2018).
Surprisingly, the model using only two-to-three-character n-grams could produce at least
one non-zero feature for all the descriptions even using only training data in 1999 (i.e., the
first year in our dataset based on CVE-ID). Such finding shows that our approach is stable
to SV data changes (concept drift) in testing data from 2000 to 2018 even with the limited
amount of data and without retraining.
Next, to increase the generalizability of our approach, values of 3-10 were considered
for selecting the maximum number of character n-grams based on their corresponding
vocabulary sizes (see Fig. 3.8). Using the elbow method in cluster analysis [255], six
was selected since the vocabulary size did not increase dramatically after this point. The
selected minimum and maximum values of character n-grams matched the minimum and
average word lengths of all NVD descriptions in our dataset, respectively.
We then used the feature aggregation algorithm (see section 3.3.4) to create the ag-
gregated features from the character n-grams (2 ≤ n ≤ 6) and word n-grams to build
the final model set and compared it with two baselines: Word-only Model (WoM) and
Character-only Model (CoM).
54
Chapter 3. Automated Report-Level Software Vulnerability Assessment with Concept
Table 3.8: Performance (Accuracy, Macro F1-Score, Weighted F1-Score) of our character-word vs. word-only and character-only
models.
Our optimal model (CWM) Word-only model (WoM) Character-only model (CoM)
SV characteristic
Macro Weighted Macro Weighted Macro Weighted
Accuracy Accuracy Accuracy
F1-Score F1-Score F1-Score F1-Score F1-Score F1-Score
Confidentiality 0.727 0.717 0.728 0.722 0.708 0.723 0.694 0.683 0.698
Integrity 0.763 0.749 0.764 0.763 0.744 0.764 0.731 0.718 0.734
Availability 0.712 0.711 0.711 0.700 0.696 0.702 0.660 0.657 0.660
Access Vector 0.914 0.540 0.901 0.904 0.533 0.894 0.910 0.538 0.899
Access Complexity 0.703 0.468 0.673 0.718 0.476 0.691 0.700 0.457 0.668
Authentication 0.875 0.442 0.844 0.864 0.425 0.832 0.866 0.441 0.840
Severity 0.668 0.575 0.663 0.686 0.569 0.675 0.661 0.549 0.652
Drift
3.5. Experimental Results and Discussion 55
It should be noted that WoM is the model in which concept drift is not handled. Unfor-
tunately, a direct comparison with the existing WoM [22] was not possible since they used
an older NVD dataset and more importantly, they did not release their source code for
reproduction. However, we tried to set up the experiments based on the guidelines and
results presented in the previous paper [22].
To be more specific, we used BoW predictors and random forest (the best of their three
models used) with the following hyperparameters: the number of trees was 100 and the
number of features for splitting was 40. For CoM, we used the same optimal classifier
of each VC. The comparison results are given in Table 3.8. CWM performed slightly
better than the WoM for four out of seven VCs regarding all evaluation metrics. Also,
4.98% features of CWM were non-zero, which was nearly five-time denser than 1.03% of
WoM. Also, CoM was the worst model among the three, which had been expected since
it contained the least information (smallest number of features). Although CWM did not
significantly outperform WoM, its main advantage is to effectively handle the OoV terms
(concept drift), except new terms without any matching parts. We hope that our solution
to concept drift will be integrated into practitioner’s existing framework/workflow and
future research work to perform more robust report-level SV assessment.
The summary answer to RQ3: The WoM does not handle the new cases well,
especially those with all zero-value features. Without retraining, the tri-gram char-
acter features can still handle the OoV words effectively with no all-zero features
for all testing data from 2000 to 2018. Our CWM performs comparably well with
the existing WoM and provides nearly five-time richer SV information. Hence, our
CWM is better for automated report-level SV assessment with concept drift.
embeddings for SV analysis and assessment. Overall, our findings show that LSA and
fastText are capable of building efficient report-level SV assessment models without too
much performance trade-off.
The summary answer to RQ4: The LSA model with 300 dimensions (6-18% of
the original size) retains from 90% up to 99% performance of the original model.
With the same feature dimensions, the model with fastText sub-word embeddings
provide even more promising results. The fastText model with the SV knowledge
outperforms that trained on a general context (e.g., Wikipedia). LSA and fastText
can help build efficient models for report-level SV assessment.
Chapter 4
Related publication: This chapter is based on our paper titled “On the Use
of Fine-grained Vulnerable Code Statements for Software Vulnerability Assessment
Models”, published in the 19th International Conference on Mining Software Repos-
itories (MSR), 2022 (CORE A) [268].
The proposed approach in Chapter 3 has improved the robustness of report-level Soft-
ware Vulnerability (SV) assessment models against changing data of SVs in the wild.
Nevertheless, these report-level models still rely on SV descriptions that mostly require
significant expertise and manual effort to create, which may cause delays for SV fixing. On
the other hand, respective vulnerable code is always available before SVs are fixed. Many
studies have developed Machine Learning (ML) approaches to detect SVs in functions and
fine-grained code statements that cause such SVs. However, as shown in Chapter 2, there
is little work on leveraging such detection outputs for data-driven SV assessment to give in-
formation about exploitability, impact, and severity of SVs. The information is important
to understand SVs and prioritize their fixing. Using large-scale data from 1,782 functions of
429 SVs in 200 real-world projects, in Chapter 4, we investigate ML models for automating
function-level SV assessment tasks, i.e., predicting seven Common Vulnerability Scoring
System (CVSS) metrics. We particularly study the value and use of vulnerable statements
as inputs for developing the assessment models because SVs in functions are originated
in these statements. We show that vulnerable statements are 5.8 times smaller in size,
yet exhibit 7.5-114.5% stronger assessment performance (Matthews Correlation Coefficient
(MCC)) than non-vulnerable statements. Incorporating context of vulnerable statements
further increases the performance by up to 8.9% (0.64 MCC and 0.75 F1-Score). Overall,
we provide the initial yet promising ML-based baselines for function-level SV assessment,
paving the way for further research in this direction.
60 Chapter 4. Automated Function-Level Software Vulnerability Assessment
4.1 Introduction
As shown in Chapter 2, previous studies (e.g., [95, 21, 22, 23, 11]) have mostly used SV
reports to develop data-driven models for assigning the Common Vulnerability Scoring
System (CVSS) [29] metrics to Software Vulnerabilities (SVs). Among sources of SV re-
ports, National Vulnerability Database (NVD) [16] has been most commonly used for
building SV assessment models [11]. The popularity of NVD is mainly because it has
SV-specific information (e.g., CVSS metrics) and less noise in SV descriptions than other
Issue Tracking Systems (ITSs) like Bugzilla [269]. The discrepancy is because NVD reports
are vetted by security experts, while ITS reports may be contributed by users/developers
with limited security knowledge [270]. However, NVD reports are mostly released long
after SVs have been fixed. Our analysis revealed that less than 3% of the SV reports with
the CVSS metrics on NVD had been published before SVs were fixed; on average, these
reports appeared 146 days later than the fixes. Note that our findings accord with the
previous studies [215, 216]. This delay renders the CVSS metrics required for SV assess-
ment unavailable at fixing time, limiting the adoption of report-level SV assessment for
understanding SVs and prioritizing their fixes.
Instead of using SV reports, an alternative and more straight-forward way is to directly
take (vulnerable) code as input to enable SV assessment prior to fixing. Once a code
function is confirmed vulnerable, SV assessment models can assign it the CVSS metrics
before the vulnerable code gets fixed, even when its report is not (yet) available. Note
that it is non-trivial to use static application security testing tools to automatically create
bug/SV reports from vulnerable functions for current SV assessment techniques as these
tools often have too many false positives [271, 272]. To develop function-level assessment
models, it is important to obtain input information about SVs in functions detected by
manual debugging or automatic means like data-driven approaches (e.g., [273, 274, 275]).
Notably, recent studies (e.g., [276, 277]) have shown that an SV in a function usually
stems from a very small number of code statements/lines, namely vulnerable statements.
Intuitively, these vulnerable statements potentially provide highly relevant information
(e.g., causes) for SV assessment models. Nevertheless, a large number of other (non-
vulnerable) lines in functions, though do not directly contribute to SVs, can still be useful
for SV assessment, e.g., indicating the impacts of an SV on nearby code. It still remains
largely unknown about function-level SV assessment models as well as the extent to which
vulnerable and non-vulnerable statements are useful as inputs for these models.
We conduct a large-scale study to fill this research gap. We investigate the useful-
ness of integrating fine-grained vulnerable statements and different types of code context
(relevant/surrounding code) into learning-based SV assessment models. The assessment
models employ various feature extraction methods and Machine Learning (ML) classifiers
to predict the seven CVSS metrics (Access Vector, Access Complexity, Authentication,
Confidentiality, Integrity, Availability, and Severity) for SVs in code functions.
Using 1,782 functions from 429 SVs of 200 real-world projects, we evaluate the use
of vulnerable statements and other lines in functions for developing SV assessment mod-
els. Despite being up to 5.8 times smaller in size (lines of code), vulnerable statements
are more effective for function-level SV assessment, i.e., 7.4-114.5% higher Matthews Cor-
relation Coefficient (MCC) and 5.5-43.6% stronger F1-Score, than non-vulnerable lines.
Moreover, vulnerable statements with context perform better than vulnerable lines alone.
Particularly, using vulnerable and all the other lines in each function achieves the best
performance of 0.64 MCC (8.9% better) and 0.75 F1-Score (8.5% better) compared to
using only vulnerable statements. We obtain such improvements when combining vulner-
able statements and context as a single input based on their code order, as well as when
treating them as two separate inputs. Having two inputs explicitly provides models with
4.2. Background and Motivation 61
1 protected String g e t E x e c u t i o n P r e a m b l e ()
2 {
3 if ( g e t W o r k i n g D i r e c t o r y A s S t r i n g () == null )
4 { return null ;}
5 String dir = getWorkingDirectoryAsString();
6 StringBuilder sb = new StringBuilder();
7 sb.append("cd");
8 - sb . append ( unifyQuotes ( dir ) ) ;
9 + sb . append ( quoteOneItem ( dir , false ) ) ;
10 sb.append("&&");
11 return sb.toString();
12 }
the location of vulnerable statements and context for the assessment tasks, while single
input does not. Surprisingly, we do not obtain any significant improvement of the double-
input models over the single-input counterparts. These results show that function-level SV
assessment models can still be effective even without knowing exactly which statements
are vulnerable. Overall, our findings can inform the practice of building function-level SV
assessment models.
Our key contributions are summarized as follows:
1. To the best of our knowledge, we are the first to leverage data-driven models for au-
tomating function-level SV assessment tasks that enable SV prioritization/planning
prior to fixing.
2. We study the value of using fine-grained vulnerable statements in functions for build-
ing SV assessment models.
Commits (VFCs) as these lines are presumably removed to fix SVs. The functions con-
taining such identified statements are considered vulnerable. Note that VFCs are used
as they can be relatively easy to retrieve from various sources like National Vulnerability
Database (NVD) [16]. An exemplary function and its vulnerable statement are in Fig. 4.1.
Line 8 “sb.append(unifyQuotes(dir));” is the vulnerable statement; this line was re-
placed with a non-vulnerable counterpart “sb.append(quoteOneItem(dir, false));” in
the VFC. The replacement was made to properly sanitize the input (dir), preventing OS
command injection.
Despite active research in SV detection, there is little work on utilizing the output of
such detection for SV assessment. Previous studies (e.g., [95, 20, 22, 23, 11]) have mostly
leveraged SV reports, mainly on NVD, to develop SV assessment models that alleviate
the need for manually defining complex rules for assessing ever-increasing SVs. However,
these SV reports usually appear long after SV fixing time. For example, the SV fix in
Fig. 4.1 was done 1,533 days before the date it was reported on NVD. In fact, such a
delay, i.e., disclosing SVs after they are fixed, is a recommended practice so that attackers
cannot exploit unpatched SVs to compromise systems [281]. One may argue that internal
bug/SV reports in Issue Tracking Systems (ITS) such as JIRA [282] or Bugzilla [269] can
be released before SV fixing and have severity levels. However, ITS severity levels are
often for all bug types, not only SVs. These ITSs also do not readily provide exploitability
and impact metrics like CVSS [29] for SVs, limiting assessment information required for
fixing prioritization. Moreover, SVs are mostly rooted in source code; thus, it is natural to
perform code-based SV assessment. We propose predicting seven base CVSS metrics (i.e.,
Access Vector, Access Complexity, Authentication, Confidentiality, Integrity, Availability,
and Severity)1 after SVs are detected in code functions to enable thorough and prior-fixing
SV assessment. We do not perform SV assessment for individual lines as for a given
function, like Li et al. [283], we observed that there can be more than one vulnerable line
and nearly all these lines are strongly related and contribute to the same SV (having the
same CVSS metrics).
Vulnerable statements represent the core parts of SVs, but we posit that other (non-
vulnerable) parts of a function may also be usable for SV assessment. Specifically, non-
vulnerable statements in a vulnerable function are either directly or indirectly related to the
current SV. We use program slicing [284] to define directly SV-related statements as the
lines affect or are affected by the variables in vulnerable statements. For example, the blue
lines in Fig. 4.1 are directly related to the SV as they define, change, or use the sb and dir
variables in vulnerable line 8. These SV-related statements can reveal the context/usage of
affected variables for analyzing SV exploitability, impact, and severity. For instance, lines
5-6 denote that dir is a directory and sb is a string (StringBuilder object), respectively;
line 7 then indicates that a directory change is performed, i.e., the cd command. This
sequence of statements suggests that sb contains a command changing directory. Line 11
returns the vulnerable command, probably affecting other components. Besides, indirectly
SV-related statements, e.g., the black lines in Fig. 4.1, are remaining lines in a function
excluding vulnerable and directly SV-related statements. These indirectly SV-related lines
may still provide information about SVs. For example, lines 3-4 in Fig. 4.1 imply that
there is only a null checking for directory without imposing any privilege requirement
to perform the command, potentially reducing the complexity of exploiting the SV. It
remains unclear to what extent different types of statements are useful for SV assessment
tasks. Therefore, this study aims to unveil the contributions of these statement types to
function-level SV assessment models.
1
These metrics are from CVSS version 2 and were selected based on the same reasons presented in the
study in Chapter 3. More details can be found in section 3.2.
4.2. Background and Motivation
- sb.append(unifyQuotes(dir));
Figure 4.2: Methodology used to answer the research questions. Note: The vulnerable function is the one described in Fig. 4.1.
63
64 Chapter 4. Automated Function-Level Software Vulnerability Assessment
RQ-wise method. The methods to collect data, extract features as well as develop and
evaluate models in Fig. 4.2 were utilized for answering all the Research Questions (RQs)
in section 4.3. RQ1 developed and compared two types of SV assessment models, namely
models using only vulnerable statements and those using only non-vulnerable statements.
In RQ2, for each of the program slicing, surrounding, and function context types, we cre-
ated a single feature vector by combining the current context and corresponding vulnerable
statements, based on their appearance order in the original functions, for model building
and performance comparison. In RQ3, for each context type in RQ2, we extracted two
separate feature vectors, one from vulnerable statements and another one from the con-
text, and then fed these vectors into SV assessment models. We compared the two-input
approach in RQ3 with the single-input counterpart in RQ2.
Percentage (%)
0 10 20 30 40 50 60 70 80 90 100
Access Local 4.7
Vector Network 95.3
High 2.1
Access
Medium 23.4
Complexity
Low 74.5
Bag-of-Tokens, Bag-of-Subtokens, Word2vec, fastText, and CodeBERT can still work with
these cases as these methods operate directly on code tokens.
4.5 Results
4.5.1 RQ1: Are Vulnerable Code Statements More Useful Than Non-
Vulnerable Counterparts for SV Assessment Models?
Based on the extraction process in section 4.4.1, we collected 1,782 vulnerable functions
containing 5,179 vulnerable and 57,633 non-vulnerable statements. The proportions of
these two types of statements are given in the first and second boxplots, respectively, in
Fig. 4.4. On average, 14.7% of the lines in the selected functions were vulnerable, 5.8 times
smaller than that of non-vulnerable lines. Interestingly, we also observed that 55% of the
functions contained only a single vulnerable statement. These values show that vulnerable
statements constitute a very small proportion of functions.
Despite the small size (no. of lines), vulnerable statements contributed more
to the predictive performance of the seven assessment tasks than non-vulnerable
statements (see Table 4.2). We considered two variants of non-vulnerable statements
for comparison. The first variant, Non-vuln (random), randomly selected the same num-
ber of lines as vulnerable statements from non-vulnerable statements in each function.
The second variant, Non-vuln (all) aka. Non-vuln (All - Vuln) in Fig. 4.4, considered
all non-vulnerable statements. Compared to same-sized non-vulnerable statements (Non-
vuln (random)), Vuln-only (using vulnerable statements solely) produced 116.9%, 126.6%,
98.7%, 90.7%, 147.9%, 111.2%, 116.7% higher MCC for Access Vector, Access Complex-
ity, Authentication, Confidentiality, Integrity, Availability, and Severity tasks, respectively.
On average, Vuln-only was 114.5% and 43.6% better than Non-vuln (random) in MCC
5
The macro version of F1-Score was used for multi-class classification.
6
r ≤ 0.1: negligible, 0.1 < r ≤ 0.3: small, 0.3 < r ≤ 0.5: medium, r > 0.5: large [309]
4.5. Results 71
Vuln-only
Surrounding context (n = 6)
All - Vuln - PS
(Indirectly SV-related)
0 20 40 60 80 100
Proportion of lines in a function (%)
and F1-Score, respectively. We obtained similar results of Non-vuln (random) when re-
peating the experiment with differently randomized lines. When using all non-vulnerable
statements (Non-vuln (all)), the assessment performance increased significantly, yet was
still lower than that of vulnerable statements. Average MCC and F1-Score of Vuln-only
were 7.4% and 5.5% higher than Non-vuln (all), respectively. The improvements of Vuln-
only over the two variants of non-vulnerable statements were statistically significant across
features/classifiers with p-values < 0.01 (p-valueN on−vuln(random) = 1.7 × 10−36 and p-
valueN on−vuln(all) = 7.2 × 10−11 ) and non-negligible effect sizes (rN on−vuln(random) = 0.62
and rN on−vuln(all) = 0.32). The low performance of Non-vuln (random) implies that SV
assessment models likely perform worse if vulnerable statements are incorrectly identi-
fied. Moreover, the decent performance of Non-vuln (all) shows that some non-vulnerable
statements are potentially helpful for SV assessment, which are studied in detail in RQ2.
Input type
Evaluation
CVSS metric Non-vuln Non-vuln
metric Vuln-only
(random) (all)
F1-Score 0.820 0.650 0.786
Access Vector
MCC 0.681 0.314 0.605
Access F1-Score 0.622 0.458 0.592
Complexity MCC 0.510 0.225 0.467
F1-Score 0.791 0.602 0.765
Authentication
MCC 0.630 0.317 0.614
F1-Score 0.645 0.411 0.625
Confidentiality
MCC 0.574 0.301 0.561
F1-Score 0.650 0.384 0.616
Integrity
MCC 0.585 0.236 0.534
F1-Score 0.647 0.417 0.624
Availability
MCC 0.583 0.276 0.551
F1-Score 0.695 0.414 0.610
Severity
MCC 0.583 0.269 0.523
F1-Score 0.695 0.484 0.659
Average
MCC 0.592 0.276 0.551
vulnerable statements, while surrounding context is predefined. The roughly similar size
helps test whether directly SV-related lines in PS context would be better than pre-defined
surrounding lines for SV assessment. The training and evaluation processes on the dataset
in RQ2 were the same as in RQ1.8
Adding context to vulnerable statements led to better SV assessment per-
formance than using vulnerable statements only (see Fig. 4.5). Among the
three, function context was the best, followed by PS and then surrounding con-
text. In terms of MCC, function context working together with vulnerable statements beat
Vuln-only by 6.4%, 6.5%, 9%, 8.2%, 11%, 11.4%, 9.7% for Access Vector, Access Complex-
ity, Authentication, Confidentiality, Integrity, Availability, and Severity tasks, respectively.
The higher F1-Score values when incorporating function context to vulnerable statements
are also evident in Fig. 4.5. On average, combining function context and vulnerable state-
ments attained 0.64 MCC and 0.75 F1-Score, surpassing using vulnerable lines solely by
8.9% in MCC and 8.5% in F1-Score. Although PS context + Vuln performed slightly worse
than function context + Vuln, MCC and F1-Score of PS context + Vuln were still 6.7% and
7.5% ahead of Vuln-only, respectively. The improvements of function and PS context +
Vuln over Vuln-only were significant across features/classifiers, i.e., p-values of 1.2 × 10−17
and 2.1 × 10−13 and medium effect sizes of 0.42 and 0.36, respectively. Compared to func-
tion/PS context + Vuln, surrounding context + Vuln outperformed Vuln-only by smaller
margins, i.e., 3% for MCC and 5.2% for F1-Score (p-value = 3.7 × 10−8 < 0.01 with a
small effect size (r = 0.27)). These findings show the usefulness of directly SV-related
(PS) lines for SV assessment models, while six lines surrounding vulnerable statements
seemingly contain less related information for the SV assessment tasks.
8
The RQ1 findings still hold when using the new dataset in RQ2, yet with a slight (≈2%) decrease in
absolute model performance.
4.5. Results
Access Vector Access Complexity Authentication Confidentiality
Type of lines in a vulnerable function Program Slicing (PS) context + Vuln 3.3 7.8 6.0 0.6
Surrounding context + Vuln 0.6 5.0 3.1 5.0
Function context + Vuln 4.6 6.4 5.2 4.4
PS context -3.2 -8.5 -1.8 -0.9
Surrounding context -2.4 -8.8 -2.7 -7.2
All - PS - Vuln -3.6 -12.4 -5.6 -12.1
Integrity Availability Severity Average
Program Slicing (PS) context + Vuln 6.4 8.3 4.1 5.2
Surrounding context + Vuln 1.7 7.6 2.2 3.6
Function context + Vuln 5.6 10.5 4.7 5.9
PS context -3.8 -3.1 -5.7 -3.8
Surrounding context -7.4 -8.2 -2.2 -5.5
All - PS - Vuln -10.6 -6.2 -17.0 -9.6
-20 -10 0 10 -20 -10 0 10 -20 -10 0 10 -20 -10 0 10
Performance (F1-Score) difference with respect to Vuln-only (using vulnerable statements only)
Figure 4.5: Differences in testing SV assessment performance (F1-Score and MCC) between models using different types of lines/context
and those using only vulnerable statements. Note: The differences were multiplied by 100 to improve readability.
73
74 Chapter 4. Automated Function-Level Software Vulnerability Assessment
Further investigation revealed that only 49% of lines in PS context overlapped with those
in surrounding context (n = 6). Note that the performance of surrounding context tended
to approach that of function context as the surrounding context size increased. Using
the dataset in RQ1, we also obtained the same patterns, i.e., function context + Vuln
> surrounding context + Vuln > Vuln-only. This result shows that function context is
generally better than the other context types, indicating the plausibility of building effective
SV assessment models using only the output of function-level SV detection (i.e., requiring
no knowledge about which statements are vulnerable in each function).
Although the three context types were useful for SV assessment when com-
bined with vulnerable statements, using these context types alone significantly
reduced the performance. As shown from RQ1, using only function context (i.e., Non-
vuln (all) in Table 4.2) was 6.9% inferior in MCC and 5.2% lower in F1-Score than Vuln-
only. Using the new dataset in RQ2, we obtained similar reductions in MCC and F1-Score
values. Fig. 4.5 also indicates that using only PS and surrounding context decreased MCC
and F1-Score of all the tasks. Particularly, using PS context alone reduced MCC and
F1-Score by 7.8% and 5.5%, respectively; whereas, such reductions in values for using
only surrounding context were 9.8% and 8%. These performance drops were confirmed
significant with p-values < 0.01 and non-negligible effect sizes. Overall, the performance
rankings of the context types with and without vulnerable statements were the same, i.e.,
function > PS > surrounding context. We also observed that all context types were better
(increasing 20.1-28.2% in MCC and 6.9-17.5% in F1-Score) than non-directly SV-related
lines (i.e., All - PS - Vuln in Fig. 4.5). These findings highlight the need for using context
together with vulnerable statements rather than using each of them alone for function-level
SV assessment tasks.
RQ2 that SV assessment models benefit from vulnerable statements along with (in-)directly
SV-related lines in functions, yet not necessarily where these lines are located.
4.6 Discussion
4.6.1 Function-Level SV Assessment: Baseline Models and Beyond
From RQ1-RQ3, we have shown that vulnerable statements and their context are useful for
SV assessment tasks. In this section, we discuss the performance of various features and
classifiers used to develop SV assessment models on the function level. We also explore
the patterns of false positives of the models used in this work. Through these discussions,
we aim to provide recommendations on building strong baseline models and inspire future
data-driven advances in function-level SV assessment.
Practices of building baselines. Among the investigated features and classifiers,
a combination of LGBM classifier and Bag-of-Subtokens features produced the
best overall performance for the seven SV assessment tasks (see Fig. 4.6). In
addition, LGBM outperformed the other classifiers, and Bag-of-Subtokens was better than
the other features. However, we did not find a single set of hyperparameters that was
consistently better than the others, emphasizing the need for hyperparameter tuning for
function-level SV assessment tasks, as generally recommended in the literature [312, 313].
Regarding the classifiers, the ensemble ones (LGBM, RF, and XGB) were significantly
better than the single counterparts (SVM, LR, and KNN) when averaging across all feature
types, aligning with the previous findings for SV assessment using SV reports [23, 22].
76 Chapter 4. Automated Function-Level Software Vulnerability Assessment
Figure 4.6: Average performance (MCC) of six classifiers and five features
for SV assessment in functions. Notes: BoT and BoST are Bag-of-Tokens
and Bag-of-Subtokens, respectively.
third-party libraries [315]. The second type of false positives involved vulnerable
variables with obscure context. For instance, a function used a potentially vulnerable
variable containing malicious inputs from users, but the affected function alone did not
contain sufficient information/context about the origin of the variable.10 Without such
variable context, a model would struggle to assess the exploitability of an SV; i.e., through
which components attackers can penetrate into a system and whether any authentica-
tion is required during the penetration. Future work can explore taint analysis [316] to
supplement function-level SV assessment models with features about variable origin/flow.
Chapter 5
Related publication: This chapter is based on our paper titled “DeepCVA: Auto-
mated Commit-level Vulnerability Assessment with Deep Multi-task Learning” pub-
lished in the 36th IEEE/ACM International Conference on Automated Software
Engineering (ASE), 2021 (CORE A*) [289].
5.1 Introduction
As reviewed in Chapter 2, existing techniques (e.g., [328, 21, 22, 23, 11]) to automate bug/-
Software Vulnerability (SV) assessment have mainly operated on bug/SV reports, but these
reports may be only available long after SVs appeared in practice. Our motivating analysis
revealed that there were 1,165 days, on average, from when an SV was injected in a code-
base until its report was published on National Vulnerability Database (NVD) [16]. Our
analysis agreed with the findings of Meneely et al. [25]. To tackle late-detected bugs/SVs,
recently, Just-in-Time (commit-level) approaches (e.g., [287, 26, 329, 330]) have been pro-
posed to rely on the changes in code commits to detect bugs/SVs right after bugs/SVs are
added to a codebase. Such early commit-level SV detection can also help reduce the delay
in SV assessment.
Even when SVs are detected early in commits, we argue that existing automated tech-
niques relying on bug/SV reports still struggle to perform just-in-time SV assessment.
Firstly, there are significant delays in the availability of SV reports, which render the ex-
isting SV assessment techniques unusable. Specifically, SV reports on NVD generally only
appear seven days after the SVs were found/disclosed [331]. Some of the detected SVs may
not even be reported on NVD [332], e.g., because of no disclosure policy. User-submitted
bug/SV reports are also only available post-release and more than 82% of the reports are
filed more than 30 days after developers detected the bugs/SVs [333]. Secondly, code re-
view can provide faster SV assessment, but there are still unavoidable delays (from several
hours to even days) [334]. Delays usually come from code reviewers’ late responses and
manual analyses depending on the reviewers’ workload and code change complexity [335].
Thirdly, it is non-trivial to automatically generate bug/SV reports from vulnerable com-
mits as it would require non-code artifacts (e.g., stack traces or program crashes) that are
mostly unavailable when commits are submitted [328, 336].
Performing commit-level SV assessment provides a possibility to inform committers
about the exploitability, impact and severity of SVs in code changes and prioritize fixing
earlier than current report-level SV assessment approaches. However, to the best of our
knowledge, there is no existing work on automating SV assessment in commits. Prior SV
assessment techniques that analyze text in SV databases (e.g., [21, 22, 23]) also cannot be
directly adapted to the commit level. Contrary to text, commits contain deletions and
additions of code with specific structure and semantics [287, 337]. Additionally, we spec-
ulate that the expert-based Common Vulnerability Scoring System (CVSS) metrics [29],
which are commonly used to quantify the exploitability, impact and severity level of SVs
for SV assessment, can be related. For example, an SQL injection is likely to be highly
severe since attackers can exploit it easily via crafted input and compromise data con-
fidentiality and integrity. We posit that these metrics would have common patterns in
commits that can be potentially shared between SV assessment models. Predicting re-
lated tasks in a shared model has been successfully utilized for various applications [27].
For instance, an autonomous car is driven with simultaneous detection of vehicles, lanes,
signs and pavement [338]. These observations motivated us to tackle a new and important
research challenge, “How can we leverage the common attributes of assessment
tasks to perform effective and efficient commit-level SV assessment?”
We present DeepCVA, a novel Deep multi-task learning model, to automate Commit-
level Vulnerability Assessment. DeepCVA first uses attention-based convolutional gated
recurrent units to extract features of code and surrounding context from vulnerability-
contributing commits (i.e., commits with vulnerable changes). The model uses these fea-
tures to predict seven CVSS assessment metrics (i.e., Confidentiality, Integrity, Availability,
Access Vector, Access Complexity, Authentication, and Severity) simultaneously using the
5.2. Background and Motivation 81
multi-task learning paradigm. The predicted CVSS metrics can guide SV management
and remediation processes.
Our key contributions are summarized as follows:
1. We are the first to tackle the commit-level SV assessment tasks that enable early
security risks estimation and planning for SV remediation.
5. We release our source code, models and datasets for future research at https://
github.com/lhmtriet/DeepCVA.
Chapter organization. Section 5.2 introduces preliminaries and motivation. Section 5.3
proposes the DeepCVA model for commit-level SV assessment. Section 5.4 describes our
experimental design and setup. Section 5.5 presents the experimental results. Section 5.6
discusses our findings and threats to validity. Section 5.7 covers the related work. Sec-
tion 5.8 concludes the work and proposes future directions.
Figure 5.1: Exemplary SV fixing commit (right) for the XML external
entity injection (XXE) (CVE-2016-3674) and its respective SV contributing
commit (left) in the xstream project.
Vector, Access Complexity, Authentication, and Severity) to assess SVs in this study be-
cause of their popularity in practice. Based on CVSS version 2, the VCC (CVE-2016-3674)
in Fig. 5.1 has a considerable impact on Confidentiality. This SV can be exploited with low
(Access) complexity with no authentication via a public network (Access Vector), making
it an attractive target for attackers.
Despite the criticality of these SVs, there have been delays in reporting, assessing and
fixing them. Concretely, the VCC in Fig. 5.1 required 1,439 and 1,469 days to be reported1
and fixed (in VFC), respectively. Existing SV assessment methods based on bug/SV reports
(e.g., [21, 22, 23]) would need to wait more than 1,000 days for the report of this SV.
However, performing SV assessment right after this commit was submitted can bypass
the waiting time for SV reports, enabling developers to realize the exploitability/impacts
of this SV and plan to fix it much sooner. To the best of our knowledge, there has not
been any study on automated commit-level SV assessment, i.e., assigning seven CVSS base
metrics to a VCC. Our work identifies and aims to bridge this important research gap.
…
+ return new based
…
WstxInputFactory(); + return new WstxInputFactory(); GRUs Integrity
}
Post-change hunk
Code-aware Tokenization
Post-change
Availability
1. Commit Preprocessing, Attention-
Vulnerability- 3-grams
…
Context Extraction & based Access Vector
…
Contributing
Tokenization GRUs
Commit
Access Complexity
Pre-change
protected XMLInputFactory ... {
protected XMLInputFactory }
Pre-change context Attention- Authentication
createInputFactory() {
…
- return new MXParserFactory();
based
…
} - return new MXParserFactory(); GRUs Severity Level
Pre-change hunk
Shared Input
Embedding Feature maps 3. Multi-task Learning with
Filters Task-specific Blocks + Softmax Layers
2. Shared Commit Feature Extractor with Attention-based
Convolutional Gated Recurrent Units
Figure 5.2: Workflow of DeepCVA for automated commit-level SV assessment. Note: The VCC is the one described in Fig. 5.1.
83
84 Chapter 5. Automated Commit-Level Software Vulnerability Assessment
Figure 5.3: Code changes outside of a method from the commit 4b9fb37
in the Apache qpid-broker-j project.
of an AST to the list of potential scopes (potential_scopes) of the current hunk. The
first (root) AST is always valid since it encompasses the whole file. Line 6 then checks
whether each node (sub-tree) of the current AST has one of the following types: class,
interface, enum, method, if/else, switch, for/while/do, try/catch, and is surrounding
the current hunk. If the conditions are satisfied, the extract_scope function would be
called recursively in line 7 until a leaf of the AST is reached. The main code starts to
extract the modified files of the current commit in line 9. For each file, we extract code
hunks (code deletions/additions) in line 12 and then obtain the AST of the current file
using an AST parser in line 13. Line 16 calls the defined extract_scope function to
generate the potential scopes for each hunk. Among the identified scopes, line 17 adds
the one with the smallest size (i.e., the number of code lines excluding empty lines and
comments) to the list of CESs (all_ces). Finally, line 18 of Algorithm 2 returns all the
CESs for the current commit.
We treat deleted (pre-change), added (post-change) code changes and their CESs as
four separate inputs to be vectorized by the shared input embedding, as illustrated in
Fig. 5.2. For each input, we concatenate all the hunks/CESs in all the affected files of a
commit to explicitly capture their interactions.
Code-aware tokenization. The four inputs extracted from a commit are then tokenized
with a code-aware tokenizer to preserve code semantics and help prediction models be more
generalizable. For example, a++ and b++ are tokenized as a, b and ++, explicitly giving
a model the information about one-increment operator (++). Tokenized code is fed into
a shared Deep Learning model, namely Attention-based Convolutional Gated Recurrent
Unit (AC-GRU), to extract commit features.
common code patterns, e.g., public class Integer. The filters are randomly initialized
and jointly learned with the other components of DeepCVA. We did not include 2-grams
and 4-grams to reduce the required computational resources without compromising the
model performance, which has been empirically demonstrated in section 5.5.2. To generate
code features of different window sizes with the three-way CNN, we multiply each filter
with the corresponding input rows and apply non-linear ReLU activation function [341],
i.e., ReLU(x) = max(0, x). We repeat the same convolutional process from the start to
the end of an input vector by moving the filters down sequentially with a stride of one.
This stride value is the smallest and helps capture the most fine-grained information from
input code as compared to larger values. Each filter size returns feature maps of the size
(N − K + 1) × F , where K is the filter size (one, three or five) and F is the number of
filters. Multiple filters are used to capture different semantics of commit data.
Attention-based Gated Recurrent Unit. The feature maps generated by the three-way
CNN sequentially enter a Gated Recurrent Unit (GRU) [31]. GRU, defined in Eq. (5.1), is
an efficient version of Recurrent Neural Networks and used to explicitly capture the order
and dependencies between code blocks. For example, the return statement comes after
the function declarations of the VCC in Fig. 5.2.
zt = σ(Wz xt + Uz ht−1 + bz )
rt = σ(Wr xt + Ur ht−1 + br )
(5.1)
ĥt = tanh(Wh xt + Uh (rt ⊙ ht−1 ) + bh )
ht = (1 − zt ) ⊙ ht−1 + zt ⊙ ĥt
5.3. The DeepCVA Model 87
where xcommit is the commit feature vector from AC-GRU; Wt is learnable weights and bt
is learnable bias.
Each task-specific vector goes through the respective softmax layer to determine the
output of each task with the highest predicted probability. The prediction output (predi )
88 Chapter 5. Automated Commit-Level Software Vulnerability Assessment
predi = argmax(probi )
probi = softmax(Wp taski + bp )
exp(zj ) (5.4)
softmax(zj ) = nlabels
P i
exp(zc )
c=1
where probi contains the predicted probabilities of nlabelsi possible outputs of task i; Wp
is learnable weights and bp is learnable bias.
Training DeepCVA. To compare DeepCVA’s outputs with ground-truth CVSS labels,
we define a multi-task loss that averages the cross-entropy losses of seven tasks in Eq. (5.5).
7
X
lossDeepCV A = lossi
i=1
(5.5)
nlabels
Xi
lossi = − yic log(probci ), yic = 1 if c is true class else 0
c=1
5.4.1 Datasets
To develop commit-level SV assessment models, we built a dataset of Vulnerability-Contributing
Commits (VCCs) and their CVSS metrics. We used Vulnerability-Fixing Commits (VFCs)
to retrieve VCCs, as discussed in section 5.2.1.
VFC identification. We first obtained VFCs from three public sources: NVD [16],
GitHub and its Advisory Database2 as well as a manually curated/verified VFC dataset
(VulasDB) [285]. In total, we gathered 13,310 VFCs that had dates ranging from July
2000 to October 2020. We selected VFCs in Java projects as Java has been commonly
investigated in the literature (e.g., [287, 288, 286]) and also in the top five most popular
languages in practice.3 Following the practice of [286], we discarded VFCs that had more
than 100 files and 10,000 lines of code to reduce noise in the data.
VCC identification with the SZZ algorithm. After the filtering steps, we had 1,602
remaining unique VFCs to identify VCCs using the SZZ algorithm [344]. This algorithm
selects commits that last modified the source code lines deleted or modified to address
an SV in a VFC as the respective VCCs of the same SV (see Fig. 5.1). As in [344],
we first discarded commits with timestamps after the published dates of the respective
SVs on NVD since SVs can only be reported after they were injected in a codebase. We
2
https://github.com/advisories
3
https://insights.stackoverflow.com/survey/2020#technology-most-loved-dreaded-and-wanted-languages-loved
5.4. Experimental Design and Setup 89
Percentage (%)
0 10 20 30 40 50 60 70 80 90 100
None 28.5
Confiden-
tiality Partial 66.8
Complete 4.7
None 32.7
Integrity
Partial 62.9
Complete 4.4
Availability
None 51.8
Partial 43.2
Complete 5.0
tication Complexity Vector
Access Access
Local 3.1
Network 96.9
Low 65.0
Medium 33.9
High 1.1
Authen-
None 79.2
Single 20.8
Low 5.4
Severity
Medium 70.9
High 23.7
then removed cosmetic changes (e.g., newlines and white spaces) and single-line/multi-line
comments in VFCs since these elements do not change code functionality [286]. Like [286],
we also considered copied or renamed files while tracing VCCs. We obtained 1,229 unique
VCCs4 of 542 SVs in 246 real-world Java projects and their corresponding expert-verified
CVSS metrics on NVD. Distributions of curated CVSS metrics are illustrated in Fig. 5.4.
The details of the number of commits and projects retained in each filtering step are also
given in Table 5.1. Note that some commits and projects were removed during the tracing
of VCCs from VFCs due to the issues coined as ghost commits studied by Rezk et al. [290].
We did not remove large VCCs (with more than 100 files and 10k lines) as we found
several VCCs were large initial/first commits. Our observations agreed with the findings
of Meneely et al. [25].
Manual VCC validation. To validate our curated VCCs, we randomly selected 293
samples, i.e., 95% confidence level and 5% error [291], for two researchers (i.e., the au-
thor of this thesis and a PhD student with three-year experience in Software Engineering
and Cybersecurity) to independently examine. The manual VCC validation was consid-
erably labor-intensive, which took approximately 120 man-hours. The Cohen’s kappa (κ)
inter-rater reliability score [345] was 0.83, i.e., “almost perfect” agreement [346]. We also
involved another PhD student having two years of experience in Software Engineering and
Cybersecurity in the discussion to resolve disagreements. Our validation found that 85%
of the VCCs were valid. In fact, the SZZ algorithm is imperfect [347], but we assert that it
is nearly impossible to obtain near 100% accuracy without exhaustive manual validation.
Specifically, the main source of incorrectly identified VCCs in our dataset was that some
4
The SV reports of all curated VCCs were not available at commit time.
90 Chapter 5. Automated Commit-Level Software Vulnerability Assessment
Table 5.1: The number of commits and projects after each filtering step.
Round 1
Training
Round 2
Validation
.....
.....
Testing
Round 10
files in VFCs were used to update version/documentation or address another issue instead
of fixing an SV. One such false positive VCC was the commit 87c89f0 in the jspwiki project
that last modified the build version in the corresponding VFC.
Data splitting. We adopted time-based splits [348] for training, validating and testing
the models to closely represent real-world scenarios where incoming/future unseen data
is not present during training [286, 349]. We trained, validated and tested the models in
10 rounds using 12 equal folds split based on commit dates (see Fig. 5.5). Specifically,
in round i, folds 1 → i, i + 1 and i + 2 were used for training, validation and testing,
respectively. We chose an optimal model with the highest average validation performance
and then reported its respective average testing performance over 10 rounds, which helped
avoid unstable results of a single testing set [212].
X-CVA worked by concatenating all seven CVSS metrics into a single label. To extract the
results of the individual tasks for X-CVA, we checked whether the ground-truth label of
each task was in the concatenated model output. For S-CVA and X-CVA, we applied six
popular classifiers: Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest
Neighbors (KNN), Random Forest (RF), XGBoost (XGB) [78] and Light Gradient Boosting
Machine (LGBM) [79]. These classifiers have been used for SV assessment based on SV
reports [22, 23]. The hyperparameters for tuning these classifiers were regularization: {l1,
l2}; regularization coefficient: {0.01, 0.1, 1, 10, 100} for LR and {0.01, 0.1, 1, 10, 100,
1,000, 10,000} for SVM; no. of neighbors: {11, 31, 51}, distance norm: {1, 2} and distance
weight: {uniform, distance} for KNN; no. of estimators: {100, 300, 500}, max. depth: {3,
5, 7, 9, unlimited}, max. no. of leaf nodes: {100, 200, 300, unlimited} for RF, XGB and
LGBM. These hyperparameters have been adapted from relevant studies [22, 23, 304].
Unlike S-CVA and X-CVA, U-CVA did not require CVSS labels to operate; therefore,
U-CVA required less human effort than S-CVA and X-CVA. We tuned U-CVA for each
task with the following no. of clusters (k): {2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35,
40, 45, 50}. To assess a new commit with U-CVA, we found the cluster with the smallest
Euclidean distance to that commit and assigned it the most frequent class of each task in
the selected cluster.
5
https://keras.io/getting_started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development
6
MCC values of random and most-frequent-class baselines were all < 0.01.
7
Precision (0.533)/Recall (0.445) of DeepCVA were > than all baselines.
5.5. Research Questions and Experimental Results
Table 5.2: Testing performance of DeepCVA and baseline models. Notes: Optimal classifiers of S-CVA/X-CVA and optimal cluster
no. (k ) of U-CVA are in parentheses. BoW, W2V and SM are Bag-of-Words, Word2vec and Software Metrics, respectively. The best
performance of DeepCVA is from the run with the highest MCC in each round. Best row-wise values are in grey.
Model
Evaluation
CVSS metric S-CVA X-CVA U-CVA DeepCVA (Best
metric
BoW W2V SM BoW W2V SM BoW W2V SM in parentheses)
F1-Score 0.416 0.406 0.423 0.420 0.434 0.429 0.292 0.332 0.313 0.436 (0.475)
Confidentiality
0.174 0.239 0.232 0.188 0.241 0.203 0.003 0.092 0.017
MCC 0.268 (0.299)
(LR) (LGBM) (XGB) (LR) (LR) (XGB) (50) (45) (50)
F1-Score 0.373 0.369 0.352 0.391 0.415 0.407 0.284 0.305 0.330 0.430 (0.458)
Integrity
0.127 0.176 0.146 0.114 0.160 0.128 -0.005 0.091 0.084
MCC 0.250 (0.295)
(LGBM) (LGBM) (RF) (LGBM) (LR) (LGBM) (25) (30) (25)
F1-Score 0.381 0.389 0.384 0.424 0.422 0.406 0.254 0.332 0.238 0.432 (0.475)
Availability
0.182 0.173 0.126 0.187 0.192 0.123 0.064 0.092 0.016
MCC 0.273 (0.303)
(RF) (LGBM) (XGB) (LR) (LR) (XGB) (10) (45) (3)
F1-Score 0.511 0.487 0.440 0.499 0.532 0.487 0.477 0.477 0.477 0.554 (0.578)
Access Vector
0.07 0.051 0.018 0.044 0.107 0.012 0.000 0.000 0.000
MCC 0.129 (0.178)
(XGB) (LR) (LR) (LGBM) (LR) (LGBM) (9) (40) (6)
F1-Score 0.437 0.448 0.417 0.412 0.445 0.361 0.315 0.365 0.385 0.464 (0.475)
Access Complexity
0.119 0.143 0.111 0.131 0.121 0.088 0.000 0.022 0.119
MCC 0.242 (0.261)
(LR) (XGB) (LGBM) (LR) (XGB) (SVM) (4) (30) (15)
F1-Score 0.601 0.584 0.593 0.541 0.618 0.586 0.458 0.526 0.492 0.657 (0.677)
Authentication
0.258 0.264 0.268 0.212 0.282 0.208 0.062 0.162 0.089
MCC 0.352 (0.388)
(SVM) (XGB) (LGBM) (RF) (SVM) (XGB) (50) (30) (50)
F1-Score 0.407 0.357 0.345 0.382 0.381 0.358 0.283 0.288 0.287 0.424 (0.460)
Severity
0.144 0.153 0.057 0.130 0.149 0.058 -0.018 0.010 0.026
MCC 0.213 (0.277)
(LR) (XGB) (XGB) (LR) (LGBM) (XGB) (4) (15) (4)
F1-Score 0.447 0.434 0.422 0.438 0.464 0.433 0.338 0.375 0.360 0.485 (0.514)
Average
MCC 0.153 0.171 0.137 0.144 0.179 0.117 0.015 0.067 0.050 0.247 (0.286)
93
94 Chapter 5. Automated Commit-Level Software Vulnerability Assessment
The average and task-wise F1-Score values of DeepCVA also beat those of the best base-
line (X-CVA with Word2vec features) by substantial margins. We found that DeepCVA
significantly outperformed the best baseline models in terms of both MCC and F1-score
averaging across all seven tasks, confirmed with p-values < 0.01 using the non-parametric
Wilcoxon signed-rank tests [254]. These results show the effectiveness of the novel design
of DeepCVA.
An example to qualitatively demonstrate the effectiveness of DeepCVA is the VCC
ff655ba in the Apache xerces2-j project, in which a hashing algorithm was added. This
algorithm was later found vulnerable to hashing collision that could be exploited with
timing attacks in the fixing commit 992b5d9. This SV was caused by the order of items
being added to the hash table in the put(String key, int value) function. Such an
order could not be easily captured by baseline models whose features did not consider the
sequential nature of code (i.e., BoW, Word2vec and software metrics) [203]. More details
about the contributions of different components to the overall performance of DeepCVA
are covered in section 5.5.2.
Regarding the baselines, the average MCC value (0.147) of X-CVA was on par with that
(0.154) of S-CVA. This result reinforces the benefits of leveraging the common attributes
among seven CVSS metrics to develop effective commit-level SV assessment models. How-
ever, X-CVA was still not as strong as DeepCVA mainly because of its much lower training
data utilization per output. For X-CVA, there was an average of 39 output combinations
of CVSS metrics in the training folds, i.e., 31 commits per output. In contrast, DeepCVA
had 13.2 times more data per output as there were at most three classes for each task (see
Fig. 5.4). Finally, we found supervised learning (S-CVA, X-CVA and DeepCVA) to be at
least 74.6% more effective than the unsupervised approach (U-CVA). This result shows the
usefulness of using CVSS metrics to guide the extraction of commit features.
4.0
5
1.0
0.8
0.6
0.4
0.2
0
-0.1
-0.4
-0.4
-0.5
-0.8
-1.1
-1.2
-1.4
-1.8
-1.9
-2.0
-2.1
-2.5
-2.6
-5
DeepCVA
-2.8
-3.2
-3.2
-3.4
-3.9
-4.1
-4.2
-4.3
-4.3
-4.4
-4.6
-4.8
-5.1
-5.3
-5.4
-5.6
-5.6
-6.0
-6.1
-6.2
-6.3
-6.4
-6.7
-7.2
-7.3
-7.3
-7.4
-7.4
-7.5
-10
-7.7
-11.9
-15
-15.0
-18.0
-18.0
-20
-19.9
-22.5
-25
-30
Confidentiality Integrity Availability Access Vector Access Complexity Authentication Severity
No context (0.215) 1-ngrams only (0.218) No three-way CNN (0.189) No task-specific blocks (0.227)
AST inputs (0.227) No attention-based GRU (0.196) No attention mechanism (0.102) No multi-task learning (0.198)
Figure 5.6: Differences of testing MCC (multiplied by 100 for readability) of the model variants compared to the proposed DeepCVA
in section 5.3. Note: The average MCC values (without multiplying by 100) of the model variants are in parentheses.
95
96 Chapter 5. Automated Commit-Level Software Vulnerability Assessment
Results. As depicted in Fig. 5.6, the main components 8 uplifted the average
MCC of DeepCVA by 25.9% for seven tasks. Note that 7/8 model variants (except
the model with no attention mechanism) outperformed the best baseline model from RQ1.
These results were confirmed with p-values < 0.01 using Wilcoxon signed-rank tests [254].
Specifically, the components8 of DeepCVA increased the MCC values by 25.3%, 20.8%,
21.5%, 35.8%, 35.5%, 18.9% and 23.6% for Confidentiality, Integrity, Availability, Access
Vector, Access Complexity, Authentication and Severity, respectively.
For the inputs, using the Smallest Enclosing Scope (CES) of code changes resulted in
a 14.8% increase in MCC compared to using hunks only, while using AST inputs had 8.8%
lower performance. This finding suggests that code context is important for assessing SVs
in commits. In contrast, syntactical information is not as necessary since code structure
can be implicitly captured by code tokens and their sequential order using our AC-GRU.
The key components of the AC-GRU feature extractor boosted the performance by
13.2% (3-grams vs. 1-grams), 25.6% (Attention-based GRU), 30.2% (Three-way CNN)
and 142% (Attention). Note that DeepCVA surpassed the state-of-the-art 3-gram [21]
and 1-gram [287] CNN-only architectures for (commit-level) SV/defect prediction. These
results show the importance of combining the (1,3,5)-gram three-way CNN with attention-
based GRUs rather than using them individually. We also found that 1-5 grams did
not significantly increase the performance (p-value = 0.186), confirming our decision in
section 5.3.2 to only use 1,3,5-sized filters.
For the prediction layers, we raised 8.8% and 24.4% MCC of DeepCVA with Task-
specific blocks and Multi-task learning, respectively. Multi-task DeepCVA took 8,988 s
(2.5 hours) and 25.7 s to train/validate and test in 10 rounds × 10 runs, which were 6.3
and 6.2 times faster compared to those of seven single-task DeepCVA models, respectively.
DeepCVA was only 11.3% and 12.7% slower in training/validating and testing than one
single-task model on average, respectively. These values highlight the efficiency of training
and maintaining the multi-task DeepCVA model. Finally, obtaining Severity using the
CVSS formula [22] from the predicted values of the other six metrics dropped MCC by
17.4% for this task. This result supports predicting Severity directly from commit data.
Single-task
S-CVA S-CVA X-CVA Multi-task
CVSS Task DeepCVA
(ROS) (SMOTE) (ROS) DeepCVA
(ROS)
Confidentiality 0.220 0.203 0.185 0.250† 0.268
Integrity 0.174 0.168 0.179† 0.206† 0.250
Availability 0.195† 0.187† 0.182 0.209† 0.273
Access Vector 0.115† 0.110† 0.092 0.156† 0.129
Access Comp. 0.172† 0.186† 0.144† 0.190† 0.242
Authentication 0.325† 0.340† 0.299† 0.318 0.352
Severity 0.132 0.124 0.141 0.186† 0.213
Average 0.190† 0.188† 0.175 0.216† 0.247
20}. We could not apply SMOTE to single-task DeepCVA as features were trained end-
to-end and unavailable prior training for finding nearest neighbors. We also did not apply
SMOTE to X-CVA as there was always a single-sample class in each round, producing no
nearest neighbor.
Results. ROS and SMOTE increased the average performance (MCC) of 3/4
baselines except X-CVA (see Table 5.3). However, the average MCC of our
multi-task DeepCVA was still 14.4% higher than that of the best oversampling-
augmented baseline (single-task DeepCVA with ROS). Overall, MCC increased by
8%, 6.9% and 9.1% for S-CVA (ROS), S-CVA (SMOTE) and single-task DeepCVA (ROS),
respectively. These improvements were confirmed significant with p-values < 0.01 using
Wilcoxon signed-rank tests [254]. We did not report oversampling results of U-CVA as they
were still much worse compared to others. We found single-task DeepCVA benefited the
most from oversampling, probably since Deep Learning usually performs better with more
data [274]. In contrast, oversampling did not improve X-CVA as oversampling did not
generate as many samples for X-CVA per class as for S-CVA (i.e., X-CVA had 13 times, on
average, more classes than S-CVA). These results further strengthen the effectiveness and
efficiency of multi-task learning of DeepCVA for commit-level SV assessment even without
the overheads of rebalancing/oversampling data.
5.6 Discussion
5.6.1 DeepCVA and Beyond
DeepCVA has been shown to be effective for commit-level SV assessment in the three
RQs, but our model still has false positives. We analyze several representative patterns
of such false positives to help further advance this task and solutions for researchers and
practitioners in the future.
Some commits were too complex and large (tangled) to be assessed correctly. For
example, the VCC 015f7ef in the Apache Spark project contained 1,820 additions and 146
deletions across 29 files; whereas, the untrusted deserialization SV occurred in just one
line 56 in LauncherConnection.java. Recent techniques (e.g., [279, 360]) pinpoint more
precise locations (e.g., individual files or lines in commits) of defects, especially in tangled
changes. Such techniques can be adapted to remove irrelevant code in VCCs (i.e., changes
98 Chapter 5. Automated Commit-Level Software Vulnerability Assessment
that do not introduce or contain SVs). More relevant code potentially gives more fine-
grained information for the SV assessment tasks. Note that DeepCVA provides a strong
baseline for comparing against fine-grained approaches.
DeepCVA also struggled to accurately predict assessment metrics for SVs related to
external libraries. For instance, the SV in the commit 015f7ef above occurs with the
ObjectInputStream class from the java.io package, which sometimes prevented Deep-
CVA from correctly assessing an SV. If an SV happens frequently with a package in the
training set, (e.g., the XML library of the VCC bba4bc2 in Fig. 5.1), DeepCVA still can infer
correct CVSS metrics. Pre-trained code models on large corpora [288, 80, 302] along with
methods to search/generate code [361] and documentation [362] as well as (SV-related)
information from developer Q&A forums [315] can be investigated to provide enriched
context of external libraries, which would in turn support more reliable commit-level SV
assessment with DeepCVA.
We also observed that DeepCVA, alongside the considered baseline models, performed
significantly worse, in terms of MCC, for Access Vector compared to the remaining tasks
(see Table 5.2). We speculate that such low performance is mainly because Access Vec-
tor contains the most significant class imbalance among the tasks, as shown in Fig. 5.4.
For single-task models, we found that using class rebalancing techniques such as ROS or
SMOTE can help improve the performance, as demonstrated in RQ3 (see section 5.5.3).
However, it is still unclear how to apply the current class rebalancing techniques for multi-
task learning models such as DeepCVA. Thus, we suggest that more future work should
investigate specific class rebalancing and/or data augmentation to address such imbalanced
data in the context of multi-task learning.
in fixing commits of third-party libraries to assess SVs in such libraries. Our work is
fundamentally different from these previous studies since we are the first to investigate the
potential of performing assessment of all SV types (not only vulnerable libraries) using
commit changes rather than bug/SV reports/fixes. Our approach allows practitioners to
realize the exploitability/impacts of SVs in their systems much earlier, e.g., up to 1,000
days before (see section 5.2.2), as compared to using bug/SV reports/fixes. Less delay
in SV assessment helps practitioners to plan/prioritize SV fixing with fresh design and
implementation in their minds. Moreover, we have shown that multi-task learning, i.e.,
predicting all CVSS metrics simultaneously, can significantly increase the effectiveness and
reduce the model development and maintenance efforts in commit-level SV assessment. It
should be noted that report-level prediction is still necessary for assessing SVs in third-
party libraries/software, especially the ones without available code (commits), to prioritize
vendor-provided patch application, as well as SVs missed by commit-level detection.
Chapter 6
Related publications: This chapter is based on two of our papers: (1) “PUMiner:
Mining Security Posts from Developer Question and Answer Websites with PU
Learning” published in the 17th International Conference on Mining Software Repos-
itories (MSR), 2020 (CORE A) [304], and (2) “A Large-scale Study of Security Vul-
nerability Support on Developer Q&A Websites” published in the 25th International
Conference on Evaluation and Assessment in Software Engineering (EASE), 2021
(CORE A) [315].
6.1 Introduction
It is important to constantly track and resolve Software Vulnerabilities (SVs) to ensure
the availability, confidentiality and integrity of software systems [3]. Developers can seek
assessment information for resolving SVs from sources verified by security experts such as
Common Weakness Enumeration (CWE) [28], National Vulnerability Database (NVD) [16]
and Open Web Application Security Project (OWASP) [167]. However, these expert-based
SV sources do not provide any mechanisms for developers to promptly ask and answer
questions about issues in implementing/understanding the reported SV solutions/concepts.
On the other hand, developer Questions and Answer (Q&A) websites contain a plethora
of such SV-related discussions. Stack Overflow (SO)1 and Security StackExchange (SSE)2
contain some of the largest numbers of SV-related discussions among developer Q&A sites
with contributions from millions of users [304].
The literature has analyzed different aspects of discussions on Q&A sites, but there is
still no investigation of how SO and SSE are supporting SV-related discussions. Specifically,
the main concepts [370], the top languages/technologies and user demographics [371], as
well as user perceptions and interactions [372] of general security discussions on SO have
been studied. However, from our analysis (see section 6.3.2), only about 20% of the
available SV posts on SO were investigated in the previous studies, limiting a thorough
understanding of SV topics (developers’ concerns when tackling SVs in practice) on Q&A
sites. Moreover, the prior studies only focused on SO, and little insight has been given into
the support of SV discussions on different Q&A sites. Such insight would potentially affect
the use of a suitable site (e.g., SO vs. SSE) to obtain necessary SV assessment information
for SV prioritization and fixing.
To fill these gaps, we conduct a large-scale empirical study using 71,329 SV posts
curated from SO and SSE. Specifically, we use Latent Dirichlet Allocation (LDA) [30]
topic modeling and qualitative analysis to answer the following four Research Questions
(RQs) that measure the support of Q&A sites for different SV discussion topics:
RQ1: What are SV discussion topics on Q&A sites?
RQ2: What are the popular and difficult SV topics?
RQ3: What is the level of expertise for supporting SV questions?
RQ4: What types of answers are given to SV questions?
Our findings to these RQs can help raise developers’ awareness of common SVs and enable
them to seek solutions to such SVs more effectively on Q&A sites. We also identify the
areas to which experts can contribute to assist the secure software engineering community.
Moreover, these common developers’ SV concerns and their characteristics can be lever-
aged for making data-driven SV assessment models more practical (closer to developers’
real-world needs) and enabling more effective understanding and fixing prioritization of
commonly encountered SVs. Furthermore, we release one of the largest datasets of SV
discussions that we have carefully curated from Q&A sites for replication and future work
at https://github.com/lhmtriet/SV_Empirical_Study.
Chapter organization. Section 6.2 covers the related work. Section 6.3 describes the
four research questions along with the methods and data used in this chapter. Sections 6.4
presents the results of each research question. Section 6.5 discusses the findings includ-
ing how they can be used for SV assessment, and then mentions the threats to validity.
Section 6.6 concludes and suggests several future directions.
1
https://stackoverflow.com/
2
https://security.stackexchange.com/
6.2. Related Work 103
select SV discussion topics based on the titles, questions and answers of SV posts on both
SO and SSE. LDA is commonly used since it can produce topic distribution (assigning
multiple topics with varying relevance) for a post, providing more flexibility/scalability
than manual coding. We also used the topic share metric [373] in Eq. (6.1) to compute the
proportion (sharei ) of each SV topic and their trends over time.
where p, D and N are a single SV post, the list of all SV posts and the number of such
posts, respectively; Ti is the ith topic and LDA is the trained LDA model.
RQ2: What are the popular and difficult SV topics?
Motivation: After the SV topics were identified, RQ2 identified the popular and difficult
topics on Q&A websites. The results of RQ2 can aid the selection of a suitable (i.e., more
popular and less difficult) Q&A site for respective SV topics.
Method : To quantify the topic popularity, we used four metrics from [375, 370, 376, 374],
namely the average values of (i) views, (ii) scores (upvotes minus downvotes), (iii) favorites
and (iv ) comments. Intuitively, a more popular topic would attract more attention (views),
interest (scores/favorites) and activities (comments) per post from users. We also obtained
the geometric mean of the popularity metrics to produce a more consistent result across
different topics. Geometric mean was used instead of arithmetic mean here since the
metrics could have different units/scales. To measure the topic difficulty, we used the
three metrics from [375, 370, 376, 374]: (i) percentage of getting accepted answers, (ii)
median time (hours) to receive an accepted answer since posted, and (iii) average ratio
of answers to views. A more difficult topic would, on average, have a lower number of
accepted answers and ratio of answers to views, but a higher amount of time to obtain
accepted answers. To achieve this, we took reciprocals of the difficulty metrics (i) and (iii)
so that a more difficult topic had a higher geometric mean of the metrics.
RQ3: What is the level of expertise to answer SV questions?
Motivation: RQ3 checked the expertise level available on Q&A websites to answer SV
questions, especially the ones of difficult topics. The findings of RQ3 can shed light on the
amount of support each topic receives from experienced users/experts on Q&A sites and
which topic may require more attention from experts. Note that experts here are users
who frequently contribute helpful (accepted) answers/knowledge.
Method : We measured both users’ general and specific expertise for SV topics on Q&A
sites. For the general expertise, we leveraged the commonly used metric, the reputation
points [382, 380, 381], of users who got accepted answers since reputation is gained through
one’s active participation and appreciation from the Q&A community in different topics.
A higher reputation received for a topic usually implies that the questions of that topic are
of more interest to experts. Similar to [380], we did not normalize the reputation by user’s
participation time since reputation may not increase linearly, e.g., due to users leaving the
sites. However, reputation is not specific to any topic; thus, it does not reflect whether
a user is experienced with a topic. Hence, we represented developers’ specific expertise
with the SV content in their answers on Q&A sites. This was inspired by Dey et al.’s
findings that developers’ expertise/knowledge could be expressed through their generated
content [383]. We determined a user’s expertise in SV topics using the topic distribution
generated by LDA applied to the concatenation of all answers to SV questions given by that
user. The specific expertise of an SV topic (see Eq. (6.2)) was then the total correlation
between LDA outputs of the current topic in SV questions and the specific expertise of
users who got the respective accepted answers. The correlation of LDA values could reveal
the knowledge (SV topics) commonly used to answer questions of a certain (SV) topic [373].
6.3. Research Method 105
X
Specif ic_Expertisei = LDA(Q(p), Ti ) ⊙ LDA(K(UAccept. ))
p∈D (6.2)
K(UAccept. ) = A1UAccept. + A2UAccept. + ... + AkUAccept. (k = AUAccept. )
where D is the list SV posts and Ti is the ith topic, while Q(p) and K(UAccept. ) are the
question content and SV knowledge of the user UAccept. who gave the accepted answer of
the post p, respectively. ⊙ is the topic-wise multiplication. AUAccept. is all SV-related
answers given by user UAccept. . Note that we only considered posts with accepted answers
to make it consistent with the general expertise.
Specifically, for each question, we first extracted the user that gave the accepted answer
(UAccept. ). We then gathered all answers, not necessarily accepted, of that user in SV posts
( AUAccept. ). Such answer list was the SV knowledge of UAccept. (K(UAccept. )). Finally, we
computed the LDA topic-wise correlation between the topic Ti in the current SV question
(LDA(Q(p), Ti )) and the user knowledge (LDA(K(UAccept. ))) to determine the specific
expertise for post p.
RQ4: What types of answers are given to SV questions?
Motivation: RQ4 extended RQ2 in terms of the solution types given if an SV question is
satisfactorily answered. We do not aim to provide solutions for every single SV. Rather,
we analyze and compare the types of support for different SV topics on SO and SSE, which
can guide developers to a suitable site depending on their needs (e.g., looking for certain
artefacts). To the best of our knowledge, we are the first to study answer types of SVs on
Q&A sites.
Method : We employed an open coding procedure [384] to inductively identify answer types.
LDA is not suitable for this purpose since it relies on word co-occurrences to determine
categories. In contrast, the same type of solutions may not share any similar words. In
RQ4, we only considered the posts with accepted answer to ensure the high quality and
relevance of the answers. We then used stratified sampling to randomly select 385 posts
(95% confidence level with 5% margin error [291]) each from SO and SSE to categorize
the answer types. Stratification ensured the proportion of each topic was maintained.
Following [385], the author of this thesis and a PhD student with three years of experience in
Software Engineering and Cybersecurity first conducted a pilot study to assign initial codes
to 30% of the selected posts and grouped similar codes into answer types. For example,
the accepted answers of SO posts 32603582 (PostgreSQL code), 20763476 (MySQL code)
and 12437165 (Android/Java code) were grouped into the Code Sample category. Similarly
to [386], we also allowed one post to have more than one answer type. The two same people
then independently assigned the identified categories to the remaining 70% of the posts.
The Kappa inter-rater score (κ) [345] was 0.801 (strong agreement), showing the reliability
of our coding. Another PhD student with two-year experience in Software Engineering and
Cybersecurity was involved to discuss and resolve the disagreements. We also correlated
the answer types with the question types on Q&A sites [386].
for the non-security (negative) class to predict security posts on Q&A sites based on a two-
stage PU learning framework [387]. Retrieving non-security posts in practice is challenging
since these posts should not contain any security context, which requires significant human
effort to define and verify. It is also worth noting that manual selection of security/non-
security posts also does not scale to millions of posts on Q&A sites. PUMiner has been
demonstrated to be more effective in retrieving security posts on SO than many learning-
based baselines such as one-class SVM [388, 389] and positive-similarity filtering [390] on
unseen posts. PUMiner can also successfully predict the cases where keyword matching
totally missed with an MCC of 0.745. Notably, with only 1% labelled positive posts,
PUMiner is still 160% better than fully-supervised learning. More details of PUMiner
can be found in Appendix 6.7. Using the curated security posts on SO and SSE, we then
employed tag-based and content-based filtering to retrieve SV posts based on their tags and
content of other parts (i.e., title, body and answers), respectively. We considered a post to
be related to SV when it mainly discussed a security flaw and/or exploitation/testing/fixing
of such flaw to compromise a software system (e.g., SO post 290981423 ). A post was not
SV-related if it just asked how to implement/use a security feature (e.g., SO post 685855)
without any explicit mention of a flaw. All the tags, keywords and posts collected were
released at https://github.com/lhmtriet/SV_Empirical_Study.
Tag-based filtering. We had a vulnerability tag on SSE but not on SO to obtain SV-
related posts, and the security tag on SO used by [370] was too coarse-grained for the
SV domain. Many posts with the security tag did not explicitly mention SV (e.g., SO
post 65983245 about privacy or SO post 66066267 about how to obtain security-relevant
commits). Therefore, we used Common Weakness Enumeration (CWE), which contains
various SV-related terms, to define relevant SV tags. However, the full CWE titles were
usually long and uncommonly used in Q&A discussions. For example, the fully-qualified
CWE name of SQL-injection (CWE-89) is “Improper Neutralization of Special Elements
used in an SQL Command (‘SQL Injection’)”, which appeared only nine times on SO
and SSE. Therefore, we needed to extract shorter and more common terms from the full
CWE titles. We adopted Part-of-Speech (POS) tagging for this purpose, in which we
only considered consecutive (n-grams of) verbs, nouns and adjectives since most of them
conveyed the main meaning of a title. For instance, we obtained the following 2-grams
for CWE-89: improper neutralization, special elements, elements used, sql command, sql
injection. We obtained 2,591 n-gram (1 ≤ n ≤ 3) terms that appeared at least once
on either SO or SSE. To ensure the relevance of these terms, we manually removed the
irrelevant terms without any specific SV context (e.g., special elements, elements used and
sql command in the above example). We found 60 and 63 SV-related tags on SO and SSE
that matched the above n-grams, respectively. We then obtained the initial settag of SV
posts that had at least one of these selected tags.
Content-based filtering. As recommended by some recent studies [391, 304], tag-based
filtering was not sufficient for selecting posts due to wrong tags (e.g., non SV-post 38539393
on SO with stack-overflow tag) or general tags (e.g., SV post 15029849 on SO with only
php tag). Therefore, as depicted in Fig. 6.1, we customized content-based filtering, which
was based on keyword matching, to refine the settag obtained from the tag-based filtering
step and select missing SV posts that were not associated with SV tags. First, we presented
the up-to-date list of 643 SV keywords for matching at https://github.com/lhmtriet/
SV_Empirical_Study. These keywords were preprocessed with stemming and augmented
with American/British spellings, space/hyphen to better handle various types of (mis-
)spellings/plurality.
3
stackoverflow.com/questions/29098142 (postid: 29098142). SSE format is secu-
rity.stackexchange.com/questions/postid. Posts in our paper follow these formats.
6.3. Research Method
Part-of-speech tagging
Identify n-grams of Select n-grams with Select matching tags on Select all SO/SSE posts with
Selected SV tags
CWE titles count frequency > 0 SO and SSE at least one selected tag
Tag-based
filtering
Determine kw_ratio (a) 1. Filter Remaining SO/ Select the remaining SO/SSE
& kw_count (b) settag SSE posts posts without the selected tags
Figure 6.1: Workflow of retrieving posts related to SV on Q&A websites using tag-based and content-based filtering heuristics.
107
Chapter 6. Collection and Analysis of Developers’ Software Vulnerability Concerns on
108
Question and Answer Websites
Table 6.1: Content-based thresholds (aSO/SSE & bSO/SSE ) for the two
steps of the content-based filtering as shown in Fig. 6.1.
Security
Thres- Stack Overflow (SO)
StackExchange (SSE)
hold
Step 1 Step 2 Step 1 Step 2
a 1 3 2 3
b 0.011 0.017 0.017 0.025
Table 6.2: The obtained SV posts using our tag-based and content-based
filtering heuristics.
where |SV _KW sp | and |W ordsp | are the numbers of SV keywords and total number of
words in post p, respectively.
Based on the post content and human inspection, the thresholds aSO/SSE and bSO/SSE for
filtering settag (step 1) as well as selecting extra posts based on their content (step 2) were
found, as given in Table 6.1. Using these thresholds, we obtained settag and setcontent of
SV posts on SO and SSE, respectively, as shown in Fig. 6.1.
SV datasets and validation. As of June 2020, we retrieved 285,720 and 58,912 secu-
rity posts from SO using PUMiner [304] and SSE using Stack Exchange Data Explorer,
respectively. We then applied the tag-based and content-based filtering steps in Fig. 6.1
and obtained 71,329 SV posts (see Table 6.2) in total including 55,883 and 15,436 ones
for settag and setcontent , respectively. We manually validated four different sets of SV
posts, i.e., settag and setcontent for SO and SSE, respectively. Specifically, we randomly
sampled 385 posts (significant size [291]) in each set for two researchers (i.e., the author of
this thesis and a PhD student with three years of experience in Software Engineering and
Cybersecurity) to examine independently.
For settag , we disagreed on 7/770 cases and only two posts were not related to SV.
The main issue was still the incorrect tag assignment (e.g., SSE post 175264 was about dll
injection but tagged with malware 4 ), though this issue had been significantly reduced by
the content-based filtering. For setcontent , the relevance of the posts was very high as there
was no discrepant case.
4
This post was short yet contained many SV keywords (e.g., “injection” and “hijack ”), resulting in high
kw_count and kw_ratio of the content-based filtering.
6.3. Research Method 109
Table 6.3: Top-5 tags of SV, security and general posts on SO and SSE
(in parentheses).
Our SV dataset was only 20% overlapping with the existing security dataset [370],
implying that there were significant differences in the nature of the two studies. Note that
we followed the settings in [370] to retrieve the updated security posts from the same SO
data we used in our study. We also reported the top tags of SV posts (see Table 6.3) and
compared them with the ones of security posts [370] and a subset of all the posts containing
an equal number of posts to the SV posts on SO and SSE. SV posts were associated with
many SV-related tags (e.g., memory-leaks, malware, segmentation-fault, xss, exploit and
penetration-test). Conversely, security posts were tagged with general terms that may not
explicitly discuss security flaws such as encryption, authentication and passwords. The tags
of general posts were mostly programming languages on SO and general security terms on
SSE. These findings highlight the importance of obtaining SV-specific posts instead of
reusing the security posts to study the support of Q&A sites for SV-related discussions.
Table 6.4: SV topics on SO and SSE identified by LDA along with their
proportions and trends over time. Notes: The topic proportions on SSE
are in parenthesis. The trends of SO are the top solid sparklines, while the
trends of SSE are the bottom dashed sparklines. Unit of proportion: %.
within the same topic. To avoid insignificant topics like [373], we only considered topics
with a probability of at least 0.1 in a post. We manually read the top-20 most frequent
words and 15 random posts of each topic per site (SO/SSE) obtained by the trained LDA
models to label the name of that topic as done in [376, 374]. The LDA model with the
most relevant set of topics would be used for answering the four RQs.
6.4 Results
6.4.1 RQ1: What are SV Discussion Topics on Q&A Sites?
Following the procedure in section 6.3.3, we identified 13 SV topics (see Table 6.4) on SO
and SSE using the optimal LDA model with α = β = 0.08. We found LDA models having
from 11 to 17 topics produced similar coherence metrics. Three of the authors manually
examined these cases, as in [393]. Duplicate and/or platform-specific topics (e.g., web
and mobile) appeared from 14 topics, making the taxonomy less generalizable. 11 and 12
topics also had high-level topics (e.g., combining XSS and CSRF). Thus, 13 was chosen as
the optimal number of SV topics. All the terms/posts of each SV topic can be found at
https://github.com/lhmtriet/SV_Empirical_Study. We describe each topic hereafter
with example SO/SSE posts. We examined 15 random posts per topic per site. If we
identified some common patterns of discussions (e.g., attack vectors or assets) on a site, we
would extract another 15 random posts of the respective site to confirm our observations.
If a pattern was no longer evident in the latter 15 posts, we would not report it.
Malwares (T1). This topic referred to the detection and removal of malicious soft-
ware. T1 posts on SO were usually about malwares in content management systems such as
Wordpress or Joomla (e.g., post 16397854: “How to remove wp-stats malware in wordpress”
or post 11464297: “How to remove .htaccess virus’). In contrast, SSE often discussed mal-
wares/viruses coming from storage devices such as SSD (e.g., post 227115: “Can viruses of
6.4. Results 111
one ssd transfer to another ssd? ”) or USB (e.g., post 173804: “Can Windows 10 bootable
USB drive get infected while trying to reinstall Windows? ”).
SQL Injection (T2). This topic concerned tactics to properly sanitize malicious
inputs that could modify SQL commands and pose threats (e.g., stealing or changing
data) to databases in various programming languages (e.g., PHP, Java, C#). A commonly
discussed tactic was to use prepared statements, which also helped increase the efficiency
of query processing. For example, developers asked questions like “How to parameterize
complex oledb queries? ” (SO post 9650292) or “How to make this code safe from SQL
injection and use bind parameters” (SSE post 138385).
Vulnerability Scanning Tools (T3). This topic was about issues related to tools
for automated detection/assessment of potential SVs in an application. Discussions of T3
mentioned different tools, and OWASP ZAP was a commonly discussed one. For example,
post 62570277 on SO discussed “Jenkins-zap installation failed ”, while post 126851 on SSE
asked “How do I turn off automated testing in OWASP ZAP? ” One possible explanation
is that OWASP ZAP is a free and easy-to-use tool for detecting and assessing SVs that
appear in the well-known top-10 OWASP list for web applications.
Cross-site Request Forgery (CSRF) (T4). This topic contained discussions on
proper setup and configuration of web application frameworks to prevent CSRF SVs. These
SVs could be exploited to send requests to perform unauthorized actions from an end-user
that a web application trusts. Discussions covered various issues in implementing differ-
ent CSRF prevention techniques recommended by OWASP.5 Some commonly discussed
techniques were anti-CSRF token (e.g., SO post 59664094: “Why Laravel 4 CSRF token
is not working? ”), double submit cookie (e.g., SSE post 203996: “What is double submit
cookie? And how it is used in the prevention of CSRF attack? ”), and SameSite cookie
attribute (e.g., SO post 41841880: “What is the benefit of blocking cookie for clicked link?
(SameSite=strict)”).
File-related Vulnerabilities (T5). Discussions of this topic were about SVs in
files that could be exploited to gain unauthorized access. The common SV types were
Path/Directory Traversal via Symlink (e.g., SSE post 165860: “Symlink file name - possi-
ble exploit? ”), XML External Entity (XXE) Injection (e.g., SO post 51860873: “Is SAX-
ParserFactory susceptible to XXE attacks? ”), and Unrestricted File Upload (e.g., SSE post
111935: “Exploiting a PHP server with a .jpg file upload ”). These SVs usually occurred for
Linux-based systems, suggesting that Linux is more popular for servers.
Synchronization Errors (T6). This topic involved SVs produced through errors in
synchronization logic (usually related to threads), which could slow down system perfor-
mance. Some common SV types being discussed were deadlocks (e.g., SO post 38960765:
“How to avoid dead lock due to multiple oledb command for same table in ssis”) and race
conditions (e.g., SSE post 163209: “What’s the meaning of ‘the some sort of race condition’
here? ”).
Encryption Errors (T7). This topic included cryptographic issues leading to falsified
authentication or retrieval of sensitive data, e.g., Man-in-the-middle (MITM) attack. Many
posts discussed public/private keys for encryption/decryption, especially using SSL/TLS
certificates to defend against MITM attacks (attempts to steal information sent between
browsers and servers). Some example discussions are post 23406005 on SO (“Man In
Middle Attack for HTTPS ”) or post 105773 on SSE (“How is it that SSL/TLS is so secure
against password stealing? ”). This may imply that many developers are still not familiar
with these certificates in practice.
5
https://cheatsheetseries.owasp.org/cheatsheets/Cross-Site_Request_Forgery_Prevention_
Cheat_Sheet.html
Chapter 6. Collection and Analysis of Developers’ Software Vulnerability Concerns on
112
Question and Answer Websites
Resource Leaks (T8). This topic considered SVs arising from improper releases of
unused memory which could deplete resources and decrease system performance. Many
discussions of T8 were about memory leaks in mobile app development. Issues were usually
related to Android (e.g., SO post 58180755: “Deal with Activity Destroying and Memory
leaks in Android ”) or IOS (e.g., SO post 47564784: “iOS dismissing a view controller doesn’t
release memory”).
Network Attacks (T9). This topic discussed attacks carried out over an online
computer network, e.g., Denial of Service (DoS) and IP/ARP Spoofing, and potential
mitigations. These network attacks directly affected the availability of a system. For
instance, SSE post 86440 discussed “VPN protection against DDoS ” or SO post 31659468
asked “How to prevent ARP spoofing attack in college? ”.
Memory Allocation Errors (T10). T10 and T8 were both related to memory issues,
but T10 did not consider memory release. Rather, this topic more focused on SVs caused
by accessing or using memory outside of what allocated that could be exploited to access
restricted memory location or crash an application. In this topic, segmentation faults (e.g.,
SO post 31260018: “Segmentation fault removal duplicate elements in unsorted linked list”)
and buffer overflows (e.g., SSE post 190714: “buffer overflow 64 bit issue”) were commonly
discussed by developers.
Cross-site Scripting (XSS) (T11). This topic mentioned tactics to properly neu-
tralize user inputs to a web page to prevent XSS attacks. These attacks could exploit users’
trust in web servers/pages to trick them to execute malicious scripts and perform unwanted
actions. XSS (T11) and CSRF (T4) are both client-side SVs, but XSS is more dangerous
since it can bypass all countermeasures of T4.5 On SO and SSE, discussions covered all
three types of XSS: (i) reflected XSS (e.g., SSE post 57268: “How does the anchor tag
(<a>) let you do an Reflected XSS? ”), (ii) stored/persistent XSS (e.g., SO post 54771897:
“How to defend against stored XSS inside a JSP attribute value in a form”), and (iii)
DOM-based XSS (e.g., SO post 44673283: “DOM XSS detection using javascript(source
and sink detection)”).
Vulnerability Theory (T12). This topic focused on theoretical/social aspects and
best practices in the SV life cycle. Many posts compared different SV-related terminolo-
gies, e.g., SSE post 103018 asked about “In CIA triad of information security, what’s the
difference between confidentiality and availability? ” or SO post 402936 discussed “Bugs
versus vulnerabilities? ”. Several other posts asked about internal SV reporting process
(e.g., SO post 3018198: “How best to present a security vulnerability to a web development
team in your own company? ”) or public SV disclosure policy (e.g., SSE post: “How to
properly disclose a security vulnerability anonymously? ”).
Brute-force/Timing Attacks (T13). T13 and T7 both exploited cryptographic
flaws, but these two topics used different attack vectors/methods. T7 focused on MITM
attacks, while T13 was about attacks making excessive attempts or capturing the timing
of a process to gain unauthorized access. Some example posts of T13 are SO post 3009988
(“What’s the big deal with brute force on hashes like MD5 ”) or SSE post 9192 (“Timing
attacks on password hashes”).
Proportion and Evolution of SV Topics. We analyzed the proportion (share metric
in Eq. (6.1)) and the evolution trend of SV topics from their inception on SO (2008) and
SSE (2010) to 2020 (see Table 6.4). The topic patterns and dynamics of SO were different
from those of SSE. Specifically, Memory Allocation Errors (T10) had the greatest number
of posts on SO, while Vulnerability Theory (T12) had the largest proportion on SSE. Apart
from XSS (T11) and Brute-force/Timing Attacks (T13), topics with many posts in one
source were not common in the other source. Moreover, we discovered three consistent topic
trends on both SO and SSE: Malwares (T1) (↗), CSRF (T4) (↗), File-related SVs (T5)
(↗) and Vulnerability Theory (T12) (↘). Among them, CSRF had the fastest changing
6.4. Results 113
0 D O Z D U H V 7
6 4 / , Q M H F W L R Q 7
9 X O Q H U D E L O L W \ 6 F D Q Q L Q J 7 R R O V 7
&