Yin 等 - 2024 - Multitask-based Evaluation of Open-source Llm on Software Vulnerability
Yin 等 - 2024 - Multitask-based Evaluation of Open-source Llm on Software Vulnerability
Abstract— This paper proposes a pipeline for quantitatively evaluating interactive Large Language Models (LLMs) using publicly
available datasets. We carry out an extensive technical evaluation of LLMs using Big-Vul covering four different common software
vulnerability tasks. This evaluation assesses the multi-tasking capabilities of LLMs based on this dataset. We find that the existing
state-of-the-art approaches and pre-trained Language Models (LMs) are generally superior to LLMs in software vulnerability detection.
However, in software vulnerability assessment and location, certain LLMs (e.g., CodeLlama and WizardCoder) have demonstrated su-
perior performance compared to pre-trained LMs, and providing more contextual information can enhance the vulnerability assessment
capabilities of LLMs. Moreover, LLMs exhibit strong vulnerability description capabilities, but their tendency to produce excessive output
arXiv:2404.02056v3 [cs.SE] 6 Jul 2024
significantly weakens their performance compared to pre-trained LMs. Overall, though LLMs perform well in some aspects, they still
need improvement in understanding the subtle differences in code vulnerabilities and the ability to describe vulnerabilities to fully realize
their potential. Our evaluation pipeline provides valuable insights into the capabilities of LLMs in handling software vulnerabilities.
1 I NTRODUCTION
Software Vulnerabilities (SVs) can expose software sys- to individuals. As a cornerstone of software quality assur-
tems to risk situations and eventually cause huge economic ance, the seamless integration of these activities underscores
losses or even threaten people’s lives. Therefore, completing the importance of a proactive and thorough approach to
software vulnerabilities is an important task for software managing software vulnerabilities in today’s dynamic and
quality assurance (SQA). Generally, there are many impor- interconnected digital landscape.
tant software quality activities for software vulnerabilities
such as SV detection, SV assessment, SV location, and SV
RQ-2
description. The relationship among the SQA activities is Vulnerability
intricate and interdependent and can be illustrated in Fig. 1. Assessment
SV detection serves as the initial phase, employing various Source Vulnerability Vulnerability Vulnerability LLM
tools and techniques to identify potential vulnerabilities Code Detection Location Description Output
RQ-1 RQ-3 RQ-4
within the software. Once detected, the focus shifts to SV
assessment, where the severity and potential impact of Fig. 1: The relationship among software vulnerability anal-
each vulnerability are meticulously evaluated. This critical ysis activities
evaluation informs the subsequent steps in the process. SV
location follows the assessment, pinpointing the exact areas
within the software’s code or architecture where vulner- Recently, Large Language Models (LLMs) [1] have been
abilities exist. This step is crucial for precise remediation widely adopted since the advances in Natural Language
efforts and to prevent the recurrence of similar vulnera- Processing (NLP) which enable LLM to be well-trained
bilities in the future. The intricacies of SV location feed with both billions of parameters and billions of training
into the comprehensive SV description, which encapsulates samples, consequently bringing a large performance im-
detailed information about each vulnerability, including its provement on tasks adopted by LLMs. LLMs can be easily
origin, characteristics, and potential exploits. In essence, the used for a downstream task by being fine-tuned [2] or
synergy among SV detection, SV assessment, SV location, being prompted [3] since they are trained to be general and
and SV description creates a robust pipeline for addressing they can capture different knowledge from various domain
software vulnerabilities comprehensively. This systematic data. Fine-tuning is used to update model parameters for
approach not only enhances the overall quality of the soft- a particular downstream task by iterating the model on
ware but also fortifies it against potential threats, thereby a specific dataset while prompting can be directly used
safeguarding against economic losses and potential harm by providing natural language descriptions or a few ex-
amples of the downstream task. Compared to prompting,
Both Xin Yin and Chao Ni are with the State Key Laboratory of Blockchain and
fine-tuning is expensive since it requires additional model
Data Security, Zhejiang University, Hangzhou, China. Chao Ni is also with training and has limited usage scenarios, especially in cases
Hangzhou High-Tech Zone (Binjiang) Blockchain and Data Security Research where sufficient training datasets are unavailable.
Institute, Hangzhou, China. E-mail: {xyin, chaoni}@zju.edu.cn. LLMs have demonstrated remarkable language compre-
Shaohua Wang is with Central University of Finance and Economics, China.
E-mail: [email protected]. hension and generation capabilities, and have been able to
Chao Ni is the corresponding author. perform well on a variety of natural language processing
2
Detection Detection
Detection* Description*
Assessment* Location*
Detection Detection
Location Location
Detection
DeepSeek 6.7B CodeLlama 7B StarCoder 7B DeepSeek 6.7B DeepSeek 33B StarCoder 7B CodeLlama 34B
WizardCoder 7B Mistral 7B Phi-2 2.7B StarCoder 7B StarCoder 15.5B WizardCoder 7B WizardCoder 34B
(a) LLMs’ performance on different software vulnerability tasks (b) The impacts of parameter sizes on LLMs’ performance across differ-
(∗ refers to the results under fine-tuning setting) ent software vulnerability tasks
Fig. 2: The capability comparison of LLMs with different parameter sizes on different software vulnerability tasks
tasks, such as text summarization [4]. Given the outstanding SV assessment frameworks CVSS (Common Vulnera-
performance of LLMs, there is a growing focus on exploring bility Scoring System) [20], which characterizes SVs by
their potential in software engineering tasks and seeking considering three metric groups: Base, Temporal, and
new opportunities to address them. Currently, as more and Environmental. The metrics that are in the groups can
more LLMs designed for software engineering tasks are be further used as the criterion for selecting serious SVs
deployed [5]–[11], many research works focused on the ap- to fix early. Therefore, we aim to explore the ability
plication of LLMs in the software engineering domain [12]– of LLMs to assess vulnerabilities and compare their
[16]. However, in the existing literature, adequate systematic performance with pre-trained LMs.
reviews and surveys have been conducted on LLMs in areas • RQ-3: How do LLMs perform on vulnerability lo-
such as generating high-quality code and high-coverage test cation? Identifying the precise location of vulnerabil-
cases [17], [18], but a systematic review and evaluation of ities in software systems is of critical importance for
open-source LLMs in the field of software vulnerability is mitigating risks and improving software quality. The
still missing. vulnerability location task involves pinpointing these
In this paper, we focus on evaluating LLMs’ performance weaknesses accurately and helps to narrow the scope
in various software vulnerability (SV)-related tasks in few- for developers to fix problems. Therefore, we aim to in-
shot and fine-tuning settings to obtain a basic, comprehen- vestigate LLMs’ capability in effectively identifying the
sive, and better understanding of their multi-task ability, precise location of vulnerabilities in software systems,
and we aim to answer the following research questions. alongside evaluating their performance against state-
• RQ-1: How do LLMs perform on vulnerability de- of-the-art approaches and pre-trained LMs.
tection? Software Vulnerabilities (SVs) can expose soft- • RQ-4: How do LLMs perform on vulnerability de-
ware systems to risk situations and consequently soft- scription? The vulnerability description task focuses
ware function failure. Therefore, detecting these SVs is on conveying a detailed explanation of these identified
an important task for software quality assurance. We issues in the source codes and helps participants to
aim to explore the ability of LLMs on vulnerability better understand the risk as well as its impacts. Un-
detection as well as the performance difference com- derstanding the intricacies of vulnerabilities in software
pared with state-of-the-art approaches and pre-trained systems plays a pivotal role in alleviating risks and bol-
Language Models (LMs). stering software quality. The vulnerability description
• RQ-2: How do LLMs perform on vulnerability as- task focuses on conveying a detailed explanation of
sessment? In practice, due to the limitation of SQA these identified issues in the source codes and helps
resources [19], it is impossible to treat all detected SVs participants to better understand the risk as well as
equally and fix all SVs simultaneously. Thus, it is neces- its impacts. Our goal is to evaluate LLMs’ ability to
sary to prioritize these detected software vulnerabilities effectively generate vulnerability descriptions within
for better treatment. An effective solution to prioritize software systems and compare their performance with
those SVs is to use one of the most widely known that of pre-trained LMs.
3
To extensively and comprehensively analyze the LLMs’ language model InstructGPT [28] with a dialog interface
ability, we use a large-scale dataset containing real-world that is fine-tuned using the Reinforcement Learning with
project vulnerabilities (named Big-Vul [21]). We carefully Human Feedback (RLHF) approach [28]–[30]. RLHF initially
design experiments to discover the findings by answering fine-tunes the base model using a small dataset of prompts
four RQs. The main contribution of our work is summarized as input and the desired output, typically human-written,
as follows and takeaway findings are shown in Table 1. to refine its performance. Subsequently, a reward model is
Eventually, we present the comparison of LLMs across four trained on a larger set of prompts by sampling outputs gen-
software vulnerability tasks under different settings, as well erated by the fine-tuned model. These outputs are then re-
as the impact of varying model sizes on performance, as ordered by human labelers to provide feedback for training
depicted in Fig. 2(a) and Fig. 2(b). In summary, the key the reward model. Reinforcement learning [31] is then used
contributions of this paper include: to calculate rewards for each output generated based on
• We extensively evaluate the performance of LLMs on the reward model, updating LLM parameters accordingly.
different software vulnerability tasks and conduct an With fine-tuning and alignment with human preferences,
extensive comparison among LLMs and learning-based LLMs better understand input prompts and instructions,
approaches to software vulnerability. enhancing performance across various tasks [28], [32].
• We design four RQs to comprehensively understand The application of LLMs in software engineering has
LLMs from different dimensions, and provide detailed seen a surge, with models like ChatGPT being employed
results with examples. for various tasks (e.g., code review, code generation, and
• We release our replication package for further study [22]. vulnerability detection). Although some works use LLMs
for vulnerability tasks [33], [34], our work differs from these
previous studies in the following aspects. (1) Closed-source
2 BACKGROUND AND R ELATED W ORK ChatGPT vs. Open-source LLMs: They only explore the
2.1 Large Language Model capabilities of the closed-source ChatGPT in vulnerability
Since the advancements in Natural Language Processing, tasks, whereas we investigate the abilities of both open-
Large Language Models (LLMs) [1] have seen widespread source code-related LLMs and general LLMs in these tasks.
adoption due to their capacity to be effectively trained with (2) Prompts vs. Few-shot and Fine-tuning Settings: They
billions of parameters and training samples, resulting in focus solely on the performance of LLMs using prompts,
significant performance enhancements. LLMs can readily be which introduces randomness and hinders the reproducibil-
applied to downstream tasks through either fine-tuning [2] ity of their findings. In contrast, we examine the capabilities
or prompting [3]. Their versatility stems from being trained of LLMs under both few-shot and fine-tuning settings, pro-
to possess a broad understanding, enabling them to capture viding the source code and corresponding model files to
diverse knowledge across various domains. Fine-tuning ensure the reproducibility of our experimental results.
involves updating the model parameters specifically for
a given downstream task through iterative training on a 2.2 Software Vulnerability
specific dataset. In contrast, prompting allows for direct uti- Software Vulnerabilities (SVs) can expose software systems
lization by providing natural language descriptions or a few to risk situations and consequently make the software un-
examples of the downstream task. Compared to prompting, der cyber-attacks, eventually causing huge economic losses
fine-tuning is resource-intensive as it necessitates additional and even threatening people’s lives. Therefore, vulnerability
model training and is applicable in limited scenarios, partic- databases have been created to document and analyze pub-
ularly when adequate training datasets are unavailable. licly known security vulnerabilities. For example, Common
LLMs are usually built on the transformer architec- Vulnerabilities and Exposures (CVE) [35], [36] and Security-
ture [23] and can be classified into three types of ar- Focus [37] are two well-known vulnerability databases. Be-
chitectures: encoder-only, encoder-decoder, and decoder- sides, Common Weakness Enumeration (CWE) defines the
only. Encoder-only (e.g., CodeBERT [24], GraphCode- common software weaknesses of individual vulnerabilities,
BERT [25], and UniXcoder [26]) and Encoder-Decoder which are often referred to as vulnerability types of CVEs.
(e.g., PLBART [27], CodeT5 [7], and CodeT5+ [8]) mod- To better address these vulnerabilities, researchers have
els are trained using Masked Language Modeling (MLM) proposed many approaches for understanding the effects
or Masked Span Prediction (MSP) objective, respectively, of software vulnerabilities, including SV detection [38]–
where a small portion (e.g., 15%) of the tokens are replaced [50], SV assessment [20], [51]–[54], SV location [55]–[57], SV
with either masked tokens or masked span tokens, LLMs repair [58]–[61] as well as SV description [62]–[65]. Many
are trained to recover the masked tokens. These models novel technologies are adopted to promote the progress
are trained as general ones on the code-related data and of software vulnerability management, including software
then are fine-tuned for the downstream tasks to achieve analysis [66], [67], machine learning [38], [45], and deep
superior performance. Decoder-only models also attract a learning [51], [56], especially LLMs [63], [64].
small portion of people’s attention and they are trained
by using Causal Language Modeling objectives to predict
the probability of the next token given all previous tokens. 3 E XPERIMENTAL D ESIGN
GPT [2] and its variants are the most representative models, In this section, we present our studied dataset, our studied
which bring the large language models into practical usage. LLMs, the techniques for fine-tuning, the prompt engineer-
Recently, the ChatGPT model attracts the widest atten- ing, the baseline approaches, the evaluation metrics, and the
tion from the world, which is the successor of the large experiment settings.
4
1 . LLMs can detect vulnerabilities, but fine-tuned LLMs perform weaker than transformer-based
approaches. Considering the computational resources and time costs of deploying LLMs,
transformer-based approaches for vulnerability detection are a more efficient choice. 2 . After
Vulnerability Detection fine-tuning, the detection capability of LLMs has improved. Larger models usually perform better,
but performance can also be influenced by model design and pre-training data. Therefore,
fine-tuning the LLM on domain-specific data before using it as a vulnerability detector is necessary.
3 . In general, different LLMs complementing each other, while CodeLlama obtains better
performance in terms of F1-score, Precision, and Recall.
6 . Few-shot setting exposes LLM’s limitations, and fine-tuning can greatly enhance the
Vulnerability Location vulnerability location capabilities of LLMs. 7 . Fine-tuning code-related LLMs as vulnerability
locators is beneficial, as they can outperform pre-trained language models in terms of F1-score,
precision, and FPR.
TABLE 3: Overview of the studied LLMs things like writing, summarizing texts, and coding, but with
Code-related LLMs General LLMs
better common sense and understanding than its earlier ver-
Models
DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2 sion, Phi-1.5. Phi-2’s evaluation demonstrates its proficiency
Fine-Tuning 6.7B 7B 7B 7B 7B 2.7B over larger models in aggregated benchmarks, emphasizing
Few-Shot 6.7B & 33B 7B & 34B 7B & 34B 7B & 15.5B 7B 2.7B
Release Date Nov’23 Aug’23 May’23 June’23 Sep’23 Dec’23
the potential of smaller models to achieve comparable or
superior performance to their larger counterparts. This is
particularly evident in its comparison with Google Gemini
English and Chinese. They provide various sizes of the Nano 2, where Phi-2 outshines despite its smaller size.
code model, ranging from 1B to 33B versions. Each model
is pre-trained on project-level code corpus by employing
a window size of 16K and an extra fill-in-the-blank task, 3.3 Model Fine-Tuning
to support project-level code completion and infilling. For The four software vulnerability tasks can be categorized into
coding capabilities, DeepSeek-Coder achieves state-of-the- two types: discriminative task (i.e., software vulnerability
art performance among open-source code models on multi- detection, software vulnerability assessment, and software
ple programming languages and various benchmarks. vulnerability location) and generative task (i.e., software
CodeLlama proposed by Rozière et al. [11] is a set of vulnerability description). Therefore, fine-tuning LLMs for
large pre-trained language models for code built on Llama software vulnerability tasks can be undertaken through
2. They achieve state-of-the-art performance among open both discriminative and generative methods, each method
models on code tasks, provide infilling capabilities, support specifically designed to make LLMs aligned with the task.
large input contexts, and demonstrate zero-shot instruction In particular, we treat the discriminative tasks as binary clas-
following for programming problems. CodeLlama is created sification, while treating the generative task as generation
by further training Llama 2 using increased sampling of one. The architectures for the two paradigms are presented
code data. As with Llama 2, the authors applied extensive in Fig. 3.
safety mitigations to the fine-tuned CodeLlama versions.
StarCoder proposed by Li et al. [10] is a large pre-
trained language model specifically designed for code. It 0 1 0 CVE description: A remote code ...
TABLE 4: The task descriptions and indicators for different software vulnerability tasks
TABLE 7: The software vulnerability detection comparison on Top-10 CWEs among fine-tuned LLMs (RQ1)
DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2 DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2
CWE Type # Total # Vul.
F1-score Precision
CWE-119 1549 128 0.321 0.309 0.316 0.269 0.258 0.281 0.223 0.197 0.212 0.215 0.181 0.197
CWE-20 1082 80 0.269 0.273 0.229 0.216 0.145 0.269 0.173 0.163 0.141 0.165 0.096 0.175
CWE-264 800 64 0.486 0.468 0.337 0.348 0.356 0.477 0.357 0.316 0.232 0.308 0.269 0.361
CWE-399 697 35 0.355 0.286 0.209 0.274 0.227 0.306 0.227 0.169 0.125 0.191 0.143 0.196
CWE-125 582 29 0.233 0.267 0.213 0.195 0.179 0.180 0.145 0.156 0.129 0.128 0.108 0.109
CWE-200 573 27 0.269 0.261 0.241 0.180 0.162 0.229 0.182 0.159 0.151 0.132 0.106 0.152
CWE-189 442 21 0.235 0.208 0.255 0.273 0.178 0.293 0.145 0.119 0.151 0.180 0.108 0.182
CWE-362 413 16 0.031 0.086 0.075 0.050 0.026 0.032 0.017 0.045 0.040 0.029 0.014 0.018
CWE-416 406 12 0.193 0.178 0.148 0.145 0.146 0.141 0.113 0.101 0.083 0.093 0.086 0.082
CWE-476 367 11 0.091 0.109 0.053 0.057 0.037 0.019 0.049 0.057 0.028 0.032 0.020 0.010
DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2 DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2
CWE Type # Total # Vul.
Recall Accuracy
CWE-119 1549 128 0.570 0.719 0.625 0.359 0.453 0.492 0.801 0.735 0.777 0.839 0.785 0.792
CWE-20 1082 80 0.609 0.844 0.609 0.313 0.297 0.578 0.804 0.735 0.758 0.866 0.793 0.814
CWE-264 800 64 0.763 0.900 0.613 0.400 0.525 0.700 0.839 0.795 0.759 0.850 0.810 0.846
CWE-399 697 35 0.815 0.926 0.630 0.481 0.556 0.704 0.885 0.821 0.815 0.901 0.854 0.877
CWE-125 582 29 0.586 0.931 0.621 0.414 0.517 0.517 0.808 0.746 0.771 0.830 0.763 0.765
CWE-200 573 27 0.514 0.743 0.600 0.286 0.343 0.457 0.829 0.743 0.770 0.841 0.784 0.812
CWE-189 442 21 0.625 0.813 0.813 0.563 0.500 0.750 0.853 0.776 0.828 0.891 0.833 0.869
CWE-362 413 16 0.200 0.800 0.600 0.200 0.200 0.200 0.847 0.794 0.821 0.908 0.821 0.855
CWE-416 406 12 0.667 0.750 0.667 0.333 0.500 0.500 0.835 0.796 0.773 0.884 0.828 0.820
CWE-476 367 11 0.571 1.000 0.429 0.286 0.286 0.143 0.782 0.687 0.706 0.820 0.719 0.725
Task Description: If this C code snippet has vulnerabilities, Accuracy, which surpass the fine-tuned LLMs by 24.4% to
output Yes; otherwise, output No. 57.0%, 64.0% to 108.9%, and 6.3% to 20.2% in terms of F1-
score, Precision, and Accuracy, respectively. Notably, the F1-
score performance of LineVul is significantly lower (0.272)
In addition to pre-trained LMs, we also consider the
than that reported in the original paper (0.910). We further
following five SOTA baselines: Devign [38], Reveal [47],
analyze this discrepancy in Section 5.1.
IVDetect [56], LineVul [39], and SVulD [48]. These base-
(2) The performance of fine-tuned LLMs is comparable
lines can be divided into two groups: graph-based (i.e.,
to graph-based approaches. For example, in terms of F1-
Devign, Reveal, and IVDetect) and transformer-based (i.e.,
score, fine-tuned LLMs achieve a range of 0.214 to 0.270.
pre-trained LMs, LineVul, and SVulD). Besides, in order
In comparison, graph-based approaches achieve a range of
to comprehensively compare the performance among base-
0.200 to 0.232.
lines and LLMs, we consider four widely used performance
measures (i.e., Precision, Recall, F1-score, and Accuracy) and (3) LLMs under few-shot setting have poor perfor-
conduct experiments on the popular dataset. Since graph- mance compared with baselines. LLMs ranging from 2.7B
based approaches need to obtain the structure information to 34B parameters perform less favorably than baselines in
(e.g., control flow graph (CFG), data flow graph (DFG)) terms of F1-score and Precision. However, as for Accuracy,
of the studied functions, we adopt the same toolkit with SVulD (transformer-based) obtains the best performance
Joern to transform functions. The functions are dropped out (0.915) and DeepSeek-Coder 6.7B under few-shot setting
directly if they cannot be transformed by Joern successfully. achieves a performance of 0.823, which is better than the
Finally, the filtered dataset (shown in Table 2) is used for three graph-based approaches.
evaluation. We follow the same strategy to build the training
Finding-1. LLMs can detect vulnerabilities, but fine-tuned
data, validation data, and testing data from the original
LLMs perform weaker than transformer-based approaches.
dataset with previous work does [39], [58]. Specifically, 80%
Considering the computational resources and time costs of de-
of functions are treated as training data, 10% of functions
ploying LLMs, transformer-based approaches for vulnerability
are treated as validation data, and the left 10% of functions
detection are a more efficient choice.
are treated as testing data. We also keep the distribution
as same as the original ones in training, validation, and [B] Fine-Tuning vs. Few-Shot. The experimental results
testing data. We undersample the non-vulnerable functions are presented in Table 6. Based on these experimental find-
to produce approximately balanced training data at the ings, we can draw the following observations: (1) LLMs
function level, while the validation and testing data remain fine-tuned for vulnerability detection demonstrate superior
in the original imbalanced ratio. Apart from presenting the performance on the task compared to LLMs in the few-
overall performance comparison, we also give the detailed shot setting. The average F1-score and average Precision
performance of LLMs on the Top-10 CWE types for a better have doubled, while the average Recall has also shown im-
analysis. provement. (2) LLMs with more parameters typically exhibit
Results. [A] LLMs vs. Baselines. Table 6 shows the overall better performance. For example, CodeLlama 34B improves
performance measures between LLMs and eleven baselines upon CodeLlama 7B by 19.4%, 34.5%, and 37.0% in terms
and the best performances are highlighted in bold. Accord- of F1-score, Precision, and Accuracy, respectively. However,
ing to the results in Table 6, we can obtain the following different LLMs may exhibit performance variations due to
observations: differences in model design and the quality of pre-training
(1) Fine-tuned LLMs have poor performance compared data. (3) Phi-2 achieves performance approximating that of
with transformer-based approaches when considering F1- other LLMs with 7 billion parameters, even with a param-
score, Precision, and Accuracy. In particular, SVulD obtains eter size of 2.7 billion. This may be attributed to the higher
0.336, 0.282, and 0.915 in terms of F1-score, Precision, and quality of its pre-training data.
9
Finding-2. After fine-tuning, the detection capability of LLMs from 0.512 to 0.854. This underscores the necessity of fine-
has improved. Larger models usually perform better, but perfor- tuning in vulnerability assessment task. Overall, fine-tuned
mance can also be influenced by model design and pre-training code-related LLMs outperform pre-trained LMs in vulner-
data. Therefore, fine-tuning the LLM on domain-specific data ability assessment. It is worth noting that DeepSeek-Coder,
before using it as a vulnerability detector is necessary. after fine-tuning, achieves the best performance compared
to other LLMs and pre-trained LMs. If researchers need to
perform tasks such as vulnerability assessment with LLM,
[C] The comparisons of Top-10 CWE types between
fine-tuning DeepSeek-Coder is a more efficient choice. We
LLMs. Table 7 shows the detailed comparisons of Top-
also find that Mistral exhibits a relatively smaller improve-
10 CWE types between fine-tuned LLMs. In this table,
ment after fine-tuning, which aligns with our expectations,
we highlight the best performance for each performance
as it is a general LLM.
metric in bold. According to the results, we can achieve
the following observations: (1) In most cases, CodeLlama
TABLE 8: The comparison between LLMs and six baselines
obtains better performance than other LLMs in terms of F1-
on software vulnerability assessment (RQ2)
score, Precision, and Recall. Different LLMs have certain
advantages in different CWE types, complementing each Methods F1-score Recall Precision Accuracy
other. (2) Considering the performance of F1-score, Preci-
CodeBERT 0.753 0.730 0.788 0.828
sion, and Recall, CodeLlama achieves the best performances GraphCodeBERT 0.701 0.666 0.772 0.802
on CWE-125 (“Out-of-bounds Read”), CWE-362 (“Concurrent UniXcoder 0.745 0.761 0.734 0.817
Execution using Shared Resource with Improper Synchronization PLBART 0.735 0.741 0.731 0.789
CodeT5 0.743 0.750 0.741 0.817
(’Race Condition’)”), and CWE-476 (“NULL Pointer Derefer- CodeT5+ 0.706 0.677 0.755 0.789
ence”), which indicates CodeLlama is exceptionally skilled at
Fine-Tuning Setting
detecting and mitigating vulnerabilities related to memory DeepSeek-Coder 6.7B 0.814 0.785 0.854 0.860
handling and synchronization issues. CodeLlama 7B 0.768 0.749 0.794 0.827
StarCoder 7B 0.671 0.677 0.666 0.764
Finding-3. In general, different LLMs complementing each WizardCoder 7B 0.793 0.778 0.813 0.842
Mistral 0.525 0.539 0.512 0.759
other, while CodeLlama obtains better performance in terms of Phi-2 0.747 0.732 0.767 0.802
F1-score, Precision, and Recall. Few-Shot Setting
DeepSeek-Coder 6.7B 0.229 0.339 0.310 0.262
DeepSeek-Coder 33B 0.290 0.323 0.336 0.335
4.2 RQ-2: Evaluating Vulnerability Assessment of CodeLlama 7B 0.310 0.331 0.334 0.373
CodeLlama 34B 0.265 0.323 0.327 0.294
LLMs StarCoder 7B 0.265 0.342 0.333 0.330
StarCoder 15.5B 0.285 0.315 0.329 0.326
In this RQ, we delineate two task descriptions for vulner- WizardCoder 7B 0.244 0.351 0.336 0.250
ability assessment: (1) code-based and (2) code-based with WizardCoder 34B 0.306 0.330 0.325 0.379
additional key information. We compare the performance of Mistral 0.283 0.308 0.296 0.424
Phi-2 0.269 0.359 0.355 0.282
LLMs in both task descriptions for vulnerability assessment
and concurrently conduct a case study to illustrate the
effectiveness of incorporating key important information.
Experimental Setting. We instruct LLM with the following Finding-4. Overall, fine-tuned code-related LLMs outperform
task descriptions (i.e., Task Description 1 and Task De- pre-trained LMs in vulnerability assessment. When resources
scription 2) to tell it to act as a vulnerability assessor. We permit, fine-tuning DeepSeek-Coder 6.7B for vulnerability
first provide LLM with the vulnerable codes to explore its assessment is optimal, as it outperforms the pre-trained LMs
performance (Task Description 1). Moreover, we provide across four metrics.
LLM with some key important information, including the
CVE description, the project, the commit message as well as Case Study. To illustrate the effectiveness of key impor-
the file name when the vulnerable code exists to investigate tant information, we present an instance of a vulnerability
the performance differences (Task Description 2). (CWE-119) in Big-Vul that is exclusively assess by CodeL-
lama, as depicted in Table 9. This example is a vulnerability
Task Description 1: Provide a qualitative severity rating of in the Linux project, categorized under CWE-119 (Improper
CVSS v2.0 for the vulnerable C code snippet. Restriction of Operations within the Bounds of a Memory
Task Description 2: Provide a qualitative severity rating of Buffer Vulnerability). In an initial assessment without criti-
CVSS v2.0 for the vulnerable C code snippet (with additional cal information, CodeLlama did not fully grasp the severity
information). of this vulnerability and labeled it as “Medium”. However,
with the provision of crucial details, CodeLlama can more
Results. Table 8 shows the detailed results of LLMs and accurately evaluate the risk level of this vulnerability. The
six baselines on vulnerability assessment. Based on these CVE description for this vulnerability highlights multiple
experimental results, we can observe a significant improve- buffer overflows in the net/wireless/nl80211.c file of the
ment in the vulnerability assessment capability of LLMs Linux kernel prior to version 2.6.39.2. These vulnerabilities
after fine-tuning. Specifically, the accuracy has increased allow local users to gain elevated privileges by leveraging
from 0.282 to 0.424, reaching a range of 0.759 to 0.860, while the CAP NET ADMIN capability during scan operations
precision has improved from 0.296 to 0.355, now ranging with an excessively long SSID value. In this scenario, the
10
TABLE 9: A vulnerable code for CodeLlama to assess with different prompts (RQ2)
Improper Restriction of Operations within the Bounds of a Memory Buffer Vulnerability (CWE-119) in Linux
Task Description 1 Provide a qualitative severity ratings of CVSS v2.0 for the vulnerable C code snippet.
Input 1 An example of a C code snippet with vulnerabilities (CVE-2011-2517).
Response 1 Severity: Medium
Task Description 2 Provide a qualitative severity rating of CVSS v2.0 for the vulnerable C code snippet (with additional
information).
Input 2 Project: Linux
File Name: net/wireless/nl80211.c
CVE Description: Multiple buffer overflows in net/wireless/nl80211.c in the Linux kernel before 2.6.39.2
allow local users to gain privileges by leveraging the CAP NET ADMIN capability during scan operations
with a long SSID value.
Commit Message: nl80211: fix check for valid SSID size in scan operations. In both trigger scan and
sched scan operations, we were checking for the SSID length before assigning the value correctly. Since the
memory was just kzalloc’ed, the check was always failing and SSID with over 32 characters were allowed to
go through. This was causing a buffer overflow when copying the actual SSID to the proper place. This bug
has been there since 2.6.29-rc4.
Response 2 Severity: High
Analysis The true Severity is High. After providing additional key information, CodeLlama output for the Severity
changed from Medium to High.
lack of proper validation of the SSID length leads to buffer Experimental Setting. We select the vulnerable functions
overflows, enabling attackers to exploit the vulnerability, with information on vulnerable lines from the testing set for
escalate privileges, and execute malicious code. The com- the evaluation and instruct LLM with the following task de-
mit message described that this bug has existed since ver- scription to explore its vulnerability location performance.
sion 2.6.29-rc4 of the Linux kernel. Given this information,
CodeLlama reassesses the risk level of this vulnerability Task Description: Provide a vulnerability location result for
as “High”. This is because it allows attackers to escalate the vulnerable C code snippet.
privileges and execute malicious code, and it has persisted
for a considerable period of time. It is crucial to address and For the fine-tuning setting of LLMs and pre-trained
patch this vulnerability promptly by updating the operating LMs, we treat the vulnerability location task as a binary
system or kernel to ensure security. classification problem, determining whether each line of
To compare the vulnerability assessment capabilities of code is vulnerable or not. For the few-shot setting, a specific
LLMs after providing key information, we have created a vulnerable function may contain one or several vulnerable
performance comparison bar chart, as shown in Fig. 5. LLMs lines, and LLM may also predict one or several potential
have limited capacity for assessing vulnerability severity vulnerable lines (Linespredict ). We convert Linespredict into
based solely on source code. However, when provided a binary classification format. For example, if a given vul-
with key important information, most LLMs (i.e., DeepSeek- nerable function consists of five lines and contains two
Coder, CodeLlama, WizardCoder, and Mistral) exhibit sig- vulnerable lines [2, 3], and the LLM predicts one potential
nificantly improved vulnerability assessment capabilities, vulnerable line [2], we convert this to a binary classification
particularly in terms of the Accuracy metric. The Accuracy format as [0, 0, 1, 0, 0] for ease of computation. To better
has increased from the range of 0.26-0.42 to the range of 0.27- evaluate the vulnerability location performance of LLM
0.56. StarCoder and Phi-2 are showing a declining trend, on a specific vulnerable function, we consider five widely
and we believe this may be attributed to the addition of used performance measures (i.e., Precision, Recall, F1-score,
key information, resulting in an increase in the number of Accuracy, and FPR).
input tokens. These LLMs may not excel in handling exces-
In addition to pre-trained LMs, we also consider the
sively long text sequences, and we analyze this further in
following four SOTA baselines: Devign [38], Reveal [47],
Section 5.2. In contrast, DeepSeek-Coder and Mistral exhibit
IVDetect [56], and LineVul [39]. For the graph-based ap-
significant improvements, possibly due to their proficiency
proaches (i.e., Devign, Reveal, and IVDetect), we use GN-
in handling long sequential text.
NExplainer [78], [79] for vulnerability location. We compare
the performance of LLMs and these baselines using Top-k
Finding-5. LLMs have the capacity for assessment of vulner-
Accuracy, as employed in previous works [39], [79].
ability severity based on source code, and can be improved by
providing more context information. Results. Table 10 presents the overall performance of vul-
nerability location between LLMs and seven baselines.
Based on this table, we can achieve the following observa-
tions: (1) Fine-tuning can greatly enhance the vulnerability
4.3 RQ-3: Evaluating Vulnerability Location of LLMs location capabilities of LLMs. For example, after fine-
In this RQ, we first outline how to assess the vulnerability tuning, CodeLlama 7B’s F1-score increases from 0.082 to
location capabilities of LLMs. Then, we proceed to compare 0.504, recall increases from 0.063 to 0.396, precision increases
the vulnerability location abilities of LLMs across different from 0.116 to 0.691, accuracy increases from 0.882 to 0.919,
settings, both at a general level and in detail, and analyze and FPR decreases from 0.043 to 0.021. (2) Code-related
the reasons behind the observed differences. LLMs often outperform pre-trained LMs in terms of F1-
11