0% found this document useful (0 votes)
11 views16 pages

Yin 等 - 2024 - Multitask-based Evaluation of Open-source Llm on Software Vulnerability

This paper evaluates the performance of open-source Large Language Models (LLMs) on software vulnerability tasks using the Big-Vul dataset. The findings indicate that while state-of-the-art pre-trained language models generally outperform LLMs in vulnerability detection, certain LLMs excel in vulnerability assessment and description when provided with contextual information. The study highlights the need for improvements in LLMs' understanding of code vulnerabilities and their tendency to produce excessive output, which affects their overall performance.

Uploaded by

dvnq2drb62
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views16 pages

Yin 等 - 2024 - Multitask-based Evaluation of Open-source Llm on Software Vulnerability

This paper evaluates the performance of open-source Large Language Models (LLMs) on software vulnerability tasks using the Big-Vul dataset. The findings indicate that while state-of-the-art pre-trained language models generally outperform LLMs in vulnerability detection, certain LLMs excel in vulnerability assessment and description when provided with contextual information. The study highlights the need for improvements in LLMs' understanding of code vulnerabilities and their tendency to produce excessive output, which affects their overall performance.

Uploaded by

dvnq2drb62
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

1

Multitask-based Evaluation of Open-Source LLM


on Software Vulnerability
Xin Yin, Chao Ni⋆ , and Shaohua Wang

Abstract— This paper proposes a pipeline for quantitatively evaluating interactive Large Language Models (LLMs) using publicly
available datasets. We carry out an extensive technical evaluation of LLMs using Big-Vul covering four different common software
vulnerability tasks. This evaluation assesses the multi-tasking capabilities of LLMs based on this dataset. We find that the existing
state-of-the-art approaches and pre-trained Language Models (LMs) are generally superior to LLMs in software vulnerability detection.
However, in software vulnerability assessment and location, certain LLMs (e.g., CodeLlama and WizardCoder) have demonstrated su-
perior performance compared to pre-trained LMs, and providing more contextual information can enhance the vulnerability assessment
capabilities of LLMs. Moreover, LLMs exhibit strong vulnerability description capabilities, but their tendency to produce excessive output
arXiv:2404.02056v3 [cs.SE] 6 Jul 2024

significantly weakens their performance compared to pre-trained LMs. Overall, though LLMs perform well in some aspects, they still
need improvement in understanding the subtle differences in code vulnerabilities and the ability to describe vulnerabilities to fully realize
their potential. Our evaluation pipeline provides valuable insights into the capabilities of LLMs in handling software vulnerabilities.

Index Terms—Software Vulnerability Analysis, Large Language Model.

1 I NTRODUCTION
Software Vulnerabilities (SVs) can expose software sys- to individuals. As a cornerstone of software quality assur-
tems to risk situations and eventually cause huge economic ance, the seamless integration of these activities underscores
losses or even threaten people’s lives. Therefore, completing the importance of a proactive and thorough approach to
software vulnerabilities is an important task for software managing software vulnerabilities in today’s dynamic and
quality assurance (SQA). Generally, there are many impor- interconnected digital landscape.
tant software quality activities for software vulnerabilities
such as SV detection, SV assessment, SV location, and SV
RQ-2
description. The relationship among the SQA activities is Vulnerability
intricate and interdependent and can be illustrated in Fig. 1. Assessment

SV detection serves as the initial phase, employing various Source Vulnerability Vulnerability Vulnerability LLM
tools and techniques to identify potential vulnerabilities Code Detection Location Description Output
RQ-1 RQ-3 RQ-4
within the software. Once detected, the focus shifts to SV
assessment, where the severity and potential impact of Fig. 1: The relationship among software vulnerability anal-
each vulnerability are meticulously evaluated. This critical ysis activities
evaluation informs the subsequent steps in the process. SV
location follows the assessment, pinpointing the exact areas
within the software’s code or architecture where vulner- Recently, Large Language Models (LLMs) [1] have been
abilities exist. This step is crucial for precise remediation widely adopted since the advances in Natural Language
efforts and to prevent the recurrence of similar vulnera- Processing (NLP) which enable LLM to be well-trained
bilities in the future. The intricacies of SV location feed with both billions of parameters and billions of training
into the comprehensive SV description, which encapsulates samples, consequently bringing a large performance im-
detailed information about each vulnerability, including its provement on tasks adopted by LLMs. LLMs can be easily
origin, characteristics, and potential exploits. In essence, the used for a downstream task by being fine-tuned [2] or
synergy among SV detection, SV assessment, SV location, being prompted [3] since they are trained to be general and
and SV description creates a robust pipeline for addressing they can capture different knowledge from various domain
software vulnerabilities comprehensively. This systematic data. Fine-tuning is used to update model parameters for
approach not only enhances the overall quality of the soft- a particular downstream task by iterating the model on
ware but also fortifies it against potential threats, thereby a specific dataset while prompting can be directly used
safeguarding against economic losses and potential harm by providing natural language descriptions or a few ex-
amples of the downstream task. Compared to prompting,
Both Xin Yin and Chao Ni are with the State Key Laboratory of Blockchain and
fine-tuning is expensive since it requires additional model
Data Security, Zhejiang University, Hangzhou, China. Chao Ni is also with training and has limited usage scenarios, especially in cases
Hangzhou High-Tech Zone (Binjiang) Blockchain and Data Security Research where sufficient training datasets are unavailable.
Institute, Hangzhou, China. E-mail: {xyin, chaoni}@zju.edu.cn. LLMs have demonstrated remarkable language compre-
Shaohua Wang is with Central University of Finance and Economics, China.
E-mail: [email protected]. hension and generation capabilities, and have been able to
Chao Ni is the corresponding author. perform well on a variety of natural language processing
2

Detection Detection

Detection* Description*

Assessment Description Assessment Description

Assessment* Location*
Detection Detection

Location Location
Detection
DeepSeek 6.7B CodeLlama 7B StarCoder 7B DeepSeek 6.7B DeepSeek 33B StarCoder 7B CodeLlama 34B
WizardCoder 7B Mistral 7B Phi-2 2.7B StarCoder 7B StarCoder 15.5B WizardCoder 7B WizardCoder 34B

(a) LLMs’ performance on different software vulnerability tasks (b) The impacts of parameter sizes on LLMs’ performance across differ-
(∗ refers to the results under fine-tuning setting) ent software vulnerability tasks

Fig. 2: The capability comparison of LLMs with different parameter sizes on different software vulnerability tasks

tasks, such as text summarization [4]. Given the outstanding SV assessment frameworks CVSS (Common Vulnera-
performance of LLMs, there is a growing focus on exploring bility Scoring System) [20], which characterizes SVs by
their potential in software engineering tasks and seeking considering three metric groups: Base, Temporal, and
new opportunities to address them. Currently, as more and Environmental. The metrics that are in the groups can
more LLMs designed for software engineering tasks are be further used as the criterion for selecting serious SVs
deployed [5]–[11], many research works focused on the ap- to fix early. Therefore, we aim to explore the ability
plication of LLMs in the software engineering domain [12]– of LLMs to assess vulnerabilities and compare their
[16]. However, in the existing literature, adequate systematic performance with pre-trained LMs.
reviews and surveys have been conducted on LLMs in areas • RQ-3: How do LLMs perform on vulnerability lo-
such as generating high-quality code and high-coverage test cation? Identifying the precise location of vulnerabil-
cases [17], [18], but a systematic review and evaluation of ities in software systems is of critical importance for
open-source LLMs in the field of software vulnerability is mitigating risks and improving software quality. The
still missing. vulnerability location task involves pinpointing these
In this paper, we focus on evaluating LLMs’ performance weaknesses accurately and helps to narrow the scope
in various software vulnerability (SV)-related tasks in few- for developers to fix problems. Therefore, we aim to in-
shot and fine-tuning settings to obtain a basic, comprehen- vestigate LLMs’ capability in effectively identifying the
sive, and better understanding of their multi-task ability, precise location of vulnerabilities in software systems,
and we aim to answer the following research questions. alongside evaluating their performance against state-
• RQ-1: How do LLMs perform on vulnerability de- of-the-art approaches and pre-trained LMs.
tection? Software Vulnerabilities (SVs) can expose soft- • RQ-4: How do LLMs perform on vulnerability de-
ware systems to risk situations and consequently soft- scription? The vulnerability description task focuses
ware function failure. Therefore, detecting these SVs is on conveying a detailed explanation of these identified
an important task for software quality assurance. We issues in the source codes and helps participants to
aim to explore the ability of LLMs on vulnerability better understand the risk as well as its impacts. Un-
detection as well as the performance difference com- derstanding the intricacies of vulnerabilities in software
pared with state-of-the-art approaches and pre-trained systems plays a pivotal role in alleviating risks and bol-
Language Models (LMs). stering software quality. The vulnerability description
• RQ-2: How do LLMs perform on vulnerability as- task focuses on conveying a detailed explanation of
sessment? In practice, due to the limitation of SQA these identified issues in the source codes and helps
resources [19], it is impossible to treat all detected SVs participants to better understand the risk as well as
equally and fix all SVs simultaneously. Thus, it is neces- its impacts. Our goal is to evaluate LLMs’ ability to
sary to prioritize these detected software vulnerabilities effectively generate vulnerability descriptions within
for better treatment. An effective solution to prioritize software systems and compare their performance with
those SVs is to use one of the most widely known that of pre-trained LMs.
3

To extensively and comprehensively analyze the LLMs’ language model InstructGPT [28] with a dialog interface
ability, we use a large-scale dataset containing real-world that is fine-tuned using the Reinforcement Learning with
project vulnerabilities (named Big-Vul [21]). We carefully Human Feedback (RLHF) approach [28]–[30]. RLHF initially
design experiments to discover the findings by answering fine-tunes the base model using a small dataset of prompts
four RQs. The main contribution of our work is summarized as input and the desired output, typically human-written,
as follows and takeaway findings are shown in Table 1. to refine its performance. Subsequently, a reward model is
Eventually, we present the comparison of LLMs across four trained on a larger set of prompts by sampling outputs gen-
software vulnerability tasks under different settings, as well erated by the fine-tuned model. These outputs are then re-
as the impact of varying model sizes on performance, as ordered by human labelers to provide feedback for training
depicted in Fig. 2(a) and Fig. 2(b). In summary, the key the reward model. Reinforcement learning [31] is then used
contributions of this paper include: to calculate rewards for each output generated based on
• We extensively evaluate the performance of LLMs on the reward model, updating LLM parameters accordingly.
different software vulnerability tasks and conduct an With fine-tuning and alignment with human preferences,
extensive comparison among LLMs and learning-based LLMs better understand input prompts and instructions,
approaches to software vulnerability. enhancing performance across various tasks [28], [32].
• We design four RQs to comprehensively understand The application of LLMs in software engineering has
LLMs from different dimensions, and provide detailed seen a surge, with models like ChatGPT being employed
results with examples. for various tasks (e.g., code review, code generation, and
• We release our replication package for further study [22]. vulnerability detection). Although some works use LLMs
for vulnerability tasks [33], [34], our work differs from these
previous studies in the following aspects. (1) Closed-source
2 BACKGROUND AND R ELATED W ORK ChatGPT vs. Open-source LLMs: They only explore the
2.1 Large Language Model capabilities of the closed-source ChatGPT in vulnerability
Since the advancements in Natural Language Processing, tasks, whereas we investigate the abilities of both open-
Large Language Models (LLMs) [1] have seen widespread source code-related LLMs and general LLMs in these tasks.
adoption due to their capacity to be effectively trained with (2) Prompts vs. Few-shot and Fine-tuning Settings: They
billions of parameters and training samples, resulting in focus solely on the performance of LLMs using prompts,
significant performance enhancements. LLMs can readily be which introduces randomness and hinders the reproducibil-
applied to downstream tasks through either fine-tuning [2] ity of their findings. In contrast, we examine the capabilities
or prompting [3]. Their versatility stems from being trained of LLMs under both few-shot and fine-tuning settings, pro-
to possess a broad understanding, enabling them to capture viding the source code and corresponding model files to
diverse knowledge across various domains. Fine-tuning ensure the reproducibility of our experimental results.
involves updating the model parameters specifically for
a given downstream task through iterative training on a 2.2 Software Vulnerability
specific dataset. In contrast, prompting allows for direct uti- Software Vulnerabilities (SVs) can expose software systems
lization by providing natural language descriptions or a few to risk situations and consequently make the software un-
examples of the downstream task. Compared to prompting, der cyber-attacks, eventually causing huge economic losses
fine-tuning is resource-intensive as it necessitates additional and even threatening people’s lives. Therefore, vulnerability
model training and is applicable in limited scenarios, partic- databases have been created to document and analyze pub-
ularly when adequate training datasets are unavailable. licly known security vulnerabilities. For example, Common
LLMs are usually built on the transformer architec- Vulnerabilities and Exposures (CVE) [35], [36] and Security-
ture [23] and can be classified into three types of ar- Focus [37] are two well-known vulnerability databases. Be-
chitectures: encoder-only, encoder-decoder, and decoder- sides, Common Weakness Enumeration (CWE) defines the
only. Encoder-only (e.g., CodeBERT [24], GraphCode- common software weaknesses of individual vulnerabilities,
BERT [25], and UniXcoder [26]) and Encoder-Decoder which are often referred to as vulnerability types of CVEs.
(e.g., PLBART [27], CodeT5 [7], and CodeT5+ [8]) mod- To better address these vulnerabilities, researchers have
els are trained using Masked Language Modeling (MLM) proposed many approaches for understanding the effects
or Masked Span Prediction (MSP) objective, respectively, of software vulnerabilities, including SV detection [38]–
where a small portion (e.g., 15%) of the tokens are replaced [50], SV assessment [20], [51]–[54], SV location [55]–[57], SV
with either masked tokens or masked span tokens, LLMs repair [58]–[61] as well as SV description [62]–[65]. Many
are trained to recover the masked tokens. These models novel technologies are adopted to promote the progress
are trained as general ones on the code-related data and of software vulnerability management, including software
then are fine-tuned for the downstream tasks to achieve analysis [66], [67], machine learning [38], [45], and deep
superior performance. Decoder-only models also attract a learning [51], [56], especially LLMs [63], [64].
small portion of people’s attention and they are trained
by using Causal Language Modeling objectives to predict
the probability of the next token given all previous tokens. 3 E XPERIMENTAL D ESIGN
GPT [2] and its variants are the most representative models, In this section, we present our studied dataset, our studied
which bring the large language models into practical usage. LLMs, the techniques for fine-tuning, the prompt engineer-
Recently, the ChatGPT model attracts the widest atten- ing, the baseline approaches, the evaluation metrics, and the
tion from the world, which is the successor of the large experiment settings.
4

TABLE 1: Takeaways: Evaluating LLMs on Software Vulnerability

Dimension Finding or Guidance

1 . LLMs can detect vulnerabilities, but fine-tuned LLMs perform weaker than transformer-based
approaches. Considering the computational resources and time costs of deploying LLMs,
transformer-based approaches for vulnerability detection are a more efficient choice. 2 . After
Vulnerability Detection fine-tuning, the detection capability of LLMs has improved. Larger models usually perform better,
but performance can also be influenced by model design and pre-training data. Therefore,
fine-tuning the LLM on domain-specific data before using it as a vulnerability detector is necessary.
3 . In general, different LLMs complementing each other, while CodeLlama obtains better
performance in terms of F1-score, Precision, and Recall.

4 . Overall, fine-tuned code-related LLMs outperform pre-trained language models in vulnerability


assessment. When resources permit, fine-tuning DeepSeek-Coder 6.7B for vulnerability assessment is
Vulnerability Assessment optimal, as it outperforms the pre-trained language models across four metrics. 5 . LLMs have the
capacity for assessment of vulnerability severity based on source code, and can be improved by
providing more context information.

6 . Few-shot setting exposes LLM’s limitations, and fine-tuning can greatly enhance the
Vulnerability Location vulnerability location capabilities of LLMs. 7 . Fine-tuning code-related LLMs as vulnerability
locators is beneficial, as they can outperform pre-trained language models in terms of F1-score,
precision, and FPR.

8 . LLMs exhibit significantly weaker performance in generating vulnerability descriptions


Vulnerability Description compared to pre-trained language models. Therefore, fine-tuning pre-trained language models for
vulnerability detection is recommended.

3.1 Studied Dataset


TABLE 2: Statistic of the studied dataset
We adopt the widely used dataset (named Big-Vul) provided
by Fan et al. [21] by considering the following reasons. The Datasets # Vul. # Non-Vul. # Total % Vul.: Non-Vul.
most important one is to satisfy the distinct characteristics of Original Big-Vul 10,900 177,736 188,636 0.061
the real world as well as the diversity in the dataset, which Filtered Big-Vul 5,260 96,308 101,568 0.055
is suggested by previous works [45], [47]. Big-Vul, to the Training 8,720 8,720 17,440 1
best of our knowledge, is the most large-scale vulnerability Validation 1,090 17,774 18,864 0.061
Testing 1,090 17,774 18,864 0.061
dataset with diverse information about the vulnerabilities, ∗ We undersample the non-vulnerable functions to produce approximately
which are collected from practical projects and these vul- balanced training data.
nerabilities are recorded in the Common Vulnerabilities and
Exposures (CVE)1 . The second one is to compare fairly with
existing state-of-the-art (SOTA) approaches (e.g., LineVul, 3.2 Studied LLMs
Devign, and SVulD).
The general LLMs are pre-trained on textual data, including
Big-Vul totally contains 3,754 code vulnerabilities col-
natural language and code, and can be used for a variety
lected from 348 open-source projects spanning 91 different
of tasks. In contrast, code-related LLMs are specifically pre-
vulnerability types from 2002 to 2019. It has 188,636 C/C++
trained to automate code-related tasks. Due to the empirical
functions with a vulnerable ratio of 5.7% (i.e., 10,900 vulner-
nature of this work, we are interested in assessing the
ability functions). The authors linked the code changes with
effectiveness of both LLM categories in vulnerability tasks.
CVEs and their descriptive information to enable a deeper
For the code-related LLMs, we select the top four models
analysis of the vulnerabilities.
released recently (in 2023), namely DeepSeek-Coder [9],
We follow the same strategy to build the training data,
CodeLlama [11], StarCoder [10], and WizardCoder [68]. For
validation data, and testing data from the original dataset
the general LLMs, we select the top two models, result-
with previous work does [39], [48], [58]. Specifically, 80% of
ing in the selection of Mistral [69], and Phi-2 [70]. For
functions are treated as training data, 10% of functions are
the few-shot setting, we select the models with no more
treated as validation data, and the left 10% of functions are
than 34B parameters from the Hugging Face Open LLM
treated as testing data. We also keep the distribution as same
Leaderboard [71], as for the fine-tuning setting, we select
as the original ones in training, validation, and testing data.
the models with 7B parameters or less. The constraint on
Notice that we undersample the non-vulnerable functions
the number of parameters is imposed by our computing
to produce approximately balanced training data at the
resources (i.e., 192GB RAM, 10 × NVIDIA RTX 3090 GPU).
function level, while the validation and testing data remain
Table 3 summarizes the characteristics of the studied LLMs,
in the original imbalanced ratio. To clean and normalize
we briefly introduce these LLMs to make our paper self-
the dataset, we remove empty lines, leading and trailing
contained.
spaces in each line, as well as comments from the source
Group 1: Code-related LLMs. DeepSeek-Coder developed
code. Finally, the split dataset is used for evaluation and the
by DeepSeek AI [9] is composed of a series of code language
statistics are shown in Table 2.
models, each trained from scratch on 2T tokens, with a
1. https://cve.mitre.org/ composition of 87% code and 13% natural language in both
5

TABLE 3: Overview of the studied LLMs things like writing, summarizing texts, and coding, but with
Code-related LLMs General LLMs
better common sense and understanding than its earlier ver-
Models
DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2 sion, Phi-1.5. Phi-2’s evaluation demonstrates its proficiency
Fine-Tuning 6.7B 7B 7B 7B 7B 2.7B over larger models in aggregated benchmarks, emphasizing
Few-Shot 6.7B & 33B 7B & 34B 7B & 34B 7B & 15.5B 7B 2.7B
Release Date Nov’23 Aug’23 May’23 June’23 Sep’23 Dec’23
the potential of smaller models to achieve comparable or
superior performance to their larger counterparts. This is
particularly evident in its comparison with Google Gemini
English and Chinese. They provide various sizes of the Nano 2, where Phi-2 outshines despite its smaller size.
code model, ranging from 1B to 33B versions. Each model
is pre-trained on project-level code corpus by employing
a window size of 16K and an extra fill-in-the-blank task, 3.3 Model Fine-Tuning
to support project-level code completion and infilling. For The four software vulnerability tasks can be categorized into
coding capabilities, DeepSeek-Coder achieves state-of-the- two types: discriminative task (i.e., software vulnerability
art performance among open-source code models on multi- detection, software vulnerability assessment, and software
ple programming languages and various benchmarks. vulnerability location) and generative task (i.e., software
CodeLlama proposed by Rozière et al. [11] is a set of vulnerability description). Therefore, fine-tuning LLMs for
large pre-trained language models for code built on Llama software vulnerability tasks can be undertaken through
2. They achieve state-of-the-art performance among open both discriminative and generative methods, each method
models on code tasks, provide infilling capabilities, support specifically designed to make LLMs aligned with the task.
large input contexts, and demonstrate zero-shot instruction In particular, we treat the discriminative tasks as binary clas-
following for programming problems. CodeLlama is created sification, while treating the generative task as generation
by further training Llama 2 using increased sampling of one. The architectures for the two paradigms are presented
code data. As with Llama 2, the authors applied extensive in Fig. 3.
safety mitigations to the fine-tuned CodeLlama versions.
StarCoder proposed by Li et al. [10] is a large pre-
trained language model specifically designed for code. It 0 1 0 CVE description: A remote code ...

was pre-trained on a large amount of code data to acquire Classifier Decoder


programming knowledge and trained on permissive data ... ... ... ...
from GitHub, including over 80 programming languages,
Encoder / Decoder Encoder / Decoder
Git commits, GitHub issues, and Jupyter notebooks. Star-
Coder can perform code editing tasks, understand natural void set () { ... write ( length ... ... } void set () { ... write ( length ... ... }
language prompts, and generate code that conforms to APIs.
StarCoder represents the advancement of applying large (a) Discriminative Fine-Tuning (b) Generative Fine-Tuning
language models in programming.
WizardCoder proposed by Luo et al. [68] is a large Fig. 3: Fine-tuning LLMs for software vulnerability tasks
pre-trained language model that empowers Code LLMs
with complex instruction fine-tuning, by adapting the Evol- Discriminative Fine-Tuning. For vulnerability detec-
Instruct method to the domain of code. Through compre- tion and vulnerability assessment, we utilize the “Au-
hensive experiments on four prominent code generation toModelForSequenceClassification” class provided by the
benchmarks, namely HumanEval, HumanEval+, MBPP, and Transformers library to implement discriminative fine-
DS-1000, the authors unveil the exceptional capabilities of tuning. “AutoModelForSequenceClassification” is a generic
their model. It surpasses all other open-source Code LLMs model class that will be instantiated as one of the
by a substantial margin. Moreover, WizardCoder even out- sequence classification model classes of the library
performs the largest closed LLMs, Anthropic’s Claude and when created with the “AutoModelForSequenceClassifica-
Google’s Bard, on HumanEval and HumanEval+. tion.from pretrained(model name or path)” class method.
Group 2: General LLMs. Mistral is a 7-billion-parameter For vulnerability location, we follow previous
language model released by Mistral AI [69]. Mistral 7B is works [72], [73] that use LLMs to classify individual code
a carefully designed language model that provides both lines as either vulnerable or non-vulnerable. For a token
efficiency and high performance to enable real-world ap- sequence T = {t1 , t2 , ..., tn } of the function, the model’s
plications. Due to its efficiency improvements, the model is decoder component, denoted as M , processes T to yield a
suitable for real-time applications where quick responses are sequence of output vectors: O = M (T ) = {o1 , o2 , ..., oL },
essential. At the time of its release, Mistral 7B outperformed where O represents the output tensor with dimensions
the best open source 13B model (Llama 2) in all evaluated L × H , L signifies the sequence length, and H denotes the
benchmarks. hidden dimension size. During the process, the contextual
Phi-2 proposed by Microsoft [70] packed with 2.7 billion information is captured by the masked self-attention
parameters. It is designed to make machines think more like mechanisms in the decoder of LLMs, where masked self-
humans and do it safely. Phi-2 is not just about numbers; it attention limits the sight to the preceding part of tokens.
is about a smarter, safer way for computers to understand Each output vector oi that represents the last token of one
and interact with the world. Phi-2 stands out because it is line is subsequently associated with a label (i.e., 0 or 1). The
been taught with a mix of new language data and careful optimization process employs the binary cross-entropy as
checks to make sure it acts right. It is built to do many the loss function.
6

TABLE 4: The task descriptions and indicators for different software vulnerability tasks

Dimension Task Description Indicator


Vulnerability Detection If this C code snippet has vulnerabilities, output Yes; otherwise, output No. // Detection
Vulnerability Assessment Provide a qualitative severity ratings of CVSS v2.0 for the vulnerable C code snippet. // Assessment
Vulnerability Location Provide a vulnerability location result for the vulnerable C code snippet. // Location
Vulnerability Description Provide a CVE description for the vulnerable C code snippet. // Description

Generative Fine-Tuning. Generative fine-tuning aims to Prompt


equip LLMs with the ability to perform Sequence-to-
Sequence (Seq2Seq) tasks. Specifically, this involves in- 1 Task Description:
putting vulnerable code and generating the corresponding If this C code snippet has vulnerabilities, output Yes; otherwise, output No.
CVE descriptions related to the vulnerabilities. To calculate 2 Source Code:
the loss during fine-tuning, we utilize the cross-entropy loss // Code Start
function, which is commonly used for Seq2Seq tasks. In void SendStatus(struct mg_connection* connection, const struct
this context, the loss measures the difference between the mg_request_info* request_info, void* user_data) {
generated output sequence and the target sequence. std::string response = "HTTP/1.1 200 OK\r\n"
"Content-Length:2\r\n\r\n"
"ok";
3.4 Prompt Engineering mg_write(connection, response.data(),
response.length());
For few-shot setting, we follow the prompt similar to those }
used in the artifacts, papers, or technical reports associated // Code End
with each corresponding model [5], [10], [11], where each 3 Indicator:
prompt contains three pieces of information: (1) task de- // Detection
scription, (2) source code, and (3) indicator. Using the soft-
ware vulnerability detection task as an example, the prompt Fig. 4: The prompt contains three pieces of information: (1)
utilized for LLM consists of three crucial components, as task description, (2) source code, and (3) indicator
depicted in Fig. 4:
• Task Description (marked as ①). We provide LLM with
the description constructed as ‘‘If this C code are designed for learning data representations and trained
snippet has vulnerabilities, output Yes; using the Masked Language Modeling (MLM) objective.
otherwise, output No’’. The task description used Encoder-decoder LMs (i.e., PLBART [27], CodeT5 [7], and
in the SV detection task varies based on the source CodeT5+ [8]) have been proposed for sequence-to-sequence
programming language we employ. tasks. They are trained to recover the correct output se-
• Source Code (marked as ②). We provide LLM with the quence given the original input, often through span predic-
code wrapped in ‘‘// Code Start’’ and ‘‘// Code tion tasks where random spans are replaced with artificial
End’’ Since we illustrate an example in C, we use the C tokens. Recently, researchers have combined MLM with
comment format of ‘‘//’’ as a prefix for the description. generative models for bidirectional and autoregressive text
We also employ different comment prefixes based on the generation or infilling [74]. All these LMs can potentially be
programming language of the code. used for our tasks, so we evaluate these LMs.
• Indicator (marked as ③). We instruct LLM to think about
the results. In this paper, we follow the best practice in TABLE 5: Overview of the studied LMs
previous work [12] and adopt the same prompt named
‘‘// Detection’’. Models # Para. Model Type Models # Para. Model Type
CodeBERT 125M Encoder-only LM PLBART 140M Encoder-decoder LM
Depending on the specific software vulnerability tasks, GraphCodeBERT 125M Encoder-only LM CodeT5 220M Encoder-decoder LM
UniXcoder 125M Encoder-only LM CodeT5+ 220M Encoder-decoder LM
the task descriptions and indicators in the prompts may ∗ For UniXcoder, we use encoder-only mode.
vary. The task descriptions and indicators for different soft-
ware vulnerability tasks are presented in Table 4.
For vulnerability location, we also consider Devign [38],
Reveal [47], IVDetect [56], and LineVul [39] as baselines.
3.5 Baselines In addressing vulnerability detection, we also include
To comprehensively compare the performance of LLMs with SVulD [48] in addition to the aforementioned approaches.
existing approaches, in this study, we consider the various We briefly introduce them as follows.
pre-trained Language Models (LMs). As shown in Table 5, Devign proposed by Zhou et al. [38] is a general graph
these models have fewer than 220 million parameters and neural network-based model for graph-level classification
can be categorized into two categories: encoder-only LMs through learning on a rich set of code semantic representa-
and encoder-decoder LMs. Encoder-only LMs (i.e., Code- tions including AST, CFG, DFG, and code sequences. It uses
BERT [24], GraphCodeBERT [25], and UniXcoder [26]) con- a novel Conv module to efficiently extract useful features
tain only the encoder component of a Transformer. They in the learned rich node representations for graph-level
7

classification. our computing resources. Table 3 summarizes the charac-


Reveal proposed by Chakraborty et al. [47] contains two teristics of the studied LLMs. Furthermore, considering the
main phases. In the feature extraction phase, it translates limitation of LLM’s conversation windows, we manually
code into a graph embedding, and in the training phase, select three examples for the few-shot setting from the train-
it trains a representation learner on the extracted features to ing data. Regarding baselines (i.e., pre-trained LMs, Reveal,
obtain a model that can distinguish the vulnerable functions IVDetect, Devign, LineVul, and SVulD), we utilize their
from non-vulnerable ones. publicly available source code and perform fine-tuning with
IVDetect proposed by Li et al. [56] contains the coarse- the default parameters provided in their original code. Con-
grained vulnerability detection component and fine-grained sidering Devign’s code is not publicly available, we make
interpretation component. In particular, IVDetect repre- every effort to replicate its functionality and achieve similar
sents source code in the form of a program dependence results on the original paper’s dataset. All these models
graph (PDG) and treats the vulnerability detection problem are implemented using the PyTorch [75] framework. The
as graph-based classification via graph convolution net- evaluation is conducted on a 16-core workstation equipped
work with feature attention. As for interpretation, IVDetect with an Intel(R) Xeon(R) Gold 6226R CPU @ 2.90Ghz, 192GB
adopts a GNNExplainer to provide fine-grained interpre- RAM, and 10 × NVIDIA RTX 3090 GPU, running Ubuntu
tations that include the sub-graph in PDG with crucial 20.04.1 LTS.
statements that are relevant to the detected vulnerability.
LineVul proposed by Fu et al. [39] is a Transformer-
based line-level vulnerability prediction approach. LineVul 4 E XPERIMENTAL RESULTS
leverages BERT architecture with self-attention layers which This section presents the experimental results by evaluating
can capture long-term dependencies within a long sequence. LLMs performances on the widely used comprehensive
Besides, benefiting from the large-scale pre-trained model, dataset (i.e., Big-Vul [21]) covering four SV-related tasks.
LineVul can intrinsically capture more lexical and logical se-
mantics for the given code input. Moreover, LineVul adopts TABLE 6: The comparison between LLMs and eleven base-
the attention mechanism of BERT architecture to locate the lines on software vulnerability detection (RQ1)
vulnerable lines for finer-grained detection.
SVulD proposed by Ni et al. [48] is a function-level Methods F1-score Recall Precision Accuracy
subtle semantic embedding for vulnerability detection along Devign 0.200 0.660 0.118 0.726
with heuristic explanations. Particularly, SVulD adopts con- Reveal 0.232 0.354 0.172 0.811
IVDetect 0.231 0.540 0.148 0.815
trastive learning to train the UniXcoder semantic embed- LineVul 0.272 0.620 0.174 0.828
ding model for learning distinguishing semantic represen- SVulD 0.336 0.414 0.282 0.915
tation of functions regardless of their lexically similar infor- CodeBERT 0.270 0.608 0.173 0.830
GraphCodeBERT 0.246 0.721 0.148 0.771
mation. UniXcoder 0.256 0.787 0.153 0.764
PLBART 0.255 0.692 0.157 0.791
CodeT5 0.237 0.759 0.141 0.748
3.6 Evaluation Metrics CodeT5+ 0.218 0.508 0.139 0.812
For considered software vulnerability-related tasks, we will Fine-Tuning Setting
perform evaluations using the widely adopted performance DeepSeek-Coder 6.7B 0.270 0.627 0.172 0.824
metrics. More precisely, to evaluate the effectiveness of CodeLlama 7B 0.259 0.806 0.154 0.761
StarCoder 7B 0.220 0.607 0.135 0.778
LLMs on vulnerability detection and vulnerability assess- WizardCoder 7B 0.214 0.365 0.151 0.861
ment, we consider the following four metrics: F1-score, Re- Mistral 0.220 0.607 0.135 0.778
call, Precision, and Accuracy. Additionally, for vulnerability Phi-2 0.241 0.557 0.154 0.818
location, besides the four aforementioned metrics, we also Few-Shot Setting
consider the Top-k Accuracy and FPR metrics. For vulner- DeepSeek-Coder 6.7B 0.084 0.156 0.057 0.823
DeepSeek-Coder 33B 0.107 0.688 0.058 0.404
ability description, we use Rouge-1, Rouge-2, and Rouge-L CodeLlama 7B 0.098 0.449 0.055 0.570
metrics. CodeLlama 34B 0.117 0.281 0.074 0.781
StarCoder 7B 0.094 0.443 0.053 0.560
StarCoder 15.5B 0.097 0.557 0.053 0.463
3.7 Implementation WizardCoder 7B 0.086 0.380 0.049 0.583
WizardCoder 34B 0.128 0.559 0.072 0.607
We develop the generation pipeline in Python, utilizing Py- Mistral 0.126 0.401 0.074 0.711
Torch [75] implementations of DeepSeek Coder, CodeLlama, Phi-2 0.099 0.563 0.054 0.471
StarCoder, WizardCoder, Mistral, and Phi-2. We use the
Huggingface [76] to load the model weights and generate
outputs. We also adhere to the best-practice guide [77]
for each prompt. For the fine-tuning setting, we select the 4.1 RQ-1: Evaluating Vulnerability Detection of LLMs
models with 7B parameters or less, and for the few-shot In this RQ, we first investigate the vulnerability detection of
setting, we use models with fewer than 34B parameters. LLMs and make a comparison with the existing state-of-the-
To directly compare the fine-tuning setting with the few- art (SOTA) approaches. Then, we conduct a more detailed
shot setting, we employ models with the same parameter analysis of the results, comparing the detection performance
in both settings (i.e., DeepSeek-Coder 6.7B, CodeLlama 7B, of LLMs under the Top-10 CWE types.
StarCoder 7B, WizardCoder 7B, Mistral 7B, and Phi-2 2.7B). Experimental Setting. We instruct LLMs with the following
The constraint on the number of parameters is imposed by task description to tell it to act as a vulnerability detector.
8

TABLE 7: The software vulnerability detection comparison on Top-10 CWEs among fine-tuned LLMs (RQ1)

DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2 DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2
CWE Type # Total # Vul.
F1-score Precision
CWE-119 1549 128 0.321 0.309 0.316 0.269 0.258 0.281 0.223 0.197 0.212 0.215 0.181 0.197
CWE-20 1082 80 0.269 0.273 0.229 0.216 0.145 0.269 0.173 0.163 0.141 0.165 0.096 0.175
CWE-264 800 64 0.486 0.468 0.337 0.348 0.356 0.477 0.357 0.316 0.232 0.308 0.269 0.361
CWE-399 697 35 0.355 0.286 0.209 0.274 0.227 0.306 0.227 0.169 0.125 0.191 0.143 0.196
CWE-125 582 29 0.233 0.267 0.213 0.195 0.179 0.180 0.145 0.156 0.129 0.128 0.108 0.109
CWE-200 573 27 0.269 0.261 0.241 0.180 0.162 0.229 0.182 0.159 0.151 0.132 0.106 0.152
CWE-189 442 21 0.235 0.208 0.255 0.273 0.178 0.293 0.145 0.119 0.151 0.180 0.108 0.182
CWE-362 413 16 0.031 0.086 0.075 0.050 0.026 0.032 0.017 0.045 0.040 0.029 0.014 0.018
CWE-416 406 12 0.193 0.178 0.148 0.145 0.146 0.141 0.113 0.101 0.083 0.093 0.086 0.082
CWE-476 367 11 0.091 0.109 0.053 0.057 0.037 0.019 0.049 0.057 0.028 0.032 0.020 0.010
DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2 DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2
CWE Type # Total # Vul.
Recall Accuracy
CWE-119 1549 128 0.570 0.719 0.625 0.359 0.453 0.492 0.801 0.735 0.777 0.839 0.785 0.792
CWE-20 1082 80 0.609 0.844 0.609 0.313 0.297 0.578 0.804 0.735 0.758 0.866 0.793 0.814
CWE-264 800 64 0.763 0.900 0.613 0.400 0.525 0.700 0.839 0.795 0.759 0.850 0.810 0.846
CWE-399 697 35 0.815 0.926 0.630 0.481 0.556 0.704 0.885 0.821 0.815 0.901 0.854 0.877
CWE-125 582 29 0.586 0.931 0.621 0.414 0.517 0.517 0.808 0.746 0.771 0.830 0.763 0.765
CWE-200 573 27 0.514 0.743 0.600 0.286 0.343 0.457 0.829 0.743 0.770 0.841 0.784 0.812
CWE-189 442 21 0.625 0.813 0.813 0.563 0.500 0.750 0.853 0.776 0.828 0.891 0.833 0.869
CWE-362 413 16 0.200 0.800 0.600 0.200 0.200 0.200 0.847 0.794 0.821 0.908 0.821 0.855
CWE-416 406 12 0.667 0.750 0.667 0.333 0.500 0.500 0.835 0.796 0.773 0.884 0.828 0.820
CWE-476 367 11 0.571 1.000 0.429 0.286 0.286 0.143 0.782 0.687 0.706 0.820 0.719 0.725

Task Description: If this C code snippet has vulnerabilities, Accuracy, which surpass the fine-tuned LLMs by 24.4% to
output Yes; otherwise, output No. 57.0%, 64.0% to 108.9%, and 6.3% to 20.2% in terms of F1-
score, Precision, and Accuracy, respectively. Notably, the F1-
score performance of LineVul is significantly lower (0.272)
In addition to pre-trained LMs, we also consider the
than that reported in the original paper (0.910). We further
following five SOTA baselines: Devign [38], Reveal [47],
analyze this discrepancy in Section 5.1.
IVDetect [56], LineVul [39], and SVulD [48]. These base-
(2) The performance of fine-tuned LLMs is comparable
lines can be divided into two groups: graph-based (i.e.,
to graph-based approaches. For example, in terms of F1-
Devign, Reveal, and IVDetect) and transformer-based (i.e.,
score, fine-tuned LLMs achieve a range of 0.214 to 0.270.
pre-trained LMs, LineVul, and SVulD). Besides, in order
In comparison, graph-based approaches achieve a range of
to comprehensively compare the performance among base-
0.200 to 0.232.
lines and LLMs, we consider four widely used performance
measures (i.e., Precision, Recall, F1-score, and Accuracy) and (3) LLMs under few-shot setting have poor perfor-
conduct experiments on the popular dataset. Since graph- mance compared with baselines. LLMs ranging from 2.7B
based approaches need to obtain the structure information to 34B parameters perform less favorably than baselines in
(e.g., control flow graph (CFG), data flow graph (DFG)) terms of F1-score and Precision. However, as for Accuracy,
of the studied functions, we adopt the same toolkit with SVulD (transformer-based) obtains the best performance
Joern to transform functions. The functions are dropped out (0.915) and DeepSeek-Coder 6.7B under few-shot setting
directly if they cannot be transformed by Joern successfully. achieves a performance of 0.823, which is better than the
Finally, the filtered dataset (shown in Table 2) is used for three graph-based approaches.
evaluation. We follow the same strategy to build the training
Finding-1. LLMs can detect vulnerabilities, but fine-tuned
data, validation data, and testing data from the original
LLMs perform weaker than transformer-based approaches.
dataset with previous work does [39], [58]. Specifically, 80%
Considering the computational resources and time costs of de-
of functions are treated as training data, 10% of functions
ploying LLMs, transformer-based approaches for vulnerability
are treated as validation data, and the left 10% of functions
detection are a more efficient choice.
are treated as testing data. We also keep the distribution
as same as the original ones in training, validation, and [B] Fine-Tuning vs. Few-Shot. The experimental results
testing data. We undersample the non-vulnerable functions are presented in Table 6. Based on these experimental find-
to produce approximately balanced training data at the ings, we can draw the following observations: (1) LLMs
function level, while the validation and testing data remain fine-tuned for vulnerability detection demonstrate superior
in the original imbalanced ratio. Apart from presenting the performance on the task compared to LLMs in the few-
overall performance comparison, we also give the detailed shot setting. The average F1-score and average Precision
performance of LLMs on the Top-10 CWE types for a better have doubled, while the average Recall has also shown im-
analysis. provement. (2) LLMs with more parameters typically exhibit
Results. [A] LLMs vs. Baselines. Table 6 shows the overall better performance. For example, CodeLlama 34B improves
performance measures between LLMs and eleven baselines upon CodeLlama 7B by 19.4%, 34.5%, and 37.0% in terms
and the best performances are highlighted in bold. Accord- of F1-score, Precision, and Accuracy, respectively. However,
ing to the results in Table 6, we can obtain the following different LLMs may exhibit performance variations due to
observations: differences in model design and the quality of pre-training
(1) Fine-tuned LLMs have poor performance compared data. (3) Phi-2 achieves performance approximating that of
with transformer-based approaches when considering F1- other LLMs with 7 billion parameters, even with a param-
score, Precision, and Accuracy. In particular, SVulD obtains eter size of 2.7 billion. This may be attributed to the higher
0.336, 0.282, and 0.915 in terms of F1-score, Precision, and quality of its pre-training data.
9

Finding-2. After fine-tuning, the detection capability of LLMs from 0.512 to 0.854. This underscores the necessity of fine-
has improved. Larger models usually perform better, but perfor- tuning in vulnerability assessment task. Overall, fine-tuned
mance can also be influenced by model design and pre-training code-related LLMs outperform pre-trained LMs in vulner-
data. Therefore, fine-tuning the LLM on domain-specific data ability assessment. It is worth noting that DeepSeek-Coder,
before using it as a vulnerability detector is necessary. after fine-tuning, achieves the best performance compared
to other LLMs and pre-trained LMs. If researchers need to
perform tasks such as vulnerability assessment with LLM,
[C] The comparisons of Top-10 CWE types between
fine-tuning DeepSeek-Coder is a more efficient choice. We
LLMs. Table 7 shows the detailed comparisons of Top-
also find that Mistral exhibits a relatively smaller improve-
10 CWE types between fine-tuned LLMs. In this table,
ment after fine-tuning, which aligns with our expectations,
we highlight the best performance for each performance
as it is a general LLM.
metric in bold. According to the results, we can achieve
the following observations: (1) In most cases, CodeLlama
TABLE 8: The comparison between LLMs and six baselines
obtains better performance than other LLMs in terms of F1-
on software vulnerability assessment (RQ2)
score, Precision, and Recall. Different LLMs have certain
advantages in different CWE types, complementing each Methods F1-score Recall Precision Accuracy
other. (2) Considering the performance of F1-score, Preci-
CodeBERT 0.753 0.730 0.788 0.828
sion, and Recall, CodeLlama achieves the best performances GraphCodeBERT 0.701 0.666 0.772 0.802
on CWE-125 (“Out-of-bounds Read”), CWE-362 (“Concurrent UniXcoder 0.745 0.761 0.734 0.817
Execution using Shared Resource with Improper Synchronization PLBART 0.735 0.741 0.731 0.789
CodeT5 0.743 0.750 0.741 0.817
(’Race Condition’)”), and CWE-476 (“NULL Pointer Derefer- CodeT5+ 0.706 0.677 0.755 0.789
ence”), which indicates CodeLlama is exceptionally skilled at
Fine-Tuning Setting
detecting and mitigating vulnerabilities related to memory DeepSeek-Coder 6.7B 0.814 0.785 0.854 0.860
handling and synchronization issues. CodeLlama 7B 0.768 0.749 0.794 0.827
StarCoder 7B 0.671 0.677 0.666 0.764
Finding-3. In general, different LLMs complementing each WizardCoder 7B 0.793 0.778 0.813 0.842
Mistral 0.525 0.539 0.512 0.759
other, while CodeLlama obtains better performance in terms of Phi-2 0.747 0.732 0.767 0.802
F1-score, Precision, and Recall. Few-Shot Setting
DeepSeek-Coder 6.7B 0.229 0.339 0.310 0.262
DeepSeek-Coder 33B 0.290 0.323 0.336 0.335
4.2 RQ-2: Evaluating Vulnerability Assessment of CodeLlama 7B 0.310 0.331 0.334 0.373
CodeLlama 34B 0.265 0.323 0.327 0.294
LLMs StarCoder 7B 0.265 0.342 0.333 0.330
StarCoder 15.5B 0.285 0.315 0.329 0.326
In this RQ, we delineate two task descriptions for vulner- WizardCoder 7B 0.244 0.351 0.336 0.250
ability assessment: (1) code-based and (2) code-based with WizardCoder 34B 0.306 0.330 0.325 0.379
additional key information. We compare the performance of Mistral 0.283 0.308 0.296 0.424
Phi-2 0.269 0.359 0.355 0.282
LLMs in both task descriptions for vulnerability assessment
and concurrently conduct a case study to illustrate the
effectiveness of incorporating key important information.
Experimental Setting. We instruct LLM with the following Finding-4. Overall, fine-tuned code-related LLMs outperform
task descriptions (i.e., Task Description 1 and Task De- pre-trained LMs in vulnerability assessment. When resources
scription 2) to tell it to act as a vulnerability assessor. We permit, fine-tuning DeepSeek-Coder 6.7B for vulnerability
first provide LLM with the vulnerable codes to explore its assessment is optimal, as it outperforms the pre-trained LMs
performance (Task Description 1). Moreover, we provide across four metrics.
LLM with some key important information, including the
CVE description, the project, the commit message as well as Case Study. To illustrate the effectiveness of key impor-
the file name when the vulnerable code exists to investigate tant information, we present an instance of a vulnerability
the performance differences (Task Description 2). (CWE-119) in Big-Vul that is exclusively assess by CodeL-
lama, as depicted in Table 9. This example is a vulnerability
Task Description 1: Provide a qualitative severity rating of in the Linux project, categorized under CWE-119 (Improper
CVSS v2.0 for the vulnerable C code snippet. Restriction of Operations within the Bounds of a Memory
Task Description 2: Provide a qualitative severity rating of Buffer Vulnerability). In an initial assessment without criti-
CVSS v2.0 for the vulnerable C code snippet (with additional cal information, CodeLlama did not fully grasp the severity
information). of this vulnerability and labeled it as “Medium”. However,
with the provision of crucial details, CodeLlama can more
Results. Table 8 shows the detailed results of LLMs and accurately evaluate the risk level of this vulnerability. The
six baselines on vulnerability assessment. Based on these CVE description for this vulnerability highlights multiple
experimental results, we can observe a significant improve- buffer overflows in the net/wireless/nl80211.c file of the
ment in the vulnerability assessment capability of LLMs Linux kernel prior to version 2.6.39.2. These vulnerabilities
after fine-tuning. Specifically, the accuracy has increased allow local users to gain elevated privileges by leveraging
from 0.282 to 0.424, reaching a range of 0.759 to 0.860, while the CAP NET ADMIN capability during scan operations
precision has improved from 0.296 to 0.355, now ranging with an excessively long SSID value. In this scenario, the
10

TABLE 9: A vulnerable code for CodeLlama to assess with different prompts (RQ2)

Improper Restriction of Operations within the Bounds of a Memory Buffer Vulnerability (CWE-119) in Linux
Task Description 1 Provide a qualitative severity ratings of CVSS v2.0 for the vulnerable C code snippet.
Input 1 An example of a C code snippet with vulnerabilities (CVE-2011-2517).
Response 1 Severity: Medium
Task Description 2 Provide a qualitative severity rating of CVSS v2.0 for the vulnerable C code snippet (with additional
information).
Input 2 Project: Linux
File Name: net/wireless/nl80211.c
CVE Description: Multiple buffer overflows in net/wireless/nl80211.c in the Linux kernel before 2.6.39.2
allow local users to gain privileges by leveraging the CAP NET ADMIN capability during scan operations
with a long SSID value.
Commit Message: nl80211: fix check for valid SSID size in scan operations. In both trigger scan and
sched scan operations, we were checking for the SSID length before assigning the value correctly. Since the
memory was just kzalloc’ed, the check was always failing and SSID with over 32 characters were allowed to
go through. This was causing a buffer overflow when copying the actual SSID to the proper place. This bug
has been there since 2.6.29-rc4.
Response 2 Severity: High
Analysis The true Severity is High. After providing additional key information, CodeLlama output for the Severity
changed from Medium to High.

lack of proper validation of the SSID length leads to buffer Experimental Setting. We select the vulnerable functions
overflows, enabling attackers to exploit the vulnerability, with information on vulnerable lines from the testing set for
escalate privileges, and execute malicious code. The com- the evaluation and instruct LLM with the following task de-
mit message described that this bug has existed since ver- scription to explore its vulnerability location performance.
sion 2.6.29-rc4 of the Linux kernel. Given this information,
CodeLlama reassesses the risk level of this vulnerability Task Description: Provide a vulnerability location result for
as “High”. This is because it allows attackers to escalate the vulnerable C code snippet.
privileges and execute malicious code, and it has persisted
for a considerable period of time. It is crucial to address and For the fine-tuning setting of LLMs and pre-trained
patch this vulnerability promptly by updating the operating LMs, we treat the vulnerability location task as a binary
system or kernel to ensure security. classification problem, determining whether each line of
To compare the vulnerability assessment capabilities of code is vulnerable or not. For the few-shot setting, a specific
LLMs after providing key information, we have created a vulnerable function may contain one or several vulnerable
performance comparison bar chart, as shown in Fig. 5. LLMs lines, and LLM may also predict one or several potential
have limited capacity for assessing vulnerability severity vulnerable lines (Linespredict ). We convert Linespredict into
based solely on source code. However, when provided a binary classification format. For example, if a given vul-
with key important information, most LLMs (i.e., DeepSeek- nerable function consists of five lines and contains two
Coder, CodeLlama, WizardCoder, and Mistral) exhibit sig- vulnerable lines [2, 3], and the LLM predicts one potential
nificantly improved vulnerability assessment capabilities, vulnerable line [2], we convert this to a binary classification
particularly in terms of the Accuracy metric. The Accuracy format as [0, 0, 1, 0, 0] for ease of computation. To better
has increased from the range of 0.26-0.42 to the range of 0.27- evaluate the vulnerability location performance of LLM
0.56. StarCoder and Phi-2 are showing a declining trend, on a specific vulnerable function, we consider five widely
and we believe this may be attributed to the addition of used performance measures (i.e., Precision, Recall, F1-score,
key information, resulting in an increase in the number of Accuracy, and FPR).
input tokens. These LLMs may not excel in handling exces-
In addition to pre-trained LMs, we also consider the
sively long text sequences, and we analyze this further in
following four SOTA baselines: Devign [38], Reveal [47],
Section 5.2. In contrast, DeepSeek-Coder and Mistral exhibit
IVDetect [56], and LineVul [39]. For the graph-based ap-
significant improvements, possibly due to their proficiency
proaches (i.e., Devign, Reveal, and IVDetect), we use GN-
in handling long sequential text.
NExplainer [78], [79] for vulnerability location. We compare
the performance of LLMs and these baselines using Top-k
Finding-5. LLMs have the capacity for assessment of vulner-
Accuracy, as employed in previous works [39], [79].
ability severity based on source code, and can be improved by
providing more context information. Results. Table 10 presents the overall performance of vul-
nerability location between LLMs and seven baselines.
Based on this table, we can achieve the following observa-
tions: (1) Fine-tuning can greatly enhance the vulnerability
4.3 RQ-3: Evaluating Vulnerability Location of LLMs location capabilities of LLMs. For example, after fine-
In this RQ, we first outline how to assess the vulnerability tuning, CodeLlama 7B’s F1-score increases from 0.082 to
location capabilities of LLMs. Then, we proceed to compare 0.504, recall increases from 0.063 to 0.396, precision increases
the vulnerability location abilities of LLMs across different from 0.116 to 0.691, accuracy increases from 0.882 to 0.919,
settings, both at a general level and in detail, and analyze and FPR decreases from 0.043 to 0.021. (2) Code-related
the reasons behind the observed differences. LLMs often outperform pre-trained LMs in terms of F1-
11

 ZR,QIR    



 ZL,QIR   
    
  
  
 
    
)VFRUH

5HFDOO
 
 
 
 
 
GHU DPD WDU&RGHU GHU 0LVWUDO 3KL GHU DPD WDU&RGHU GHU 0LVWUDO 3KL
H S 6 H HN&R &RGH/O 6 :L] DUG&R H S 6H HN&R &RGH/O 6 :L] DUG&R
'H 'H
 


 
      
 
  
 

$FFXUDF\

3UHFLVLRQ

 
  
 
 



 
GHU DPD WDU&RGHU GHU 0LVWUDO 3KL GHU DPD WDU&RGHU GHU 0LVWUDO 3KL
H S 6 H HN&R &RGH/O 6 :L] DUG&R H S 6H HN&R &RGH/O 6 :L] DUG&R
'H 'H
Fig. 5: The impact of key important information on LLM Vulnerability Assessment (RQ2)

TABLE 10: The comparison between LLMs and six baselines in Fig. 6, where the x-axis represents k and the y-axis
on software vulnerability location (RQ3) represents Top-k Accuracy (%). For comparison, we average
the results of all LLMs and pre-trained LMs. We can observe
Methods F1-score Recall Precision Accuracy FPR
that the performance of these graph-based methods does
CodeBERT 0.470 0.514 0.433 0.879 0.078
GraphCodeBERT 0.483 0.477 0.489 0.893 0.058 not show significant differences but is considerably weaker
UniXcoder 0.460 0.384 0.575 0.908 0.032 than that of pre-trained LMs, LLMs, and LineVul. Although
PLBART 0.436 0.416 0.458 0.886 0.058
CodeT5 0.493 0.408 0.623 0.914 0.028 pre-trained LMs achieve the highest accuracy at k=20, the
CodeT5+ 0.303 0.207 0.565 0.902 0.018 difference between them and LLMs is not substantial.
Fine-Tuning Setting
DeepSeek-Coder 6.7B 0.437 0.332 0.640 0.912 0.021
CodeLlama 7B 0.504 0.396 0.691 0.919 0.021
StarCoder 7B 0.245 0.169 0.443 0.893 0.024
WizardCoder 7B 0.520 0.427 0.664 0.918 0.025 
Mistral 0.314 0.384 0.266 0.827 0.122
Phi-2 0.458 0.361 0.629 0.912 0.025
Few-Shot Setting

7RSN$FFXUDF\ 

DeepSeek-Coder 6.7B 0.111 0.111 0.112 0.852 0.081


DeepSeek-Coder 33B
CodeLlama 7B
0.110
0.082
0.112
0.063
0.108
0.116
0.849
0.882
0.084
0.043

CodeLlama 34B 0.115 0.090 0.158 0.884 0.044
StarCoder 7B 0.088 0.066 0.134 0.887 0.039 
StarCoder 15.5B 0.095 0.078 0.120 0.876 0.052
WizardCoder 7B 0.082 0.063 0.120 0.884 0.042 //0V
WizardCoder 34B 0.096 0.072 0.145 0.887 0.039  /0V
Mistral 0.086 0.065 0.127 0.885 0.040
Phi-2 0.073 0.053 0.116 0.885 0.037 'HYLJQ
 ,9'HWHFW
5HYHDO
 /LQH9XO
score, precision, and FPR. For example, CodeLlama 7B
outperforms the pre-trained LMs, which are averaged over          
six different models, by 14.3%, 31.9%, and 54.3% in terms of
F1-score, precision, and FPR, respectively. Fig. 6: The interpretation results for different vulnerability
The Top-k Accuracy results of the interpreters are shown locators with k from 2 to 20 (RQ3)
12

TABLE 11: An example of generating vulnerability description for vulnerable code (RQ4)

Describe vulnerability code (CWE-119, Improper Restriction of Operations within the Bounds of a Memory Buffer)
Task Description Provide a CVE description for the vulnerable C code snippet.
Source Code An example of a C code snippet with vulnerabilities. CVE ID: CVE-2016-0841.
Response From CodeT5+ libmedia in mediaserver in Android 4.x before 4.4.4, 5.0.x before 5.0.2, 5.1.x before 5.1.1,
and 6.x before 2016-08-01 has certain incorrect declarations, which allows remote attackers
to execute arbitrary code or cause a denial of service (NULL pointer dereference or
memory corruption) via a crafted media file, aka internal bug 28166152.
Response From CodeLlama media/libmedia/MediaMetadataRetriever.cpp in mediaserver in Android 4.x before 4.4.4,
5.0.x before 5.0.2, 5.1.x before 5.1.1, and 6.x before 2016-04-01 does not validate the
data size, which allows attackers to execute arbitrary code or cause a denial of service
(memory corruption) via a crafted media file, aka internal bug 28799341. Specifically, the
constructor calls the native function createMetadataRetriever in libstagefright, and later
calls the getFrameAtTime function, aka internal bug 29244311. The getFrameAtTime call
triggers a memory corruption warning, which allows an attacker to execute arbitrary code
via a crafted video file, aka internal ...
CVE Description media/libmedia/mediametadataretriever.cpp in mediaserver in Android 4.x before 4.4.4,
5.0.x before 5.0.2, 5.1.x before 5.1.1, and 6.x before 2016-04-01 mishandles cleared service
binders, which allows remote attackers to execute arbitrary code or cause a denial of service
(memory corruption) via a crafted media file, aka internal bug 26040840.

Case Study. We find that LLMs under few-shot setting form a case study to provide a comprehensive assessment
have limitations, and in some exceptional cases, they tend of their performance in describing vulnerabilities.
to output more vulnerable lines, even if these lines do not Experimental Setting. We instruct LLMs with a designated
contain vulnerabilities. We take StarCoder as an example, task description, guiding them to perform the role of a
Fig. 7 depicts a vulnerability code snippet from the Big- vulnerability descriptor. Table 11 illustrates an example of
Vul dataset, with the vulnerability behavior occurring in our approach to evaluating LLMs’ proficiency in conducting
lines 3 and 4. However, StarCoder tends to output more vulnerability descriptions.
vulnerability lines, such as “[1, 2, 3, 4, 5, 6, 7, 8, 9]”,
whereas after fine-tuning, StarCoder becomes more cautious Task Description: Provide a CVE description for the vulner-
and only predicts “[4]”. Note that we convert the model’s able C code snippet.
predictions into a specific format, i.e., transform “[0, 0, 0, 1,
0, 0, 0, 0, 0]” to “[4]”. To evaluate the precision of the generated CVE de-
scription, we adopt the widely used performance metric
Task Description:
ROUGE [80], which is a set of metrics and is used for
Provide a vulnerability location result for the vulnerable C code snippet.
evaluating automatic summarization and machine transla-
StarCoder: [1, 2, 3, 4, 5, 6, 7, 8, 9] StarCoder (Fine-Tuning): [4]
tion software in natural language processing. The metrics
01 standard_info_part2(standard_display *dp,..., int nImages)
02 {
compare an automatically produced summary or translation
03 dp->pixel_size = bit_size(pp,..., png_get_bit_depth(pp, pi)); against a reference or a set of references (human-produced)
04 dp->bit_width = png_get_image_width(pp, pi) * dp->pixel_size; summary or translation. Here, we totally consider three
05 dp->cbRow = png_get_rowbytes(pp, pi); settings: 1, 2, and L.
06 if (dp->cbRow != (dp->bit_width+7)/8)
07 png_error(pp, "bad png_get_rowbytes calculation"); Results. Table 12 represents the vulnerability description
08 store_ensure_image(dp->ps, pp, nImages, dp->cbRow, dp->h); capabilities of LLMs and six baselines. According to the
09 } results, we can obtain the following observations: (1) LLMs
exhibit significantly weaker performance in generating
Fig. 7: An example to demonstrate the limitations of Star- vulnerability descriptions compared to pre-trained LMs.
Coder in vulnerability location (RQ3) For instance, pre-trained LMs achieve an average perfor-
mance of 0.600, 0.487, and 0.591 on ROUGE-1, ROUGE-
2, and ROUGE-L, respectively, whereas fine-tuned LLMs
Finding-6. Few-shot setting exposes LLM’s limitations, and only achieve an average of 0.406, 0.301, and 0.400 on the
fine-tuning can greatly enhance the vulnerability location same metrics. (2) Fine-tuning can significantly enhance the
capabilities of LLMs. performance of LLMs in vulnerability descriptions. After
Finding-7. Fine-tuning code-related LLMs as vulnerability fine-tuning, there is a several-fold improvement in ROUGE-
locators is beneficial, as they can outperform pre-trained LMs 1, ROUGE-2, and ROUGE-L. This suggests that these LLMs
in terms of F1-score, precision, and FPR. possess strong learning capabilities and can extract more
gains from historical data. (3) The low ROUGE-2 scores indi-
cate that Phi-2 has a limited ability to generate accurate and
4.4 RQ-4: Evaluating Vulnerability Description of LLMs relevant high-order n-grams (pairs of consecutive words)
In this RQ, we employ the ROUGH metric to evaluate the in vulnerability descriptions, indicating potential issues in
LLMs’ vulnerability description capabilities. We conduct a capturing specific and detailed information.
detailed statistical analysis of LLMs’ abilities and also per- Case Study. To demonstrate the capability of pre-trained
13

TABLE 12: The comparison of LLMs on software vulnerabil- TABLE 13: The comparison of CodeT5+, CodeLlama, and
ity description (RQ4) CodeLlama-Trim on selected examples (RQ4)

Methods ROUGE-1 ROUGE-2 ROUGE-L Methods ROUGE-1 ROUGE-2 ROUGE-L


CodeBERT 0.511 0.376 0.501 CodeT5+ 0.730 0.644 0.722
GraphCodeBERT 0.538 0.406 0.528 CodeLlama 0.366 0.266 0.360
UniXcoder 0.658 0.558 0.650 CodeLlama-Trim 0.625 0.523 0.616
PLBART 0.447 0.313 0.437
CodeT5 0.700 0.604 0.693
CodeT5+ 0.747 0.668 0.740
Finding-8. LLMs exhibit significantly weaker performance in
Fine-Tuning Setting
generating vulnerability descriptions compared to pre-trained
DeepSeek-Coder 6.7B 0.434 0.325 0.425
CodeLlama 7B 0.392 0.292 0.387 LMs. Therefore, fine-tuning pre-trained LMs for vulnerability
StarCoder 7B 0.420 0.321 0.416 detection is recommended.
WizardCoder 7B 0.425 0.327 0.419
Mistral 0.453 0.347 0.448
Phi-2 0.313 0.196 0.305 5 D ISCUSSION
Few-Shot Setting This section discusses open questions regarding the perfor-
DeepSeek-Coder 6.7B 0.230 0.073 0.215
DeepSeek-Coder 33B 0.219 0.066 0.203
mance differences observed, the impact of input sequence
CodeLlama 7B 0.221 0.070 0.205 length, and potential threats to the validity of our results.
CodeLlama 34B 0.258 0.094 0.242
StarCoder 7B 0.243 0.084 0.229
StarCoder 15.5B 0.255 0.089 0.241 5.1 Analysis of Performance Difference
WizardCoder 7B 0.230 0.066 0.211 In RQ1, for LineVul, there is a huge difference between
WizardCoder 34B 0.276 0.111 0.261
the results obtained in this paper (i.e., 0.272 of F1-score)
Mistral 0.290 0.095 0.267
Phi-2 0.210 0.056 0.194 and the ones reported in original work (i.e., 0.910 of F1-
score). To ensure a fair comparison, we first check the
correctness of our LineVul reproduction by re-conducting
the corresponding experiments using the original dataset
LMs and LLMs in generating vulnerability descriptions, we provided by LineVul’s official source and we obtain similar
present an example of a vulnerability (CWE-119) described results. Then, we inspect each step of the data preprocessing
by CodeT5+ and CodeLlama, as shown in Table 11. This process, as outlined in Section 3.1. In particular, this process
example represents a vulnerability within the Linux project, involves three pre-processing in total: removing blank lines,
categorized as CWE-119 (Improper Restriction of Opera- removing comments, and trimming leading and trailing
tions within the Bounds of a Memory Buffer Vulnerability). spaces from lines. We pre-process the original dataset of
It is noteworthy that even when provided with only the code LineVul, re-train, and test the model under the same pa-
of the vulnerability, CodeT5+ produces text highly similar to rameter settings. The results are shown in Table 14 and we
the CVE description (highlighted in orange), indicating that obtain the following conclusions:
pre-trained LMs are capable of comprehending the essence • Our reproduced LineVul performs closely to the original
and crucial features of vulnerabilities and expressing this one.
information in natural language. • Removing blank lines and comments does not signifi-
Additionally, we find that CodeLlama’s response is very cantly affect LineVul’s results.
similar to the CVE description, but with many additional • Trimming leading and trailing spaces from lines causes a
details. We hypothesize that the poor performance of LLMs drastic decrease in LineVul’s performance.
is not due to their inability to generate appropriate vulner- Generally, for C/C++ source code, we know that re-
ability descriptions, but rather because they tend to output moving leading and trailing spaces does not affect the
tokens endlessly, even when they should stop. In contrast, code’s semantics. Thus, to verify whether it is general to
pre-trained LMs typically stop at the appropriate points. other transformer-based models, we conduct another ex-
To further analyze this, we investigate the vulnerability periment on UniXcoder (another famous and widely used
description capabilities of LLMs after mitigating this issue. transformer-based pre-trained model) by adopting the same
Using CodeLlama as an example, we randomly select 100 filtering operations. The results are presented in the right
examples from the testing set and manually determine part of Table 14. Table 14 shows that the UniXcoder’s perfor-
where the descriptions should terminate, trimming CodeL- mance closely resembled LineVul’s before the third step of
lama’s output accordingly. We then calculate the ROUGE processing. However, after pre-processing, UniXcoder’s per-
metrics for the trimmed outputs and compare them with formance similarly plummeted. Thus, we believe that such
the original results and those of CodeT5+. The final results types of operation will have side impacts on transformer-
are presented in Table 13, we find that after trimming, the based models since these methods pay attention to each
ROUGE-1, ROUGE-2, and ROUGE-L scores for CodeLlama token, though these tokens have no semantic meaning in the
significantly improved, even nearing those of CodeT5+. This context of source code. Based on this observation, we believe
confirms our hypothesis that LLMs actually possess strong that the vulnerability detection effectiveness of LineVul after
vulnerability description capabilities, but their performance space removal is correct and the performance results are
is hindered by the tendency to output excessively. reasonable.
14

TABLE 14: The reproduced results for LineVul and UniXcoder

LineVul UniXcoder
Datasets
F1-score Accuracy Recall Precision F1-score Accuracy Recall Precision
Original dataset 0.90 0.95 0.86 0.95 0.86 0.98 0.82 0.90
Remove empty lines 0.85 0.98 0.79 0.93 0.85 0.98 0.82 0.88
Remove comments 0.86 0.99 0.81 0.93 0.85 0.98 0.81 0.90
Remove spaces 0.40 0.94 0.37 0.45 0.26 0.95 0.16 0.68

5.2 Analysis of Input Sequence Length users online and can obtain a good response from LLMs.
In RQ2, we find that after adding key information, the Furthermore, LLMs will generate responses with some ran-
performance of StarCoder and Phi-2 in vulnerability assess- domness even given the same prompt. Therefore, we set
ment actually weakened. We hypothesize that these LLMs “temperature” to 0, which will reduce the randomness at
may not excel in handling excessively long text sequences. most and we try our best to collect all these results in two
Therefore, adding key information, which results in an days to avoid the model being upgraded. The second one
increase in the number of input tokens, leads to a decline is about the potential mistakes in the implementation of
in performance. In this section, we aim to analyze the studied baselines. To minimize such threats, we directly use
performance of StarCoder and Phi-2 with respect to input the original source code shared by corresponding authors.
sequence length to determine whether there is a perfor- Threats to External Validity may correspond to the gen-
mance decline as the input length increases. As shown in eralization of the studied dataset. To mitigate this threat,
Fig. 8, the horizontal axis represents the token length of the we adopt the most large-scale vulnerability dataset with
input sequence, and the vertical axis represents the F1-score diverse information about the vulnerabilities, which are
of vulnerability assessment. We categorize the input token collected from practical projects, and these vulnerabilities
lengths into 0-128, 128-256, 256-512, 512-1024, and 1024+ are recorded in the Common Vulnerabilities and Exposures
(e.g., an input token length of 64 falls into the 0-128 range), (CVE). However, we do not consider these vulnerabilities
and evaluate the vulnerability assessment performance of found recently. Besides, we do not adopt another large-scale
StarCoder and Phi-2 for each category. According to Fig. 8, vulnerability dataset named SARD since it is built manually
we observe that as the input length increases, the F1-scores and cannot satisfy the distinct characteristics of the real
of both LLMs gradually decrease, revealing their significant world [45], [47].
limitations in assessing long sequences of vulnerable code. Threats to Construct Validity mainly correspond to the
Therefore, in practical applications requiring the assessment performance metrics in our evaluations. To minimize such
of long sequences of vulnerable code, we may need to con- threats, we consider a few widely used performance metrics
sider alternative optimization strategies or model choices to to evaluate the performance of LLMs on different types of
ensure accuracy and reliability. tasks, e.g., Recall, Precision, and ROUGE.

   6WDU&RGHU 6 C ONCLUSION


  3KL
    This paper aims to comprehensively investigate the capa-

  bilities of LLMs for software vulnerability tasks as well as
its impacts. To achieve that, we adopt a large-scale vulner-

ability dataset (named Big-Vul) and then conduct several
)VFRUH

experiments focusing on four dimensions: (1) Vulnerability



Detection, (2) Vulnerability Assessment, (3) Vulnerabil-
ity Location, and (4) Vulnerability Description. Overall,

although LLMs show some ability in certain areas, they
still need further improvement to be competent in soft-
 ware vulnerability-related tasks. Our research conducts a
comprehensive survey of LLMs’ capabilities and provides
 a reference for enhancing its understanding of software
    
vulnerabilities in the future.
Fig. 8: The variation of F1-scores for vulnerability assess-
ment with respect to input sequence length
ACKNOWLEDGEMENTS
5.3 Threats to Validity This work was supported by the National Natural Sci-
Threats to Internal Validity mainly contains in two-folds. ence Foundation of China (Grant No.62202419 and No.
The first one is the design of a prompt to instruct LLMs 62172214), the Ningbo Natural Science Foundation (No.
to give out responses. We design our prompt according to 2022J184), and the Key Research and Development Program
the practical advice [77] which has been verified by many of Zhejiang Province (No.2021C01105).
15

R EFERENCES [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.


Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- Advances in neural information processing systems, vol. 30, 2017.
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Lan- [24] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong,
guage models are few-shot learners,” Advances in neural informa- L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained
tion processing systems, vol. 33, pp. 1877–1901, 2020. model for programming and natural languages,” arXiv preprint
[2] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., arXiv:2002.08155, 2020.
“Improving language understanding by generative pre-training,” [25] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan,
2018. A. Svyatkovskiy, S. Fu et al., “Graphcodebert: Pre-training code
[3] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre- representations with data flow,” arXiv preprint arXiv:2009.08366,
train, prompt, and predict: A systematic survey of prompting 2020.
methods in natural language processing,” ACM Computing Sur- [26] D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, “Unixcoder:
veys, vol. 55, no. 9, pp. 1–35, 2023. Unified cross-modal pre-training for code representation,” arXiv
[4] R. Tang, Y.-N. Chuang, and X. Hu, “The science of detecting llm- preprint arXiv:2203.03850, 2022.
generated texts,” arXiv preprint arXiv:2303.07205, 2023. [27] W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified
[5] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, pre-training for program understanding and generation,” arXiv
S. Savarese, and C. Xiong, “Codegen: An open large language preprint arXiv:2103.06333, 2021.
model for code with multi-turn program synthesis,” arXiv preprint [28] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin,
arXiv:2203.13474, 2022. C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language
[6] Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, models to follow instructions with human feedback,” Advances in
L. Shen, A. Wang, Y. Li et al., “Codegeex: A pre-trained model for Neural Information Processing Systems, vol. 35, pp. 27 730–27 744,
code generation with multilingual evaluations on humaneval-x,” 2022.
arXiv preprint arXiv:2303.17568, 2023. [29] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and
[7] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware D. Amodei, “Deep reinforcement learning from human prefer-
unified pre-trained encoder-decoder models for code understand- ences,” Advances in neural information processing systems, vol. 30,
ing and generation,” arXiv preprint arXiv:2109.00859, 2021. 2017.
[8] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi, [30] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford,
“Codet5+: Open code large language models for code understand- D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language
ing and generation,” arXiv preprint arXiv:2305.07922, 2023. models from human preferences,” arXiv preprint arXiv:1909.08593,
[9] D. AI, “Deepseek coder: Let the code write itself,” https://github. 2019.
com/deepseek-ai/DeepSeek-Coder, 2023. [31] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
[10] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, “Proximal policy optimization algorithms,” arXiv preprint
M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source arXiv:1707.06347, 2017.
be with you!” arXiv preprint arXiv:2305.06161, 2023. [32] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Love-
[11] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, nia, Z. Ji, T. Yu, W. Chung et al., “A multitask, multilingual,
J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation multimodal evaluation of chatgpt on reasoning, hallucination, and
models for code,” arXiv preprint arXiv:2308.12950, 2023. interactivity,” arXiv preprint arXiv:2302.04023, 2023.
[12] C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in [33] X. Zhou, T. Zhang, and D. Lo, “Large language model for vul-
the era of large pre-trained language models,” in Proceedings of the nerability detection: Emerging results and future directions,” in
45th International Conference on Software Engineering (ICSE 2023). Proceedings of the 2024 ACM/IEEE 44th International Conference on
Association for Computing Machinery, 2023. Software Engineering: New Ideas and Emerging Results, 2024, pp. 47–
[13] C. S. Xia and L. Zhang, “Keep the conversation going: Fixing 51.
162 out of 337 bugs for $0.42 each using chatgpt,” arXiv preprint [34] M. Fu, C. K. Tantithamthavorn, V. Nguyen, and T. Le, “Chatgpt
arXiv:2304.00385, 2023. for vulnerability detection, classification, and repair: How far
[14] ——, “Less training, more repairing please: revisiting automated are we?” in 2023 30th Asia-Pacific Software Engineering Conference
program repair via zero-shot learning,” in Proceedings of the 30th (APSEC). IEEE, 2023, pp. 632–636.
ACM Joint European Software Engineering Conference and Symposium [35] C. MITRE, “Common vulnerabilities and exposures (cve),” 2023.
on the Foundations of Software Engineering, 2022, pp. 959–971. [Online]. Available: https://cve.mitre.org/
[15] R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, [36] G. Bhandari, A. Naseer, and L. Moonen, “Cvefixes: automated
M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, collection of vulnerabilities and their fixes from open-source soft-
“Understanding the effectiveness of large language models in code ware,” in Proceedings of the 17th International Conference on Predictive
translation,” arXiv preprint arXiv:2308.03109, 2023. Models and Data Analytics in Software Engineering, 2021, pp. 30–39.
[16] S. Kang, J. Yoon, and S. Yoo, “Large language models are few-shot [37] Symantec, “securityfocus,” 2023. [Online]. Available: https:
testers: Exploring llm-based general bug reproduction,” in 2023 //www.securityfocus.com/
IEEE/ACM 45th International Conference on Software Engineering [38] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective
(ICSE). IEEE, 2023, pp. 2312–2323. vulnerability identification by learning comprehensive program
[17] D. Zan, B. Chen, F. Zhang, D. Lu, B. Wu, B. Guan, W. Yongji, semantics via graph neural networks,” in In Proceedings of the 33rd
and J.-G. Lou, “Large language models meet nl2code: A survey,” International Conference on Neural Information Processing Systems,
in Proceedings of the 61st Annual Meeting of the Association for 2019, p. 10197–10207.
Computational Linguistics (Volume 1: Long Papers), 2023, pp. 7443– [39] M. Fu and C. Tantithamthavorn, “Linevul: A transformer-based
7464. line-level vulnerability prediction,” 2022.
[18] C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Es- [40] S. Cao, X. Sun, L. Bo, R. Wu, B. Li, and C. Tao, “Mvd: Memory-
caping coverage plateaus in test generation with pre-trained large related vulnerability detection based on flow-sensitive graph neu-
language models,” in International conference on software engineering ral networks,” arXiv preprint arXiv:2203.02660, 2022.
(ICSE), 2023. [41] Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, and Z. Chen, “Sysevr: A frame-
[19] S. Khan and S. Parkinson, “Review into state of the art of vulnera- work for using deep learning to detect software vulnerabilities,”
bility assessment using artificial intelligence,” Guide to Vulnerabil- IEEE Transactions on Dependable and Secure Computing, 2021.
ity Analysis for Computer Networks and Systems, pp. 3–32, 2018. [42] X. Cheng, G. Zhang, H. Wang, and Y. Sui, “Path-sensitive code
[20] T. H. Le, H. Chen, and M. A. Babar, “A survey on data-driven embedding via contrastive learning for software vulnerability
software vulnerability assessment and prioritization,” ACM Com- detection,” in Proceedings of the 31st ACM SIGSOFT International
puting Surveys (CSUR), 2021. Symposium on Software Testing and Analysis, 2022, pp. 519–531.
[21] J. Fan, Y. Li, S. Wang, and T. N. Nguyen, “A c/c++ code vulnerabil- [43] Y. Wu, D. Zou, S. Dou, W. Yang, D. Xu, and H. Jin, “Vulcnn: An
ity dataset with code changes and cve summaries,” in Proceedings image-inspired scalable vulnerability detection system,” 2022.
of the 17th International Conference on Mining Software Repositories, [44] Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong,
2020, pp. 508–512. “Vuldeepecker: A deep learning-based system for vulnerability
[22] “Replication,” 2024. [Online]. Available: https://github.com/ detection,” in Proceedings of the 25th Annual Network and Distributed
vinci-grape/VulEmpirical System Security Symposium, 2018.
16

[45] D. Hin, A. Kan, H. Chen, and M. A. Babar, “Linevd: Statement- Software Engineering Conference and Symposium on the Foundations of
level vulnerability detection using graph neural networks,” arXiv Software Engineering, 2021, pp. 341–353.
preprint arXiv:2203.05181, 2022. [62] J. Sun, Z. Xing, H. Guo, D. Ye, X. Li, X. Xu, and L. Zhu, “Generating
[46] X. Zhan, L. Fan, S. Chen, F. We, T. Liu, X. Luo, and Y. Liu, informative cve description from exploitdb posts by extractive
“Atvhunter: Reliable version detection of third-party libraries summarization,” ACM Transactions on Software Engineering and
for vulnerability identification in android applications,” in 2021 Methodology (TOSEM), 2022.
IEEE/ACM 43rd International Conference on Software Engineering [63] H. Guo, S. Chen, Z. Xing, X. Li, Y. Bai, and J. Sun, “Detecting
(ICSE). IEEE, 2021, pp. 1695–1707. and augmenting missing key aspects in vulnerability descrip-
[47] S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep learning tions,” ACM Transactions on Software Engineering and Methodology
based vulnerability detection: Are we there yet,” IEEE Transactions (TOSEM), vol. 31, no. 3, pp. 1–27, 2022.
on Software Engineering, 2021. [64] H. Guo, Z. Xing, S. Chen, X. Li, Y. Bai, and H. Zhang, “Key aspects
[48] C. Ni, X. Yin, K. Yang, D. Zhao, Z. Xing, and X. Xia, “Distinguish- augmentation of vulnerability description based on multiple secu-
ing look-alike innocent and vulnerable code by subtle semantic rity databases,” in 2021 IEEE 45th Annual Computers, Software, and
representation learning and explanation,” in Proceedings of the 31st Applications Conference (COMPSAC). IEEE, 2021, pp. 1020–1025.
ACM Joint European Software Engineering Conference and Symposium [65] H. Guo, Z. Xing, and X. Li, “Predicting missing information of key
on the Foundations of Software Engineering, 2023, pp. 1611–1622. aspects in vulnerability reports,” arXiv preprint arXiv:2008.02456,
[49] B. Steenhoek, M. M. Rahman, R. Jiles, and W. Le, “An empirical 2020.
study of deep learning models for vulnerability detection,” in 2023 [66] G. Fan, R. Wu, Q. Shi, X. Xiao, J. Zhou, and C. Zhang, “Smoke:
IEEE/ACM 45th International Conference on Software Engineering scalable path-sensitive memory leak detection for millions of
(ICSE). IEEE, 2023, pp. 2237–2248. lines of code,” in 2019 IEEE/ACM 41st International Conference on
[50] B. Steenhoek, H. Gao, and W. Le, “Dataflow analysis-inspired deep Software Engineering (ICSE). IEEE, 2019, pp. 72–82.
learning for efficient vulnerability detection,” in Proceedings of the [67] W. Li, H. Cai, Y. Sui, and D. Manz, “Pca: memory leak detection us-
46th IEEE/ACM International Conference on Software Engineering, ing partial call-path analysis,” in Proceedings of the 28th ACM Joint
2024, pp. 1–13. Meeting on European Software Engineering Conference and Symposium
[51] C. Ni, L. Shen, W. Wang, X. Chen, X. Yin, and L. Zhang, “Fva: As- on the Foundations of Software Engineering, 2020, pp. 1621–1625.
sessing function-level vulnerability by integrating flow-sensitive [68] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma,
structure and code statement semantic,” in 2023 IEEE/ACM 31st Q. Lin, and D. Jiang, “Wizardcoder: Empowering code large lan-
International Conference on Program Comprehension (ICPC). IEEE, guage models with evol-instruct,” arXiv preprint arXiv:2306.08568,
2023, pp. 339–350. 2023.
[52] A. Feutrill, D. Ranathunga, Y. Yarom, and M. Roughan, “The effect [69] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot,
of common vulnerability scoring system metrics on vulnerability D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al.,
exploit delay,” in 2018 Sixth International Symposium on Computing “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
and Networking (CANDAR). IEEE, 2018, pp. 1–10. [70] “Phi-2: The surprising power of small
[53] G. Spanos and L. Angelis, “A multi-target approach to estimate language models,” 2023. [Online]. Avail-
software vulnerability characteristics and severity scores,” Journal able: https://www.microsoft.com/en-us/research/blog/
of Systems and Software, vol. 146, pp. 152–166, 2018. phi-2-the-surprising-power-of-small-language-models/
[54] T. H. M. Le, D. Hin, R. Croft, and M. A. Babar, “Deepcva: Au- [71] “Hugging face open llm leaderboard,” 2023. [Online]. Avail-
tomated commit-level vulnerability assessment with deep multi- able: https://huggingface.co/spaces/HuggingFaceH4/open
task learning,” in 2021 36th IEEE/ACM International Conference on llm leaderboard
Automated Software Engineering (ASE). IEEE, 2021, pp. 717–729. [72] A. Z. Yang, C. Le Goues, R. Martins, and V. Hellendoorn, “Large
[55] Z. Li, D. Zou, S. Xu, Z. Chen, Y. Zhu, and H. Jin, “Vuldeelocator: language models for test-free fault localization,” in Proceedings of
a deep learning-based fine-grained vulnerability detector,” IEEE the 46th IEEE/ACM International Conference on Software Engineering,
Transactions on Dependable and Secure Computing, 2021. 2024, pp. 1–12.
[56] Y. Li, S. Wang, and T. N. Nguyen, “Vulnerability detection with [73] J. Zhang, C. Wang, A. Li, W. Sun, C. Zhang, W. Ma, and Y. Liu,
fine-grained interpretations,” in Proceedings of the 29th ACM Joint “An empirical study of automated vulnerability localization with
Meeting on European Software Engineering Conference and Symposium large language models,” arXiv preprint arXiv:2404.00287, 2024.
on the Foundations of Software Engineering, 2021, pp. 292–303. [74] A. Aghajanyan, B. Huang, C. Ross, V. Karpukhin, H. Xu, N. Goyal,
[57] C. Ni, W. Wang, K. Yang, X. Xia, K. Liu, and D. Lo, “ The Best of D. Okhonko, M. Joshi, G. Ghosh, M. Lewis et al., “Cm3: A
Both Worlds: Integrating Semantic Features with Expert Features causal masked multimodal model of the internet,” arXiv preprint
for Defect Prediction and Localization,” in Proceedings of the 2022 arXiv:2201.07520, 2022.
30th ACM Joint Meeting on European Software Engineering Conference [75] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
and Symposium on the Foundations of Software Engineering. ACM, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An im-
2022, pp. 672–683. perative style, high-performance deep learning library,” Advances
[58] C. Ni, K. Yang, X. Xia, D. Lo, X. Chen, and X. Yang, “Defect in neural information processing systems, vol. 32, 2019.
identification, categorization, and repair: Better together,” arXiv [76] “Hugging face,” 2023. [Online]. Available: https://huggingface.co
preprint arXiv:2204.04856, 2022. [77] J. Shieh, “Best practices for prompt engineering with openai
[59] Q. Zhang, Y. Zhao, W. Sun, C. Fang, Z. Wang, and api,” OpenAI, February https://help.openai. com/en/articles/6654000-
L. Zhang, “Program repair: Automated vs. manual,” arXiv preprint best-practices-for-prompt-engineering-with-openai-api, 2023.
arXiv:2203.05166, 2022. [78] Z. Ying, D. Bourgeois, J. You, M. Zitnik, and J. Leskovec, “Gn-
[60] Z. Chen, S. J. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshy- nexplainer: Generating explanations for graph neural networks,”
vanyk, and M. Monperrus, “Sequencer: Sequence-to-sequence Advances in neural information processing systems, vol. 32, 2019.
learning for end-to-end program repair,” IEEE Transactions on [79] Y. Hu, S. Wang, W. Li, J. Peng, Y. Wu, D. Zou, and H. Jin, “Inter-
Software Engineering, 2019. preters for gnn-based vulnerability detection: Are we there yet?”
[61] Q. Zhu, Z. Sun, Y.-a. Xiao, W. Zhang, K. Yuan, Y. Xiong, and in Proceedings of the 32nd ACM SIGSOFT International Symposium
L. Zhang, “A syntax-guided edit decoder for neural program on Software Testing and Analysis, 2023, pp. 1407–1419.
repair,” in Proceedings of the 29th ACM Joint Meeting on European [80] C.-Y. Lin, “Rouge: A package for automatic evaluation of sum-
maries,” in Text summarization branches out, 2004, pp. 74–81.

You might also like