0% found this document useful (0 votes)
72 views10 pages

Literature Review On Vulnerability Detection Using

The document reviews the development of using natural language processing (NLP) technology for vulnerability detection. It discusses how deep learning models like CNN, RNN, LSTM, and Transformer have been applied to problems in NLP. Recent models like CodeBERT that use deep learning for code-level tasks are also mentioned.

Uploaded by

thuanphat342001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views10 pages

Literature Review On Vulnerability Detection Using

The document reviews the development of using natural language processing (NLP) technology for vulnerability detection. It discusses how deep learning models like CNN, RNN, LSTM, and Transformer have been applied to problems in NLP. Recent models like CodeBERT that use deep learning for code-level tasks are also mentioned.

Uploaded by

thuanphat342001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Literature review on vulnerability detection using

NLP technology
1st jiajie wu
School of Computer
Hangzhou University of Electronic Science and technology
hangzhou,zhejiang, China
[email protected]
arXiv:2104.11230v1 [cs.CR] 23 Apr 2021

have been solved, and certain results have been achieved.


Abstract:Vulnerability detection has always been the most These results show that the use of NLP technology to study
important task in the field of software security. With the automatic vulnerability detection (one of the code intelligence
development of technology, in the face of massive source tasks) technology has a lot of room for development. We will
code, automated analysis and detection of vulnerabilities
has become a current research hotspot. For special text do a detailed introduction in the chapter III.
files such as source code, using some of the hottest NLP The chapters are arranged as follows: Section II introduces
technologies to build models and realize the automatic the development of NLP, and section III introduces the latest
analysis and detection of source code has become one of development of NLP technology in vulnerability detection.
the most anticipated studies in the field of vulnerability
detection. This article does a brief survey of some recent II. T HE DEVELOPMENT OF NLP TECHNOLOGY
new documents and technologies, such as CodeBERT, and A. Natural Language Processing
summarizes the previous technologies.
Index Terms—vulnerability deletion, code intelligence, deep Natural language processing (NLP) is the use of computers
learning, CodeBERT, NLP to model human natural language in order to solve the appli-
cation of natural language in some related problems. In NLP,
I. I NTRODUCTION the problems that need to be solved can be divided into two
In recent years, with the continuous maturity of software categories:
technology, more and more software has been developed by • One is the natural language understanding (NLU) prob-
people. While people enjoy the convenience brought by soft- lem, including text classification [7], named entity recog-
ware, they are also threatened by software vulnerabilities. It nition [8], [9], relation extraction [10], reading compre-
can be said that software vulnerabilities are one of the biggest hension, etc. [11]–[13];
problems that threaten the normal operation of software. For • The second is natural language generation (NLG) prob-
software users, the direct and indirect economic losses caused lems, including machine translation [14]–[16], text sum-
by software vulnerabilities worldwide have exceeded tens mary generation [17], [18], automatic question and an-
of billions of dollars. It is an indisputable fact that there swer system [19], [20],Image caption generation [21]–
are various vulnerabilities in most software. There are many [23] etc.
types of software vulnerabilities, such as CVE-2015-8558 [1], When NLP researchers studied and solved these two types
which are explained in detail on CVE [2]. The longer the of problems, they found that the underlying problems that
vulnerability exists, the easier it is to be exploited by hackers, constitute these problems are basically the same, such as
and the greater the damage to the company or organization embeding expressions of vocabulary. Now researchers are
[3], so the ability to automatically detect the vulnerability in more inclined to use a unified model for modeling (pre-
the software within a certain time frame has become one of training stage), and then adjust the model according to specific
the hottest researches at the moment One. problems (fine-tuning stage). Research at this stage has made
How can the automatic detection of vulnerabilities be more great progress. It is believed that in the near future, machines
accurate? Deep learning technology gives us the possibility. can truly understand human language and even understand
With the continuous reform and development of deep learning human thinking.
technology in recent years, great progress has been made in Since 1980s, traditional NLP has increasingly relied on
the field of natural language processing (NLP). In particular, statistics, probability and shallow learning (traditional machine
the series of models such as GPT [4] and BERT [5] have learning) [24], such as naive Bayes, hidden Markov model,
taken NLP technology a big step forward. The source code is conditional random field, support Vector machines and K-
essentially a text in a special format. It is logically feasible to proximity algorithms, etc., these algorithms are still widely
use NLP technology for code processing. In fact, in modern used in NLP today. But with the development of deep learning
code intelligence, models such as CodeBERT [6] have already (DL), people are paying more and more attention to how to
been proposed by some scholars and some code-level tasks use DL models to solve the problems in NLP [25].
B. DL in NLP cannot be processed in parallel. Today, with massive data,
The main goal of DL is to learn the deep neural net- it greatly reduces the development of RNN in engineering
work model [26]. The neural network model is composed applications. In NLP, CNN and RNN are used to extract
of neurons and the edges connected to them. Each neuron the character-level representation of words, as shown in
can input and output. The data inside the neuron can be Figure 2.
nonlinearly transformed. [27]. According to the development
of the timeline, we use the time point at which Transformer
[28] is proposed as the segmentation point. The model method
before its appearance is called the basic model method, and
the later one is called the modern model method (or attention
model method). We will introduce them separately below.
Basic model method introduction:
1) Convolutional Neural Network (CNN) [29]: Due to the
excellent abstract feature extraction ability of the convo-
lution kernel, it has achieved great success in the field
of computer vision (CV). In the field of NLP, CNN-
Fig. 2. CNN & RNN for extracting character-level representation for a word
based algorithms have also appeared one after another, [9]
such as [30]–[34], etc. In the research related to vulner-
3) Long Short-Term Memory Networks(LSTM) [41]: In
ability detection, some scholars have used CNN to mine
addition to the structural limitations, RNN cannot capture
vulnerabilities [35], as shown in figure 1.
long sequence text information due to the problem of
vanishing gradient [42], so scholars modified RNN The
LSTM model is proposed to solve the defect that RNN
cannot process data in parallel. LSTM is one of the mod-
els with the strongest ”memory” ability in NLP so far, and
it is also one of the most widely used models. However,
because LSTM has complex gating logic, it consumes a
lot of space and time during training. Gated Recurrent
Unit (GRU) [43] is a model that is similar in structure
to LSTM but more lightweight, and its performance in
training is not worse than LSTM. For the comparison
Fig. 1. Using CNN to classify source code [35] between the three basic models of CNN, GRU, and LSTM
Although these models use CNN as a feature extractor in NLP applications, please refer to [44]. Since LSTM is a
to extract features from text data, because the feature one-way model, in order to obtain the context information
dimensions of text data are not many, in text data, more of the token, people often superimpose the LSTM/GRU
attention is paid to the close connection between contexts, model in two directions to obtain a two-way LSTM
and the model is required to have a ”memory” function model (Bi-LSTM) [45]. In practical applications, the Bi-
, So CNN does not perform very impressively when LSTM model is often used to extract the features of the
processing tasks in NLP. But the latest research shows sentence, and then the CRF algorithm is used to process
that with the development of multimodal technology the downstream tasks [46].
[36]–[38], in some code generation tasks, such as image 4) Embedding [47]–[50]: Embedding technology is a tech-
generation instructions, the use of CNN-based models has nology that can convert tokens into space vectors. The
achieved good results [39]. earliest embedding technology can be traced back to
2) Recurrent Neural Network(RNN) [40]: One of the charac- the distribution of words [51], which can represent a
teristics of RNN is its ”memory”. RNN can take serialized sequence of tokens in vector form as the input of the
data as input or output serialized data. For serialized data deep learning model. With the continuous development
such as text, using RNN for processing has a natural of technology, embedding technology can be divided into
advantage. In the output of the RNN, the above sequence two types:
information of the current token can be included, which • One is the classic Non-Contextual embedding technol-
makes the RNN have a ”memory” function. When pro- ogy, which is also called contextual-independent em-
cessing the data in this article, people often use a two-way bedding in some literatures, which refers to embedding
RNN, that is, to process the above and below information independent of the context, such as Word2Vec [52],
of the current token separately. Let the token contain the GloVe [53] and other models. When embedding, the
current context information at the same time, which is contextual semantic relationship of words in the sen-
very important for the model to understand the meaning tence is not considered. To put it simply, these models
of the sentence. However, the model with RNN structure only learn the mapping of words in the vector space.
Each word is a fixed representation and cannot deal
with the problem of word representation in the context
of similar polysemous words. It is worth mentioning
that in embedding technology, oov (out of vocabulary,
oov) is often encountered. The common solution is to
use substrings for further segmentation, such as BPE
[54], [55]. For the semantic analysis performance of
these classic models such as word2vec and GloVe,
please refer to [56];
• The other is Contextual embedding technology, also
Fig. 3. A comparison between the traditional encoder-decoder architecture
called contextual-dependent embedding, such as the (left) and the attention-based architecture (right) [61]
famous ELMo model [57], [58], including the models
such as BERT [5] that appear later, the words embed-
ding learned are all contextual embedding. Contextual
embedding technology will comprehensively consider
the context information in the sentence when learning
the vectorized representation of words, and integrate
the context information of a single token into the
representation of the word. In this way, it is deal-
ing with issues such as polysemous words, syntactic Fig. 4. all attention types [62]
structure, and semantic roles. At the time, the words
can be represented differently according to the current
semantic environment. For a more detailed analysis and attention mechanism. From the structural point of view,
comparison of the two technologies, please refer to Transformer is a typical Encoder-Decode structure, and
[59]. its general training process is as follows:
• On the encoder side, after the serialized token un-
Attention model method introduction:
dergoes input embedding and positional embedding,
1) Attention mechanism: The attention mechanism is an the QKV matrix is generated using the three weight
instinctive mechanism that imitates people when observ- matrices of QKV, and then the attention matrix is
ing objects. In computers, the attention mechanism is obtained using the multi-head attention mechanism,
essentially calculating the weight of a certain item, and and then passes through the conventional add&norm,
finally all the items are weighted so that more information fully connected layer, etc. The whole process can be
is contributed than the important items. The attention folded N times in total. In [28], there are 6 folds;
mechanism was first applied in machine translation [60]. • On the decoder side, the process is roughly the same as
Due to its excellent performance, it was widely used that on the encoder side. The only difference is that a
in other NLP tasks. Now it has become very popu- multi-head attention layer is added, that is, the second
lar, and most of the NLP models have basically been attention is performed. In this attention input, the v
integrated. The attention mechanism, especially in the value uses the output of the encoder side.
encoder-decoder architecture, can be used alone in the For specific details about Transformer and attention
encoder or decoder, or mixed, as shown in the figure mechanism, please see [64]. In [65], the author classified
3. In summary, the attention mechanism can be divided Transformers according to technology and main purpose.
into 6 categories, as shown in the figure 4, of which the For the visualization research of Transformer, it is ex-
most common are self-attention and multi-dimensional plained in this article [66], and we will not go into it
attention. All models that apply the attention mechanism here.
can be collectively referred to as Attention Model (AM). 3) GPT [4]: The Generative Pre-trained Transformer (GPT)
In addition to the application of AM in NLP, AM has is a Transformer-based pre-training model developed by
also received extensive attention in the fields of Computer OpenAI. The purpose is to learn the dependency between
Vision (CV), Multi-Modal Tasks, Graph-based Systems sentences and words in long text. Over time, GPT has
and Recommender Systems (RS) [61]. evolved from GPT-1 to GPT-3 [67]. The biggest differ-
2) Transformer [28]: With the birth of the Transformer ence between GPT-1 and BERT is that the GPT-1 model
architecture, the architecture with the strongest feature scans text from left to right, so token embedding can only
extraction capabilities so far was born. In addition to consider the information before the current token, without
the NLP field, Transformer has also made great progress considering the information below the token, while BERT
in the CV field [63]. The architecture of Transformer is uses a two-way model training . Therefore, GPT-1 only
shown in Figure 5. integrates the information above the token. For GPT-1’s
As can be seen from the figure, Transformer integrates the use of the information below the token, it is used as a
Fig. 6. Overall pre-training and fine-tuning procedures for BERT [5]

using the encoder-decoder architecture , As shown in


figure 7. In the pre-training stage of BART, 5 noisy
Fig. 5. Architecture of the Transformer Model [63]

new input to the model for training after prediction. GPT


can realize unsupervised training. In GPT-3, unsupervised
training of network text data is realized. The parameters
in the model have reached 175 billion, which is about
the number of GPT-2 [68] parameters (1.5 billion), GPT-
3 can be said to be the largest and most advanced pre-
training model so far.
4) BERT [5]: Bidirectional Encoder Representations from Fig. 7. A schematic comparison of BART with BERT and GPT. [72]
Transformers (BERT) is one of the best NLP models so
input transformation methods including Token Masking,
far. BERT uses a two-way Transformer block for training,
Token Deletion, Text Infilling, Sentence Permutation, and
taking into account the context information contained in
Document Rotation are used. In the fine-tuning stage,
the word. After BERT, although many excellent models
the author trained four tasks: Sequence Classification,
(such as XLNet [69]) have been proposed, the huge
Token Classification, Sequence Generation, and Machine
influence and excellent performance of BERT cannot be
Translation. The results are shown in the figure 8. As can
replaced by other models. The training process of BERT
be seen from the results in the figure, this extended model
is shown in figure 6.
performs better than the BERT model on the data results.
In BERT, Masked Language Modeling (MLM) [70] tech-
nology is used. This technique is a fill-in technique. When
doing pre-training, it predicts the hidden information
in the original text, and obtains the context embedding
of the input token. The general process is that in the
BERT input, approximately 15% is randomly selected.
The token of is masked, and then the BERT is pre-trained
to predict the masked token. One disadvantage of this
technique is that the masked token information will not be
encoded into the context embedding. In the downstream
task, the information deviation problem will occur due
to the missing information of the previously masked
word. The solution is to process the tokens selected to Fig. 8. Comparison of pre-training objectives [72]
be masked at a random ratio of 8/1/1, that is, 80% of
the masked tokens continue to be masked, 10% use the C. The pre-training model
original token for training, and 10% tokens are randomly Currently, the mainstream research direction of NLP pro-
replaced with other tokens. For a detailed summary of cessing problems tends to be completed in two stages. The
the application of the BERT model in NLP, please see first stage is to build pre-trained models (PTM) based on
5) [71].
BART [72]: BART is a denoising seq2seq algorithm, context embedding. The second stage is based on Specific
which can be said to be an extension of the BERT model. tasks fine-tuning the PTM. According to the classification in
From an architectural point of view, it can be regarded [73], PTM can be divided into three categories: serialization
as a ”combination” of the BERT and GPT framework, model, recursive model and self-attention model according
to the model structure. Using the pre-training mechanism • Static analysis: refers to the use of additional detec-
can improve the generalization performance of the model, tion programs to detect programs that are suspected of
allowing researchers or engineers to have more energy to vulnerabilities. During the analysis process, the detected
deal with downstream specific tasks. It is worth mentioning program does not need to be executed, only the source
that the bias problem in NLP will become prominent as the code of the detected program is required;
model becomes larger. For example, in GPT-3, the number • Dynamic analysis: refers to the execution environment
of parameters has reached 175 billion. Although GPT-3 is by that reproduces the software under test. Select the test
far the largest and most advanced NLP pre-training model, cases required for the execution of the tested software,
it also exhibits the most prejudiced [74]. In addition, most and then execute the tested program, monitor the program
of the pre-training models have a very large overhead (time, execution process and the variable change process, and
memory) during training. In some simple tasks, the effect of find the loopholes in the execution in time;
the context-independent embedding method is better than that • Hybrid dynamic and static analysis: As the name sug-
of the context-dependent embedding citearora2020contextual. gests, it refers to the use of dynamic analysis and static
This shows that there is no best model, only the most suitable analysis together, but this does not essentially improve the
model. To use a pre-trained model, there are usually two steps. accuracy of the analysis, because while focusing on the
The first step is to download the pre-trained model. You can static and dynamic analysis points, it will also inherit the
use the third-party package transformer [75]. The second step dynamic analysis and static analysis. The insufficiency.
is based on the specific downstream tasks. The model is fine- In this article, dynamic analysis techniques such as fuzzing
tuned. Generally, transfer learning [76] is used to adjust the
testing or taint analysis [82]–[86] are not within the scope
knowledge in the pre-training model to apply it to downstream of this article. We only discuss static analysis techniques.
tasks. There are many transfer learning methods in NLP, and
According to whether the vulnerability detection technology
the most widely used method is Domain Adaptation [77]. The
uses the Transformer architecture, we artificially divide it
article [78] provides a more detailed classification of this. into two categories, one is the detection model based on
III. VULNERABILITY DECETION USING NEURAL traditional DL technology, and the other is the NLP pre-
N ETWORKS training detection model based on the Transformer architecture
(III-B). Detection models based on traditional DL technology,
Vulnerability detection has always been the top priority in such as LSTM/GRU/Bi-LSTM models, etc., when this type
the field of software security. With the development of deep of model performs source code vulnerability detection, it is
learning technology in CV, NLP and other fields, the use of generally divided into two stages:
deep learning methods to understand and detect vulnerabilities
in the source code, thereby replacing manual detection meth- 1) The first stage is to segment the source code and extract
ods, has become the focus and hotspot of current research the features in the source code. There are two ways to
[79]. Although more and more detection methods have been save the results after segmentation:
proposed, the number of vulnerabilities reported on CVE [2] • One is based on the storage method of abstract syntax
and NVD [80] is increasing day by day. The reason is that in trees (AST). Use code attributes and use AST tree
addition to the large-scale increase in the number of software, analysis tools to decompose the source code into the
another important reason is that root vulnerabilities are not form of AST, and then perform vulnerability analysis
easy to be detected, that is, if a root vulnerability is not in the AST tree [87] or do other tasks, such as Alon
detected, it will not help to repair other shallow vulnerabilities uses path-AST (pAST) to express and complete the
caused by it, and vice versa. , If the fundamental vulnerabilities code The code completion task [88] has been added.
are detected and fixed, other repetitive vulnerabilities will • One is the saving method based on the graph. Most
disappear. This requires vulnerability detection or mining tools of the graph segmentation results are saved as Code
to deeply understand the semantic information related to the Property Graphs(CPG) [89]. In CPG, AST, Control
vulnerability, so as to fundamentally detect the root vulner- flow graph (CFG) and Program dependence graph
ability. To do this, deep NLP technology provides unlimited (PDG) have been integrated together, and the extracted
possibilities. code feature information will be more, and the final
vulnerability detection result will be relatively better,
A. vulnerability introduction because in the CPG The vulnerability code provides
Software vulnerabilities are defined as follows [81], namely: more vulnerability information for the model, as shown
A software vulnerability is an instance of a flaw,caused by in Figure 9. Most of the literature now uses CPG to
a mistake in the design, development, or configuration of extract code features. For example, in [90], CPG is
software such that it can be exploited to violate some explicit called Augmented AST; in Devign [91], the sequence
or implicit security policy. logic relationship between source codes (Natural Code
Vulnerability detection and analysis methods are divided into Sequence, NSC) is actually another form of AST tree
three types according to whether the detected code is executed in CPG. It is worth mentioning that the data set used
or not: in Devign is widely used by many researchers, and
Data set open source. There are many tools to generate to use GNN for modeling training. In Devign [91],
CPG, you can directly use Joern or DG [92] and the gated graph neural networks (GGNN) [99], [100]
other tools. The use of these tools is inextricably model is used for modeling training. The advantage
linked to LLVM. As for the AST tree generation tool, is that the information in the entire graph structure
in the https://github.com/Kolkir/code2seq library, AST can be fully considered, and there will be no ad-
generation tools for programming languages such as jacent junction information. Lost, more suitable for
Java, C++, C, C# and python are provided. semantic graph structure representation in vulnerability
detection tasks. When dealing with real data sets, the
performance of existing detection models based on
traditional DL technology is not very good. This is
due to problems such as data imbalance and data
duplication in real data sets. REVEAL [93] can be
used as a configurable vulnerability prediction tool,
focusing on solving the problem of data imbalance
in real data sets, and using representation learning
to solve problems such as insufficient recognition of
the vulnerability boundary by the model, as shown in
Figure 10 The performance of the boundary between
vulnerabilities and non-vulnerabilities under different
models. In addition, wang [90] uses transfer learning
Fig. 9. By graph splitting,the Red-shaded code elements are most contributing
in the model to deal with the problem of insufficient
for vulnerability decetion [93] data.

2) The second stage is modeling training. In this stage, the


input is the output of the previous stage. According to
whether the graph neural network (GNN) model is used
or not, it can be divided into two categories:
• Use non-GNN model: Generally, the output in the first
stage is in the form of a graph, so the graph needs
to be encoded, converted into a vector, and then fed Fig. 10. t-SNE plots illustrating the separation between vulnerable (denoted
by + ) and non-vulnerable (denoted by ◦ ) example [93]
to the model. A series of work similar to SySeVR
[94],Vuldeelocator [95]–[98], the code is segmented
B. new era of vulnerability detection
at the token level, the semantic information inside the
slice is relatively strong, and then Word2Vec [52] is In the NLP field, the best models so far are models such
used for the slice code mikolov2013efficient Vectorized as BERT [5], GPT [4] and their extended models. These
representation, which converts the slice code into a models all use Transformer as the feature extractor. Since
vector representation. In the modeling phase, these the code is also a special kind of text data, it is natural to
algorithms use LSTM or GRU and their variants Bi- think of using these excellent models such as BERT to do
LSTM, Bi-GRU and other models for modeling train- vulnerability detection. Listed below are some of the latest
ing, and the final results perform well on their respec- models that apply NLP technology to code intelligence (CI)
tive artificialy synthesized data sets. But new research tasks. These models have common characteristics, that is, the
shows that [93], when tested with real data on the training process is divided into two stages, the first stage is
VulDeePecker [95] model, its accuracy is reduced to pre-turning, and the second stage is fine-Tuning, and specific
11.12%. This result is both unexpected and reasonable. vulnerability detection tasks are generally completed in the
Because the LSTM or Bi-LSTM model itself is not fine-tuning stage.
very sufficient in processing vulnerability information, 1) CodeBERT [6]: CodeBERT is a model developed by
the ability to extract relevant vulnerability information Microsoft for code intelligence tasks. CodeBERT uses
features is limited, that is, the generalization ability of bimodal (bimodal) [101]–[103] to train the model, where
the model is not strong, and the data in the real data bimodal refers to natural language (NL) and program-
set is unbalanced (not Vulnerability data is much more ming language (PL), where NL refers to the program
than vulnerability data), which causes the accuracy of code Natural language annotations. In addition to the
the VulDeePecker model to be reduced by more than pre-training of the model using PL-NL dual-modality,
50 CodeBERT also uses the pure code single-modality mode
• Use GNN model: Since the source code is sliced of 6 programming languages for training. In order to
in the first stage, it is generally saved as a graph. better adapt to this model, standard masked language
Therefore, continuing this logic, it becomes natural modeling (MLM) and replaced token detection (RTD)
methods are used for training, as shown in Figure 11. Code Intelligence (CI) tasks refer to a series of tasks related
It is worth mentioning that no model is a panacea. to source code operations on the source code that are solved
using artificial intelligence methods. Common code intelli-
gence tasks are divided into four categories of sub-questions.
This classification rule is the same as the classification rule of
the problem in NLP, but the problem in NLP is oriented to
the macro concept of ”text”, and the code is also A special
kind of ”text”, we can regard vulnerability detection as a
sub-task of code intelligence. The advantage of doing so is
that more training samples and more generation pre-training
models can be obtained. Applying some advanced NLP models
to CI, using the powerful feature extraction capabilities of
deep learning to extract relevant semantic information from
Fig. 11. CodeBERT training model [6] the code, has gradually become a research hotspot. Research
at this stage is mainly focused on the representation of vul-
When using CodeBERT for code generation tasks, the nerability information. In other words, if deeper vulnerability
code2seq [104] model does not perform as well. In information can be excavated, the ability to identify, judge and
code2seq, Alon uses the concept of path-context to extract repair vulnerabilities will be greatly improved.
more relevant semantic information from the code than
CodeBERT, which uses source code for training. Later, R EFERENCES
in the extended version of CodeBERT GraphCodeBERT
[105], the internal structure of the code was considered. In
the pre-training stage, the semantic-level structure of data
[1] Junaid Akram and Ping Luo. Sqvdt: A scalable quantitative vulnerabil-
flow was used to make the model more effective. In the ity detection technique for source code security assessment. Software:
four downstream tasks of code search, clone detection, Practice and Experience, 51(2):294–318, 2021.
code translation and code refinement, GraphCodeBERT [2] Common Vulnerabilities Exposures (CVE). Available at
http://cve.mitre.org.
achieved the best performance. [3] Yonghee Shin, Andrew Meneely, Laurie Williams, and Jason A Os-
2) CodeXGLUE [106]: In the code intelligence research, if borne. Evaluating complexity, code churn, and developer activity
a benchmark data set is provided, the research results metrics as indicators of software vulnerabilities. IEEE transactions
on software engineering, 37(6):772–787, 2010.
will be more convincing. CodeXGLUE provides three
[4] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.
types of model architectures: codeBERT, codeGPT, and Improving language understanding by generative pre-training. 2018.
code-encoder-decoder to help more researchers quickly [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
solve problems in code intelligence. The problems in Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805, 2018.
code intelligence that CodeXGLUE has implemented [6] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng,
are specifically broken down into the following four Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al.
categories of sub-questions: Codebert: A pre-trained model for programming and natural languages.
arXiv preprint arXiv:2002.08155, 2020.
• code-code:clone detection, defect detection,cloze [7] Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, San-
test,code completion, coderepair,code-to-code jana Mendu, Laura Barnes, and Donald Brown. Text classification
algorithms: A survey. Information, 10(4):150, 2019.
translation [8] Vikas Yadav and Steven Bethard. A survey on recent advances in
• text-code:natural language code search, text-to-code named entity recognition from deep learning models. arXiv preprint
generation arXiv:1910.11470, 2019.
• code-text:code summarization
[9] Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. A survey on
deep learning for named entity recognition. IEEE Transactions on
• text-text:documentation translation Knowledge and Data Engineering, 2020.
For detailed descriptions of these sub-problems, see [10] Shantanu Kumar. A survey of deep learning methods for relation
extraction. arXiv preprint arXiv:1705.03645, 2017.
[106]. In the latest pre-trained CodeBERT model, in the [11] Daria Dzendzik, Carl Vogel, and Jennifer Foster. English machine
downstream task Insecure Code Detection, ACC has been reading comprehension datasets: A survey, 2021.
increased to 65.3% (previously 62.08%) . [12] Razieh Baradaran, Razieh Ghiasi, and Hossein Amirkhani. A sur-
vey on machine reading comprehension systems. arXiv preprint
3) PLBART [107]: PLBART applies the BART [72] frame- arXiv:2001.01582, 2020.
work to code intelligence, where PL refers to program [13] Changchang Zeng, Shaobo Li, Qin Li, Jie Hu, and Jianjun Hu. A
language (PL). In PLBART, the noise reduction and self- survey on machine reading comprehension—tasks, evaluation metrics
and benchmark datasets. Applied Sciences, 10(21):7640, 2020.
encoding strategy in BART is continued, using token
[14] Chenhui Chu and Rui Wang. A survey of domain adaptation for neural
masking, token deletion, and token infilling three ways machine translation. arXiv preprint arXiv:1806.00258, 2018.
to add noise. In the fine-tuning stage, the author uses the [15] Shuoheng Yang, Yuxin Wang, and Xiaowen Chu. A survey of deep
four major tasks of Code Summarization, Code Gener- learning techniques for neural machine translation. arXiv preprint
arXiv:2002.07526, 2020.
ation, Code Translation, and Code Classification as the [16] Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan. A comprehensive
downstream tasks of PLBARK for fine-tuning. survey of multilingual neural machine translation, 2020.
[17] Danqing Wang, Pengfei Liu, Yining Zheng, Xipeng Qiu, and Xuanjing [40] Mitsuo Kawato, Kazunori Furukawa, and Ryoji Suzuki. A hierarchical
Huang. Heterogeneous graph neural networks for extractive document neural-network model for control and learning of voluntary movement.
summarization. arXiv preprint arXiv:2004.12393, 2020. Biological cybernetics, 57(3):169–185, 1987.
[18] Mudasir Mohd, Rafiya Jan, and Muzaffar Shah. Text document sum- [41] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
marization using word embedding. Expert Systems with Applications, Neural computation, 9(8):1735–1780, 1997.
143:112958, 2020. [42] Sepp Hochreiter. The vanishing gradient problem during learning
[19] Tahseen Sultana and Srinivasu Badugu. A review on different question recurrent neural nets and problem solutions. International Journal of
answering system approaches. pages 579–586, 2020. Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116,
[20] Zahra Abbasiyantaeb and Saeedeh Momtazi. Text-based question 1998.
answering from information retrieval and deep neural network per- [43] Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and
spectives: A survey. arXiv preprint arXiv:2002.06612, 2020. Yoshua Bengio. On the properties of neural machine translation:
[21] Sulabh Katiyar and Samir Kumar Borgohain. Comparative evaluation Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
of cnn architectures for image caption generation. arXiv preprint [44] Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. Com-
arXiv:2102.11506, 2021. parative study of cnn and rnn for natural language processing. arXiv
[22] Harshit Parikh, Harsh Sawant, Bhautik Parmar, Rahul Shah, Santosh preprint arXiv:1702.01923, 2017.
Chapaneri, and Deepak Jayaswal. Encoder-decoder architecture for [45] Zhenjin Dai, Xutao Wang, Pin Ni, Yuming Li, Gangmin Li, and
image caption generation. In 2020 3rd International Conference on Xuming Bai. Named entity recognition using bert bilstm crf for chinese
Communication System, Computing and IT Applications (CSCITA), electronic health records. In 2019 12th international congress on image
pages 174–179. IEEE, 2020. and signal processing, biomedical engineering and informatics (cisp-
[23] Saloni Kalra and Alka Leekha. Survey of convolutional neural net- bmei), pages 1–5. IEEE, 2019.
works for image captioning. Journal of Information and Optimization [46] Rabah Alzaidy, Cornelia Caragea, and C Lee Giles. Bi-lstm-crf
Sciences, 41(1):239–260, 2020. sequence labeling for keyphrase extraction from scholarly documents.
[24] Gobinda G Chowdhury. Natural language processing. Annual review In The world wide web conference, pages 2551–2557, 2019.
of information science and technology, 37(1):51–89, 2003. [47] Amir Bakarov. A survey of word embeddings evaluation methods.
[25] Daniel W Otter, Julian R Medina, and Jugal K Kalita. A survey of arXiv preprint arXiv:1801.09536, 2018.
the usages of deep learning for natural language processing. IEEE [48] Soubraylu Sivakumar, Lakshmi Sarvani Videla, T Rajesh Kumar,
Transactions on Neural Networks and Learning Systems, 2020. J Nagaraj, Shilpa Itnal, and D Haritha. Review on word2vec word
[26] Md Zahangir Alom, Tarek M Taha, Chris Yakopcic, Stefan Westberg, embedding neural net. In 2020 International Conference on Smart
Paheding Sidike, Mst Shamima Nasrin, Mahmudul Hasan, Brian C Electronics and Communication (ICOSEC), pages 282–290. IEEE,
Van Essen, Abdul AS Awwal, and Vijayan K Asari. A state-of-the-art 2020.
survey on deep learning theory and architectures. Electronics, 8(3):292, [49] Tomasz Limisiewicz and David Mareček. Syntax representation in
2019. word embeddings and neural networks–a survey. arXiv preprint
arXiv:2010.01063, 2020.
[27] Jürgen Schmidhuber. Deep learning in neural networks: An overview.
Neural networks, 61:85–117, 2015. [50] Sebastian Ruder, Ivan Vulić, and Anders Søgaard. A survey of cross-
lingual word embedding models. Journal of Artificial Intelligence
[28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Research, 65:569–631, 2019.
Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention
[51] Zellig S Harris. Distributional structure. Word, 10(2-3):146–162, 1954.
is all you need. arXiv preprint arXiv:1706.03762, 2017.
[52] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient
[29] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner.
estimation of word representations in vector space. arXiv preprint
Gradient-based learning applied to document recognition. Proceedings
arXiv:1301.3781, 2013.
of the IEEE, 86(11):2278–2324, 1998.
[53] Jeffrey Pennington, Richard Socher, and Christopher D Manning.
[30] Yoon Kim. Convolutional neural networks for sentence classification, Glove: Global vectors for word representation. In Proceedings of the
2014. 2014 conference on empirical methods in natural language processing
[31] Rie Johnson and Tong Zhang. Effective use of word order for text (EMNLP), pages 1532–1543, 2014.
categorization with convolutional neural networks. arXiv preprint [54] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural ma-
arXiv:1412.1058, 2014. chine translation of rare words with subword units. arXiv preprint
[32] Rie Johnson and Tong Zhang. Semi-supervised convolutional neural arXiv:1508.07909, 2015.
networks for text categorization via region embedding. Advances in [55] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad
neural information processing systems, 28:919, 2015. Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao,
[33] Ye Zhang and Byron Wallace. A sensitivity analysis of (and practition- Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xi-
ers’ guide to) convolutional neural networks for sentence classification. aobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku
arXiv preprint arXiv:1510.03820, 2015. Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil,
[34] Thien Huu Nguyen and Ralph Grishman. Relation extraction: Per- Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol
spective from convolutional neural networks. In Proceedings of the 1st Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s
Workshop on Vector Space Modeling for Natural Language Processing, neural machine translation system: Bridging the gap between human
pages 39–48, 2015. and machine translation, 2016.
[35] Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob [56] Erion Çano and Maurizio Morisio. Word embeddings for senti-
Harer, Onur Ozdemir, Paul Ellingwood, and Marc McConley. Auto- ment analysis: a comprehensive empirical survey. arXiv preprint
mated vulnerability detection in source code using deep representation arXiv:1902.00753, 2019.
learning. In 2018 17th IEEE international conference on machine [57] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christo-
learning and applications (ICMLA), pages 757–762. IEEE, 2018. pher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized
[36] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Mul- word representations. arXiv preprint arXiv:1802.05365, 2018.
timodal machine learning: A survey and taxonomy. IEEE transactions [58] Qi Liu, Matt J Kusner, and Phil Blunsom. A survey on contextual
on pattern analysis and machine intelligence, 41(2):423–443, 2018. embeddings. arXiv preprint arXiv:2003.07278, 2020.
[37] Umut Sulubacak, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, [59] Alessio Miaschi and Felice Dell’Orletta. Contextual and non-contextual
Desmond Elliott, Lucia Specia, and Jörg Tiedemann. Multimodal word embeddings: an in-depth linguistic investigation. In Proceedings
machine translation through visuals and speech, 2019. of the 5th Workshop on Representation Learning for NLP, pages 110–
[38] Chao Zhang, Zichao Yang, Xiaodong He, and Li Deng. Multimodal 119, 2020.
intelligence: Representation learning, information fusion, and applica- [60] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural
tions. IEEE Journal of Selected Topics in Signal Processing, 14(3):478– machine translation by jointly learning to align and translate. arXiv
493, 2020. preprint arXiv:1409.0473, 2014.
[39] Sulabh Katiyar and Samir Kumar. Comparative evaluation of cnn [61] Sneha Chaudhari, Gungor Polatkan, Rohan Ramanath, and Varun
architectures for image caption generation. International Journal of Mithal. An attentive survey of attention models. arXiv preprint
Advanced Computer Science and Applications, 11(12), 2020. arXiv:1904.02874, 2019.
[62] Dichao Hu. An introductory survey on attention mechanisms in nlp Woo. The art, science, and engineering of fuzzing: A survey. IEEE
problems. In Proceedings of SAI Intelligent Systems Conference, pages Transactions on Software Engineering, 2019.
432–448. Springer, 2019. [86] Yan Wang, Peng Jia, Luping Liu, Cheng Huang, and Zhonglin Liu. A
[63] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, systematic review of fuzzing based on machine learning techniques.
Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A PloS one, 15(8):e0237749, 2020.
survey. arXiv preprint arXiv:2101.01169, 2021. [87] Fabian Yamaguchi, Markus Lottmann, and Konrad Rieck. Generalized
[64] Benyamin Ghojogh and Ali Ghodsi. Attention mechanism, transform- vulnerability extrapolation using abstract syntax trees. pages 359–368,
ers, bert, and gpt: Tutorial and survey. 2020. 2012.
[65] Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient [88] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. code2vec:
transformers: A survey. arXiv preprint arXiv:2009.06732, 2020. Learning distributed representations of code. Proceedings of the ACM
[66] Adrian MP Braşoveanu and Răzvan Andonie. Visualizing transformers on Programming Languages, 3(POPL):1–29, 2019.
for nlp: A brief survey. In 2020 24th International Conference [89] Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck.
Information Visualisation (IV), pages 270–279. IEEE, 2020. Modeling and discovering vulnerabilities with code property graphs.
[67] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared In 2014 IEEE Symposium on Security and Privacy, pages 590–604.
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish IEEE, 2014.
Sastry, Amanda Askell, et al. Language models are few-shot learners. [90] Huanting Wang, Guixin Ye, Zhanyong Tang, Shin Hwei Tan, Songfang
arXiv preprint arXiv:2005.14165, 2020. Huang, Dingyi Fang, Yansong Feng, Lizhong Bian, and Zheng Wang.
[68] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Combining graph-based learning with automated data collection for
and Ilya Sutskever. Language models are unsupervised multitask code vulnerability detection. IEEE Transactions on Information Foren-
learners. OpenAI blog, 1(8):9, 2019. sics and Security, 2020.
[69] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan [91] Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu.
Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pre- Devign: Effective vulnerability identification by learning comprehen-
training for language understanding. arXiv preprint arXiv:1906.08237, sive program semantics via graph neural networks, 2019.
2019. [92] Marek Chalupa. Dg: Analysis and slicing of llvm bitcode. In
[70] Wilson L Taylor. “cloze procedure”: A new tool for measuring International Symposium on Automated Technology for Verification and
readability. Journalism quarterly, 30(4):415–433, 1953. Analysis, pages 557–563. Springer, 2020.
[71] MV Koroteev. Bert: A review of applications in natural language [93] Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi
processing and understanding. arXiv preprint arXiv:2103.11943, 2021. Ray. Deep learning based vulnerability detection: Are we there yet?
[72] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Ab- arXiv preprint arXiv:2009.07235, 2020.
delrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettle- [94] Yi Hu. A framework for using deep learning to detect software
moyer. Bart: Denoising sequence-to-sequence pre-training for natural vulnerabilities, 2019.
language generation, translation, and comprehension. arXiv preprint
[95] Deqing Zou, Sujuan Wang, Shouhuai Xu, Zhen Li, and Hai Jin.
arXiv:1910.13461, 2019.
Vuldeepecker: A deep learning-based system for multiclass vulner-
[73] Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and
ability detection. IEEE Transactions on Dependable and Secure
Xuanjing Huang. Pre-trained models for natural language processing:
Computing, page 1–1, 2019.
A survey. Science China Technological Sciences, pages 1–26, 2020.
[96] Zhen Li, Deqing Zou, Shouhuai Xu, Zhaoxuan Chen, Yawei Zhu, and
[74] Ismael Garrido-Muñoz, Arturo Montejo-Ráez, Fernando Martı́nez-
Hai Jin. Vuldeelocator: a deep learning-based fine-grained vulnerability
Santiago, and L Alfonso Ureña-López. A survey on bias in deep nlp.
detector. arXiv preprint arXiv:2001.02350, 2020.
Applied Sciences, 11(7):3184, 2021.
[75] Thomas Wolf, Julien Chaumond, Lysandre Debut, Victor Sanh, [97] Deqing Zou, Yawei Zhu, Shouhuai Xu, Zhen Li, Hai Jin, and Hengkai
Clement Delangue, Anthony Moi, Pierric Cistac, Morgan Funtowicz, Ye. Interpreting deep learning-based vulnerability detector predictions
Joe Davison, Sam Shleifer, et al. Transformers: State-of-the-art natural based on heuristic searching. ACM Transactions on Software Engi-
language processing. In Proceedings of the 2020 Conference on Empir- neering and Methodology (TOSEM), 30(2):1–31, 2021.
ical Methods in Natural Language Processing: System Demonstrations, [98] Changming Liu, Deqing Zou, Peng Luo, Bin B. Zhu, and Hai Jin. A
pages 38–45, 2020. heuristic framework to detect concurrency vulnerabilities. In Proceed-
[76] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, ings of the 34th Annual Computer Security Applications Conference,
Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on ACSAC ’18, page 529–541, New York, NY, USA, 2018. Association
transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020. for Computing Machinery.
[77] Alan Ramponi and Barbara Plank. Neural unsupervised domain [99] Daniel Beck, Gholamreza Haffari, and Trevor Cohn. Graph-to-
adaptation in nlp—a survey. arXiv preprint arXiv:2006.00632, 2020. sequence learning using gated graph neural networks. arXiv preprint
[78] Zaid Alyafeai, Maged Saeed AlShaibani, and Irfan Ahmad. A survey arXiv:1806.09835, 2018.
on transfer learning in natural language processing. arXiv preprint [100] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel.
arXiv:2007.04239, 2020. Gated graph sequence neural networks, 2017.
[79] Guanjun Lin, Sheng Wen, Qing-Long Han, Jun Zhang, and Yang Xiang. [101] Dhanesh Ramachandram and Graham W Taylor. Deep multimodal
Software vulnerability detection using deep neural networks: a survey. learning: A survey on recent advances and trends. IEEE Signal
Proceedings of the IEEE, 108(10):1825–1848, 2020. Processing Magazine, 34(6):96–108, 2017.
[80] National Vulnerability Database(NVD). Available at [102] Wei Chen, Weiping Wang, Li Liu, and Michael S. Lew. New ideas
https://nvd.nist.gov. and trends in deep multimodal content understanding: A review, 2020.
[81] Seyed Mohammad Ghaffarian and Hamid Reza Shahriari. Software [103] Tariq Habib Afridi, Aftab Alam, Muhammad Numan Khan, Jawad
vulnerability analysis and discovery using machine-learning and data- Khan, and Young-Koo Lee. A multimodal memes classification: A
mining techniques: A survey. ACM Comput. Surv., 50(4), 2017. survey and open research issues. arXiv preprint arXiv:2009.08395,
[82] Zaoyu Wei, Jiaqi Wang, Xueqi Shen, and Qun Luo. Smart contract 2020.
fuzzing based on taint analysis and genetic algorithms. Journal of [104] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. code2seq:
Quantum Computing, 2(1):11, 2020. Generating sequences from structured representations of code. arXiv
[83] Heribertus Yulianton, Agung Trisetyarso, Wayan Suparta, Bahtiar Saleh preprint arXiv:1808.01400, 2018.
Abbas, and Chul Ho Kang. Web application vulnerability detection [105] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie
using taint analysis and black-box testing. In IOP Conference Series: Liu, Long Zhou, Nan Duan, Jian Yin, Daxin Jiang, et al. Graphcode-
Materials Science and Engineering, volume 879, page 012031. IOP bert: Pre-training code representations with data flow. arXiv preprint
Publishing, 2020. arXiv:2009.08366, 2020.
[84] James Fell. A review of fuzzing tools and methods. Technical [106] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy,
report, Technical Report. https://dl. packetstormsecurity. Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu
net/papers/general/a . . . , 2017. Tang, et al. Codexglue: A machine learning benchmark dataset for
[85] Valentin Jean Marie Manès, HyungSeok Han, Choongwoo Han, code understanding and generation. arXiv preprint arXiv:2102.04664,
Sang Kil Cha, Manuel Egele, Edward J Schwartz, and Maverick 2021.
[107] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei
Chang. Unified pre-training for program understanding and generation.
arXiv preprint arXiv:2103.06333, 2021.

You might also like