Crossproject Transfer Representation Learning For Vulnerable Fun 2018

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2821768, IEEE
Transactions on Industrial Informatics
1

Cross-Project Transfer Representation Learning for


Vulnerable Function Discovery
Guanjun Lin, Jun Zhang, Member, IEEE, Wei Luo, Lei Pan, Member, IEEE, Yang Xiang, Senior Member, IEEE,
Olivier De Vel and Paul Montague

Abstract—Machine learning is now widely used to detect by attackers, companies and organizations may suffer from
security vulnerabilities in software, even before the software significant financial loss as well as irreparable damage to
is released. But its potential is often severely compromised reputation [22].
at the early stage of a software project, when we face a
shortage of high-quality training data and have to rely on overly The early detection of vulnerabilities in applications is
generic hand-crafted features. This paper addresses this cold- vital for implementing cost-effective attack mitigation solu-
start problem of machine learning, by learning rich features tions. From the perspective of code execution, techniques
that generalize across similar projects. To reach an optimal for identifying vulnerabilities can be categorized into static,
balance between feature richness and generalisability, we devise dynamic and hybrid approaches. Static techniques such as
a data-driven method including the following innovative ideas.
First, the code semantics are revealed through serialized abstract rule-based analysis [6], code similarity detection i.e. code
syntax trees (ASTs), with tokens encoded by Continuous Bag-of- clone detection [8], [9], symbolic execution [2], mainly rely
Words neural embeddings. Next, the serialized ASTs are fed to a on the analysis of source code, but often struggle to reveal
sequential deep learning classifier (bidirectional LSTM) to obtain bugs and vulnerabilities occurring at runtime. Dynamic anal-
a representation indicative of software vulnerability. Finally, the ysis includes fuzzing test [23] and taint analysis [17], and
neural representation obtained from existing software projects is
then transferred to the new project to enable early vulnerability focuses on detecting vulnerabilities manifested during program
detection even with a small set of training labels. To validate execution, but in general, has low code coverage. The hybrid
this vulnerability detection approach, we manually labeled 457 approaches combining static and dynamic analysis techniques
vulnerable and collected 30,000+ non-vulnerable functions from aim to overcome the aforementioned weaknesses. However, all
six open-source projects. The empirical results confirmed that of these approaches rely on a limited set of known syntactic or
the trained model is capable of generating representations which
are indicative of program vulnerability and is adaptable across behavioral patterns of vulnerabilities and such deficiency raises
multiple projects. Compared with traditional code metrics, our the challenge of detecting previously unseen vulnerabilities.
transfer-learned representations are more effective for predicting Data-driven vulnerability discovery using machine learn-
vulnerable functions, both within a project and across multiple ing (ML) provides a new opportunity for intelligent, effec-
projects. tive and efficient vulnerability detection. Existing ML-based
Index Terms—Cross-project, Vulnerability discovery, Repre- approaches primarily operate on source code which offers
sentation learning, Transfer learning, Abstract syntax tree. better human readability. Researchers have applied source-
code based features such as imports (i.e. header files), function
I. I NTRODUCTION calls [16], software complexity metrics, and code changes
Vulnerabilities in software critically undermine the security [22] as indicators for identifying potentially vulnerable files or
of computer systems and threaten the IT infrastructure of code fragments. Moreover, features and information obtained
many government sectors and organizations. For instance, the from version control systems such as developer activities
recently disclosed “Heartbleed” and “Shellshock” vulnerabil- [12] and code commits [20] were also adopted for predicting
ities, and a vulnerability in the Server Message Block (SMB) vulnerabilities. Most recently, two studies: VUDDY [9] and
protocol exploited by the WannaCry ransomware have affected VulPecker [10] focused on detecting vulnerable functions
a wide range of systems and millions of users worldwide. and code fragments based on code clone/similarity analysis,
According to [4] and [26], one of the major causes of nevertheless, both approaches incur high false negative rate.
security incidents and breaches can be attributed to exploitable However, most existing ML-based approaches focus on soft-
vulnerabilities in software. Once a vulnerability is exploited ware component- or file-level vulnerability detection, which
rely on the manual effort and expertise of the code auditor
Guanjun Lin, Wei Luo, Lei Pan are with School of Information to inspect the code base to accurately pinpoint the exact
Technology, Deakin University, Geelong, VIC 3216, Australia (e-mail:
{lingu,wei.luo,l.pan}@deakin.edu.au). location of vulnerabilities. Because of the relative scarcity
Jun Zhang is with School of Software and Electrical Engineering, Swin- of vulnerabilities, there are insufficient historical vulnerability
burne University of Technology, Melbourne, VIC 3122, Australia (e-mail: data for training and validating a statistical model, especially
[email protected])
Yang Xiang is with Digital Research & Innovation Capability Platform, on inactive open-source projects. In this paper, we aim to ex-
Swinburne University of Technology, Melbourne, VIC 3122, Australia (e- plore a fine-grained vulnerability detection approach targeting
mail: [email protected]) multiple software projects. To overcome the challenges, we
Olivier De Vel and Paul Montague are with Defence Science & Technology
Group (DSTG), Department of Defence, Australia (e-mail: {Olivier.DeVel, propose a framework which solves this problem in three stages
Paul.Montague}@dst.defence.gov.au) (Figure 1). Firstly, we create vulnerability ground truth data

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2821768, IEEE
Transactions on Industrial Informatics
2

1 Source Projects with


labelled data
Open source 2 3
projects Potential
Target Projects Vulnerable
Transferable Classifiers
with limited Function
Representations
labelled data Ranking List

Fig. 1: The proposed framework for vulnerability discovery. It contains 3 stages: the first stage is to pre-train a Bi-LSTM network
using source code projects; the second stage is to feed the trained network with the target project to obtain representations as
features; the last stage is to train a ML classifier with the learned features.

at the function-level. Secondly, we extract features from the II presents how features are extracted from ASTs derived
abstract syntax trees (ASTs) of each function. Specifically, from source code functions. Section III describes how to
we use a parser to obtain ASTs in a serialized form by leverage LSTM for obtaining the sequential patterns in ASTs
using Depth-first Traversal (DFT). Then, we convert serialized for vulnerability detection. Then, Section IV evaluates the
ASTs to equal-length sequences while preserving the structural detection performance of the proposed approach using two sets
and semantic features. To further refine these features, we of experiments for the evaluation of the effectiveness of our
apply a long short-term memory (LSTM) [7] recurrent neural deep AST representations and transfer representation learning.
network with Word2vec [13] embeddings for learning a higher Section VI concludes this paper.
level of representations. We hypothesize that the algorithm
has the capacity of automatically extracting deep vulnerable II. F UNCTION L EVEL AST E NGINEERING
programming features that contain much richer information
We believe that software vulnerabilities are often reflected
than the shallow features driven by domain knowledge. We
in the syntactical structure of source code, particularly at the
also hypothesize that the learned low-level representations are
function-level. To capture such features and code properties,
transferable and are independent of software projects using the
we follow the early work of Yamaguchi et al [25]. The authors
same programming language. Lastly, given a target project
assumed that vulnerable programming patterns are associated
with insufficient labeled data, we apply the same features
with many vulnerabilities, and these patterns can be revealed
extraction process and feed the data to the pre-trained network
by analyzing the program’s ASTs. An AST is a syntactical
for learning a subset of representations. Subsequently, the
structure of source code (for instance, a function), depicting
learned representations are used to train a classifier for vulner-
the relationships among the components of the code in a
ability prediction. The empirical study shows that the features
hierarchical tree view, and faithfully representing the function-
extracted using our method are significantly more effective
level control flow (see Figure 2(a) and 2(b)). Compared with
than software code metrics (CMs) in detecting vulnerabilities.
the control flow graphs (CFGs), ASTs provide a natural
Despite the small number of instances labeled in the projects,
program representation at the function level and reserve more
our algorithm is capable of effectively utilizing available data
information of the source code, while CFGs usually do not
from other projects for pre-training a basic network, which can
include variable declarations. Therefore, in this paper, we
then be used for extracting deep AST representations for the
choose ASTs for extracting the latent programming patterns.
projects with insufficient data. Empirical results demonstrate
To achieve this, an AST needs to be serialized for converting
the effectiveness of learned representations which contribute
to a vector while preserving its structure and semantics. The
to better detection accuracy than using traditional CMs.
vector that holds the structural and semantic information
In summary, our contributions are three-fold: can then be leveraged by our proposed Bi-LSTM network
• We propose a framework for function-level vulnerability for obtaining deep representations capable of distinguishing
discovery, which offers a fine-grained detection capabil- vulnerable functions from non-vulnerable ones.
ity, facilitating quick location of vulnerabilities.
• We develop an approach to extract the sequential fea- A. Robust AST Parsing
tures of ASTs which capture the structural and semantic
Prior to extracting features from ASTs, we have to obtain
information of functions. Such information reflects the
the ASTs from the source code. ASTs are usually generated by
vulnerable programming patterns.
a compiler during the code parsing stage. However, without a
• We construct a bi-directional LSTM (Bi-LSTM) network
working build environment, obtaining ASTs from C/C++ code
for effectively extracting deep AST representations, that
is non-trivial. With “CodeSensor” which is a robust parser
supports the transferability across software projects. Em-
implemented by [25] based on the concept of island grammars
pirical studies show that the deep AST representations
[15], we can extract ASTs from individual source files or
provide the precise identification of vulnerable functions
even from fragments of function code without the presence of
(80% precision is achieved when retrieving the 10 most
dependent libraries. By feeding CodeSensor with the source
probable vulnerable functions).
code files, the parsed ASTs can be generated in a serialized
The rest of this paper is organized as follows: Section format, ready for subsequent processing.

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2821768, IEEE
Transactions on Industrial Informatics
3

In Figure 2(c), a serialized AST is organized to fit in a include control flow elements and operators. The control flow
table, presenting a more user-friendly view than the original elements are derived from while/do while statements, if /else
tree view. The first column is the type of all the nodes; the statements, etc. If a function contains an if /else statement, it
second column records the depth/layer of each type and the will have “if” and “else” nodes in its AST. As for operators
last two columns present the actual names and values of the such as “+”, “-” or “=”, they will remain unchanged.
types in the original function, respectively.
B. AST Refining
1) Code Structure Preserving: With the parsed ASTs in a
serialized format, they need to be transformed to vectors so
that they can be processed by ML algorithms. To preserve the
structural information of ASTs, we applied the same method
addressed in our previous work [11]. Firstly, the ASTs have
to be traversed, allowing their components to be assembled
in a uniform sequence to form vectors. In this paper, we use
a DFT to map the AST elements to vectors. In future work,
we will examine whether breadth-first traversal yields better
results in the subsequent classification process.
With serialized ASTs, using DFT is straightforward, as the
(a) Two examples of C functions serialized format (Figure 2(c)) is written in a depth-first search
sequence from top to bottom. For every AST, we map its
Func: foo (int)
nodes to a vector so that each node becomes an element in
params stmnts
the vector. To preserve the structural information of ASTs, the
param: int decl: int op: = for: (int i=0;….) return: (x+y) op: + sequence of each element matters. Take function foo shown
call: bar forinit: int i=0 cond: i<5 forexpr: i++ stmts in Figure 2(a) for example, the root of its AST (in Figure
arg: x decl: int op: = op: < op: ++ If: (i==2) 2(b)), being a function’s name: foo, will be mapped to the first
cond: i==2 stmts element of the vector. The second and third layer of the AST,
op: == return: y which usually contains the “params” node, return type and the
(b) An AST of function foo, with placeholder nodes (thick line), API nodes parameter type, will be the second, third and fourth element of
(dashed line) and syntax node (dotted line) [25] the vector respectively, and likewise for the subsequent nodes.
After mapping the AST of function foo to a vector, it will be
of the form like [foo, int, params, int, x, stmnts, decl, int, y, =,
call, bar,...]. For this textual vector, we treat it as a “sentence”
with semantic meanings. The semantic meaning is formed by
the elements of the vector and their sequence. For instance,
function baz shown in Figure 2(a) is similar to function foo,
both of which take a parameter x of type int and return an int
value. They also share the same names of local variables and
have for, if statements and call the same function. However,
(c) A serialized AST of function foo they are different in terms of behavior. When mapping function
baz to a vector: [baz, int, params, int, x, stmnts, decl, int, y, =,
Fig. 2: Motivating examples of how function foo is converted decl, int,...], we can immediately recognize that the sequence
to an AST in serialized format. of elements of vector baz differs from that of vector foo, in
spite of them sharing the same textual content. Therefore, the
When referring to the nodes of ASTs, we follow [25] for converted vector should uniquely identify a function. By doing
the naming conventions. An AST, as shown in Figure (2(b)) so, we can preserve the structural and contextual information
consists of three types of nodes: placeholder nodes, API nodes, to a large extent.
and syntax nodes. The placeholder nodes are not actual com- 2) Tokenization & Padding: Since the subsequent ML
ponents of a function, but they link the function components algorithms take numeric vectors as inputs, textual elements of
together to form a tree. All ASTs have placeholder nodes vectors are mapped to numbers. To achieve this, a mapping is
such as “params” signifying that its leaf nodes are function built to link each textual element of vectors to an integer. These
parameters, or “stmnts” denoting its leaf being statements integers act as “tokens” which uniquely identify each textual
of various types. According to [25], the API nodes refer to element. For instance, we map type “int” to “1” and keyword
the types of function return values and function parameters. “static” to “2”, and so on. By replacing textual elements of
They also can be variable declarations and function calls. For vectors with numeric tokens, their sequence remains intact.
instance, a function which has a “void” return type will have Padding and truncation are standard practices to handle
a “void” node. A function taking an int parameter will have vectors of various length and provide a unified length of input
an “int” node. The syntax nodes are syntactic elements that vectors. Among the vectors resulting from the previous step,

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2821768, IEEE
Transactions on Industrial Informatics
4

our longest vector contains more than 2200 elements and the In this work, we leverage recent advances in Recurrent Neural
shortest one is less than 5. To balance the length and the Networks (RNNs) [14] to reveal deep sequential features that
over sparsity of vectors, a suitable length should be chosen support transferability across different software projects. Based
for padding/truncation. For short vectors, we use zeros for on RNNs, we proposed a Bi-LSTM network, using the LSTM
end-of-sequence padding. cell as the building block, plus a global max pooling layer to
3) Code Semantics Preserving: Aside from the structural accommodate large ASTs, for extracting the latent sequential
information, the code semantics also needs to be preserved features in ASTs.
through word embedding. The word embedding represents
each “word” of the input sequence as a dense vector of a A. RNN and LSTM
fixed dimension. Traditional embedding methods may convert
RNNs are capable of extracting complex sequential inter-
words, in our context, the AST nodes such as type “int”, or
actions that are crucial for some prediction tasks. They treat
operators like “+” and “-” to arbitrary dimensional vectors
inputs as sequentially dependent data. For a given time step
while ignoring the relationships that may exist between the
t, an RNN’s output yt not only depends on the current input
nodes. To allow the algorithm to leverage the information that
xt , but also on the accumulated information up to the time
is held between the nodes (namely, the elements of vectors)
step t−1. This feature offers RNNs the capability of retaining
and be more expressive, we apply Word2vec [13] with the
useful information from the past for the current prediction [18]
Continuous Bag-of-Words (CBOW) model to convert each
so that they become powerful tools for solving NLP problems.
element of the vectors to word embeddings of 100 dimensions
There is a strong resemblance between ASTs and “sentences”
(which is the default settings). We use the Word2vec directly
in natural languages, which motivates us to apply RNNs to our
at the output of CodeSensor and include all of the code base
scenario. Since, the vectors containing components of ASTs
as well as the type of AST nodes as the corpus of text content
in a sequence reflect the structural information, altering the
for the algorithm to learn from. We visualize the embeddings
sequence of any element changes the structural and semantic
learned using Word2vec and project the embeddings with
meanings. For example, vector [main, int, decl, int, =, ...,
Principle Component Analysis (PCA) to a 2-D plane. Figure
return] means that the main function returning an int type and
3 shows that different types of elements are grouped into
contains a declaration of a variable of int type. If we change
separable clusters. Therefore, with Word2Vec, the elements
the sequence of “main” and “decl” so that the vector [decl, int,
of vectors can be represented with semantically meaningful
main, int, op, ..., return] becomes semantically meaningless
vector representations so that the elements which share similar
because a local variable declaration should be inside of the
contexts in the code are located in close proximity in the vector
main function for C/C++. So, we hypothesize that a function
space.
with vulnerability will display certain “linguistic” oddities
discoverable by RNNs.
char Usually, the vulnerable programming patterns in a function
can be associated with many lines of code. When we map
cons
t
functions to vectors, patterns linked to vulnerabilities are
i
nt
related to multiple elements of the vector. The standard RNNs
== s
igned are able to handle short-term dependencies such as the element
par
am
- “main” which should be followed by type name “int” or
< doubl
e
>
<< par
ams “void”, but they have a problem when dealing with long-
++ + uns
igned
>> voi
d term dependencies such as capturing vulnerable programming
+= -
- f
loat f
unc_name
patterns that are related to many continuous or intermittent
Oper
ator
s elements. Therefore, we use an RNN with LSTM cells to
Var
iabl
eTypes capture long-term dependencies [18] for learning high-level
s
tat
ic
Funct
ionRet
urnTypes& ASTNodes representations of vulnerabilities.
Fig. 3: The plot of AST node embeddings learned through
Word2vec. PCA was used to project 100-dimensional vectors B. Network Architecture
learned with Word2vec to a 2-D plane. The blue dots represent The architecture of our LSTM-based network is illustrated
operators in ASTs; the red dots denote variable types and the in Figure 4. The configuration of the network (i.e. the choice
green ones are function return types and placeholder nodes. of activation function, the number of LSTM cells) is fine-
The figure depicts that AST nodes with similar code seman- tuned based on the experiments. The network takes a “tokens”
tics (e.g., the operators) are grouped together, showing that sequence as input. The first layer is the Word2vec embedding
applying Word2vec produces more meaningful embeddings. layer mentioned in II-B3 which maps each element of the se-
quence to a vector in a semantic space where similar elements
are close to each other. The second layer is an LSTM layer
III. T RANSFERABLE R EPRESENTATION L EARNING which contains 64 LSTM units in a bi-directional form (a total
Traditionally, the tree structure such as an AST can be pro- of 128 LSTM units). The third layer is a global max pooling
cessed as shallow graph theoretical features. Such features are layer. Since in our data sets, each function sample contains
still not sufficiently robust across different software projects. at most one vulnerability, applying global max pooling to

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2821768, IEEE
Transactions on Industrial Informatics
5

select the maximum value can help to strengthen potentially IV. P ERFORMANCE E VALUATION
vulnerable signals. During the training phase, two dense layers, A. Experiment Datasets
one of which with the sigmoid activation function, are added
for converging the learned representations to a probability, To the best of our knowledge, there is no publicly-available,
indicating how likely this sequence is vulnerable. We acquire function-level vulnerability ground truth dataset. Although pre-
the output (activations) of the third layer as the learned vious studies [25], [24] and [9] have focused on vulnerability
representative features that are highly abstractive. detection at the function-level granularity, their data is not
publicly accessible. Therefore, we constructed a function-level
vulnerability dataset from scratch and have made our data and
our code publicly available on GitHub 1 . We manually labeled
C. The Bi-LSTM 457 vulnerable functions and collected 32,531 non-vulnerable
RNNs treat inputs dependently. They perform the current functions from 6 open-source projects across 1000+ popular
task such as predicting a word by examining dependencies releases. We obtained the vulnerability data from the National
across recent information. However, they only can capture the Vulnerability Database (NVD) and the Common Vulnerability
dependencies of the ith word xi given the previous words and Exposures (CVE). Both NVD and CVE repositories use
such as xi−3 : xi−1 . For many NLP tasks, checking the CVE ID as unique identifiers for vulnerabilities. CVE IDs are
previous information is generally insufficient to perform an assigned to vulnerabilities, allowing a security professional to
accurate prediction, the subsequent words such as xi+1 : xi+3 quickly access the technical information of known vulnera-
of word xi can also be useful. To obtain the dependencies of bilities across multiple CVE-compatible sources. NVD offers
surrounding words of word xi , the bi-directional RNN also a convenient option for searching the known vulnerabilities
known as Bi-RNN [21] is designed to serve this purpose. The of a software project. Using the NVD description, we can
Bi-LSTM which is a variation of Bi-RNN is similar in essence download the corresponding version of a project and locate
but captures longer dependencies. each vulnerable function in the software project’s source code,
Contextual information is crucial for vulnerability detection. and label it accordingly.
Because the source code is a logical and semantic structure, Our experiments included six open-source projects:
it is closely connected and tightly coupled. Hence, the occur- LibTIFF, LibPNG, FFmpeg, Pidgin, VLC Media Player, and
rence of a vulnerable code fragment is usually related to either Asterisk. Their source code can be obtained from GitHub or
previous or subsequent code, or even to both. A vulnerable their public code base. For each of these projects, we man-
code fragment usually contains multiple lines of code which ually labeled the vulnerable functions recorded on CVE and
can be distributed across a function block. In many cases, it is NVD until 1st October 2017. Then, excluding the identified
difficult to exactly pinpoint which line of code actually causes vulnerable functions, we selected the remaining functions as
the vulnerability. Hence, the bi-directional implementation of non-vulnerable ones. Table I provides a summary of the related
LSTM can help to detect a long-term dependency of both projects and the number of functions used in our experiments.
forwards and backwards, which can effectively capture the TABLE I: Source code projects involved in experiments
vulnerable programming patterns. # Vulnerable # Non-vulnerable
Project
Functions Labeled Functions Used
LibTIFF 96 777
LibPNG 43 499
D. Pre-training for Obtaining Representations FFmpeg 191 4921
Pidgin 29 8050
The pre-training phase trains a basic network using histor- VLC Media Player 42 3636
Asterisk 56 14648
ical vulnerability data of several different software projects.
These projects (a.k.a source projects) initialize the network
parameters for learning low-level features of vulnerabilities.
As depicted in Figure 4, two dense layers are added after the B. Environment and Parameters
global max pooling layer to form a complete network. The The implementation of the Bi-LSTM network used Keras
training inputs are AST-based features from both vulnerable (version 2.0.8) [3] with TensorFlow (version 1.3.0) [1] back-
and non-vulnerable functions. This makes sure that the hidden end. The random forest algorithm was provided by the scikit-
nodes capture the sequential interactions that are discrimi- learn package (version 0.19.0) [19]. The Word2vec embedding
native of vulnerable programs. The input data are divided was provided by the gensim package (version 3.0.1) with all
into training and validation sets to build and evaluate the default settings. The computational system is a server running
model and guide the model tuning processes to maximize the CentOS Linux 7 with two Physical Intel(R) Xeon(R) E5-2690
performance. Once the model is trained and the performance v3 2.60GHz CPUs and 96GB of RAM.
is satisfactory, we feed the trained networks with the processed When mapping ASTs to vectors, we made a trade-off
AST-based features of a target project with limited labels between the tree complexity and a shallow representation.
and obtain the learned representations from the third layer of Since a complex AST results in a long vector containing
the networks. Given a sequence of an arbitrary length as an thousands of elements, we need to truncate over-complicated
input fed to the networks, the learned representation is a 128
dimensional vector. 1 https://github.com/DanielLin1986/TransferRepresentationLearning

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2821768, IEEE
Transactions on Industrial Informatics
6

Tokenized Word Embedding Layer Bi-directional LSTM Layer Global Max Pooling Layer Learned Representations Dense Layers
sequence
em1 r1
s1
.1, .5, .3, .6, .3 .... .1, .5, .3, .6, .1....
LSTMf LSTMb
27 em2 r2
16 .2, .3, .1, .9, .5.... LSTMf .6, .8, .1, .1, .9....
LSTMb
2 em3 r3
31 .6, .3, .1, .4, .2.... LSTMf LSTMb .6, .3, .1, .4, .5....
5

….
….
….

….
emn r128
….

….
LSTMf LSTMb .2, .6, .1, .9, .3....
62 .2, .1, .1, .9, .7....

Fig. 4: The 5-layer architecture of the proposed LSTM network for learning deep AST representations. During the pre-training
phase, the network takes a tokenized sequence converted from an AST as an input. In the representation learning phase, the last
two dense layers are removed and the output of global max pooling layers are used as the learned deep AST representations
as features for subsequent processing.

ASTs when converting them to vectors of the same length We compared our method with the approach of applying
for balancing between information loss and excessively long traditional CMs, since CMs are quality measures for quanti-
vectors. We observed that approximately 93% of AST samples fying programs’ complexity in software tests. Therefore they
are within 1000 elements in length, so, we truncated the are important indicators of software faults and vulnerabilities,
vectors which have more than 1000 elements, and for the ones as complex code is difficult to comprehend and therefore
having fewer elements, we padded them with 0s. hard to maintain and test, which might introduce faults and
Table II shows the parameters we tuned for pre-training vulnerabilities to software systems [4] [22]. Using Understand,
the Bi-LSTM network with source projects. Then, the trained a commercial code enhancement tool, we were able to collect
network was fed with the target project for generating repre- function-level CMs from source code. We randomly selected
sentations which were used as features for training a random 23 CMs (i.e. Lines of Code, Cyclomatic complexity, Essentials
forest classifier to acquire a list of functions ranked based on and so on) as features for vulnerability detection to compare
their probabilities of being vulnerable. with our method using the AST representations as features.
Both features sets were trained separately with a random forest
TABLE II: Tuned parameters for training Bi-LSTM networks
classifier for comparison.
Parameter Description (value1)
The dimensionality of embedding vectors that the
Other metrics adopted for measuring the effectiveness of a
Embedding dim prediction model in code auditing include the reduced amount
elements of AST vectors will be converted to (100).
Data dim The dimensionality of the input vector (1000). of code or files to inspect compared to not using the model.
Bi-LSTM Units The number of the Bi-LSTM units per layer (64). Such metrics were used in [22] as estimators of cost reduction.
The number of training samples that are propagated
Batch Size
through the network at a time (32). In our scenario, to quantify the inspection reduction in terms of
One forward/backward pass of all the training sam- functions, we define the Function Inspection Reduction Rate
Epoch
ples (150). (FIRR) for measuring how much effort can be saved with our
The training will stop if the validation accuracy ceases
Monitor
to increase (val acc). method compared with the method using CMs as features. The
The choice of loss function to minimize (bi- FIRR is the ratio of the reduced number of functions needed
Loss function
nary crossentropy). for inspection to the number of functions that are randomly
Optimizer The RMSprop with default parameters was used.
selected. Given a recall value achieved by our method e.g.
The tanh function is used for hidden layers and the
Activation sigmoid function is used for the last layer (tanh, 60%, one needs to randomly select 60% of the total number
sigmoid). of functions in order to achieve the same recall. Therefore,
1 These values were used in Keras for implementing the Bi-LSTM networks. with a recall value achieved by a given method, the amount
of functions that a random selection needs to select (denoted
as Nrandom ) can be calculated as:
C. Evaluation Metrics and Baseline
Nrandom = recall × Ntotal (1)
The performance of our method is measured by the pro-
portion of vulnerable functions returned in a function list. where the Ntotal is the total number of functions used for
Hence, the metric that we apply for performance evaluation is testing.
the top-k precision (denoted as P@K). The metric is usually Given a recall achieved by using CMs, to acquire the
used in the context of information retrieval systems such as same amount of vulnerable function, the number of functions
search engines for measuring how many relevant documents (denoted by NCM ) that random selection needs to select can
are acquired in all the top-k retrieved documents [5]. In be calculated as
our context, the P@K refers to the proportion of vulnerable NCM = recallCM × Ntotal (2)
functions in the top-k retrieved functions. In our experiments,
k ranged from 10 to 200 to simulate a practical case where Given a recall achieved by applying deep AST representations,
not all code is audited due to time and resources limitations. to acquire the same amount of vulnerable function, the number

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2821768, IEEE
Transactions on Industrial Informatics
7

of functions (denoted by NAST ) that random selection needs


to select can be calculated as follows:
NAST = recallAST × Ntotal (3)
Therefore, the FIRR of using AST-based representation to
CM-based features can be defined as:
NAST − NCM NCM
F IRRAST −to−CM = =1− (4)
NAST NAST

D. Experiment Settings and Outcomes


1) Deep AST Representations versus CMs, and random
selection: We first used the FFmpeg project to evaluate Fig. 5: Precision comparison between deep AST representa-
the effectiveness of our AST-based features generated using tions (AST-based), CMs and random selection on FFmpeg.
the proposed approach. We collected 4,921 functions across
multiple releases of the FFmpeg project, in which there are 191 deep AST representations. Then, we further divided the target
vulnerable functions. We used our proposed feature processing project into training and testing parts with a 1-to-3 ratio to
technique to process the collected samples and divide them simulate the scenario that there usually is insufficient labeled
into three data sets: training, validation, and testing with a data for vulnerability discovery. Lastly, for the training and
ratio of 13:4:3. The training and validation sets are used to testing parts, we used the obtained representations and CMs
train the proposed Bi-LSTM network. Then, the performance to train random forest classifiers, respectively.
is evaluated on the testing set. For comparison, we combined Figure 6(a) shows the comparison results of our method,
the previous divided training and validation sets as the new CM-based features and random selection on project LibTIFF.
training set and use the 23 CMs as features for training a The testing set contains 583 functions among which there
random forest classifier. Subsequently, we validated the trained are 71 vulnerable ones. When examining the top 10 func-
classifier on the same testing set. tions, our approach achieved 100% precision. With 20 most
The testing set contains 741 functions, among which there probable vulnerable function retrieved, 17 vulnerable functions
are 25 vulnerable samples. Figure 5 shows that the green were found. Figure 6(b) illustrates the comparison results on
line which is the precision achieved by using deep AST LibPNG with 347 non-vulnerable functions and 28 vulnerable
representations lies above the blue line being the precision functions. With our approach, achieving 28% precision means
obtained using CMs as features. When retrieving the 10 most that checking the top 100 most probable vulnerable functions
probable vulnerable functions, 8 vulnerable functions were was able to identify all of the vulnerable functions. However,
identified using our method. With CMs, only 5 were found. with CM-based features, one needs to examine the top 200
Although when searching for the 20 most probable vulnerable function list to achieve the same goal. Similar to the results
functions, both methods identified 9 functions, when retrieving on LibTIFF, the test on project FFmpeg shows that 100%
more functions, our method achieved higher precision. The precision was achieved on the top 10 function list, as shown in
reason was that using CMs as features helped to pick the Figure 6(c) . On the testing dataset for FFmpeg, there are 143
vulnerable functions which were long and complex. Within vulnerable functions out of 3691 total number of functions.
the top 20 functions, there are some long vulnerable functions Despite the imbalance between vulnerable and non-vulnerable
which were found by both approaches. As more functions were samples, checking around 5% of the total number of functions
examined, the CM-based features would generate more false can discover 43% of vulnerable samples with our approach,
positives, as almost all the of long functions were recognized while using CM-based features, one can only discover 36%
as vulnerable by CMs. When examining the top 200 most of vulnerable functions. Hence, the results showed that the
probable vulnerable functions, our method found 24 out of transfer-learned deep AST representations were more effective
25 actual vulnerable functions, while with CMs, only 18 than the human-defined CMs on our datasets. However, we
were identified and for random selection, one could only find also observed that when retrieving 500 functions both AST-
6 vulnerable functions by chance. When the top 500 most based and CMs methods can identify all the vulnerable func-
probable vulnerable functions were returned, both AST-based tions.
and CMs methods achieved 100% recall. 3) Measuring Function Inspection Reduction: The FIRR
2) Transfer-Learned Representations versus CMs, and ran- provides an estimator for measuring the cost and effort from a
dom selection: We conducted three experiments to demon- perspective of the reduction of the overall functions needed
strate the effectiveness of the proposed approach that helps to for code inspection since the purpose of adopting the ML
apply learned knowledge from the source projects to the target technique is to minimize cost and human effort. Figure 7
project for vulnerability detection. Among the six projects shows the overall trend of FIRR of using deep AST repre-
listed in Table I, we chose one project as the target project, sentations to applying CM as features on three projects. Gen-
and the other five as the source projects. We used the functions erally, compared with the method using CM as features, our
from source projects to train a network, and with the trained approach can significantly reduce the number of functions for
network, we fed it with the target project code to obtain the inspection when retrieving fewer than 100 functions. Among

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2821768, IEEE
Transactions on Industrial Informatics
8

�AST-based
100% -•- Code Metrics
100%
-.-Random Selection
80%

60%
-------•
50% 50%
........
.. .... •------ 24%
........36%
....
40% ....
---
............
....
20%
• --�•....__2_s%.., •.....___2..,s•..%_____•z4%
•�---.....
0%
Top-10 Top-20 Top-SO Top-100 Top-150 Top-200

(a) Testing on LibTIFF Fig. 7: FIRR of transfer-learned AST representations to CMs


as features.

method is not directly applicable. This will be addressed in


our future research.
When applying the LSTM network for representation learn-
ing, the training process on 23,000+ samples took up to 6
hours on our server. In practice, this problem can be mitigated
by training the network offline so that the trained network is
ready for extracting the learned representations.
Another challenge arises from the severe imbalance between
any two classes which affects the detection performance. To
overcome this, we apply a random forest classifier which is an
(b) Testing on LibPNG
ensemble classifier for training on the learned representations.
We believe that the detection performance can be further
�AST-based improved by addressing the imbalance issues with techniques
100% -•- Code Metrics such as oversampling or undersampling. Additionally, the
100% empirical results showed that some vulnerable functions are
-.-Random Selection
80% in the top 50 retrieved list detected by using features in CM
but not in the top 50 list that using AST representations. It
60%
•-------­ -- -- -- -- will be an interesting research direction to combine the two
40% 50% 50% ----
38%
·------·------·------· 31% methods for better identification of vulnerable functions.

20% 31% 29% V. C ONCLUSIONS


26%
0% • • • • • • Our framework leverages ASTs and their deep represen-
Top-10 Top-20 Top-SO Top-100 Top-150 Top-200 tations to convert project-specific source code to project-
agnostic features capturing deeper structures and semantics
(c) Testing on FFmpeg
of functions. This allows vulnerable programming patterns
Fig. 6: Precision comparison among the transfer-learned rep- learned from software source projects to facilitate the repre-
resentations (AST-based), CMs, and random selection on dif- sentation generation on a target project for better vulnerability
ferent projects. The general trend in the 3 figures shows that prediction. We first learn the deep AST representations, then
the proposed method using transfer-learned representations as we process the ASTs by converting them to sequences of
features gained the best precision. elements. Following this, the ASTs are tokenized and mapped
to vectors that preserve code semantics. To concisely capture
three projects, our approach performed well on FFmpeg, as the rather complex sequential interactions among different
a maximal reduction of 57% of functions for inspection was program segments, we propose a deep-learning network ar-
observed. But, on LibPNG, the number of reduced functions chitecture to learn high-level abstractions which we refer to
was not as large as that on the other projects. Our future as deep AST representations. Trained with historical software
work will further investigate how project differences affect vulnerability data, these representations become refined fea-
the performance of our approach. tures reflecting the intrinsic patterns indicative of a software
vulnerability. Finally, these learned features are fed to train a
E. Discussion random forest algorithm and to obtain a ranking list showing
the most probable functions that are vulnerable.
Our proposed approach takes source code functions as
inputs, which facilitates the filtering of potentially vulnerable R EFERENCES
functions during a development process. However, for vulner- [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,
abilities that involve multiple functions or multiple files, our M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow:

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2821768, IEEE
Transactions on Industrial Informatics
9

A system for large-scale machine learning.” in OSDI, vol. 16, [24] F. Yamaguchi, F. Lindner, and K. Rieck, “Vulnerability ex-
2016, pp. 265–283. trapolation: assisted discovery of vulnerabilities using machine
[2] C. Cadar, D. Dunbar, D. R. Engler et al., “Klee: Unassisted learning,” in Proceedings of the 5th USENIX conference on
and automatic generation of high-coverage tests for complex Offensive technologies. USENIX Association, 2011.
systems programs.” in OSDI, vol. 8, 2008, pp. 209–224. [25] F. Yamaguchi, M. Lottmann, and K. Rieck, “Generalized vulner-
[3] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, ability extrapolation using abstract syntax trees,” in Proceedings
2015. of the 28th ACSAC. ACM, 2012, pp. 359–368.
[4] I. Chowdhury and M. Zulkernine, “Using complexity, coupling, [26] F. Yamaguchi, C. Wressnegger, H. Gascon, and K. Rieck,
and cohesion metrics as early indicators of vulnerabilities,” JSA, “Chucky: Exposing missing checks in source code for vulner-
vol. 57, no. 3, pp. 294–313, 2011. ability discovery,” in Proceedings of the 2013 SIGSAC CCS.
[5] P. R. Christopher D. Manning and H. Schütze, Introduction ACM, 2013, pp. 499–510.
to Information Retrieval. Cambridge University: Cambridge
University Press, 2009.
[6] D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf, “Bugs
as deviant behavior: A general approach to inferring errors in
systems code,” in SIGOPS Operating Systems Review, vol. 35,
no. 5. ACM, 2001, pp. 57–72.
[7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[8] J. Jang, A. Agrawal, and D. Brumley, “Redebug: finding un-
patched code clones in entire os distributions,” in S&P, 2012
Symposium on. IEEE, 2012, pp. 48–62.
[9] S. Kim, S. Woo, H. Lee, and H. Oh, “Vuddy: A scalable
approach for vulnerable code clone discovery,” in S&P, 2017
Symposium on. IEEE, 2017, pp. 595–614.
[10] Z. Li, D. Zou, S. Xu, H. Jin, H. Qi, and J. Hu, “Vulpecker:
an automated vulnerability detection system based on code
similarity analysis,” in Proceedings of the 32nd ACCSA. ACM,
2016, pp. 201–213.
[11] G. Lin, J. Zhang, W. Luo, L. Pan, and Y. Xiang, “Poster:
Vulnerability discovery with function representation learning
from unlabeled projects,” in Proceedings of the 2017 SIGSAC
Conference on CCS. ACM, 2017, pp. 2539–2541.
[12] A. Meneely and L. Williams, “Secure open source collaboration:
an empirical study of linus’ law,” in Proceedings of the 16th
Conference on CCS. ACM, 2009, pp. 453–462.
[13] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient
estimation of word representations in vector space,” arXiv
preprint arXiv:1301.3781, 2013.
[14] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khu-
danpur, “Recurrent neural network based language model,”
in Eleventh Annual Conference of the International Speech
Communication Association, 2010.
[15] L. Moonen, “Generating robust parsers using island grammars,”
in Reverse Engineering, 2001. Proceedings. Eighth Working
Conference on. IEEE, 2001, pp. 13–22.
[16] S. Neuhaus, T. Zimmermann, C. Holler, and A. Zeller, “Pre-
dicting vulnerable software components,” in Proceedings of the
14th Conference on CCS. ACM, 2007, pp. 529–540.
[17] J. Newsome and D. Song, “Dynamic taint analysis for automatic
detection, analysis, and signature generation of exploits on
commodity software,” 2005.
[18] C. Olah, “Understanding lstm networks,” GITHUB blog, posted
on August, vol. 27, p. 2015, 2015.
[19] F. Pedregosa et al., “Scikit-learn: Machine learning in Python,”
JMLR, vol. 12, pp. 2825–2830, 2011.
[20] H. Perl, S. Dechand, M. Smith, D. Arp, F. Yamaguchi, K. Rieck,
S. Fahl, and Y. Acar, “Vccfinder: Finding potential vulnerabil-
ities in open-source projects to assist code audits,” in Proceed-
ings of the 22nd SIGSAC Conference on CCS. ACM, 2015,
pp. 426–437.
[21] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural
networks,” TSP, vol. 45, no. 11, pp. 2673–2681, 1997.
[22] Y. Shin, A. Meneely, L. Williams, and J. A. Osborne, “Evalu-
ating complexity, code churn, and developer activity metrics as
indicators of software vulnerabilities,” TSE, vol. 37, no. 6, pp.
772–787, 2011.
[23] M. Sutton, A. Greene, and P. Amini, Fuzzing: brute force
vulnerability discovery. Pearson Education, 2007.

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like