Automated Vulnerability Detection Using Deep Representation Learning
Automated Vulnerability Detection Using Deep Representation Learning
1
Draper
2
Boston University
arXiv:1807.04320v2 [cs.LG] 28 Nov 2018
Abstract— Increasing numbers of software vulnerabilities are the automated detection of vulnerabilities in C/C++1 source
discovered every year whether they are reported publicly or code learned from real world code examples.
discovered internally in proprietary code. These vulnerabilities
can pose serious risk of exploit and result in system compromise, II. R ELATED W ORK
information leaks, or denial of service. We leveraged the wealth
of C and C++ open-source code available to develop a large- There currently exist a wide variety of analysis tools that
scale function-level vulnerability detection system using machine attempt to uncover common vulnerabilities in software. Static
learning. To supplement existing labeled vulnerability datasets,
we compiled a vast dataset of millions of open-source functions analyzers, such as Clang [7], do so without needing to execute
and labeled it with carefully-selected findings from three different programs. Dynamic analyzers repeatedly execute programs
static analyzers that indicate potential exploits. The labeled with many test inputs on real or virtual processors to identify
dataset is available at: https://osf.io/d45bw/. Using these datasets, weaknesses. Both static and dynamic analyzers are rule-based
we developed a fast and scalable vulnerability detection tool based tools and thus limited to their hand-engineered rules and not
on deep feature representation learning that directly interprets
lexed source code. We evaluated our tool on code from both able to guarantee full test coverage of codebases. Symbolic
real software packages and the NIST SATE IV benchmark execution [8] replaces input data with symbolic values and
dataset. Our results demonstrate that deep feature representation analyzes their use over the control flow graph of a program.
learning on source code is a promising approach for automated While it can probe all feasible program paths, symbolic exe-
software vulnerability detection. cution is expensive and does not scale well to large programs.
Index Terms—artificial neural networks, computer security,
Beyond these traditional tools, there has been significant
data mining, machine learning
recent work on the usage of machine learning for program
analysis. The availability of large amounts of open-source
I. I NTRODUCTION code opens the opportunity to learn the patterns of software
vulnerabilities directly from mined data. For a comprehensive
Hidden flaws in software can result in security vulnera- review of learning from “Big Code”, including that not directly
bilities that potentially allow attackers to compromise sys- related to our work, see Allamanis et al. [9].
tems and applications. Thousands of such vulnerabilities are In the area of vulnerability detection, Hovsepyan et al. [10]
reported publicly to the Common Vulnerabilities and Expo- used a support vector machine (SVM) on a bag-of-words
sures database [1] each year and many more are discov- (BOW) representation of a simple tokenization of Java source
ered internally in proprietary code and patched. Recent high- code to predict static analyzer labels. However, their work
profile exploits have shown that these security holes can have was limited to training and evaluating on a single software
disastrous effects, both financially and societally [5]. These repository. Pang et al. [11] expanded on this work by including
vulnerabilities are often caused by subtle errors made by n-grams in the feature vectors used with the SVM classifier.
programmers and can propagate quickly due to the prevalence Mou et al. [12] explored the potential of deep learning for
of open-source software and code reuse. program analysis by embedding the nodes of the abstract
syntax tree representations of source code and training a
While there are existing tools for program analysis, these
tree-based convolutional neural network for simple supervised
tools typically only detect a limited subset of possible er-
classification problems. Li et al. [13] used a recurrent neural
rors based on pre-defined rules. With the recent widespread
network (RNN) trained on code snippets related to library/API
availability of open-source repositories, it has become pos-
function calls to detect two types of vulnerabilities related to
sible to use data-driven techniques to discover vulnerability
the improper usage of those calls. Harer et al. [14] trained
patterns. We present machine learning (ML) techniques for
an RNN to detect vulnerabilities in the lexed representations
* Correspondence to: [email protected] 1 While our work focuses on C/C++, the techniques are applicable to any
† Tomo Lazovich now works at Lightmatter programming language.
17th IEEE International Conference on Machine Learning and Applications (IEEE ICMLA 2018), Orlando, Florida, USA
of functions in a synthetic codebase, as part of a generative capture the relevant meaning of critical tokens while keeping
adversarial approach to code repair. the representation generic and minimizing the total token
To our knowledge, no work has been done on using deep vocabulary size. Making our lexed representation of code
learning to learn features directly from source code in a large from different software repositories as standardized as possible
natural codebase to detect a variety of vulnerabilities. The empowers transfer learning across the full dataset. Standard
limited datasets (in both size and variety) used by most of lexers, designed for actually compiling code, capture far too
the previous works limit the usefulness of the results and much detail that can lead to overfitting in ML approaches.
prevent them from taking full advantage of the power of deep Our lexer was able to reduce C/C++ code to representations
learning. using a total vocabulary size of only 156 tokens. All base
C/C++ keywords, operators, and separators are included in
III. DATA
the vocabulary. Code that does not affect compilation, such
Given the complexity and variety of programs, a large as comments, is stripped out. String, character, and float
number of training examples are required to train machine literals are lexed to type-specific placeholder tokens, as are all
learning models that can effectively learn the patterns of identifiers. Integer literals are tokenized digit-by-digit, as these
security vulnerabilities directly from code. We chose to an- values are frequently relevant to vulnerabilities. Types and
alyze software packages at the function-level because it is function calls from common libraries that are likely to have
the lowest level of granularity capturing the overall flow of a relevance to vulnerabilities are mapped to generic versions. For
subroutine. We compiled a vast dataset of millions of function- example, u32, uint32_t, UINT32, uint32, and DWORD
level examples of C and C++ code from the SATE IV Juliet are all lexed as the same generic token representing 32-bit
Test Suite [6], Debian Linux distribution [15], and public Git unsigned data types. Learned embeddings of these individual
repositories on GitHub [16]. Table I shows the data summary tokens would likely distinguish them based on the kind of code
of the number of functions we collected and used from each they are commonly used in, so care was taken to build in the
source in our dataset of over 12 million functions. desired invariance.
SATE IV GitHub Debian
B. Data curation
Total 121,353 9,706,269 3,046,758
Passing curation 11,896 782,493 491,873 One very important step of our data preparation was
‘Not vulnerable’ 6,503 (55%) 730,160 (93%) 461,795 (94%) the removal of potential duplicate functions. Open-source
‘Vulnerable’ 5,393 (45%) 52,333 (7%) 30,078 (6%)
repositories often have functions duplicated across different
TABLE I: Total number of functions obtained from each packages. Such duplication can artificially inflate performance
data source, the number of valid functions remaining after metrics and conceal overfitting, as training data can leak into
removing duplicates and applying cuts, and the number of test sets. Likewise, there are many functions that are near
functions without and with detected vulnerabilities. duplicates, containing trivial changes in source code that do
not significantly affect the execution of the function. These
near duplicates are challenging to remove, as they can often
The SATE IV Juliet Test Suite contains synthetic code appear in very different code repositories and can look quite
examples with vulnerabilities from 118 different Common different at the raw source level.
Weakness Enumeration (CWE) [1] classes and was originally To protect against these issues, we performed an extremely
designed to explore the performance of static and dynamic strict duplicate removal process. We removed any function
analyzers. While the SATE IV dataset provides labeled exam- with a duplicated lexed representation of its source code or
ples of many types of vulnerabilities, it is made up of synthetic a duplicated compile-level feature vector. This compile-level
code snippets that do not sufficiently cover the space of natural feature vector was created by extracting the control flow graph
code to provide an appropriate training set alone. To provide a of the function as well as the operations happening in each
vast dataset of natural code to augment the SATE IV data, we basic block (opcode vector, or op-vec) and the definition and
mined large numbers of functions from Debian packages and use of variables (use-def matrix)2 . Two functions with iden-
public Git repositories. The Debian package releases provide tical instruction-level behaviors or functionality are likely to
a selection of very well-managed and curated code which is have both similar lexed representations and highly correlated
in use on many systems today. The GitHub dataset provides a vulnerability status.
larger quantity and wider variety of (often lower-quality) code. The “Passing curation” row of Table I reflects the number
Since the open-source functions from Debian and GitHub are of functions remaining after the duplicate removal process,
not labeled, we used a suite of static analysis tools to generate about 10.8% of the total number of functions pulled. Although
the labels. Details of the label generation are explained in our strict duplicate removal process filters out a significant
Subsection III-C. amount of data, this approach provides the most conservative
A. Source lexing
2 Our compile-level feature extraction framework incorporated modified
To generate useful features from the raw source code of variants of strace and buildbot as well as a custom Clang plugin – we omit
each function, we created a custom C/C++ lexer designed to the details to focus on the ML aspects of our work.
CWE ID CWE Description Frequency %
120/121/122 Buffer Overflow 38.2%
119 Improper Restriction of Operations within the Bounds of a Memory Buffer 18.9%
476 NULL Pointer Dereference 9.5%
469 Use of Pointer Subtraction to Determine Size 2.0%
20, 457, 805 etc. Improper Input Validation, Use of Uninitialized Variable, Buffer Access with Incorrect Length Value, etc. 31.4%
performance results, closely estimating how well our tool will Length Value”, an exploitable vulnerability that can lead to
perform against code it has never seen before. program crashes, so functions with this finding were labeled
“vulnerable”. On the other hand, Cppcheck’s “Unused struct
C. Labels member” finding was mapped to “CWE-563: Assignment
Labeling code vulnerability at the function level was a to Variable without Use”, a poor code practice unlikely to
significant challenge. The bulk of our dataset was made up cause a security vulnerability, so corresponding functions were
of mined open-source code without known ground truth. In labeled “not vulnerable” even though static analyzers flagged
order to generate labels, we pursued three approaches: static them. Of the 390 total types of findings from the static
analysis, dynamic analysis, and commit-message/bug-report analyzers, 149 were determined to result in a potential security
tagging. vulnerability. Roughly 6.8% of our curated, mined C/C++
While dynamic analysis is capable of exposing subtle flaws functions triggered a vulnerability-related finding. Table II
by executing functions with a wide range of possible inputs, it shows the statistics of frequent CWEs in these “vulnerable”
is extremely resource intensive. Performing a dynamic analysis functions. All open-source function source codes from Debian
on the roughly 400 functions in a single module of the LibTIFF and GitHub with corresponding CWE labels are available here:
3.8.2 package from the ManyBugs dataset [17] took nearly a https://osf.io/d45bw/.
day of effort. Therefore, this approach was not realistic for
our extremely large dataset. IV. M ETHODS
Commit-message based labeling turned out to be very Our primary machine learning approach to vulnerability
challenging, providing low-quality labels. In our tests, both detection, depicted in Figure 1, combines the neural feature
humans and ML algorithms were poor at using commit mes- representations of lexed function source code with a powerful
sages to predict corresponding Travis CI [18] build failures or ensemble classifier, random forest (RF).
fixes. Motivated by recent work by Zhou et al. [19], we also
tried a simple keyword search looking for commit words like A. Neural network classification and representation learning
“buggy”, “broken”, “error”, “fixed”, etc. to label before-and- Since source code shares some commonalities with writing
after pairs of functions, which yielded better results in terms of and work done for programming languages is more limited, we
relevancy. However, this approach greatly reduced the number build off approaches developed for natural language processing
of candidate functions that we could label and still required (NLP) [22]. We leverage feature-extraction approaches similar
significant manual inspection, making it inappropriate for our to those used for sentence sentiment classification with convo-
vast dataset. lutional neural networks (CNNs) and recurrent neural networks
As a result, we decided to use three open-source static ana- (RNNs) for function-level source vulnerability classification.
lyzers, Clang, Cppcheck [20], and Flawfinder [21], to generate 1) Embedding: The tokens making up the lexed functions
labels. Each static analyzer varies in its scope of search and are first embedded into a fixed k-dimensional representation
detection. For example, Clang’s scope is very broad but also (limited to range [−1, 1]) that is learned during classification
picks up on syntax, programming style, and other findings training via backpropagation to a linear transformation of
which are not likely to result in a vulnerability. Flawfinder’s a one-hot embedding. Several unsupervised word2vec ap-
scope is geared towards CWEs and does not focus on other proaches [23] trained on a much larger unlabeled dataset were
aspects such as style. Therefore, we incorporated multiple explored for seeding this embedding, but these yielded mini-
static analyzers and pruned their outputs to exclude findings mal improvement in classification performance over randomly-
that are not typically associated with security vulnerabilities initialized learned embeddings. A fixed one-hot embedding
in an effort to create robust labels. was also tried, but gave diminished results. As our vocabulary
We had a team of security researchers map each static size is much smaller than those of natural languages, we
analyzer’s finding categories to the corresponding CWEs and were able to use a much smaller embedding than is typical
identify which CWEs would likely result in potential security in NLP applications. Our experiments found that k = 13
vulnerabilities. This process allowed us to generate binary performed the best for supervised embedding sizes, balanc-
labels of “vulnerable” and “not vulnerable”, depending on the ing the expressiveness of the embedding against overfitting.
CWE. For example, Clang’s “Out-of-bound array access” find- We found that adding a small amount of random Gaussian
ing was mapped to “CWE-805: Buffer Access with Incorrect noise N µ = 0, σ 2 = 0.01 to each embedded representation
int sequence
* max pool
id
source code dense layers
=
new
int * data =
int labels
new int[10];
[
1
0
lexer ] random forest
; classifier
convolutional learned source
embedding filters features
Fig. 1: Illustration of our convolutional neural representation-learning approach to source code classification. Input source code
is lexed into a token sequence of variable length `, embedded into a ` × k representation, filtered by n convolutions of size
m × k, and maxpooled along the sequence length to a feature vector of fixed size n. The embedding and convolutional filters
are learned by weighted cross entropy loss from fully-connected classification layers. The learned n-dimensional feature vector
is used as input to a random forest classifier, which improves performance compared to the neural network classifier alone.
substantially improved resistance to overfitting and was much 5) Training: For data batching convenience, we trained
more effective than other regularization techniques such as only on functions with token length 10 ≤ ` ≤ 500, padded
weight decay. to the maximum length of 500. Both the convolutional and
2) Feature extraction: We explored both CNNs and RNNs recurrent networks were trained with batch size 128, Adam
for feature extraction from the embedded source representa- optimization (with learning rates 5 × 10−4 and 1 × 10−4 ,
tions. Convolutional feature extraction: We use n convolu- respectively), and with a cross entropy loss. Since the dataset
tional filters with shape m × k, so each filter spans the full was strongly unbalanced, vulnerable functions were weighted
space of the token embedding. The filter size m determines the more heavily in the loss function. This weight is one of the
number of sequential tokens that are considered together and many hyper-parameters we tuned to get the best performance.
we found that a fairly large filter size of m = 9 worked best. A We used a 80:10:10 split of our SATE IV, Debian, and GitHub
total number of n = 512 filters, paired with batch normaliza- combined dataset to train, validate, and test our models. We
tion followed by ReLU, was most effective. Recurrent feature tuned and selected models based on the highest validation
extraction: We also explored using recurrent neural networks Matthews Correlation Coefficient (MCC).
for feature extraction to allow longer token-dependencies to be B. Ensemble learning on neural representations
captured. The embedded representation is fed to a multi-layer
RNN and the output at each step in the length ` sequence is While the neural network approaches automatically build
concatenated. We used two-layer Gated Recurrent Unit RNNs their own features, their classification performance on our
with hidden state size n0 = 256, though Long Short Term full dataset was suboptimal. We found that using the neural
Memory RNNs performed equally well. features (outputs from the sequence-maxpooled convolution
layer in the CNN and sequence-maxpooled output states in
3) Pooling: As the length of C/C++ functions found in the RNN) as inputs to a powerful ensemble classifier such as
the wild can vary dramatically, both the convolutional and random forest or extremely randomized trees yielded the best
recurrent features are maxpooled along the sequence length results on our full dataset. Having the features and classifier
` in order to generate a fixed-size (n or n0 , respectively) optimized separately seemed to help resist overfitting. This
representation. In this architecture, the feature extraction layers approach also makes it more convenient to quickly retrain a
should learn to identify different signals of vulnerability and classifier on new sets of features or combinations of features.
thus the presence of any of these along the sequence is
important. V. R ESULTS
4) Dense layers: The feature extraction layers are followed To provide a strong benchmark, we trained an RF classifier
by a fully-connected classifier. 50% dropout on the maxpooled on a “bag-of-words” (BOW) representation of function source
feature representation connections to the first hidden layer was code, which ignores the order of the tokens. An examination of
used when training. We found that using two hidden layers of the bag-of-words feature importances shows that the classifier
64 and 16 before the final softmax output layer gave the best exploits label correlations with (1) indicators of the source
classification performance. length and complexity and (2) combinations of calls which
Evaluation on packages from Debian and GitHub Evaluation on SATE IV Juliet Test Suite
1.0 CNN + RF 1.0
RNN + RF
0.8 CNN 0.8
RNN
CNN + RF
RNN + RF
0.4 0.4 CNN
RNN
BOW + RF
0.2 0.2 Clang
Flawfinder
0.0 0.0 Cppcheck
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
recall false positive rate
Fig. 2: Precision versus recall of different ML approaches Fig. 3: SATE IV test data ROC, with true vulnerability
using our lexer representation on Debian and Github test data. labels, compared to the three static analyzers we considered.
Vulnerable functions make up 6.5% of the test data. Vulnerable functions make up 43% of the test data.