Automated Vulnerability Detectionin Source Code Using Deep Representation Learning
Automated Vulnerability Detectionin Source Code Using Deep Representation Learning
net/publication/330475443
CITATIONS READS
347 3,054
8 authors, including:
Marc McConley
Draper Laboratory
28 PUBLICATIONS 847 CITATIONS
SEE PROFILE
All content following this page was uploaded by Paul Ellingwood on 28 January 2021.
1
Draper
2
Boston University
Authorized licensed use limited to: DRAPER. Downloaded on June 09,2020 at 17:30:23 UTC from IEEE Xplore. Restrictions apply.
limited datasets (in both size and variety) used by most of lexers, designed for actually compiling code, capture far too
the previous works limit the usefulness of the results and much detail that can lead to overfitting in ML approaches.
prevent them from taking full advantage of the power of deep Our lexer was able to reduce C/C++ code to representations
learning. using a total vocabulary size of only 156 tokens. All base
C/C++ keywords, operators, and separators are included in
III. DATA
the vocabulary. Code that does not affect compilation, such
Given the complexity and variety of programs, a large as comments, is stripped out. String, character, and float
number of training examples are required to train machine literals are lexed to type-specific placeholder tokens, as are all
learning models that can effectively learn the patterns of identifiers. Integer literals are tokenized digit-by-digit, as these
security vulnerabilities directly from code. We chose to an- values are frequently relevant to vulnerabilities. Types and
alyze software packages at the function-level because it is function calls from common libraries that are likely to have
the lowest level of granularity capturing the overall flow of a relevance to vulnerabilities are mapped to generic versions. For
subroutine. We compiled a vast dataset of millions of function- example, u32, uint32_t, UINT32, uint32, and DWORD
level examples of C and C++ code from the SATE IV Juliet are all lexed as the same generic token representing 32-bit
Test Suite [6], Debian Linux distribution [15], and public Git unsigned data types. Learned embeddings of these individual
repositories on GitHub [16]. Table I shows the data summary tokens would likely distinguish them based on the kind of code
of the number of functions we collected and used from each they are commonly used in, so care was taken to build in the
source in our dataset of over 12 million functions. desired invariance.
SATE IV GitHub Debian B. Data curation
Total 121,353 9,706,269 3,046,758 One very important step of our data preparation was
Passing curation 11,896 782,493 491,873
‘Not vulnerable’ 6,503 (55%) 730,160 (93%) 461,795 (94%) the removal of potential duplicate functions. Open-source
‘Vulnerable’ 5,393 (45%) 52,333 (7%) 30,078 (6%) repositories often have functions duplicated across different
packages. Such duplication can artificially inflate performance
TABLE I: Total number of functions obtained from each metrics and conceal overfitting, as training data can leak into
data source, the number of valid functions remaining after test sets. Likewise, there are many functions that are near
removing duplicates and applying cuts, and the number of duplicates, containing trivial changes in source code that do
functions without and with detected vulnerabilities. not significantly affect the execution of the function. These
near duplicates are challenging to remove, as they can often
The SATE IV Juliet Test Suite contains synthetic code appear in very different code repositories and can look quite
examples with vulnerabilities from 118 different Common different at the raw source level.
Weakness Enumeration (CWE) [1] classes and was originally To protect against these issues, we performed an extremely
designed to explore the performance of static and dynamic strict duplicate removal process. We removed any function
analyzers. While the SATE IV dataset provides labeled exam- with a duplicated lexed representation of its source code or
ples of many types of vulnerabilities, it is made up of synthetic a duplicated compile-level feature vector. This compile-level
code snippets that do not sufficiently cover the space of natural feature vector was created by extracting the control flow graph
code to provide an appropriate training set alone. To provide a of the function as well as the operations happening in each
vast dataset of natural code to augment the SATE IV data, we basic block (opcode vector, or op-vec) and the definition and
mined large numbers of functions from Debian packages and use of variables (use-def matrix)2 . Two functions with iden-
public Git repositories. The Debian package releases provide tical instruction-level behaviors or functionality are likely to
a selection of very well-managed and curated code which is have both similar lexed representations and highly correlated
in use on many systems today. The GitHub dataset provides a vulnerability status.
larger quantity and wider variety of (often lower-quality) code. The “Passing curation” row of Table I reflects the number
Since the open-source functions from Debian and GitHub are of functions remaining after the duplicate removal process,
not labeled, we used a suite of static analysis tools to generate about 10.8% of the total number of functions pulled. Although
the labels. Details of the label generation are explained in our strict duplicate removal process filters out a significant
Subsection III-C. amount of data, this approach provides the most conservative
performance results, closely estimating how well our tool will
A. Source lexing perform against code it has never seen before.
To generate useful features from the raw source code of
each function, we created a custom C/C++ lexer designed to C. Labels
capture the relevant meaning of critical tokens while keeping Labeling code vulnerability at the function level was a
the representation generic and minimizing the total token significant challenge. The bulk of our dataset was made up
vocabulary size. Making our lexed representation of code 2 Our compile-level feature extraction framework incorporated modified
from different software repositories as standardized as possible variants of strace and buildbot as well as a custom Clang plugin – we omit
empowers transfer learning across the full dataset. Standard the details to focus on the ML aspects of our work.
758
Authorized licensed use limited to: DRAPER. Downloaded on June 09,2020 at 17:30:23 UTC from IEEE Xplore. Restrictions apply.
CWE ID CWE Description Frequency %
120/121/122 Buffer Overflow 38.2%
119 Improper Restriction of Operations within the Bounds of a Memory Buffer 18.9%
476 NULL Pointer Dereference 9.5%
469 Use of Pointer Subtraction to Determine Size 2.0%
20, 457, 805 etc. Improper Input Validation, Use of Uninitialized Variable, Buffer Access with Incorrect Length Value, etc. 31.4%
of mined open-source code without known ground truth. In “not vulnerable” even though static analyzers flagged them.
order to generate labels, we pursued three approaches: static Of the 390 total types of findings from the static analyzers,
analysis, dynamic analysis, and commit-message/bug-report 149 were determined to result in a potential security vulnera-
tagging. bility. Roughly 6.8% of our curated, mined C/C++ functions
While dynamic analysis is capable of exposing subtle flaws triggered a vulnerability-related finding. Table II shows the
by executing functions with a wide range of possible inputs, it statistics of frequent CWEs in these “vulnerable” functions.
is extremely resource intensive. Performing a dynamic analysis
on the roughly 400 functions in a single module of the LibTIFF IV. M ETHODS
3.8.2 package from the ManyBugs dataset [17] took nearly a Our primary machine learning approach to vulnerability
day of effort. Therefore, this approach was not realistic for detection, depicted in Figure 1, combines the neural feature
our extremely large dataset. representations of lexed function source code with a powerful
Commit-message based labeling turned out to be very ensemble classifier, random forest (RF).
challenging, providing low-quality labels. In our tests, both
humans and ML algorithms were poor at using commit mes- A. Neural network classification and representation learning
sages to predict corresponding Travis CI [18] build failures or Since source code shares some commonalities with writing
fixes. Motivated by recent work by Zhou et al. [19], we also and work done for programming languages is more limited, we
tried a simple keyword search looking for commit words like build off approaches developed for natural language processing
“buggy”, “broken”, “error”, “fixed”, etc. to label before-and- (NLP) [22]. We leverage feature-extraction approaches similar
after pairs of functions, which yielded better results in terms of to those used for sentence sentiment classification with convo-
relevancy. However, this approach greatly reduced the number lutional neural networks (CNNs) and recurrent neural networks
of candidate functions that we could label and still required (RNNs) for function-level source vulnerability classification.
significant manual inspection, making it inappropriate for our 1) Embedding: The tokens making up the lexed functions
vast dataset. are first embedded into a fixed k-dimensional representation
As a result, we decided to use three open-source static ana- (limited to range [−1, 1]) that is learned during classification
lyzers, Clang, Cppcheck [20], and Flawfinder [21], to generate training via backpropagation to a linear transformation of
labels. Each static analyzer varies in its scope of search and a one-hot embedding. Several unsupervised word2vec ap-
detection. For example, Clang’s scope is very broad but also proaches [23] trained on a much larger unlabeled dataset were
picks up on syntax, programming style, and other findings explored for seeding this embedding, but these yielded mini-
which are not likely to result in a vulnerability. Flawfinder’s mal improvement in classification performance over randomly-
scope is geared towards CWEs and does not focus on other initialized learned embeddings. A fixed one-hot embedding
aspects such as style. Therefore, we incorporated multiple was also tried, but gave diminished results. As our vocabulary
static analyzers and pruned their outputs to exclude findings size is much smaller than those of natural languages, we
that are not typically associated with security vulnerabilities were able to use a much smaller embedding than is typical
in an effort to create robust labels. in NLP applications. Our experiments found that k = 13
We had a team of security researchers map each static performed the best for supervised embedding sizes, balanc-
analyzer’s finding categories to the corresponding CWEs and ing the expressiveness of the embedding against overfitting.
identify which CWEs would likely result in potential security We found that adding a small
amount of random Gaussian
vulnerabilities. This process allowed us to generate binary noise N μ = 0, σ 2 = 0.01 to each embedded representation
labels of “vulnerable” and “not vulnerable”, depending on the substantially improved resistance to overfitting and was much
CWE. For example, Clang’s “Out-of-bound array access” find- more effective than other regularization techniques such as
ing was mapped to “CWE-805: Buffer Access with Incorrect weight decay.
Length Value”, an exploitable vulnerability that can lead to 2) Feature extraction: We explored both CNNs and RNNs
program crashes, so functions with this finding were labeled for feature extraction from the embedded source representa-
“vulnerable”. On the other hand, Cppcheck’s “Unused struct tions. Convolutional feature extraction: We use n convolu-
member” finding was mapped to “CWE-563: Assignment to tional filters with shape m × k, so each filter spans the full
Variable without Use”, a poor code practice unlikely to cause a space of the token embedding. The filter size m determines the
security vulnerability, so corresponding functions were labeled number of sequential tokens that are considered together and
759
Authorized licensed use limited to: DRAPER. Downloaded on June 09,2020 at 17:30:23 UTC from IEEE Xplore. Restrictions apply.
int sequence
* max pool
id
source code dense layers
=
new
int labels
[
1
0
lexer ] random forest
; classifier
convolutional learned source
embedding filters features
Fig. 1: Illustration of our convolutional neural representation-learning approach to source code classification. Input source code
is lexed into a token sequence of variable length , embedded into a × k representation, filtered by n convolutions of size
m × k, and maxpooled along the sequence length to a feature vector of fixed size n. The embedding and convolutional filters
are learned by weighted cross entropy loss from fully-connected classification layers. The learned n-dimensional feature vector
is used as input to a random forest classifier, which improves performance compared to the neural network classifier alone.
we found that a fairly large filter size of m = 9 worked best. A combined dataset to train, validate, and test our models. We
total number of n = 512 filters, paired with batch normaliza- tuned and selected models based on the highest validation
tion followed by ReLU, was most effective. Recurrent feature Matthews Correlation Coefficient (MCC).
extraction: We also explored using recurrent neural networks
for feature extraction to allow longer token-dependencies to be B. Ensemble learning on neural representations
captured. The embedded representation is fed to a multi-layer While the neural network approaches automatically build
RNN and the output at each step in the length sequence is their own features, their classification performance on our
concatenated. We used two-layer Gated Recurrent Unit RNNs full dataset was suboptimal. We found that using the neural
with hidden state size n = 256, though Long Short Term features (outputs from the sequence-maxpooled convolution
Memory RNNs performed equally well. layer in the CNN and sequence-maxpooled output states in
3) Pooling: As the length of C/C++ functions found in the RNN) as inputs to a powerful ensemble classifier such as
the wild can vary dramatically, both the convolutional and random forest or extremely randomized trees yielded the best
recurrent features are maxpooled along the sequence length results on our full dataset. Having the features and classifier
in order to generate a fixed-size (n or n , respectively) optimized separately seemed to help resist overfitting. This
representation. In this architecture, the feature extraction layers approach also makes it more convenient to quickly retrain a
should learn to identify different signals of vulnerability and classifier on new sets of features or combinations of features.
thus the presence of any of these along the sequence is
important. V. R ESULTS
4) Dense layers: The feature extraction layers are followed To provide a strong benchmark, we trained an RF classifier
by a fully-connected classifier. 50% dropout on the maxpooled on a “bag-of-words” (BOW) representation of function source
feature representation connections to the first hidden layer was code, which ignores the order of the tokens. An examination of
used when training. We found that using two hidden layers of the bag-of-words feature importances shows that the classifier
64 and 16 before the final softmax output layer gave the best exploits label correlations with (1) indicators of the source
classification performance. length and complexity and (2) combinations of calls which
5) Training: For data batching convenience, we trained are commonly misused and lead to vulnerabilities (such as
only on functions with token length 10 ≤ ≤ 500, padded memcpy and malloc.) Improvements over this baseline can
to the maximum length of 500. Both the convolutional and be interpreted as being due to more complex and specific
recurrent networks were trained with batch size 128, Adam vulnerability indication patterns.
optimization (with learning rates 5 × 10−4 and 1 × 10−4 , Overall, our CNN models performed better than the RNN
respectively), and with a cross entropy loss. Since the dataset models as both standalone classifiers and feature generators.
was strongly unbalanced, vulnerable functions were weighted In addition, the CNNs were faster to train and required much
more heavily in the loss function. This weight is one of the fewer parameters. On our natural function dataset, the RF
many hyper-parameters we tuned to get the best performance. classifier trained on neural feature representations performed
We used a 80:10:10 split of our SATE IV, Debian, and GitHub better than the standalone network for both the CNN and RNN
760
Authorized licensed use limited to: DRAPER. Downloaded on June 09,2020 at 17:30:23 UTC from IEEE Xplore. Restrictions apply.
Evaluation on packages from Debian and GitHub Evaluation on SATE IV Juliet Test Suite
1.0 1.0
CNN + RF
RNN + RF
0.8 CNN 0.8
RNN
0.6 CNN + RF
RNN + RF
0.4 CNN
0.4
RNN
BOW + RF
0.2 0.2 Clang
Flawfinder
0.0 Cppcheck
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
recall false positive rate
Fig. 2: Precision versus recall of different ML approaches Fig. 3: SATE IV test data ROC, with true vulnerability
using our lexer representation on Debian and Github test data. labels, compared to the three static analyzers we considered.
Vulnerable functions make up 6.5% of the test data. Vulnerable functions make up 43% of the test data.
0.6
our ML models, corresponding to Figure 2.
0.4
PR AUC ROC AUC MCC F1
0.2 Clang – – 0.227 0.450
Flawfinder – – 0.079 0.365
Cppcheck – – 0.060 0.050
0.0
BOW + RF 0.890 0.913 0.607 0.786
0.0 0.2 0.4 0.6 0.8 1.0 RNN 0.900 0.923 0.646 0.807
recall CNN 0.944 0.954 0.698 0.840
RNN + RF 0.914 0.934 0.657 0.813
Fig. 4: Performance of a multi-label CNN + RF classifier on CNN + RF 0.916 0.936 0.672 0.824
Debian and Github data by vulnerability type (see Table II.) TABLE IV: Results on the SATE IV Juliet Suite test data for
our ML models and three static analyzers, as in Figure 3.
features. Likewise, the RF classifiers trained on neural network
representations performed better than the benchmark BOW
classifier. vulnerability labels. Figure 3 shows the performance of our
Figure 2 shows the precision-recall performance of the best models alongside the SA findings on this nearly label-balanced
versions of all of the primary ML approaches on our natural dataset. We find that our models, especially the CNN, perform
function test dataset. The area under the precision-recall curve much better on the SATE IV test data than on the natural
(PR AUC) and receiver operating characteristic (ROC AUC) functions from Debian and GitHub, likely because SATE IV
as well as the MCC and F1 score at the validation-optimal has many examples for each vulnerability it contains and has
thresholds are shown in Table III. Figure 4 shows the perfor- fairly consistent style and structure. Among the SA tools,
mance of our strongest classifier when trained to detect specific Clang performs the best on the SATE IV data, but still finds
vulnerability types from a shared feature representation. Some very few vulnerabilities compared with all of the ML methods.
CWE types are significantly more challenging than others. The full SATE IV results are shown in Table IV.
We compare our ML models against our collection of Our ML methods have some additional advantages over
SA tools on SATE IV Juliet Suite dataset, which has true traditional static analysis tools. Our custom lexer and ML
761
Authorized licensed use limited to: DRAPER. Downloaded on June 09,2020 at 17:30:23 UTC from IEEE Xplore. Restrictions apply.
ACKNOWLEDGMENT
The authors thank Hugh J. Enxing and Thomas Jost for their
efforts creating the data ingestion pipeline. This project was
sponsored by the Air Force Research Laboratory (AFRL) as
part of the DARPA MUSE program.
R EFERENCES
[1] MITRE, Common Weakness Enumeration. https://cwe.mitre.org/data/
index.html.
[2] T. D. LaToza, G. Venolia, and R. DeLine, “Maintaining mental models:
A study of developer work habits,” in Proc. 28th Int. Conf. Software
Engineering, ICSE ’06, (New York, NY, USA), pp. 492–501, ACM,
2006.
[3] D. Yadron, “After heartbleed bug, a race to plug internet hole,” Wall
Street Journal, vol. 9, 2014.
[4] C. Foxx, “Cyber-attack: Europol says it was unprecedented in scale.”
Fig. 5: Screenshot from our interactive vulnerability detection https://www.bbc.com/news/world-europe-39907965, 2017.
[5] C. Arnold, “After Equifax hack, calls for big changes in credit report-
demo. The convolutional feature activation map [24] for a ing industry.” http://www.npr.org/2017/10/18/558570686/after-equifax-
detected vulnerability is overlaid in red on the original code. hack-calls-for-big-changes-in-credit-reporting-industry, 2017.
[6] NIST, Juliet test suite v1.3, 2017. https://samate.nist.gov/SRD/testsuite.
php.
[7] Z. Xu, T. Kremenek, and J. Zhang, “A memory model for static analysis
models can rapidly digest and score large repositories and of C programs,” in Proc. 4th Int. Conf. Leveraging Applications of
source code without requiring that the code be compiled. Formal Methods, Verification, and Validation, pp. 535–548, 2010.
Additionally, since the ML methods all output probabilities, [8] J. C. King, “Symbolic execution and program testing,” Commun. ACM,
vol. 19, pp. 385–394, July 1976.
the thresholds can be tuned to achieve the desired precision [9] M. Allamanis, E. T. Barr, P. T. Devanbu, and C. A. Sutton, “A
and recall. The static analyzers on the other hand return a fixed survey of machine learning for Big Code and naturalness,” CoRR,
number of findings, which may be overwhelmingly large for vol. abs/1709.06182, 2017.
[10] A. Hovsepyan, R. Scandariato, W. Joosen, and J. Walden, “Software
huge codebases or too small for critical applications. While vulnerability prediction using text analysis techniques,” in Proc. 4th Int.
static analyzers are able to better localize the vulnerabilities Workshop Security Measurements and Metrics, MetriSec ’12, pp. 7–10,
they find, we can use visualization techniques, such as the 2012.
[11] Y. Pang, X. Xue, and A. S. Namin, “Predicting vulnerable software
feature activation map shown in Figure 5, to help understand components through n-gram analysis and statistical feature selection,” in
why our algorithms make their decisions. 2015 IEEE 14th Int. Conf. Machine Learning and Applications (ICMLA),
2015.
VI. C ONCLUSIONS [12] L. Mou, G. Li, Z. Jin, L. Zhang, and T. Wang, “TBCNN: A tree-based
convolutional neural network for programming language processing,”
We have demonstrated the potential of using ML to detect CoRR, 2014.
[13] Z. Li et al., “VulDeePecker: A deep learning-based system for vulner-
software vulnerabilities directly from source code. To do ability detection,” CoRR, vol. abs/1801.01681, 2018.
this, we built an extensive C/C++ source code dataset mined [14] J. Harer et al., “Learning to repair software vulnerabilities with gener-
from Debian and GitHub repositories, labeled with curated ative adversarial networks,” arXiv preprint arXiv:1805.07475, 2018.
[15] Debian, Debian - the universal operating system. https://www.debian.
vulnerability findings from a suite of static analysis tools, org/.
and combined it with the SATE IV dataset. We created a [16] Github, Github. https://github.com/.
custom C/C++ lexer to create a simple, generic representation [17] C. Le Goues et al., “The ManyBugs and IntroClass benchmarks
for automated repair of C programs,” IEEE Transactions on Soft-
of function source code ideal for ML training. We applied a ware Engineering (TSE), vol. 41, pp. 1236–1256, December 2015.
variety of ML techniques inspired by classification problems http://dx.doi.org/10.1109/TSE.2015.2454513.
in the natural language domain, fine-tuned them for our [18] T. CI, Travis CI. https://travis-ci.org/.
[19] Y. Zhou and A. Sharma, “Automated identification of security issues
application, and achieved the best overall results using features from commit messages and bug reports,” in Proc. 2017 11th Joint
learned via convolutional neural network and classified with Meeting Foundations of Software Engineering, pp. 914–919, 2017.
an ensemble tree algorithm. [20] Cppcheck, Cppcheck. http://cppcheck.sourceforge.net/.
[21] D. A. Wheeler, Flawfinder. https://www.dwheeler.com/flawfinder/.
Future work should focus on improved labels, such as those [22] Y. Kim, “Convolutional neural networks for sentence classification,” in
from dynamic analysis tools or mined from security patches. Proc. 2014 Conf. Empirical Methods in Natural Language Processing
This would allow scores produced from the ML models to (EMNLP), (Doha, Qatar), pp. 1746–1751, Association for Computational
Linguistics, October 2014.
be more complementary with static analysis tools. The ML [23] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
techniques developed in this work for learning directly on “Distributed representations of words and phrases and their composition-
function source code can also be applied to any code clas- ality,” in Advances in Neural Information Processing Systems, pp. 3111–
3119, 2013.
sification problem, such as detecting style violations, commit [24] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning
categorization, or algorithm/task classification. As larger and deep features for discriminative localization,” in Computer Vision and
better-labeled datasets are developed, deep learning for source Pattern Recognition (CVPR), 2016 IEEE Conference on, pp. 2921–2929,
IEEE, 2016.
code analysis will become more practical for a wider variety
of important problems.
762
View publication stats Authorized licensed use limited to: DRAPER. Downloaded on June 09,2020 at 17:30:23 UTC from IEEE Xplore. Restrictions apply.