Web Application Vulnerability Prediction Using Hybrid Program Ana
Web Application Vulnerability Prediction Using Hybrid Program Ana
Research Collection School Of Computing and School of Computing and Information Systems
Information Systems
11-2014
Lionel BRIAND
Part of the Information Security Commons, and the Programming Languages and Compilers
Commons
Citation
SHAR, Lwin Khin; BRIAND, Lionel; and TAN, Hee Beng Kuan. Web application vulnerability prediction using
hybrid program analysis and machine learning. (2014). IEEE Transactions on Dependable and Secure
Computing. 12, (6), 688-707.
Available at: https://ink.library.smu.edu.sg/sis_research/4895
This Journal Article is brought to you for free and open access by the School of Computing and Information
Systems at Institutional Knowledge at Singapore Management University. It has been accepted for inclusion in
Research Collection School Of Computing and Information Systems by an authorized administrator of Institutional
Knowledge at Singapore Management University. For more information, please email [email protected].
688 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015
Abstract—Due to limited time and resources, web software engineers need support in identifying vulnerable code. A practical
approach to predicting vulnerable code would enable them to prioritize security auditing efforts. In this paper, we propose using a set of
hybrid (staticþdynamic) code attributes that characterize input validation and input sanitization code patterns and are expected to be
significant indicators of web application vulnerabilities. Because static and dynamic program analyses complement each other, both
techniques are used to extract the proposed attributes in an accurate and scalable way. Current vulnerability prediction techniques rely
on the availability of data labeled with vulnerability information for training. For many real world applications, past vulnerability data is
often not available or at least not complete. Hence, to address both situations where labeled past data is fully available or not, we apply
both supervised and semi-supervised learning when building vulnerability predictors based on hybrid code attributes. Given that
semi-supervised learning is entirely unexplored in this domain, we describe how to use this learning scheme effectively for vulnerability
prediction. We performed empirical case studies on seven open source projects where we built and evaluated supervised and
semi-supervised models. When cross validated with fully available labeled data, the supervised models achieve an average of 77
percent recall and 5 percent probability of false alarm for predicting SQL injection, cross site scripting, remote code execution and file
inclusion vulnerabilities. With a low amount of labeled data, when compared to the supervised model, the semi-supervised model
showed an average improvement of 24 percent higher recall and 3 percent lower probability of false alarm, thus suggesting
semi-supervised learning may be a preferable solution for many real world applications where vulnerability data is missing.
Index Terms—Vulnerability prediction, security measures, input validation and sanitization, program analysis, empirical study
1 INTRODUCTION
we build vulnerability predictors that are fine-grained, PhpMiner. First, we evaluated supervised learning
accurate, and scalable. The approach is fine-grained because models built from IVS attributes. Based on cross vali-
it identifies vulnerabilities at program statement levels. We dation, the model achieves 77 percent recall and 5
use both static and dynamic program analysis techniques to percent probability of false alarm, on average over
extract IVS attributes. Static analysis can help assess general 15 datasets, across SQLI, XSS, RCE, and FI vulner-
properties of a program. Yet, dynamic analysis can focus on abilities. From a practical standpoint, the results
more specific code characteristics that are complementary show that our approach detects many of the above
to the information obtained with static analysis. We use common vulnerabilities at a very small cost (low
dynamic analysis only to infer the possible types of input false alarm rate), which is very promising consider-
validation and sanitization code, rather than to precisely ing that the existing approaches either report many
prove their correctness, and apply machine learning on false warnings or miss many vulnerabilities.
these inferences for vulnerability prediction. Therefore, we Second, we compared supervised and semi-super-
mitigate the scalability issue typically associated with vised learning models with a low sampling rate of
dynamic analysis. Thus, our proposed IVS attributes reflect 20 percent (i.e., only 20 percent of the available
relevant properties of the implementations of input valida- training data are labeled with vulnerability infor-
tion and input sanitization methods in web programs and mation). On average, the supervised model
are expected to help predict vulnerabilities in an accurate achieves 47 percent recall and 8 percent probability
and scalable manner. Furthermore, we use both supervised of false alarm, whereas the semi-supervised model
learning and semi-supervised learning methods to build achieves 71 percent recall and 5 percent probability
vulnerability predictors from IVS attributes, such that our of false alarm. However, when compared to the
method can also be used in contexts where there is limited supervised model based on complete vulnerability
vulnerability data for training. data, on average, the semi-supervised model
This work is an extension of our previous work [33], achieves the same probability of false alarm but a
which is a pattern mining approach based on static and 6 percent lower recall. Therefore, our results sug-
dynamic analyses that classify input validation and saniti- gest that when sufficient vulnerability data is
zation functions through the systematic extraction of their available for training, a supervised model should
security-related properties. The extraction is based on static be favored. On the other hand, when the available
property inference and analysis of dynamic execution vulnerability data is limited, a semi-supervised
traces. The enhancements and additional contributions of model is probably a better alternative.
this paper are as follows: The outline of the paper is as follows. Section 2 provides
background information. Section 3 presents our classifica-
In our previous work that only targeted SQLI and tion scheme that characterizes input validation and saniti-
XSS vulnerabilities, we stated that the proposed zation methods. Section 4 describes our vulnerability
method could be adapted to other, similar types of prediction framework. Section 5 evaluates our vulnerability
vulnerabilities. In this paper, we address two predictors. Section 6 discusses related work. Section 7 con-
more, frequent types of vulnerabilities, which are cludes our study.
remote code execution and file inclusion vulnerabil-
ities. Hence, we propose additional attributes to 2 BACKGROUND
mine the code patterns associated with these new
types of vulnerabilities. This paper targets SQLI, XSS, RCE, and FI vulnerabilities.
We had only made use of data dependency graphs to These security risks, if exploited, could lead to serious
identify input validation and sanitization methods. issues such as disclosure of confidential, sensitive informa-
But some of these methods may be identified from tion, integrity violation, denial of service, loss of commercial
control dependency graphs, e.g., input condition confidence and customer trust, and threats to the continuity
checks, which ensure that valid inputs are often of business operations. According to CVE [6], 55,504 vulner-
implemented through predicates. Therefore, in this abilities were found in web applications within 1999-2013.
work, to better identify those methods, we leverage Among them, 34 percent belong to RCE, 13.2 percent to
control dependency information. XSS, 10.3 percent to SQLI, and 3.8 percent to FI. Thus, these
We propose static slicing and dynamic execution four common vulnerabilities are responsible for 61.3 percent
techniques that effectively mine both data depen- of the total number of vulnerabilities found. All these types
dency and control dependency information and of vulnerabilities are caused by potential weakness in web
describe the techniques in detail. applications regarding the way they handle user inputs.
We modified our prototype tool, PhpMiner, to mine They are briefly described using PHP code examples in the
the control dependency information and to extract following.
additional attributes.
We explore the use of semi-supervised learning 2.1 SQL Injection
schemes. To the best of our knowledge, we are the first SQLI vulnerabilities occur when user input is used in data-
to build vulnerability prediction models that way, base queries without proper checks. It allows attackers to
which makes such models more widely applicable. trick the query interpreter into executing unintended com-
We conducted two sets of experiments on a set of mands or accessing unauthorized data. Consider the follow-
open source PHP applications of various sizes using ing code:
690 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015
TABLE 1
Input Validation and Sanitization Attributes
ID Name Description
Static analysis-based attributes
1 Client Input accessed from HTTP request parameters such as HTTP Get
2 File Input accessed from files such as Cookies, XML
3 Text-database Text-based input accessed from database
4 Numeric-database Numeric-based input accessed from database
5 Session Input accessed from persistent data object such as HTTP Session
6 Uninit Un-initialized program variable
7 Un-taint Function that returns predefined information or information not influenced by
external users
8 Known-vuln-user Custom function that has caused security issues in the past
9 Known-vuln-std Language built-in function that has caused security issues in the past
10 Propagate Function or operation that propagates partial or complete value of a string
Hybrid analysis-based attributes
11 Numeric Function or operation that converts a string into a numeric
12 DB-operator Function that filters query operators such as ( ¼ )
13 DB-comment-delimiter Function that filters query comment delimiters such as (–)
14 DB-special Function that filters other database special characters different from the above, such
as (\x00) and (\x1a)
15 String-delimiter Function that filters string delimiters such as (‘) and (“)
16 Lang-comment-delimiter Function that filters programming language comment delimiter characters such as (/)
17 Other-delimiter Function that filters other delimiters different from the above delimiters such as (#)
18 Script-tag Function that filters dynamic client script tags such as (<script>)
19 HTML-tag Function that filters static client script tags such as (<div>)
20 Event-handler Function that disallow the use of inputs as the values of client side event handlers
such as (onload ¼ )
21 Null-byte Function that filters null byte (%00)
22 Dot Function that filters dot (.)
23 DotDotSlash Function that filters dot-dot-slash (../) sequences
24 Backslash Function that filters backslash (\)
25 Slash Function that filters slash (/)
26 Newline Function that filters newline (\n)
27 Colon Function that filters colon (,) or semi-colon (;)
28 Other-special Function that filters any other special characters different from the above
special characters such as parenthesis
29 Encode Function that encodes a string into a different format
30 Canonicalize Function that converts a string into its most standard, simplest form
31 Path Function that filters directory paths or URLs
32 Limit-length Function or operation that limits a string into a specific length
Dependent attribute
33 Vuln? Indicates a class label—Vulnerable or Not-Vulnerable
3.1 Static Analysis-Based Classification example, Client inputs like HTTP GET parameters should
Attributes 1-10 in Table 1 characterize the functions and the always be sanitized before used in sinks whereas it may not
program operations to be classified by static analysis only. be necessary to sanitize Database inputs if they have been
The first six attributes in Table 1 characterize the classifica- sanitized prior to their storage (double sanitization might
tion of user inputs depending on the nature of sources. The cause security problems depending on the context). Uninit
reason for including input sources in our classification variables are variables that are un-initialized at the point of
scheme is that most of the common vulnerabilities arise its usage, which could cause security problems (e.g., an
from the misidentification of inputs. That is, developers attacker could inject malicious values in HTTP parameters
may implement adequate input validation and sanitization having the same name as un-initialized variables by
methods but yet, they may fail to recognize all the data that enabling the register_global parameter in PHP config-
could be manipulated by external users, thereby missing uration files). The reason for two types of Database inputs—
some of the inputs for validation. Therefore, in security Text-database (string-type data) and Numeric-database
analysis, it is important to first identify all the input sources. (numeric-type data) is to reflect the fact that string-type
The reason for classifying the inputs into different types data retrieved from data stores can cause second order
is that each class of inputs causes different types of vulner- security attacks such as second order SQLI and stored XSS,
abilities and different security defense schemes may while it is difficult to conduct those attacks with numeric-
be required to secure these different classes of inputs. For type data.
692 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015
Un-taint refers to functions or operations that return is important to classify these methods implemented in a
information not extracted from the input string (e.g., program path into different types because, together with
mysql_num_rows). It also corresponds to functions or their associated vulnerability data, our vulnerability pre-
logic operations that return a Boolean value. The reason for dictors can learn this information and then predict future
this attribute is that since the outcome values are not vulnerabilities.
obtained from an input, the taint information flow stops at In Table 1, the attribute Numeric relates to 1) numeric-
those functions and operations and thus, a sink would not type casting built-in functions or operations (e.g., $a ¼
be vulnerable from using those values. (double) $b/$c); 2) language built-in numeric type
Known-vulnerable-user corresponds to a class of custom checking functions (e.g., is_numeric); and 3) custom func-
functions that have caused security issues in the past. tions that return only numeric, mathematic, and/or dash ‘-’
Known-vulnerable-std characterizes a class of language built- characters (e.g., functions that validate inputs such as math-
in functions that have caused security issues in the past. For ematic equation, postal code, or credit card number). When
example, according to vulnerability report CVE-2013-3238 an input to be used in a sink is supposed to be a numeric
[6], preg_replace function with the “/e” modifier type, the sink can be made safe from this input through
enabled has caused security issues. These functions are to such functions or operations because various alphabetic
be predefined by users based on their experiences or the characters are typically required to conduct security attacks.
information obtained from security databases (we referred DB-operator, DB-comment-delimiter, and DB-special basi-
to CVE [6] and PHP security [47]). cally reflect functions that filter sequence of characters that
Clearly in Sk , there would also be functions and oper- have special meaning to a database query parser. For exam-
ations that do not serve any security purpose. They may ple, mysql_real_escape_string is one such built-in
simply propagate the input. Consequently, we use the function provided by PHP. Clearly, these attributes could
attribute Propagate to characterize functions and opera- predict SQLI vulnerability.
tions (e.g., substring, concatenation) that do not serve String-delimiter reflects functions that filter single quote
any security purpose and that simply propagate (part of) (‘) and double quote (“) characters. Lang-comment-delimiter
the input. reflects functions that filter comment delimiters such as (/)
Since the above functions and operations either have that are significant to script interpreters such as JavaScript.
clear definitions with respect to security requirements or Other-delimiter reflects functions that filter any other com-
are associated with known vulnerability issues, they could ment delimiters such as (#). All these attributes could be sig-
be predefined in a database and classifications can be made nificant vulnerability indicators because they could disrupt
statically. This database can be expanded as and when new the syntax of intended HTML documents, SQL queries, etc.
vulnerability analysis information is available. Script-tag reflects functions that filter sequences of char-
acters, which could invoke dynamic script interpreters such
3.2 Hybrid Analysis-Based Classification as JavaScript, Flash, and Silverlight. HTML-tag reflects func-
Attributes 11-32 listed in Table 1 characterize the functions tions that filter sequences of special characters such as
to be classified by either static or dynamic analysis. This <body>, which have special meaning to the static HTML
hybrid analysis-based classification is applied for validation interpreter. Since Script-tag and HTML-tag filter special
and sanitization methods implemented using both standard characters that may cause XSS, these attributes could pre-
security functions (i.e., language built-in or custom func- dict XSS vulnerability. Event-handler reflects functions that
tions with known and tested security properties) and non- disallow the use of inputs as values of event handlers (e.g.,
standard security functions. If there are only standard secu- onload) or other dangerous HTML attributes (e.g., src).
rity functions to be classified, we classify them based on Inputs used as the values of event handlers can easily cause
their security-related information (static analysis); other- XSS. For example, consider the following code:
wise, we use dynamic analysis. <img src ¼ ‘$user_input’>
In a program, various input validation and sanitization If a malicious value, such as http://hackersite.
processes may be implemented using language built-in org/xss.js, is assigned to $user_input, XSS arises.
functions and/or custom functions. Since inputs to web Since the exploit does not necessarily use special characters
applications are naturally strings, string replacement/ like <script, filtering special characters is insufficient to
matching functions or string manipulation procedures like prevent XSS. Instead, in such cases, only Event-handler type
escaping are generally used to implement custom input val- functions can safely prevent XSS. Hence, Event-handler attri-
idation and sanitization procedures. A good security func- bute could predict XSS flaw.
tion generally consists of a set of string functions that accept Null-byte, Dot, DotDotSlash, Backslash, Slash, Newline,
safe strings or reject unsafe strings. Colon, and Other-special reflect functions that filter different
These functions are clearly important indicators of vul- types of meta-characters. Filtering Dot (.) character is impor-
nerabilities, but we need to analyze the purpose of each tant to handle unintended file extensions or double file
validation and sanitization function since different extension cases, which may cause file inclusion attacks (see
defense methods are generally required to prevent differ- real world example at CVE-2013-3239). NullByte (%00)
ent types of vulnerabilities. For example, to prevent SQLI characters can be used to bypass sanitization routines and
vulnerabilities, escaping characters that have special trick underlying systems into interpreting a given value
meaning to SQL parsers is required whereas escaping incorrectly. For example, a file value like script.
characters that have special meaning to client script inter- php%00.txt can trick a PHP program to see it as a non-
preters is needed to prevent XSS vulnerabilities. Thus, it malicious text file but the underlying web server or the
SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 693
Fig. 3. CFG of a program slice on tainted variables (a) $id and $pwd at
sink 7 and (b) $name at sink 10.
Fig. 2. Sample PHP program with custom and language built-in valida-
tion and sanitization functions. numbers of inputs, and the types and numbers of input vali-
dation and sanitization functions identified from each path
in the slice, can we predict the sink’s vulnerability?”
slicing algorithm based on the SDG [32]. We first construct
Therefore, our objective is to infer the potential effects of
the PDG for the main method of a web application program
validation checks and sanitization operations on tainted
and also construct PDGs for the methods called from the
variables using static and dynamic analyses, and classify
main method according to the algorithm given by Ferrante
those operations based on these inferences. For every path
et al. [8]. We then construct the SDG. A PDG models a pro-
in Sk that propagates the values of tainted variables into k,
gram procedure as a graph in which the nodes represent
we carry out this analysis.
program statements and the edges represent data or control
Our hybrid (static and dynamic) analysis includes the
dependences between statements. SDG extends PDG by
techniques proposed by Balzarotti et al. [11]. Basically, for
modeling interprocedural relations between the main pro-
each function in a data flow graph, Balzarotti et al. first ana-
gram and its subprograms.
lyze the function’s static program properties in an attempt
To illustrate, Fig. 2 shows an interprocedural slice of
to determine the potential sanitization effect of the function
the sink at line 10 (denoted as S10) with respect to variable
on the input. If this static analysis is likely to be imprecise,
$name. Fig. 3a shows the CFG for the slice of the sink at line
then they simulate the effect of the sanitization functions on
7 (denoted as S7) and Fig. 3b shows the CFG for the slice of S10.
the input by executing the code with different test inputs,
containing various types of attack strings. The execution
4.1.2 Hybrid Analysis results are then passed to a test oracle, which evaluates the
Typically, a web application program accesses inputs and functions’ sanitization effect by checking the presence of
propagates them via tainted variables for further processing those attack strings.
of the application’s logics. These processes may often Building on Balzarotti et al.’s work, we model the same
include sensitive program operations such as database information using IVS attributes to enable machine learning
updates, HTML outputs, and file accesses. If the program and vulnerability prediction. Another difference is that our
variables propagating the input data are not properly analysis is performed on program slices rather than data
checked before being used in those sinks, vulnerabilities flow graphs. As we discussed earlier, since input validation
arise. Therefore, to prevent web application vulnerabilities, and sanitization can be performed using predicates, the
developers typically employ input validation and input san- analysis of data flow graphs may be insufficient. A detailed
itization methods along the paths propagating the inputs to comparison of our work with Balzarotti et al.’s work is pro-
the sinks. By default, inputs to web application programs vided in the related work section. In the following, we
are strings. As such, input validation checks and sanitiza- explain how we made use of Balzarotti et al.’s analysis tech-
tion operations performed in a program are mainly based nique in our context.
on string operations. These operations typically include lan- Step 1. We first extract all possible paths from Sk . To
guage built-in validation and sanitization functions (e.g., avoid infinite paths, we use Balzarotti et al.’s solution that is
mysql_real_escape_string), string replacement and to traverse each loop only once. For example, as shown in
string matching functions (e.g., str_match), and regular- Fig. 3, S7 has only one path and S10 has two paths.
expression-based string replacement and matching func- Step 2. Each extracted path Pi is classified according to
tions (e.g., preg_replace). the IVS attributes (Table 1). As described next, classification
Basically, our approach attempts to answer the following is performed with compulsory static analysis first followed
research question: “Given a slice of sink, from the types and by optional dynamic analysis.
SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 695
4) Compute the number of synthetic instances that predictor attributes that are statistically significant in terms
need to be generated for each minority instance xi : of their association with the dependent attribute and thus,
gi ¼ gbi G . are selected by the LR modeling process. b0 is a constant. bi
5) Finally, gi instances for each minority instance xi are is the regression coefficient estimated using a maximum
generated using the formula: xnew ¼ xi þ ð^ xi xi Þ d likelihood estimation method for attribute ai .
where x^i is one of the K nearest neighbors of xi and The curve between p and any attribute ai , assuming that
d 2 ð0; 1 is a random number. all other attributes are constant, takes a flexible ‘S’ shape
Hence, the idea of adaptive synthetic oversampling is to which ranges between two extreme cases:
focus on generating more synthetic data for borderline
minority class instances in the attribute space that have a a) When ai is not a significant predictor of vulnerabil-
high risk of misclassification, rather than blindly generating ity, then the curve approximates a horizontal line,
new minority class instances to balance the data, which, for that is, p does not depend on ai
some minority class instances, could result in over-fitting b) When ai strongly indicates vulnerability, then the
while still under-representing the borderline instances. It curve approximates a step function.
ensures the adequate representation of minority class data As such, logistic regression analysis is flexible in terms of
by systematically generating synthetic data where learning the types of monotonic relationships it can model between
is expected to be more difficult. the probability of vulnerability and predictor attributes.
Attribute selection. Some of the IVS attributes are only rel- RF [37] is an ensemble learning method for classification
evant for a specific type of vulnerability (for example, Dot- that consists of a collection of tree-structured classifiers. In
DotSlash is only relevant for detecting FI vulnerability) and many cases the predictive accuracy is greatly enhanced as
some attributes may be correlated. We use an attribute the final prediction output comes from an ensemble of learn-
selection technique called correlation-based feature subset ers, rather than a single learner. Given an input sample, each
selection with a greedy stepwise backward search algorithm tree casts a vote (classification) and the forest outputs the
[50] to filter the irrelevant or redundant attributes and thus, classification having the majority vote from the trees. At an
to reduce the potential negative impact they may have on intuitive level, the forest construction procedure is as follows:
the learning process. This technique selects the best subset 1) Select K bootstrap samples from the training set.
of attributes by performing a greedy backward search Bootstrapping, i.e., sampling with replacements,
through the space of attribute subsets. It starts with a subset ensures that about one-third of the training set is left
of attributes and deletes each attribute one by one. It then out, which can be used as a test set.
evaluates the worth of a subset of attributes by considering 2) Fit a classification tree to each bootstrap sample,
the individual predictive ability of each attribute along with resulting in K trees. Each tree is grown to the largest
the degree of redundancy between them. Subsets of features extent possible without pruning.
that are highly correlated with the class while having low 3) Each instance i left out in the construction of the kth
inter-correlation are preferred. The algorithm stops when tree is classified by the kth tree. Due to bootstrap-
the deletion of any remaining attributes results in a decrease ping, i can be classified by about one-third of the
in predictive accuracy. trees. Taking c to be the class that got most of the
4.2.3 Supervised Learning votes across these classifications, the proportion of
times that c is not equal to the true class of i averaged
Classification is a type of supervised learning methods
over all instances is the so-called out-of-bag error esti-
because the class label of each training instance has to be
mate. This estimate can be used as an estimate of the
provided. In this study, we build logistic regression (LR)
generalization error and is used to guide the forest
and RandomForest (RF) models from the proposed attrib-
construction process.
utes. There are two reasons for choosing these two types of
classifiers: 1) These classifiers were benchmarked as among
the top classifiers in the literature [14], 2) LR-based pre- 4.2.4 Semi-Supervised Learning
dictor achieved the best result in our initial work [33] and As discussed above, for supervised learning, we use LR
yields results that are easy to interpret in terms of the and RF, the latter being a type of ensemble learning
impact of attributes on vulnerability predictions. method that has achieved high accuracy in the literature
LR [38] is a type of statistical classification model. It can [14]. However, as ensemble learning works by combining
be used for predicting the outcome (class label) of a depen- individual classifiers, it typically requires significant
dent attribute based on one or more predictor attributes. amounts of labeled data for training. In certain industrial
The probabilities describing the possible outcomes of a contexts, relevant and labeled data available for learning
given instance are modeled, as a function of the predictor may be limited.
attributes, using a logistic function: Semi-supervised methods [39] use, for training, a small
1 amount of labeled data together with a much larger amount
pðai ; . . . ; an Þ ¼ ; of unlabeled data. This method that exploits unlabeled data
1 þ eA
can enable ensemble learning when there are very few
where p is a conditional probability: the probability that a labeled data. As explained by Zhou [43], combining semi-
sink in a path is vulnerable as a function of the path’s secu- supervised learning with ensembles has many advantages.
rity-related properties reflected through predictor attrib- Unlabeled data is exploited to help enrich labeled training
utes. Að¼ b0 þ bi ai þ þ bn an Þ is a linear combination of n samples allowing ensemble learning: Each individual learner
698 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015
TABLE 2
Test Subjects
is improved with unlabeled data labeled by the ensemble the use of semi-supervised learning instead of supervised
consisting of all other learners. As listed in Lu et al. [41], a learning if there are few defects reported. But no perfor-
few different types of semi-supervised methods, such as EM- mance comparison between semi-supervised learning
based, clustering-based, and disagreement-based learning, and supervised learning has yet been investigated in the
have been proposed in literature. But none of these techni- context of vulnerability prediction. This leads us to our next
ques has been explored for vulnerability prediction so far. research question.
Hence, based on these motivations, we explore the use of Question 2 (Q2). Even if the availability of vulnerability
an algorithm called CoForest, Co-trained Random Forest, data is limited, can vulnerabilities be predicted using
(CF), which applies semi-supervised learning on RF. It is a semi-supervised learning? Further, will the performance
disagreement-based, semi-supervised learner initially pro- of a semi-supervised learner be superior to that of a
posed by Li and Zhou [42]. CF uses multiple, diverse learn- supervised learner when the availability of vulnerability
ers, and combines them to exploit unlabeled data (semi- data is limited?
supervised learning), and maintains a large disagreement
between the learners to promote the learning process. 5.2 Experiment Subjects
CF is based on RF and its procedure is as follows: To evaluate the effectiveness of our vulnerability prediction
framework, we perform experiments on seven, real-world
1) Construct a random forest H with K trees with the PHP web applications, with known vulnerabilities and
available labeled data L. benchmarked for the evaluation of many vulnerability
2) For each tree k in H, repeat the following steps 3 6. detection approaches [3], [4], [21], [28]. These applications
3) Construct a new random forest Hk by removing k can be obtained from SourceForge [5]. Table 2 shows rele-
from H. vant statistics for these applications. The vulnerability infor-
4) Use Hk to label all the unlabeled data U and esti- mation can be found in security advisories such as CVE [6].
mate the labeling confidence based on the degree of Securities advisories typically report only vulnerable web
agreements on the labeling, i.e., the number of classi- pages, which is too coarse-grained for our purpose. And its
fiers that vote for the label assigned by Hk . vulnerability information can typically be traced to multiple
5) Generate a new labeled dataset L0 by combining L vulnerabilities appearing in different program statements.
with the unlabeled data labeled with the confidence Therefore, we still had to manually inspect the reported vul-
levels above a preset confidence threshold. nerable web pages and analyze the server programs to
6) Refine k with L0 . locate the vulnerable program statements.
7) Repeat the above steps 2 6 until none of the trees in For data collection, we enhanced the prototype tool
H changes. PhpMiner used in our previous work [33]. PhpMiner basically
For detail information on CF, please refer to [40] and [42]. implements the steps shown in Fig. 1. It is a fully automated
data collection tool. Given a PHP program, it generates con-
trol flow graphs, program dependence graphs, and system
5 EXPERIMENTAL EVALUATION dependence graphs of the program. It then computes back-
5.1 Research Questions ward static program slices of the sinks found in the program,
This paper aims to investigate the following two research according to the interprocedural slicing algorithm given by
questions: Horwitz et al. [32]. Then, it uses a depth-first search strategy
Question 1 (Q1). Can our proposed IVS attributes, when to extract the paths in the slices. We also implement the tech-
fed to a machine learner, accurately predict SQLI, XSS, RCE, niques discussed in Section 4 to automate the static and
and FI vulnerabilities? dynamic analysis-based classifications of the paths. For
High accuracy is expected to translate into high recall static-based classification, we classify over 330 PHP built-in
and low probability of false alarm when predicting vulner- functions and 30 PHP operators into various input valida-
abilities. Although classifiers can be effective, as discussed tion and sanitization types and store them in a database.
above, a sufficient number of instances with known vulner- As output, PhpMiner produces the attribute vectors like the
ability information is required to train a classifier (super- ones shown in Fig. 6, without the vulnerability labels, which
vised learning). As a result, in certain situations, supervised were manually tagged by us for the experiment. Our tool
learning is either infeasible or ineffective. In the context of also implements the evaluation procedures (Fig. 7) for
defect prediction, some studies [40], [41] have endorsed supervised and semi-supervised learners. For learning
SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 699
TABLE 4
Data Distributions of Attributes Across Instances
Across Datasets
Fig. 10. Cross validation results of the supervised learners—Logistic Regression (LR) and RandomForest (RF).
First, we can observe, as we expected, that frequencies Regression) in our previous work [33], for our vulnerability
vary significantly across attributes, thus showing their prediction context. We focus our discussion below on pre-
widely different importance in terms of predicting vulnera- dictive accuracy based on RF’s results.
bility. The most selected attributes include Client, Uninit, Averaging over all 15 datasets, the RF models achieved
Un-taint, and Propagate. We note that a few attributes, the result (pd ¼ 77%, pf ¼ 5%), which is better than the
namely DB-operator, Other-delimiter, Event-handler, Dot, result generally benchmarked (pd > 70%, pf < 25%) by
Colon, Other-special, and Path, were not selected at all, many prediction studies [23], [34]. This implies that our pre-
although those attributes reflect functions that could sani- diction approach detects 77 percent of the top four web
tize potentially dangerous meta-characters like (.) and (,). application vulnerabilities at the cost of filtering a few false
This is not surprising since, as observed in Table 4, the data positives. Given that, in practice, web application projects
distributions of those attributes are sparse, indicating that typically have many software modules containing many
they are not present in most instances. The attributes like sinks, and undergo many versions over a long lifespan,
Dot and Path are not relevant for most datasets since they such models can be very useful in practice to predict vulner-
are designed for detecting FI and RCE vulnerabilities and abilities in new versions based on vulnerability data from
our experiment only contains two datasets that correspond past versions.
to each of these vulnerabilities. We also manually checked For all the datasets, the RF models achieved low pf
that some of the rarely-selected attributes are actually pres- results. And for most datasets, the RF models also achieved
ent in some of our datasets, but they were not selected by high pd results. But we also note that, for a few datasets,
logistic regression as they were found to be not statistically the models achieved pd results lower than our benchmark
significant. Lastly, the overall key observation is that most pd result (pd > 70 percent). If we take myadmin2-xss as a
of the proposed attributes are selected by different models representative example, our model only achieved
with varying frequencies, suggesting that the set of pro- pd ¼ 48%, but still, achieving a very low pf (1 percent)
posed attributes reflects the various vulnerability patterns makes such a model useful in practice. Looking more
in the selected datasets. closely at the numbers, myadmin2-xss contains a total of 425
instances, including 14 vulnerable instances (Table 3).
5.4.3 Prediction Results Thus, the model catches nearly half of the vulnerabilities at
Fig. 10 shows the predictive accuracy of LR and RF models the expense of only four false warnings, which are not
learnt from IVS attributes, in terms of recall, precision, and costly for developers to filter.
probability of false alarm, based on cross validation. On Hence, to answer Q1, the supervised prediction models
average, RF performed slightly better than LR. Thus, this built from IVS attributes can predict SQLI, XSS, RCE, and FI
study allows us to recommend a better supervised learning vulnerabilities in most datasets, with a sufficient level of
scheme (RandomForest) than the one used (Logistic accuracy to be useful. And even in the few cases where the
702 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015
Fig. 11. Cross validation results of the semi-supervised learner—CoForest (CF) and the supervised learner—RandomForest (RF) at a sampling rate
m ¼ 20%.
TABLE 5
Runtime Performance of PhpMiner
Test Subject Static Analysis Time (s) Dynamic Analysis Time (s) Average Learning Time (s) Total time (s)
SchoolMate 1.5.4 8,211 792 99 9,102
FaqForge 1.3.2 6,789 511 84 7,384
Utopia News Pro 1.1.4 7,699 1,250 87 9,036
Phorum 5.2.18 10,592 2,134 132 12,858
CuteSITE 1.2.3 9,205 1,977 105 11,287
PhpMyAdmin 3.4.4 17,549 3,104 141 20,794
PhpMyAdmin 3.5.0 28,700 4,570 160 33,430
semi-supervised learning should be favored to supervised applications in commercial sectors since all our test applica-
learning. tions are open source. But it is difficult to conduct experi-
ments on commercial applications since their vulnerability
data is not publicly accessible. Also, the implicit assumption
5.6 Discussion on Data Collection and Model
Learning Performance of our approach that all the application code is available for
analysis is clearly a limitation in some application contexts.
We showed above that both static analysis- and dynamic
Some (especially commercial) applications might use plug-
analysis-based attributes contribute to achieving sufficient
ins or third-party software components, which may be only
predictive accuracy for the models to be useful in practice,
known at runtime or for which the source code is unavail-
that is, for vulnerabilities to be detected at reasonable cost.
able. We also consider that a security-sensitive program
Still, it is required that we also analyze the scalability of these
operation is vulnerable if it uses an input read from an
analyses. Table 5 shows the runtime performance of
external environment with unknown security controls.
PhpMiner. Since our hybrid analysis technique is based on the
Hence, our result would be incorrect if the application is
work of Balzarotti et al. [11], the runtime performance of our
run inside a framework that provides a layer of safeguards
tool also showed similar results. That is, PhpMiner actually
that properly validate all the incoming inputs.
spent most of the time on static analysis in extracting slices
Our data only reflect the vulnerability patterns of those
and their paths. The time spent on running the test suites
that are reported in vulnerability databases. Hence, our
(dynamic analysis) was considerably less. Although the total
vulnerability predictions may not detect vulnerabilities
time taken was up to a maximum of nine hours, we believe
having different characteristics in terms of our proposed
that it is reasonable considering that some of the test subjects
attributes. But, considering the wide variability in charac-
are widely-used, real world applications and thus, our per-
teristics of the test subjects (see Table 2), our results
formance results suggest that our tool can be applicable in
should be widely applicable. It is also noteworthy that
practice. Also, while implementing our tool, performance
our underlying hybrid analysis may produce classifica-
optimization was not as much a focus as would be expected
tion errors affecting the prediction results. For example,
in an industry strength tool and there is probably significant
dynamic analysis may incorrectly flag a function as Java-
room for improvement. Average learning time in Table 5
Script tag filtering function. But since our predictors are
refers to time spent on training and testing a learner with one
learnt on past data, if the same function is causing a num-
specific setting. We did not differentiate the time spent on
ber of sinks to be vulnerable, machine learning algorithms
supervised learning and semi-supervised learning because
learn from it and the presence of such function in the pro-
the time difference between these machine learning processes
gram slices will indicate vulnerabilities.
is insignificant. It took a maximum of three minutes for train-
The use of additional or different machine learning tech-
ing and testing a learner (including 50 trials for each setting).
niques might alter our results. For data balancing, we
also tried other sampling techniques like undersampling
5.7 Threats to Validity (remove majority class data) [49], but adaptive synthetic
Our current work targets PHP web applications because the oversampling provided better results. Regarding attribute
vulnerabilities we address are very common and serious for selection, we also evaluated learners without any attribute
PHP applications [51]. However, though this is a practical selection and with different attribute selection methods such
limitation, it is possible to extend the logic presented in this as gain ratio [9]. But correlation-based method provided
paper to other programming languages. For example, to slightly better results. For supervised learning, we used two
adapt our approach to Java, the same classification schemes very different classification algorithms which are statistical-
described in this work could be used. One could predefine based and ensemble-based, respectively. We also tried other
Java built-in functions and operations to perform static anal- types of classifiers like multi-layer perceptron and C4.5 that
ysis-based classification. And to perform static and dynamic are neural network-based and tree-based, respectively. But
analyses, there are readily available Java program analysis RandomForest’s results were superior. We have not tried
tools such as Chord [44]. Furthermore, despite these neces- other algorithms for semi-supervised learning. We did not
sary adaptations to other languages, it is important to note focus our attention on fine-tuning the prediction models and
that the overall approach would be similar. therefore, better results might be obtained.
Data sampling bias is one of our threats to validity. Our Like all other empirical studies, our results are limited to
results here may not generalize well to other types of the applied machine learning processes, the test subjects,
704 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015
and the experimental setup used. One good solution to attributes. Their work is based on the concept that software
refute, confirm, or improve our results is to replicate the components similar to known vulnerable ones, in terms of
experiments with new test subjects and probably with fur- imports and function calls, are likely to be vulnerable as
ther machine learning strategies. This can be easily done well. They achieved pd ¼ 45% and pr ¼ 70%.
since we have clearly defined our empirical methods and Yamaguchi et al. [45] and [46] use natural language
setup, and we also provide the data used in the experiments processing techniques to identify and extract API usage
and the data collection tool on our web site [7]. patterns from abstract syntax trees [45] or dependency
graphs [46], which are then represented as attributes for
machine learning. The numbers of attributes are not
6 RELATED WORK bounded. Whereas, we propose 32 code attributes, each
Our work applies machine learning for the prediction of of which is specifically designed to reflect a specific type
vulnerabilities in web applications. Hence, its related work of input validation and sanitization code pattern and
falls into three main categories: defect prediction, vulnera- thus, is an important indicator of vulnerability. Also, we
bility prediction, and vulnerability detection. use program analysis techniques—both static and
Defect prediction. In defect prediction studies, defect dynamic analyses to accurately extract those attributes for
predictors are generally built from static code attributes machine learning.
such as object-oriented design attributes [12], LOC counts, The above vulnerability prediction approaches generally
and code complexity attributes [14], [34], [35] because target software components or program functions. By con-
static attributes can be cheaply and consistently collected trast, our method targets specific program statements for
across many systems [34]. However, it was quickly real- vulnerability prediction. Another major difference is that
ized that such attributes can only provide limited accu- we use code attributes that characterize input validation
racy [13], [15], [25]. Arisholm et al. [13] and Nagappan and sanitization routines.
et al. [25] reported that process attributes (e.g., developer Shar and Tan [2], [16] predicted vulnerabilities using static
experience and fault history) could significantly improve analysis. Similar to this extension work, they classify the
prediction models. On the other hand, as process attrib- types of validation and sanitization functions implemented
utes are difficult to measure and measurements are often for the sinks and reflect those classifications on static code
inconsistent, Menzies et al. [15] showed that static code attributes. Although their supervised learners built from
attributes could still be effective if predictors are tuned to static attributes achieved good accuracies, they observed that
user-specific goals. static analysis could not precisely classify the types of some
In many real world applications, defect data is often lim- of the validation and sanitization functions. Later, Shar et al.
ited, which makes supervised learning infeasible or ineffec- [33] predicted vulnerabilities using hybrid code attributes.
tive. Li et al. [40] and Lu et al. [41] showed that semi- Dynamic analysis was incorporated into static analysis to
supervised learning can be used to address this problem improve the classification accuracy. Although these earlier
and that semi-supervised learners could also perform well works only targeted SQLI and XSS vulnerabilities, they
in software defect prediction. Li et al. [40] used the CoForest stressed that the work should be extended to address other
method, which is also used by our work. types of vulnerabilities as well. This work extends the prior
The similarity with these defect prediction studies is that ones by addressing two additional types of common vul-
our work also uses machine learning techniques in building nerabilities. We propose new attributes and analyze code
vulnerability predictors. However, the major difference is patterns related to these additional vulnerabilities. More
that our study targets security vulnerabilities in web appli- importantly, this work also introduces semi-supervised
cations. Since these studies show that existing set of attrib- learning in the domain of vulnerability prediction.
utes do not work everywhere, we define specific attributes Vulnerability detection. Jovanovic et al. [3] and Xie and
targeted at predicting vulnerabilities based on automated Aiken [4] showed that many XSS and SQLI vulnerabilities
and scalable static and dynamic analysis. can be detected by static program analysis techniques. They
Vulnerability prediction. Shin et al. [23] used code com- identify various input sources and sensitive sinks, and deter-
plexity, code churn, and developer activity attributes to pre- mine whether any input data is used in those sinks without
dict vulnerable programs. They achieved pd ¼ 80% and passing through sanity checks. Such static taint tracking
pf ¼ 25%. Their assumption was that, the more complex the approaches often generate too many false alarms as these
code, the higher the chances of vulnerability. But from our approaches cannot reason about the correctness and the ade-
observations, many of the vulnerabilities arise from simple quacy of those sanity checks. Thus, these approaches are not
code and, if a program does not employ any input valida- precise in general.
tion and sanitization routines, it would be simpler but nev- To improve precision, Fu and Li [27] and Wassermann
ertheless contain many vulnerabilities. Walden et al. [24] and Su [28] approximated the string values that may appear
investigated the correlations between security resource indi- at sensitive sinks by using symbolic execution and string
cator (SRI) and numbers of vulnerabilities in PHP web analysis techniques. More recent approaches incorporate
applications. SRI is derived from publicly available security dynamic analysis techniques such as concolic execution
information such as past vulnerabilities, secure develop- [21], and model checking [22]. These approaches reason
ment guidelines, and security implications regarding sys- about various paths in the program that lead to sensitive
tem configurations. Neuhaus et al. [26] also predicted sinks and attempt to generate test cases that are likely to be
vulnerabilities in software components from the past vul- attack vectors. All these approaches reduce false alarm
nerability information, and the imports and function calls rates. But symbolic, concolic, and model checking
SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 705
techniques often lead to path explosion problem [30]. It is require constraint solving and model checking to reason
difficult to reason about all the paths in the program when about correctness as in existing dynamic techniques, e.g.,
the program contains many branches and loops. Further, concolic execution. Our analysis is also fine-grained since it
the performance of these approaches also depends very identifies vulnerabilities at the program statement level as
much on the capabilities of their underlying string con- opposed to the component level, as in existing vulnerability
straint solvers in handling a myriad of string operations prediction approaches.
offered by programming languages. Therefore, these In our experiments on seven PHP web applications, we
approaches typically suffer from scalability issues. first showed that the proposed IVS attributes can be used
Our static and dynamic analysis technique builds on to detect several types of vulnerabilities. On average, the
Balzarotti et al. [11]. But, similar to the above techniques, RandomForest models, built on IVS attributes, achieved
Balzarotti et al. apply static and dynamic analysis to deter- (pd ¼ 92%, pf ¼ 4%), (pd ¼ 72%, pf ¼ 9%), (pd ¼ 64%,
mine the correctness of custom sanitization functions identi- pf ¼ 1%), (pd ¼ 76%, pf ¼ 1%) when predicting SQL injec-
fied on data flow graphs, thus leading to scalability issues as tion, cross site scripting, remote code execution, and file
well. The difference or the contribution of our work is that inclusion vulnerabilities, respectively. We also showed
we leverage machine learning techniques to mitigate this that, when a limited number of sinks with known vulner-
scalability problem. That is, a predictor can learn correct and abilities are available for training the prediction model,
incorrect custom functions based on historical data. Though semi-supervised learning is a good alternative to super-
we apply Balzarotti et al.’s static and dynamic analysis tech- vised learning. We compared RandomForest (supervised)
nique, we do not do so to precisely compute the correctness and CoForest (semi-supervised) models with a low data
of custom functions, but rather to infer their security pur- sampling rate of 20 percent, that determine the amount
poses and apply these inferences in machine learning. As a of labeled training data. The CoForest model achieved
result, our approach also does not require string solving and (pd ¼ 71%, pf ¼ 5%), on average over 15 datasets, out-
reasoning of (potentially infinite) program paths like con- performing the RandomForest model that achieved (pd ¼
colic execution and model checking techniques. 47%, pf ¼ 8%).
However, symbolic, concolic, and model checking To generalize our current results, our experiment can be
approaches could possibly yield high vulnerability detec- easily replicated and extended as we made our tool and
tion accuracy, which may never be matched by machine data available online [7]. We also intend to conduct more
learning-based methods. Thus, our objective is not to pro- experiments with industrial applications. While we believe
vide a replacement for such techniques but rather to pro- that the proposed approach can be a useful and comple-
vide a complementary approach to combine with them and mentary solution to existing approaches, studies need to be
to use when they are not applicable. One could, for exam- carried out to determine the feasibility and usefulness of
ple, first gather vulnerability predictions on code sections integrating multiple approaches.
using machine learning and then focus on code sections
with predicted vulnerabilities using the more accurate tech- ACKNOWLEDGMENTS
niques mentioned above. Thereafter, ideally, the confirmed The authors would like to thank Hongyu Zhang [40] for
vulnerabilities should be removed by manual audits or by providing us with the Java implementation of CoForest
using automated vulnerability removal techniques such as algorithm. This work was partially supported by the
Shar and Tan [29]. National Research Fund, Luxembourg (FNR/P10/03). Lwin
Khin Shar is the corresponding author.
7 DISCUSSIONS AND CONCLUDING REMARKS
REFERENCES
The main goal of this paper is to achieve both high accuracy
and good scalability in detecting web application vulner- [1] OWASP. (2012, Jan.). The open web application security project
[Online]. Avaialble: http://www.owasp.org
abilities. In principle, our proposed approach leverages all [2] L. K. Shar and H. B. K. Tan, “Predicting SQL injection and
the advantages provided by existing static and dynamic cross site scripting vulnerabilities through mining input saniti-
taint analysis approaches and further enhances accuracy by zation patterns,” Inf. Softw. Technol., vol. 55, no. 10, pp. 1767–
1780, 2013.
using prediction models developed with machine learning [3] N. Jovanovic, C. Kruegel, and E. Kirda, “Pixy: A static analysis
techniques and based on available vulnerability informa- tool for detecting web application vulnerabilities,” in Proc. IEEE
tion. Static analysis is generally sound but tends to generate Symp. Security Privacy, 2006, pp. 258–263.
many false alarms. Dynamic analysis is precise but could [4] Y. Xie and A. Aiken, “Static detection of security vulnerabilities in
scripting languages,” in Proc. USENIX Security Symp., 2006,
miss vulnerabilities as it is difficult or impossible to exercise pp. 179–192.
every test case scenario. Our strategy consisted in building [5] (2012, Mar.). SourceForge. [Online]. Available: http://www.sour-
predictors using machine learners trained with the informa- ceforge.net
tion provided by both static and dynamic analyses and [6] (2013, May). CVE: Distributions of vulnerabilities by types
[Online]. Available: http://www.cvedetails.com/vulnerabilities-
available vulnerability information, in order to achieve by-types.php
good accuracy while meeting scalability requirements. [7] PhpMiner [Online]. Availble: http://sharlwinkhin.com/
Our static analysis only involves computing program sli- phpminer.html, 2013.
[8] J. Ferrante, K. J. Ottenstein, and J. D. Warren, “The program
ces. Dynamic analysis is only used to infer security-check- dependence graph and its use in optimization,” ACM Trans. Pro-
ing types of validation and sanitization functions and we gramm. Languages Syst., vol. 9, pp. 319–349, 1987.
use this inferred information for prediction rather than cor- [9] I. H. Witten, E. Frank, and M. A. Hall, Data Mining, 3rd ed. San
rectness analysis. This approach is scalable since it does not Mateo, CA, USA: Morgan Kaufmann, 2011.
706 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015
[10] (2012, Mar.). RSnake [Online]. Available: http://ha.ckers.org [35] Q. Song, Z. Jia, M. Shepperd, S. Ying, and J. Liu, “A general soft-
[11] D. Balzarotti, M. Cova, V. Felmetsger, N. Jovanovic, E. Kirda, ware defect-proneness prediction framework,” IEEE Trans. Softw.
C. Kruegel, and G. Vigna, “Saner: Composing static and dynamic Eng., vol. 37, no. 3, pp. 356–370, May/Jun. 2011.
analysis to validate sanitization in web applications,” in Proc. [36] D. Fisher, L. Xu, and N. Zard, “Ordering effects in clustering,”
IEEE Symp. Security Privacy, 2008, pp. 387–401. in Proc. Int. Workshop Mach. Learning, 1992, pp. 163–168.
[12] L. C. Briand, J. W€ ust, J. W. Daly, and D. V. Porter, “Exploring the [37] L. Breiman, “Random forests,” Mach. Learning, vol. 45, no. 1,
relationships between design measures and software quality in pp. 5–32, 2001.
object-oriented systems,” J. Syst. Softw., vol. 51, no. 3, pp. 245–273, [38] D. W. Hosmer Jr, S. Lemeshow, and R. X. Sturdivant, Applied
2000. Logistic Regression, 3rd ed. New York, NY, USA: Wiley, 2013.
[13] E. Arisholm, L. C. Briand, and E. B. Johannessen, “A systematic [39] O. Chapelle, B. Sch€ olkopf, and A. Zien, Eds., Semi-Supervised
and comprehensive investigation of methods to build and evalu- Learning. Cambridge, MA, USA: MIT Press, 2006.
ate fault prediction models,” J. Syst. Softw., vol. 83, no. 1, pp. 2–17, [40] M. Li, H. Zhang, R. Wu, and Z.-H. Zhou, “Sample-based software
2010. defect prediction with active and semi-supervised learning,”
[14] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, “Benchmarking Automated Softw. Eng., vol. 19, pp. 201–230, 2012.
classification models for software defect prediction: a proposed [41] H. Lu, B. Cukic, and M. Culp, “Software defect prediction using
framework and novel findings,” IEEE Trans. Softw. Eng., vol. 34, semi-supervised learning with dimension reduction,” in Proc. Int.
no. 4, pp. 485–496, Jul./Aug. 2008. Conf. Automated Softw. Eng., 2012, pp. 314–317.
[15] T. Menzies, Z. Milton, B. Turhan, B. Cukic, Y. Jiang, and A. Bener, [42] M. Li and Z.-H. Zhou, “Improve computer-aided diagnosis with
“Defect prediction from static code features: current results, limi- machine learning techniques using undiagnosed samples,” IEEE
tations, new approaches,” Automated Softw. Eng., vol. 17, no. 4, Trans. Syst., Man Cyberne., Part A: Syst. Humans, vol. 37, no. 6,
pp. 375–407, 2010. pp. 1088–1098, Nov. 2007.
[16] L. K. Shar and H. B. K. Tan, “Predicting common web application [43] Z.-H. Zhou, “When semi-supervised learning meets ensemble
vulnerabilities from input validation and sanitization code learning,” in Proc. Int. Workshop Multiple Classifier Syst., 2009,
patterns,” in Proc. Int. Conf. Automated Softw. Eng., 2012, pp. 310– pp. 529–538.
313. [44] Chord: A versatile platform for program analysis. (2011). Proc.
[17] C. Anley, Advanced SQL Injection in SQL Server Applications, Next Tutorial ACM Conf. Program. Language Des. Implementation
Generation Security Software Ltd., White Paper, 2002. [Online]. Available: http://pag.gatech.edu/chord
[18] S. Palmer, Web application vulnerabilities: Detect, exploit, pre- [45] F. Yamaguchi, M. Lottmann, and K. Rieck, “Generalized vulnera-
vent, Syngress, 2007. bility extrapolation using abstract syntax trees,” in Proc. Annu.
[19] Y. Kamei, A. Monden, S. Matsumoto, T. Kakimoto, and K. Matsu- Comput. Security Appl. Conf., 2012, pp. 359–368.
moto, “The effects of over and under sampling on fault-prone [46] F. Yamaguchi, C. Wressnegger, H. Gascon, and K. Rieck,
module detection,” in Proc. Int. Symp. Empirical Softw. Eng. Meas., “Chucky: Exposing missing checks in source code for vulnerabil-
2007, pp. 196–204. ity discovery,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secu-
[20] J. Demsar, “Statistical comparisons of classifiers over multiple rity, 2013, pp. 499–510.
data sets,” J. Mach. Learning Res., vol. 7, pp. 1–30, 2006. [47] PHP Security [Online]. Available: http://www.php.net/manual/
_
[21] A. Kiezun, P. J. Guo, K. Jayaraman, and M. D. Ernst, “Automatic en/security.php, 2013.
creation of SQL injection and cross-site scripting attacks,” in Proc. [48] H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive syn-
Int. Conf. Softw. Eng., 2009, pp. 199–209. thetic sampling approach for imbalanced learning,” in Proc. Int.
[22] M. Martin and M. S. Lam, “Automatic generation of XSS and SQL Joint Conf. Neural Netw., 2008, pp. 1322–1328.
injection attacks with goal-directed model checking,” in Proc. [49] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE
USENIX Security Symp., 2008, pp. 31–43. Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
[23] Y. Shin, A. Meneely, L. Williams, and J. A. Osborne, “Evaluating [50] M. A. Hall, “Correlation-based feature selection for machine
complexity, code churn, and developer activity metrics as indica- learning,” Ph.D. thesis, Dept. Comput. Sci., Univ. Waikato, Hamil-
tors of software vulnerabilities,” IEEE Trans. Softw. Eng., vol. 37, ton, New Zealand, 1998.
no. 6, pp. 772–787, Nov./Dec. 2011. [51] PHP Top 5 [Online]. Available: https://www.owasp.org/index.
[24] J. Walden, M. Doyle, G. A. Welch, and M. Whelan, “Security of php/PHP_Top_5, 2014.
open source web applications,” in Proc. Int. Symp. Empirical Softw.
Eng. Meas., 2009, pp. 545–553. Lwin Khin Shar received the PhD degree in
[25] N. Nagappan, T. Ball, and B. Murphy, “Using historical in-process electrical and electronic engineering from the
and product metrics for early estimation of software failures,” Nanyang Technological University of Singapore.
in Proc. Int. Symp. Softw. Rel. Eng., 2006, pp. 62–74. He is a research associate in software verification
[26] S. Neuhaus, T. Zimmermann, C. Holler, and A. Zeller, “Predicting and validation at the SnT centre for Security, Reli-
vulnerable software components,” in Proc. ACM Conf. Comput. ability, and Trust, University of Luxembourg. His
Commun. Security, 2007, pp. 529–540. research interests include software security and
[27] X. Fu and C.-C. Li, “A string constraint solver for detecting web privacy analysis using program analysis and
application vulnerability,” in Proc. Int. Conf. Softw. Eng. Knowl. machine learning techniques. He is a member of
Eng., 2010, pp. 535–542. the IEEE.
[28] G. Wassermann and Z. Su, “Sound and precise analysis of web
applications for injection vulnerabilities,” in Proc. ACM SIGPLAN
Conf. Program. Language Des. Implementation, 2007, pp. 32–41.
[29] L. K. Shar and H. B. K. Tan, “Automated removal of cross site
scripting vulnerabilities in web applications,” Inf. Softw. Technol.,
vol. 54, no. 5, pp. 467–478, 2012.
[30] K.-K. Ma, K. Y. Phang, J. S. Foster, and M. Hicks, “Directed sym-
bolic execution,” in Proc. Int. Conf. Static Anal., 2011, pp. 95–111.
[31] M. Weiser, “Program slicing,” in Proc. Int. Conf. Softw. Eng., 1981,
pp. 439–449.
[32] S. Horwitz, T. Reps, and D. Binkley, “Interprocedural slicing
using dependence graphs,” ACM Trans. Program. Languages Syst.,
vol. 12, no. 1, pp. 26–61, 1990.
[33] L. K. Shar, H. B. K. Tan, and L. C. Briand, “Mining SQL injection
and cross site scripting vulnerabilities using hybrid program ana-
lysis,” in Proc. Int. Conf. Softw. Eng., 2013, pp. 642–651.
[34] T. Menzies, J. Greenwald, and A. Frank, “Data mining static code
attributes to learn defect predictors,” IEEE Trans. Softw. Eng.,
vol. 33, no. 1, pp. 2–13, Jan. 2007.
SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 707
Lionel C. Briand is a full professor and a vice- Hee Beng Kuan Tan received the PhD degree in
director of the Interdisciplinary Centre for ICT computer science from the National University
Security, Reliability, and Trust (SnT), University of Singapore. He is an associate professor in
of Luxembourg. He was granted the IEEE Com- the Division of Information Engineering in the
puter Society Harlan Mills award in 2012 for School of Electrical and Electronic Engineering,
contributions to Model-based Verification and Nanyang Technological University. He has
Testing, and elected Reliability Engineer of the 13 years of experience in IT industry before mov-
year (2013) by the IEEE Reliability Society. His ing to academic. He was also a lecturer in the
research interests include software testing and Department of Information Systems and Com-
verification, model-driven engineering, quality puter Science in the National University of Singa-
assurance and control, and applications of pore. His research interests include software
machine learning and evolutionary computation to software engineering. testing and analysis, software security, and software size estimation. He
He is a fellow of the IEEE (2010) and a Canadian professional engineer is a senior member of IEEE and a member of the ACM.
(P. Eng.) registered in Ontario, Canada.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.