0% found this document useful (0 votes)
3 views21 pages

Web Application Vulnerability Prediction Using Hybrid Program Ana

The document discusses a method for predicting web application vulnerabilities using hybrid program analysis and machine learning. It proposes a set of code attributes related to input validation and sanitization, employing both supervised and semi-supervised learning techniques to improve vulnerability detection accuracy. Empirical studies show that semi-supervised models can significantly enhance recall rates when labeled data is limited, making them a viable option for real-world applications with incomplete vulnerability data.

Uploaded by

Blender Junior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views21 pages

Web Application Vulnerability Prediction Using Hybrid Program Ana

The document discusses a method for predicting web application vulnerabilities using hybrid program analysis and machine learning. It proposes a set of code attributes related to input validation and sanitization, employing both supervised and semi-supervised learning techniques to improve vulnerability detection accuracy. Empirical studies show that semi-supervised models can significantly enhance recall rates when labeled data is limited, making them a viable option for real-world applications with incomplete vulnerability data.

Uploaded by

Blender Junior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Singapore Management University

Institutional Knowledge at Singapore Management University

Research Collection School Of Computing and School of Computing and Information Systems
Information Systems

11-2014

Web application vulnerability prediction using hybrid program


analysis and machine learning
Lwin Khin SHAR
Singapore Management University, [email protected]

Lionel BRIAND

Hee Beng Kuan TAN

Follow this and additional works at: https://ink.library.smu.edu.sg/sis_research

Part of the Information Security Commons, and the Programming Languages and Compilers
Commons

Citation
SHAR, Lwin Khin; BRIAND, Lionel; and TAN, Hee Beng Kuan. Web application vulnerability prediction using
hybrid program analysis and machine learning. (2014). IEEE Transactions on Dependable and Secure
Computing. 12, (6), 688-707.
Available at: https://ink.library.smu.edu.sg/sis_research/4895

This Journal Article is brought to you for free and open access by the School of Computing and Information
Systems at Institutional Knowledge at Singapore Management University. It has been accepted for inclusion in
Research Collection School Of Computing and Information Systems by an authorized administrator of Institutional
Knowledge at Singapore Management University. For more information, please email [email protected].
688 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015

Web Application Vulnerability Prediction Using


Hybrid Program Analysis and Machine Learning
Lwin Khin Shar, Member, IEEE, Lionel C. Briand, Fellow, IEEE, and
Hee Beng Kuan Tan, Senior Member, IEEE

Abstract—Due to limited time and resources, web software engineers need support in identifying vulnerable code. A practical
approach to predicting vulnerable code would enable them to prioritize security auditing efforts. In this paper, we propose using a set of
hybrid (staticþdynamic) code attributes that characterize input validation and input sanitization code patterns and are expected to be
significant indicators of web application vulnerabilities. Because static and dynamic program analyses complement each other, both
techniques are used to extract the proposed attributes in an accurate and scalable way. Current vulnerability prediction techniques rely
on the availability of data labeled with vulnerability information for training. For many real world applications, past vulnerability data is
often not available or at least not complete. Hence, to address both situations where labeled past data is fully available or not, we apply
both supervised and semi-supervised learning when building vulnerability predictors based on hybrid code attributes. Given that
semi-supervised learning is entirely unexplored in this domain, we describe how to use this learning scheme effectively for vulnerability
prediction. We performed empirical case studies on seven open source projects where we built and evaluated supervised and
semi-supervised models. When cross validated with fully available labeled data, the supervised models achieve an average of 77
percent recall and 5 percent probability of false alarm for predicting SQL injection, cross site scripting, remote code execution and file
inclusion vulnerabilities. With a low amount of labeled data, when compared to the supervised model, the semi-supervised model
showed an average improvement of 24 percent higher recall and 3 percent lower probability of false alarm, thus suggesting
semi-supervised learning may be a preferable solution for many real world applications where vulnerability data is missing.

Index Terms—Vulnerability prediction, security measures, input validation and sanitization, program analysis, empirical study

1 INTRODUCTION

W EBapplications play an important role in many of our


daily activities such as social networking, email,
banking, shopping, registrations, and so on. As web soft-
concolic testing, have been proposed. Static taint analysis
approaches are scalable in general but are ineffective in
practice due to high false positive rates [11], [21]. Dynamic
ware is also highly accessible, web application vulnerabil- taint analysis [11], model checking [22], symbolic [27] and
ities arguably have greater impact than vulnerabilities in concolic [21] testing techniques can be highly accurate as
other types of software. Web developers are directly respon- they are able to generate real attack values, but have scal-
sible for the security of web applications. Unfortunately, ability issues for large systems due to path explosion prob-
they often have limited time to follow up with new arising lem [30]. There are also scalable vulnerability prediction
security issues and are often not provided with adequate approaches such as Shin et al. [23]. But the granularity of
security training to become aware of state-of-the-art web current prediction approaches is coarse-grained: they iden-
security techniques. tify vulnerabilities at the level of software modules or com-
According to OWASP’s Top 10 Project [1], SQL injection ponents. Hence, alternative or complementary vulnerability
(SQLI), cross site scripting (XSS), remote code execution detection solutions that are scalable, accurate, and fine-
(RCE), and file inclusion (FI) are among the most common grained would be beneficial to web developers.
and serious web application vulnerabilities threatening the From the perspective of web developers, input validation
privacy and security of both clients and applications nowa- and input sanitization are two secure coding techniques that
days. To address these security threats, many web vulnera- they can adopt to protect their programs from such common
bility detection approaches, such as static taint analysis, vulnerabilities. Input validation typically checks an input
dynamic taint analysis, modeling checking, symbolic and against required properties like data length, range, type,
and sign. Input sanitization, in general, cleanses an input
string by accepting only pre-defined characters and reject-
 L.K. Shar and L.C. Briand are with the Interdisciplinary Centre for ICT ing others, including characters with special meaning to the
Security, Reliability and Trust, University of Luxembourg, 4 rue Alphonse interpreter under consideration. Intuitively, an application
Weicker, L-2721, Luxembourg. is vulnerable if the developers failed to implement these
E-mail: {lwinkhin.shar, lionel.briand}@uni.lu.
 H.B.K. Tan is with the Department of Information Engineering, School of techniques correctly or to a sufficient degree.
Electrical & Electronic Engineering, Nanyang Technological University, Hence, from the above observation, in this paper, we
Nanyang Avenue, Singapore 639798. E-mail: [email protected]. hypothesize that code attributes that characterize validation
Manuscript received 24 Oct. 2013; revised 24 Aug. 2014; accepted 22 Oct. and sanitization code implemented in the program could be
2014. Date of publication 19 Nov. 2014; date of current version 13 Nov. 2015. used to predict web application vulnerabilities. Based on
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. this hypothesis, we propose a set of code attributes called
Digital Object Identifier no. 10.1109/TDSC.2014.2373377 input validation and sanitization (IVS) attributes from which
1545-5971 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 689

we build vulnerability predictors that are fine-grained, PhpMiner. First, we evaluated supervised learning
accurate, and scalable. The approach is fine-grained because models built from IVS attributes. Based on cross vali-
it identifies vulnerabilities at program statement levels. We dation, the model achieves 77 percent recall and 5
use both static and dynamic program analysis techniques to percent probability of false alarm, on average over
extract IVS attributes. Static analysis can help assess general 15 datasets, across SQLI, XSS, RCE, and FI vulner-
properties of a program. Yet, dynamic analysis can focus on abilities. From a practical standpoint, the results
more specific code characteristics that are complementary show that our approach detects many of the above
to the information obtained with static analysis. We use common vulnerabilities at a very small cost (low
dynamic analysis only to infer the possible types of input false alarm rate), which is very promising consider-
validation and sanitization code, rather than to precisely ing that the existing approaches either report many
prove their correctness, and apply machine learning on false warnings or miss many vulnerabilities.
these inferences for vulnerability prediction. Therefore, we  Second, we compared supervised and semi-super-
mitigate the scalability issue typically associated with vised learning models with a low sampling rate of
dynamic analysis. Thus, our proposed IVS attributes reflect 20 percent (i.e., only 20 percent of the available
relevant properties of the implementations of input valida- training data are labeled with vulnerability infor-
tion and input sanitization methods in web programs and mation). On average, the supervised model
are expected to help predict vulnerabilities in an accurate achieves 47 percent recall and 8 percent probability
and scalable manner. Furthermore, we use both supervised of false alarm, whereas the semi-supervised model
learning and semi-supervised learning methods to build achieves 71 percent recall and 5 percent probability
vulnerability predictors from IVS attributes, such that our of false alarm. However, when compared to the
method can also be used in contexts where there is limited supervised model based on complete vulnerability
vulnerability data for training. data, on average, the semi-supervised model
This work is an extension of our previous work [33], achieves the same probability of false alarm but a
which is a pattern mining approach based on static and 6 percent lower recall. Therefore, our results sug-
dynamic analyses that classify input validation and saniti- gest that when sufficient vulnerability data is
zation functions through the systematic extraction of their available for training, a supervised model should
security-related properties. The extraction is based on static be favored. On the other hand, when the available
property inference and analysis of dynamic execution vulnerability data is limited, a semi-supervised
traces. The enhancements and additional contributions of model is probably a better alternative.
this paper are as follows: The outline of the paper is as follows. Section 2 provides
background information. Section 3 presents our classifica-
 In our previous work that only targeted SQLI and tion scheme that characterizes input validation and saniti-
XSS vulnerabilities, we stated that the proposed zation methods. Section 4 describes our vulnerability
method could be adapted to other, similar types of prediction framework. Section 5 evaluates our vulnerability
vulnerabilities. In this paper, we address two predictors. Section 6 discusses related work. Section 7 con-
more, frequent types of vulnerabilities, which are cludes our study.
remote code execution and file inclusion vulnerabil-
ities. Hence, we propose additional attributes to 2 BACKGROUND
mine the code patterns associated with these new
types of vulnerabilities. This paper targets SQLI, XSS, RCE, and FI vulnerabilities.
 We had only made use of data dependency graphs to These security risks, if exploited, could lead to serious
identify input validation and sanitization methods. issues such as disclosure of confidential, sensitive informa-
But some of these methods may be identified from tion, integrity violation, denial of service, loss of commercial
control dependency graphs, e.g., input condition confidence and customer trust, and threats to the continuity
checks, which ensure that valid inputs are often of business operations. According to CVE [6], 55,504 vulner-
implemented through predicates. Therefore, in this abilities were found in web applications within 1999-2013.
work, to better identify those methods, we leverage Among them, 34 percent belong to RCE, 13.2 percent to
control dependency information. XSS, 10.3 percent to SQLI, and 3.8 percent to FI. Thus, these
 We propose static slicing and dynamic execution four common vulnerabilities are responsible for 61.3 percent
techniques that effectively mine both data depen- of the total number of vulnerabilities found. All these types
dency and control dependency information and of vulnerabilities are caused by potential weakness in web
describe the techniques in detail. applications regarding the way they handle user inputs.
 We modified our prototype tool, PhpMiner, to mine They are briefly described using PHP code examples in the
the control dependency information and to extract following.
additional attributes.
 We explore the use of semi-supervised learning 2.1 SQL Injection
schemes. To the best of our knowledge, we are the first SQLI vulnerabilities occur when user input is used in data-
to build vulnerability prediction models that way, base queries without proper checks. It allows attackers to
which makes such models more widely applicable. trick the query interpreter into executing unintended com-
 We conducted two sets of experiments on a set of mands or accessing unauthorized data. Consider the follow-
open source PHP applications of various sizes using ing code:
690 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015

mysql_query(“SELECT  FROM user WHERE 2.4 File Inclusion


uid ¼ ‘“.$_GET[‘id’].”‘“); FI vulnerability refers to an attacker’s ability to include a file
As the validity of input parameter $_GET is not checked, that originates from a remote (possibly an attacker’s) server
an SQLI attack can be conducted by providing the parame- or access/include a local file that is not intended to be
ter id with the following values: accessed without proper authorization. It is caused by user
/login.php?id ¼ xxx’þORþ’1’%3D’1 inputs being part of filenames or the use of un-initialized
The query becomes SELECT  FROM user WHERE uid ¼ variables in file operations. Consider the following code:
‘xxx’ OR ‘1’ ¼ ‘1’. Effectively, the attack changes the include($_GET[‘file’]);
semantics of the query to SELECT  FROM user, which pro- An attack may conduct a file inclusion attack using the
vides the attacker with unauthorized access to the user following values:
table. Like mysql_query, any other language built-in func- /include.php?file¼http://evil.com/mali-
tions such as mysql_execute that interact with the data- cious.php
base can cause SQLI. This attack causes the vulnerable PHP program to include
and execute a malicious PHP file that may cause dangerous
2.2 Cross Site Scripting program behaviors. Similar PHP commands that may cause
XSS flaws arise when the user input is used in HTML out- FI vulnerability include include_once and require.
put statements without proper validation or escaping. It Moreover, an FI vulnerability may also appear with PHP
allows attackers to execute scripts in the victim’s browser, operations that involve file accesses and file operations in
which can hijack user sessions, deface web sites, or redirect which the attacker may be able to view restricted files, or
the user to malicious sites. Consider the following code: even execute malicious commands on the web server that
echo Welcome. $_GET[‘new_user’]; can lead to a full compromise of the system. For example,
Similar to the above SQLI example, as the input consider the following code:
parameter $_GET is not checked, an XSS attack can be $handle ¼ fopen($_GET[‘newPath’], ”r”);
conducted by providing the parameter new_user with In the above case, the input newPath is received from the
the following values: HTTP GET parameter. An attacker could provide a value like
<script>alert(document.cookie);</script> newPath!“../../../../../etc/passwd%00.
When the victim’s browser executes the script sent by the txt” in order to access the password file from the file sys-
server, it shows the new user’s cookie values instead of the tem. The expression ‘dot-dot-slash (../)’ instructs the system
intended user information. Using a more malicious script, a to go one directory up. The attacker has to guess how many
redirection to the attacker’s server is also possible and sensi- directories he has to go up to find the user confidential
tive user information could be redirected. Like echo, any folder on the system, but this can be easily done by trial and
other language constructs or functions such as print that error. Note that this vulnerability is known as directory tra-
generate HTML output could cause XSS. versal, but we group this vulnerability together with FI as it
can also be seen as a local file inclusion.
2.3 Remote Code Execution
RCE vulnerability refers to an attacker’s ability to execute
arbitrary program code on a target server. It is caused by 3 CLASSIFICATION SCHEME
user inputs in security sensitive functions such as file sys- Before presenting our proposed approach, in this section,
tem calls (e.g., fwrite), code execution functions (e.g., we first describe the IVS attributes (listed in Table 1) on
eval), command execution functions (e.g., system), and which vulnerability predictors shall be built. Basically, these
directory creating functions (e.g., mkdir). attributes characterize various types of program functions
It allows a remote attacker to execute arbitrary code in and operations that are commonly used (collected from var-
the system with administrator privileges. It is an ious sources like [1], [17], [18]) as input validation and sani-
extremely risky vulnerability, which can expose a web tization procedures to defend against web application
site to different attacks, ranging from malicious deletion vulnerabilities. Using these attributes, functions and opera-
of data to web page defacing. The following code depicts tions are classified according to their security-related prop-
an RCE vulnerability. erties (i.e., the type of validation- and sanitization-effects
$comments ¼ $_POST[‘comments’]; these functions and operations may enforce on the inputs
$log ¼ fopen(‘comments.php’,’a’); being processed). For example, the PHP function str_re-
fwrite($log,’<br />‘.’<br />‘.’<center>‘. place(‘<‘, ‘ ‘, $input) removes HTML tags from the
‘Comments::’.’<br />‘.$comments); input. Since the presence of HTML tags in $input could
The above code retrieves user comments and logs them cause XSS, the function has a security property that filters
without sanitization. This means that an attacker can exe- HTML tags and prevents XSS.
cute malicious requests, ranging from simple information In Table 1, static analysis-based attributes are attributes to
gathering using phpinfo() to complex attacks that be extracted using static analysis alone. Hybrid analysis-based
obtain a shell on the vulnerable server using shell_exec attributes are attributes to be extracted combining static
(). Other sensitive PHP functions and operations associ- analysis and dynamic analysis. The term ‘filter’ in Table 1
ated with this vulnerability type include header, indicates a validation or sanitization process that allows
preg_replace() with “/e” modifier on, fopen, $_GET only valid strings or that performs character removal,
[‘func_name’], $_GET[‘argument’], vassert, replacement, or escaping. All these attributes are numeric
create_function, and unserialize. (positive integers) and are presented next.
SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 691

TABLE 1
Input Validation and Sanitization Attributes

ID Name Description
Static analysis-based attributes
1 Client Input accessed from HTTP request parameters such as HTTP Get
2 File Input accessed from files such as Cookies, XML
3 Text-database Text-based input accessed from database
4 Numeric-database Numeric-based input accessed from database
5 Session Input accessed from persistent data object such as HTTP Session
6 Uninit Un-initialized program variable
7 Un-taint Function that returns predefined information or information not influenced by
external users
8 Known-vuln-user Custom function that has caused security issues in the past
9 Known-vuln-std Language built-in function that has caused security issues in the past
10 Propagate Function or operation that propagates partial or complete value of a string
Hybrid analysis-based attributes
11 Numeric Function or operation that converts a string into a numeric
12 DB-operator Function that filters query operators such as ( ¼ )
13 DB-comment-delimiter Function that filters query comment delimiters such as (–)
14 DB-special Function that filters other database special characters different from the above, such
as (\x00) and (\x1a)
15 String-delimiter Function that filters string delimiters such as (‘) and (“)
16 Lang-comment-delimiter Function that filters programming language comment delimiter characters such as (/)
17 Other-delimiter Function that filters other delimiters different from the above delimiters such as (#)
18 Script-tag Function that filters dynamic client script tags such as (<script>)
19 HTML-tag Function that filters static client script tags such as (<div>)
20 Event-handler Function that disallow the use of inputs as the values of client side event handlers
such as (onload ¼ )
21 Null-byte Function that filters null byte (%00)
22 Dot Function that filters dot (.)
23 DotDotSlash Function that filters dot-dot-slash (../) sequences
24 Backslash Function that filters backslash (\)
25 Slash Function that filters slash (/)
26 Newline Function that filters newline (\n)
27 Colon Function that filters colon (,) or semi-colon (;)
28 Other-special Function that filters any other special characters different from the above
special characters such as parenthesis
29 Encode Function that encodes a string into a different format
30 Canonicalize Function that converts a string into its most standard, simplest form
31 Path Function that filters directory paths or URLs
32 Limit-length Function or operation that limits a string into a specific length
Dependent attribute
33 Vuln? Indicates a class label—Vulnerable or Not-Vulnerable

3.1 Static Analysis-Based Classification example, Client inputs like HTTP GET parameters should
Attributes 1-10 in Table 1 characterize the functions and the always be sanitized before used in sinks whereas it may not
program operations to be classified by static analysis only. be necessary to sanitize Database inputs if they have been
The first six attributes in Table 1 characterize the classifica- sanitized prior to their storage (double sanitization might
tion of user inputs depending on the nature of sources. The cause security problems depending on the context). Uninit
reason for including input sources in our classification variables are variables that are un-initialized at the point of
scheme is that most of the common vulnerabilities arise its usage, which could cause security problems (e.g., an
from the misidentification of inputs. That is, developers attacker could inject malicious values in HTTP parameters
may implement adequate input validation and sanitization having the same name as un-initialized variables by
methods but yet, they may fail to recognize all the data that enabling the register_global parameter in PHP config-
could be manipulated by external users, thereby missing uration files). The reason for two types of Database inputs—
some of the inputs for validation. Therefore, in security Text-database (string-type data) and Numeric-database
analysis, it is important to first identify all the input sources. (numeric-type data) is to reflect the fact that string-type
The reason for classifying the inputs into different types data retrieved from data stores can cause second order
is that each class of inputs causes different types of vulner- security attacks such as second order SQLI and stored XSS,
abilities and different security defense schemes may while it is difficult to conduct those attacks with numeric-
be required to secure these different classes of inputs. For type data.
692 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015

Un-taint refers to functions or operations that return is important to classify these methods implemented in a
information not extracted from the input string (e.g., program path into different types because, together with
mysql_num_rows). It also corresponds to functions or their associated vulnerability data, our vulnerability pre-
logic operations that return a Boolean value. The reason for dictors can learn this information and then predict future
this attribute is that since the outcome values are not vulnerabilities.
obtained from an input, the taint information flow stops at In Table 1, the attribute Numeric relates to 1) numeric-
those functions and operations and thus, a sink would not type casting built-in functions or operations (e.g., $a ¼
be vulnerable from using those values. (double) $b/$c); 2) language built-in numeric type
Known-vulnerable-user corresponds to a class of custom checking functions (e.g., is_numeric); and 3) custom func-
functions that have caused security issues in the past. tions that return only numeric, mathematic, and/or dash ‘-’
Known-vulnerable-std characterizes a class of language built- characters (e.g., functions that validate inputs such as math-
in functions that have caused security issues in the past. For ematic equation, postal code, or credit card number). When
example, according to vulnerability report CVE-2013-3238 an input to be used in a sink is supposed to be a numeric
[6], preg_replace function with the “/e” modifier type, the sink can be made safe from this input through
enabled has caused security issues. These functions are to such functions or operations because various alphabetic
be predefined by users based on their experiences or the characters are typically required to conduct security attacks.
information obtained from security databases (we referred DB-operator, DB-comment-delimiter, and DB-special basi-
to CVE [6] and PHP security [47]). cally reflect functions that filter sequence of characters that
Clearly in Sk , there would also be functions and oper- have special meaning to a database query parser. For exam-
ations that do not serve any security purpose. They may ple, mysql_real_escape_string is one such built-in
simply propagate the input. Consequently, we use the function provided by PHP. Clearly, these attributes could
attribute Propagate to characterize functions and opera- predict SQLI vulnerability.
tions (e.g., substring, concatenation) that do not serve String-delimiter reflects functions that filter single quote
any security purpose and that simply propagate (part of) (‘) and double quote (“) characters. Lang-comment-delimiter
the input. reflects functions that filter comment delimiters such as (/)
Since the above functions and operations either have that are significant to script interpreters such as JavaScript.
clear definitions with respect to security requirements or Other-delimiter reflects functions that filter any other com-
are associated with known vulnerability issues, they could ment delimiters such as (#). All these attributes could be sig-
be predefined in a database and classifications can be made nificant vulnerability indicators because they could disrupt
statically. This database can be expanded as and when new the syntax of intended HTML documents, SQL queries, etc.
vulnerability analysis information is available. Script-tag reflects functions that filter sequences of char-
acters, which could invoke dynamic script interpreters such
3.2 Hybrid Analysis-Based Classification as JavaScript, Flash, and Silverlight. HTML-tag reflects func-
Attributes 11-32 listed in Table 1 characterize the functions tions that filter sequences of special characters such as
to be classified by either static or dynamic analysis. This <body>, which have special meaning to the static HTML
hybrid analysis-based classification is applied for validation interpreter. Since Script-tag and HTML-tag filter special
and sanitization methods implemented using both standard characters that may cause XSS, these attributes could pre-
security functions (i.e., language built-in or custom func- dict XSS vulnerability. Event-handler reflects functions that
tions with known and tested security properties) and non- disallow the use of inputs as values of event handlers (e.g.,
standard security functions. If there are only standard secu- onload) or other dangerous HTML attributes (e.g., src).
rity functions to be classified, we classify them based on Inputs used as the values of event handlers can easily cause
their security-related information (static analysis); other- XSS. For example, consider the following code:
wise, we use dynamic analysis. <img src ¼ ‘$user_input’>
In a program, various input validation and sanitization If a malicious value, such as http://hackersite.
processes may be implemented using language built-in org/xss.js, is assigned to $user_input, XSS arises.
functions and/or custom functions. Since inputs to web Since the exploit does not necessarily use special characters
applications are naturally strings, string replacement/ like <script, filtering special characters is insufficient to
matching functions or string manipulation procedures like prevent XSS. Instead, in such cases, only Event-handler type
escaping are generally used to implement custom input val- functions can safely prevent XSS. Hence, Event-handler attri-
idation and sanitization procedures. A good security func- bute could predict XSS flaw.
tion generally consists of a set of string functions that accept Null-byte, Dot, DotDotSlash, Backslash, Slash, Newline,
safe strings or reject unsafe strings. Colon, and Other-special reflect functions that filter different
These functions are clearly important indicators of vul- types of meta-characters. Filtering Dot (.) character is impor-
nerabilities, but we need to analyze the purpose of each tant to handle unintended file extensions or double file
validation and sanitization function since different extension cases, which may cause file inclusion attacks (see
defense methods are generally required to prevent differ- real world example at CVE-2013-3239). NullByte (%00)
ent types of vulnerabilities. For example, to prevent SQLI characters can be used to bypass sanitization routines and
vulnerabilities, escaping characters that have special trick underlying systems into interpreting a given value
meaning to SQL parsers is required whereas escaping incorrectly. For example, a file value like script.
characters that have special meaning to client script inter- php%00.txt can trick a PHP program to see it as a non-
preters is needed to prevent XSS vulnerabilities. Thus, it malicious text file but the underlying web server or the
SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 693

attacks to a certain extent since the number of malicious


characters that can be used is limited.
We believe that the above attributes reflect the types of
input validation and sanitization methods that are com-
monly used to prevent SQLI, XSS, RCE, and FI attacks. We
note that our list of attributes may not be exhaustive. Users
should refine and update them on a regular basis to reflect
latest vulnerability reports. As our vulnerability detection
approach is based on machine learning, it is not difficult to
re-train vulnerability predictor to learn new vulnerability
information.

4 VULNERABILITY PREDICTION FRAMEWORK


Our vulnerability modeling is based on the observations
Fig. 1 Proposed vulnerability prediction framework.
from the analysis of many vulnerability reports in
security databases such as CVE [6] and from the study of
operating system may interpret it as a PHP program. Hence, typical security defense methods. Our vulnerability pre-
it can be used to perform many security attacks such as file diction framework is depicted in Fig. 1. It comprises two
inclusion and remote code execution. main activities:
DotDotSlash include “dot-dot-slash (../)” sequences. 1) Hybrid program analysis. For each sink, a backward
These sequences can be used to conduct local file inclusion static program slice is computed with respect to the
attacks. Backslash (\) is typically used as escape character in sink statement and the variables used in the sinks.
most escaping processes and therefore, if an input actually Each path in the slice is analyzed using hybrid (static
contains this character, it has to be escaped first to avoid and dynamic) analysis to extract its validation and
confusion for the entire escaping process. sanitization effects on those variables. The path is
Like the above special characters, Slash, Newline, Colon, then classified according to its input validation and
and Other-special characters could also force an interpreter sanitization effects inferred by the hybrid analysis.
to misinterpret the input data. Other-special characters Classifications are captured with IVS attributes
include characters such as leading and trailing spaces, described in Section 3.
parenthesis, ( j ), (%), (_), (^), and ([). For example, the new- 2) Building vulnerability prediction models. We then build
line character (\n) could break a string into two parts where vulnerability prediction models from those attributes
the second part could become unintended code. The charac- based on supervised or semi-supervised learning
ters (^) and ([) could cause a regular-expression function to schemes and evaluate them using robust accuracy
misinterpret a regular expression. The character (%) used measures.
in an ‘SQL-LIKE’ clause could cause unintended database- The details of these activities are described in the follow-
record matches. ing sections.
Hence, since the above meta-characters could cause un-
intended program behaviors and security issues, the pres-
ence or absence of functions that escape or remove those 4.1 Hybrid Program Analysis
characters from the inputs could indicate vulnerabilities. 4.1.1 Terms and Definitions Used
Encode reflects functions that encode an input string into Our analysis is based on the control flow graph (CFG), the
a different format. An input may be properly sanitized program dependence graph (PDG), and the system depen-
using encoding functions. For example, in <a href ¼ dence graph (SDG) of a web application program. Each
‘login.php?name ¼ ‘.urlencode($input)>, the var- node in the graphs represents one source code statement.
iable $input is properly sanitized to be safely included in a We may therefore use program statement and node inter-
sink that generates a URL reference. Inversely, Canonicalize changeably depending on the context.
reflects functions that transform an input string, which may A sink is a node in a CFG that uses variables defined from
have more than one possible representation into a standard, input sources and thus, may be vulnerable to input manipu-
normal form so that malicious data disguised in a different, lation attacks. This allows us to predict vulnerabilities at
possibly encoded, form can be detected. For example, given statement levels. Input nodes are the nodes at which data
a disguised malicious input./../../etc/passwd, PHP’s from the external environment are accessed. A variable is
realpath function returns the canonicalized path /etc/ tainted if it is defined from input nodes.
passwd removing symbolic links and extra (/) characters As described earlier, the first step of our approach is to
from the input. Path reflects functions that filter directory compute a backward static program slice for each sink k
paths or URLs (e.g., <a href¼‘www.hack.com/hack. and the set of tainted variables used in k. According to the
js’). These functions can detect the inclusion of external or original definition given by Weiser [31], backward static
illegitimate URLs in sensitive program locations, preventing slice Sk with respect to slicing criterion <k, V> consists of all
potential XSS, remote code execution, and file inclusion nodes (including predicates) in the CFG that may affect the
attacks. Limit-length reflects functions that limit the length of values of V at node k, where V is a subset of variables used
an input string. Such functions can limit the possibilities of in k. We compute Sk using Horwitz et al.’s interprocedural
694 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015

Fig. 3. CFG of a program slice on tainted variables (a) $id and $pwd at
sink 7 and (b) $name at sink 10.

Fig. 2. Sample PHP program with custom and language built-in valida-
tion and sanitization functions. numbers of inputs, and the types and numbers of input vali-
dation and sanitization functions identified from each path
in the slice, can we predict the sink’s vulnerability?”
slicing algorithm based on the SDG [32]. We first construct
Therefore, our objective is to infer the potential effects of
the PDG for the main method of a web application program
validation checks and sanitization operations on tainted
and also construct PDGs for the methods called from the
variables using static and dynamic analyses, and classify
main method according to the algorithm given by Ferrante
those operations based on these inferences. For every path
et al. [8]. We then construct the SDG. A PDG models a pro-
in Sk that propagates the values of tainted variables into k,
gram procedure as a graph in which the nodes represent
we carry out this analysis.
program statements and the edges represent data or control
Our hybrid (static and dynamic) analysis includes the
dependences between statements. SDG extends PDG by
techniques proposed by Balzarotti et al. [11]. Basically, for
modeling interprocedural relations between the main pro-
each function in a data flow graph, Balzarotti et al. first ana-
gram and its subprograms.
lyze the function’s static program properties in an attempt
To illustrate, Fig. 2 shows an interprocedural slice of
to determine the potential sanitization effect of the function
the sink at line 10 (denoted as S10) with respect to variable
on the input. If this static analysis is likely to be imprecise,
$name. Fig. 3a shows the CFG for the slice of the sink at line
then they simulate the effect of the sanitization functions on
7 (denoted as S7) and Fig. 3b shows the CFG for the slice of S10.
the input by executing the code with different test inputs,
containing various types of attack strings. The execution
4.1.2 Hybrid Analysis results are then passed to a test oracle, which evaluates the
Typically, a web application program accesses inputs and functions’ sanitization effect by checking the presence of
propagates them via tainted variables for further processing those attack strings.
of the application’s logics. These processes may often Building on Balzarotti et al.’s work, we model the same
include sensitive program operations such as database information using IVS attributes to enable machine learning
updates, HTML outputs, and file accesses. If the program and vulnerability prediction. Another difference is that our
variables propagating the input data are not properly analysis is performed on program slices rather than data
checked before being used in those sinks, vulnerabilities flow graphs. As we discussed earlier, since input validation
arise. Therefore, to prevent web application vulnerabilities, and sanitization can be performed using predicates, the
developers typically employ input validation and input san- analysis of data flow graphs may be insufficient. A detailed
itization methods along the paths propagating the inputs to comparison of our work with Balzarotti et al.’s work is pro-
the sinks. By default, inputs to web application programs vided in the related work section. In the following, we
are strings. As such, input validation checks and sanitiza- explain how we made use of Balzarotti et al.’s analysis tech-
tion operations performed in a program are mainly based nique in our context.
on string operations. These operations typically include lan- Step 1. We first extract all possible paths from Sk . To
guage built-in validation and sanitization functions (e.g., avoid infinite paths, we use Balzarotti et al.’s solution that is
mysql_real_escape_string), string replacement and to traverse each loop only once. For example, as shown in
string matching functions (e.g., str_match), and regular- Fig. 3, S7 has only one path and S10 has two paths.
expression-based string replacement and matching func- Step 2. Each extracted path Pi is classified according to
tions (e.g., preg_replace). the IVS attributes (Table 1). As described next, classification
Basically, our approach attempts to answer the following is performed with compulsory static analysis first followed
research question: “Given a slice of sink, from the types and by optional dynamic analysis.
SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 695

Compulsory static analysis. We classify each path Pi accord-


ing to our classification scheme (Section 3) using static analy-
sis first. Standard security functions and some of the
language built-in functions/operations can be statically and
precisely classified based on their known specific security
requirements or their functional properties. We classify such
functions and operations into different types according to
their security-related properties and store that classification
in a database. If a node n in Pi processes a tainted variable
that is also used by sink k, we analyze its static properties Fig. 4. Code to be exercised to test for attribute Script-tag.
such as the language parameters and operators used by n,
and also the functions with known, specific security pur- different encoding schemes. We acknowledge that our test
poses that are invoked by n. Then, if there is any match to suite may be incomplete. However, our test-suite database
our predefined classifications, n is classified accordingly. can be updated and extended as and when new, sophisti-
To illustrate, recall the code snippet in Fig. 2: cated attack vectors are available.
4 $name ¼ $_POST[‘name’]; For dynamic execution and analysis, we first extract the
8 $name ¼ ‘Welcome ‘. $name; code according to the sequence of instructions in Pi and
In statement 4, $_POST is a language parameter which then generate the test code Ci by instrumenting the
we can predefine as a type of input. In statement 8, the lan- extracted code. Each input source is replaced with a
guage operator that performs string concatenation (.) is desired test input. For each test execution, the same test
used. As this operation only propagates the values of input is used for every input source in Pi . We also handle
tainted variables to the next operation, we could classify (.) predicates like statement 5, in which the predicate checks
as a taint propagation type. a variable not used in sink 10. We want to ensure that the
Likewise, as shown in Fig. 3, standard security functions path under test is exercised until the sink is reached, in the
can be identified from P1 of both sink 7 and sink 10. As sink presence of such irrelevant predicate checks. For example,
7 has only one path, it would only require static analysis for for P2 of S10, predicate 5 does not validate the variable
the whole classification process. For P1 of sink 7, we would $name. Instead, it validates another variable used in
identify two standard validation and sanitization functions another sink. To ensure that P2 is exactly followed, the
that process $id, which is also used in sink 7. These func- standard solution is to solve a path condition involving
tions are: the constraint of $id and find its appropriate value. But as
5 if(is_numeric($id)) { this solution is not scalable, we simply set the predicate to
6 $pwd ¼ mysql_real_escape_string($pwd); be false (see Fig. 4). Or we set it to be true if P1 of S10 is to
7 $name ¼ mysql_query(“SELECT name FROM user be tested. Note that our solution which forces a predicate
WHERE id ¼ $id AND pass ¼ ‘$pwd’”); to be true or false could cause our classification of Pi to be
In statement 5, is_numeric() is used to validate that inaccurate if Pi is an infeasible path. But this is a necessary
$id is a numeric. In statement 6, mysql_real_escape_- trade-off between scalability and accuracy.
string is used to escape MySQL database special charac- However, we do not perform the above instrumentation
ters. As these functions are language built-in validation and if the predicate validates the variable used in the sink.
sanitization functions, we could classify them statically. We Instead, we generate a piece of code as an alternative branch
classify them into different validation and sanitization types of the predicate that Pi follows. The code invokes the test
according to their specifications. oracle function with an empty string indicating that the vali-
Optional dynamic analysis: If Pi contains non-standard dation method successfully found and rejected the invalid
security functions or language built-in functions involving test input. An oracle function accepts two arguments. The
complex string manipulations such as preg_replace, the first argument is the final values at the sink and the second
type or purpose of the function cannot be easily inferred one specifies the type of test suite used. An example of such
using static analysis alone. In this case, we perform dynamic a case is provided in the following:
analysis on Pi .
$id ¼ ‘xx OR ‘1’ ¼ ‘1;
We maintain a database containing different test suites.
if(is_numeric($id)) {
Test cases are made of various types of attack strings contain-
oracle(“. . .id ¼ $id . . .”, ‘String-delimiter’);
ing malicious characters and benign strings. Attack strings
} else {
are derived from the security attack vectors provided by oracle(‘‘,’String-delimiter’);
OWASP [1] and RSnake [10]. These two security specialists exit;
provide a comprehensive coverage of security attack vectors
that could bypass various types of input validation and sani- Finally, we instrument the sink such that the final values
tization routines. Each test suite T is designed to test each reached into the sink can be analyzed by a test oracle. The
hybrid analysis-based attribute (discussed in Section 3.2). oracle function evaluates whether the malicious values con-
For example, a test case: <script>alert();</script> tained in test input variables have been filtered in the final
could discriminate functions that accept or reject JavaScript values of the variables. If so, Pi is classified according to the
tags and we would use it to test the attribute Script-tag. Our type of test case used.
test suite contains such a test case and its variants generated For each test suite T that is designed to test a hybrid attri-
using different combinations of special characters or bute a, we execute the code Ci with a test input t1 from T
696 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015

range from zero to an upper bound that depends on the


number of classified program operations or functions. Since
we propose 33 IVS attributes (Table 1), each path would be
represented by a 33-dimensional attribute vector. To illus-
trate, Fig. 6 shows the attributes for sink 7 and sink 10
extracted from the paths in their respective slices. The last
column is the class attribute to be predicted, that is whether
a sink is vulnerable or not in a given path. In our case stud-
ies, this comes from existing vulnerability data.

4.2.2 Data Preprocessing


Data balancing. As shown in Table 3, in most of our datasets,
Fig. 5. A test oracle function. the proportion of vulnerable sinks to non-vulnerable ones is
small. This is an imbalanced data problem and should be
expected in many such vulnerability datasets. Prior studies
and check if Pi can be classified as a from the execution
have shown that imbalanced data can significantly affect
result. If Pi cannot be classified as a, we choose a different
the performance of machine learning classifiers [19], [49]
test input t2 and repeat the process until it is classified as a
because some of the data might go unlearned by the classi-
(i.e., increase the value of a by one) or all the test inputs
fier due to their lack of representation, thus leading to
from T have been used. This whole process is iterated for all
induction rules which tend to explain the majority class
test suites, excluding those that are irrelevant to the type of
data and favoring its predictive accuracy. Since for our
sink. For example, if the sink is a class of HTML outputs
problem, the minority class data capture the ‘vulnerable’
such as echo, the test suites for attributes such as DB-opera-
instances, we need a high predictive accuracy for this class
tor, DB-comment-delimiter, and DB-special (see Section 3.2)
as missing a vulnerability is far more critical than reporting
are irrelevant.
a false alarm. To address this problem, we use a sampling
For our running example in Fig. 2, we have identified
method called adaptive synthetic oversampling [48]. It bal-
that P1 of S7 and P1 of S10 require only static analysis for
ances the (unbalanced) data by generating synthetic, artifi-
classification. Only P2 of S10 needs to be classified using
cial data for the minority class instances, thus reducing the
dynamic analysis. Fig. 4 shows the instrumented code snip-
bias introduced by the class imbalance problem. It does not
pet to test P2 of S10. Fig. 5 shows a sample test oracle func-
require modification of standard classifiers and thus, can be
tion that evaluates if an execution result relates to attribute
conveniently added as an additional data preprocessing
Script-tag. In the example, the oracle verifies if the input
step [49].
string contains the value <script. After executing the
Given an imbalanced data ds with majority class data
code in Fig. 4, P2 of S10 would be classified as Script-tag since
dsmaj and minority class data dsmin , the algorithm to gener-
the value <script has been filtered from $name before
being used in the sink. ate synthetic data, given by He et al. [48], can be summa-
rized as follows:
4.2 Building Vulnerability Prediction Model 1) Compute the total number of instances to be gener-
Many machine learning techniques can be used to build vul- ated for the minority class data: G ¼ ðdsmaj 
nerability predictors. Regardless of the specific technique dsmin Þ  b, where b 2 ð0; 1 is the desired balance
used, the goal is to learn and generalize patterns in the data level after generating synthetic data. We use b ¼ 1 to
associated with sinks, which can then be efficiently used for achieve a fully balanced dataset.
predicting vulnerability for new sinks. As more sophisti- 2) For each instance xi in dsmin , K nearest neighbors are
cated security attacks are being discovered, it is important searched in ds based on the Euclidean distance in the
for a vulnerability analysis approach to be able to adapt. attribute space and the ratio g i is calculated as:
With machine learning, it is possible to adapt to new vulner- g i ¼ Kmaj =K where Kmaj is the number of instances
ability patterns via re-training. from K that belong to the majority class. A high ratio
value indicates that xi is mostly surrounded with
4.2.1 Data Representation majority class instances and thus, has a high risk of
Our unit of measurement, an instance in machine learning misclassification. Pds
terminology, is a path in the slice of a sink and we character- 3) Normalize g i according to P g^i ¼ g i = i¼1min g i so that
ize each path with IVS attributes. The attribute values may g^i is a density distribution ( g^i ¼ 1).

Fig. 6. Attribute vectors (instances).


SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 697

4) Compute the number of synthetic instances that predictor attributes that are statistically significant in terms
need to be generated for each minority instance xi : of their association with the dependent attribute and thus,
gi ¼ gbi  G . are selected by the LR modeling process. b0 is a constant. bi
5) Finally, gi instances for each minority instance xi are is the regression coefficient estimated using a maximum
generated using the formula: xnew ¼ xi þ ð^ xi  xi Þ  d likelihood estimation method for attribute ai .
where x^i is one of the K nearest neighbors of xi and The curve between p and any attribute ai , assuming that
d 2 ð0; 1 is a random number. all other attributes are constant, takes a flexible ‘S’ shape
Hence, the idea of adaptive synthetic oversampling is to which ranges between two extreme cases:
focus on generating more synthetic data for borderline
minority class instances in the attribute space that have a a) When ai is not a significant predictor of vulnerabil-
high risk of misclassification, rather than blindly generating ity, then the curve approximates a horizontal line,
new minority class instances to balance the data, which, for that is, p does not depend on ai
some minority class instances, could result in over-fitting b) When ai strongly indicates vulnerability, then the
while still under-representing the borderline instances. It curve approximates a step function.
ensures the adequate representation of minority class data As such, logistic regression analysis is flexible in terms of
by systematically generating synthetic data where learning the types of monotonic relationships it can model between
is expected to be more difficult. the probability of vulnerability and predictor attributes.
Attribute selection. Some of the IVS attributes are only rel- RF [37] is an ensemble learning method for classification
evant for a specific type of vulnerability (for example, Dot- that consists of a collection of tree-structured classifiers. In
DotSlash is only relevant for detecting FI vulnerability) and many cases the predictive accuracy is greatly enhanced as
some attributes may be correlated. We use an attribute the final prediction output comes from an ensemble of learn-
selection technique called correlation-based feature subset ers, rather than a single learner. Given an input sample, each
selection with a greedy stepwise backward search algorithm tree casts a vote (classification) and the forest outputs the
[50] to filter the irrelevant or redundant attributes and thus, classification having the majority vote from the trees. At an
to reduce the potential negative impact they may have on intuitive level, the forest construction procedure is as follows:
the learning process. This technique selects the best subset 1) Select K bootstrap samples from the training set.
of attributes by performing a greedy backward search Bootstrapping, i.e., sampling with replacements,
through the space of attribute subsets. It starts with a subset ensures that about one-third of the training set is left
of attributes and deletes each attribute one by one. It then out, which can be used as a test set.
evaluates the worth of a subset of attributes by considering 2) Fit a classification tree to each bootstrap sample,
the individual predictive ability of each attribute along with resulting in K trees. Each tree is grown to the largest
the degree of redundancy between them. Subsets of features extent possible without pruning.
that are highly correlated with the class while having low 3) Each instance i left out in the construction of the kth
inter-correlation are preferred. The algorithm stops when tree is classified by the kth tree. Due to bootstrap-
the deletion of any remaining attributes results in a decrease ping, i can be classified by about one-third of the
in predictive accuracy. trees. Taking c to be the class that got most of the
4.2.3 Supervised Learning votes across these classifications, the proportion of
times that c is not equal to the true class of i averaged
Classification is a type of supervised learning methods
over all instances is the so-called out-of-bag error esti-
because the class label of each training instance has to be
mate. This estimate can be used as an estimate of the
provided. In this study, we build logistic regression (LR)
generalization error and is used to guide the forest
and RandomForest (RF) models from the proposed attrib-
construction process.
utes. There are two reasons for choosing these two types of
classifiers: 1) These classifiers were benchmarked as among
the top classifiers in the literature [14], 2) LR-based pre- 4.2.4 Semi-Supervised Learning
dictor achieved the best result in our initial work [33] and As discussed above, for supervised learning, we use LR
yields results that are easy to interpret in terms of the and RF, the latter being a type of ensemble learning
impact of attributes on vulnerability predictions. method that has achieved high accuracy in the literature
LR [38] is a type of statistical classification model. It can [14]. However, as ensemble learning works by combining
be used for predicting the outcome (class label) of a depen- individual classifiers, it typically requires significant
dent attribute based on one or more predictor attributes. amounts of labeled data for training. In certain industrial
The probabilities describing the possible outcomes of a contexts, relevant and labeled data available for learning
given instance are modeled, as a function of the predictor may be limited.
attributes, using a logistic function: Semi-supervised methods [39] use, for training, a small
1 amount of labeled data together with a much larger amount
pðai ; . . . ; an Þ ¼ ; of unlabeled data. This method that exploits unlabeled data
1 þ eA
can enable ensemble learning when there are very few
where p is a conditional probability: the probability that a labeled data. As explained by Zhou [43], combining semi-
sink in a path is vulnerable as a function of the path’s secu- supervised learning with ensembles has many advantages.
rity-related properties reflected through predictor attrib- Unlabeled data is exploited to help enrich labeled training
utes. Að¼ b0 þ bi ai þ    þ bn an Þ is a linear combination of n samples allowing ensemble learning: Each individual learner
698 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015

TABLE 2
Test Subjects

Test Subject Description LOC Security Advisories


SchoolMate 1.5.4 School administration system 8,145 Vulnerability information in [21]
FaqForge 1.3.2 Document creation and management system 2,238 Bugtraq-43897
Utopia News Pro 1.1.4 News management system 5,737 Bugtraq-15027
Phorum 5.2.18 Message board system 12,324 CVE-2008-1486 CVE-2011-4561
CuteSITE 1.2.3 Content management system 11,441 CVE-2010-5024 CVE-2010-5025
PhpMyAdmin 3.4.4 MySQL database management system 44,628 From PMASA-2011-13 to PMASA-2013-4
PhpMyAdmin 3.5.0 MySQL database management system 102,491 From PMASA-2011-13 to PMASA-2013-4

is improved with unlabeled data labeled by the ensemble the use of semi-supervised learning instead of supervised
consisting of all other learners. As listed in Lu et al. [41], a learning if there are few defects reported. But no perfor-
few different types of semi-supervised methods, such as EM- mance comparison between semi-supervised learning
based, clustering-based, and disagreement-based learning, and supervised learning has yet been investigated in the
have been proposed in literature. But none of these techni- context of vulnerability prediction. This leads us to our next
ques has been explored for vulnerability prediction so far. research question.
Hence, based on these motivations, we explore the use of Question 2 (Q2). Even if the availability of vulnerability
an algorithm called CoForest, Co-trained Random Forest, data is limited, can vulnerabilities be predicted using
(CF), which applies semi-supervised learning on RF. It is a semi-supervised learning? Further, will the performance
disagreement-based, semi-supervised learner initially pro- of a semi-supervised learner be superior to that of a
posed by Li and Zhou [42]. CF uses multiple, diverse learn- supervised learner when the availability of vulnerability
ers, and combines them to exploit unlabeled data (semi- data is limited?
supervised learning), and maintains a large disagreement
between the learners to promote the learning process. 5.2 Experiment Subjects
CF is based on RF and its procedure is as follows: To evaluate the effectiveness of our vulnerability prediction
framework, we perform experiments on seven, real-world
1) Construct a random forest H with K trees with the PHP web applications, with known vulnerabilities and
available labeled data L. benchmarked for the evaluation of many vulnerability
2) For each tree k in H, repeat the following steps 3  6. detection approaches [3], [4], [21], [28]. These applications
3) Construct a new random forest Hk by removing k can be obtained from SourceForge [5]. Table 2 shows rele-
from H. vant statistics for these applications. The vulnerability infor-
4) Use Hk to label all the unlabeled data U and esti- mation can be found in security advisories such as CVE [6].
mate the labeling confidence based on the degree of Securities advisories typically report only vulnerable web
agreements on the labeling, i.e., the number of classi- pages, which is too coarse-grained for our purpose. And its
fiers that vote for the label assigned by Hk . vulnerability information can typically be traced to multiple
5) Generate a new labeled dataset L0 by combining L vulnerabilities appearing in different program statements.
with the unlabeled data labeled with the confidence Therefore, we still had to manually inspect the reported vul-
levels above a preset confidence threshold. nerable web pages and analyze the server programs to
6) Refine k with L0 . locate the vulnerable program statements.
7) Repeat the above steps 2  6 until none of the trees in For data collection, we enhanced the prototype tool
H changes. PhpMiner used in our previous work [33]. PhpMiner basically
For detail information on CF, please refer to [40] and [42]. implements the steps shown in Fig. 1. It is a fully automated
data collection tool. Given a PHP program, it generates con-
trol flow graphs, program dependence graphs, and system
5 EXPERIMENTAL EVALUATION dependence graphs of the program. It then computes back-
5.1 Research Questions ward static program slices of the sinks found in the program,
This paper aims to investigate the following two research according to the interprocedural slicing algorithm given by
questions: Horwitz et al. [32]. Then, it uses a depth-first search strategy
Question 1 (Q1). Can our proposed IVS attributes, when to extract the paths in the slices. We also implement the tech-
fed to a machine learner, accurately predict SQLI, XSS, RCE, niques discussed in Section 4 to automate the static and
and FI vulnerabilities? dynamic analysis-based classifications of the paths. For
High accuracy is expected to translate into high recall static-based classification, we classify over 330 PHP built-in
and low probability of false alarm when predicting vulner- functions and 30 PHP operators into various input valida-
abilities. Although classifiers can be effective, as discussed tion and sanitization types and store them in a database.
above, a sufficient number of instances with known vulner- As output, PhpMiner produces the attribute vectors like the
ability information is required to train a classifier (super- ones shown in Fig. 6, without the vulnerability labels, which
vised learning). As a result, in certain situations, supervised were manually tagged by us for the experiment. Our tool
learning is either infeasible or ineffective. In the context of also implements the evaluation procedures (Fig. 7) for
defect prediction, some studies [40], [41] have endorsed supervised and semi-supervised learners. For learning
SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 699

TABLE 4
Data Distributions of Attributes Across Instances
Across Datasets

Attribute Mean StdDev Min Max


Client 0.43 0.80 0 17
File 0.11 0.21 0 3
Text-database 0.30 0.46 0 11
Numeric-database 0.02 0.08 0 3
Session 0.36 0.69 0 16
Uninit 0.10 0.23 0 5
Un-taint 1.08 1.35 0 30
Known-vuln-user 0.05 0.16 0 10
Known-vuln-std 0.08 0.14 0 3
Propagate 3.23 4.19 0 99
Numeric 0.10 0.32 0 8
DB-operator 0.00 0.01 0 1
DB-comment-delimiter 0.20 0.22 0 8
DB-special 0.20 0.23 0 8
Fig. 7. Prediction model evaluation procedure. String-delimiter 0.05 0.23 0 6
Lang-comment-delimiter 0.00 0.01 0 1
Other-delimiter 0.00 0.02 0 2
supervised learners, it relies on the Weka 3.7 Java package
Script-tag 0.03 0.14 0 6
with default options provided by Witten et al. [9]. For learn- HTML-tag 0.03 0.14 0 6
ing CoForest, we use the Java package from Li et al. [40]. Event-handler 0.00 0.01 0 2
Table 3 shows the datasets extracted from the test sub- Null-byte 0.01 0.07 0 5
jects by PhpMiner. The dataset name myadmin1 refers to Dot 0.01 0.05 0 2
PhpMyAdmin 3.4.4 and myadmin2 refers to PhpMyAdmin DotDotSlash 0.01 0.05 0 2
3.5.0. The rest is self-explanatory. Because we have the RCE Backslash 0.00 0.04 0 2
Slash 0.01 0.05 0 4
and FI vulnerability data available for only PhpMyAdmin Newline 0.01 0.04 0 2
systems, we used two versions of PhpMyAdmin to Colon 0.00 0.02 0 3
avoid having only one dataset for these two types of vulner- Other-special 0.01 0.06 0 2
abilities. As shown in Table 3, we extracted four different Encode 0.02 0.12 0 5
sets of datasets, each corresponding to a different type of Canonicalize 0.10 0.22 0 4
vulnerabilities. In total, we collected 15 datasets. Table 4 Path 0.00 0.04 0 2
Limit-length 0.02 0.10 0 2
shows descriptive statistics for the values of IVS attributes
extracted from those datasets. On our web site [7], we pro-
vide the implementation of PhpMiner and the datasets.
5.3 Accuracy
We assess the predictive accuracy of our models in terms of
TABLE 3 probability of detection or recall, probability of false alarm,
Datasets and precision. We can use the following contingency table
to define these standard measures.
Dataset #Instances #Vuln. instances
(a) Datasets with SQL injection vulnerabilities Actual
Vulnerable Non-Vulnerable
schmate-sqli 189 152
faqforge-sqli 42 17 Vulnerable True positive False positive
phorum-sqli 122 5 Prediction (tp) (fp)
cutesite-sqli 63 35 Non-Vulnerable False negative True negative
(fn) (tn)
(b) Datasets with cross-site scripting vulnerabilities
schmate-xss 172 138
faqforge-xss 115 53 Recall (pd ¼ tp =ðtp þ fnÞ ) measures how complete our
utopia-xss 86 17 model is in correctly predicting vulnerable sinks. Probabil-
phorum-xss 237 9 ity of false alarm (pf ¼ fp =ðfp þ tnÞ ) is generally used to
cutesite-xss 239 40
myadmin1-xss 305 20 measure the cost of using the model. Precision (pr ¼
myadmin2-xss 425 14 tp =
ðtp þ fpÞ ) measures the extent to which vulnerable sinks
(c) Datasets with remote code execution vulnerabilities are correctly predicted. Ideally, the model should neither
myadmin1-rce 221 3 miss actual vulnerabilities (pd  1) nor throw false alarms
myadmin2-rce 297 5 (pf  0; pr  1) As this is, however, difficult to achieve in
(d) Datasets with file inclusion vulnerabilities practice, our aim is to achieve the highest possible recall
myadmin1-fi 139 5 with a very low probability of false alarm. The model would
myadmin2-fi 121 2 then be very useful in our context as it would detect many
vulnerabilities at a very low cost. We prefer to focus on the
700 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015

Fig. 8 Attributes (IDs) selected in logistic regression models.

probability of false alarm rather than precision because,


in security, there are typically few vulnerable sinks in a
dataset and thus, even a small number of false alarms could
result in low precision, though the model would actually
still be useful.

5.4 Supervised Learning Experiments


To investigate our first research question (Q1), supervised
learning experiments were conducted on the 15 datasets
described above. We use two types of supervised prediction
models, LR and RF (Section 4.2.3), to evaluate if our pro-
posed IVS attributes can accurately predict SQLI, XSS, RCE,
and FI vulnerabilities. Using logistic regression analysis, we
also discuss the relative importance of each proposed attri- Fig. 9. Frequencies of the attributes selected in logistic regression
bute in vulnerability prediction. models.

at identifying such relationships and assessing the effect of


5.4.1 Experimental Design attributes on vulnerability prediction.
We evaluate the supervised models using the procedure Logistic regression selects attributes for vulnerability
shown in Fig. 7. Each model is cross validated on each data- classification based on their statistical significance and uses
set. We follow a fivefold, standard cross validation proce- regression coefficient values to weigh the effect of attributes
dure, repeated ten times (i.e., training and testing 50 times on vulnerability prediction. For example, the following
for each model) [9]. As discussed in Section 4.2.2, oversam- shows a logistic regression model obtained through maxi-
pling is included in the procedure to address the imbal- mum likelihood estimation obtained during cross validation
anced data problem. Attributes selection is also included to on the phorum_xss dataset:
filter irrelevant, redundant, or correlated attributes. But, to 1
Vuln? ¼ :
prevent data sampling bias, oversampling and attribute 1 þ eð5:3þ9:6Clientþ7:7Uninit37:6Untaint40StringdelimiterÞ
selection is only applied to training instances. Repeating the
procedure ten times reduces possible sampling bias due to Such an equation can be informally interpreted as “a path
random splits in cross validation. The randomization also in a sink is highly likely to be vulnerable if it accesses
defends against ordering effects [36]. user inputs from input sources of Client and Uninit, but the
odds of being vulnerable decrease significantly if the path
also contains String-delimiter and Un-taint -type input vali-
5.4.2 Attribute Relevancy Analysis dation and sanitization functions.” Quantitatively, the
One major advantage of using machine learning approaches above logistic regression equation can account for non-lin-
is that it can select the most informative and significant ear relationships between the probability of vulnerability
attributes in such a way as to optimize prediction. It would and the predictor attributes.
not be straightforward to assess vulnerability by just In total, 750 LR models (50 cross validations  15 data-
inspecting attribute values in the presence of highly com- sets) were built. Fig. 8 shows the union of the attributes
plex relationships. It is also expected that different attrib- selected by the LR models during cross validation of each
utes will exhibit widely varying levels of importance in dataset. Fig. 9 shows the frequency with which they were
vulnerability prediction. Machine learning algorithms aim selected during cross validation.
SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 701

Fig. 10. Cross validation results of the supervised learners—Logistic Regression (LR) and RandomForest (RF).

First, we can observe, as we expected, that frequencies Regression) in our previous work [33], for our vulnerability
vary significantly across attributes, thus showing their prediction context. We focus our discussion below on pre-
widely different importance in terms of predicting vulnera- dictive accuracy based on RF’s results.
bility. The most selected attributes include Client, Uninit, Averaging over all 15 datasets, the RF models achieved
Un-taint, and Propagate. We note that a few attributes, the result (pd ¼ 77%, pf ¼ 5%), which is better than the
namely DB-operator, Other-delimiter, Event-handler, Dot, result generally benchmarked (pd > 70%, pf < 25%) by
Colon, Other-special, and Path, were not selected at all, many prediction studies [23], [34]. This implies that our pre-
although those attributes reflect functions that could sani- diction approach detects 77 percent of the top four web
tize potentially dangerous meta-characters like (.) and (,). application vulnerabilities at the cost of filtering a few false
This is not surprising since, as observed in Table 4, the data positives. Given that, in practice, web application projects
distributions of those attributes are sparse, indicating that typically have many software modules containing many
they are not present in most instances. The attributes like sinks, and undergo many versions over a long lifespan,
Dot and Path are not relevant for most datasets since they such models can be very useful in practice to predict vulner-
are designed for detecting FI and RCE vulnerabilities and abilities in new versions based on vulnerability data from
our experiment only contains two datasets that correspond past versions.
to each of these vulnerabilities. We also manually checked For all the datasets, the RF models achieved low pf
that some of the rarely-selected attributes are actually pres- results. And for most datasets, the RF models also achieved
ent in some of our datasets, but they were not selected by high pd results. But we also note that, for a few datasets,
logistic regression as they were found to be not statistically the models achieved pd results lower than our benchmark
significant. Lastly, the overall key observation is that most pd result (pd > 70 percent). If we take myadmin2-xss as a
of the proposed attributes are selected by different models representative example, our model only achieved
with varying frequencies, suggesting that the set of pro- pd ¼ 48%, but still, achieving a very low pf (1 percent)
posed attributes reflects the various vulnerability patterns makes such a model useful in practice. Looking more
in the selected datasets. closely at the numbers, myadmin2-xss contains a total of 425
instances, including 14 vulnerable instances (Table 3).
5.4.3 Prediction Results Thus, the model catches nearly half of the vulnerabilities at
Fig. 10 shows the predictive accuracy of LR and RF models the expense of only four false warnings, which are not
learnt from IVS attributes, in terms of recall, precision, and costly for developers to filter.
probability of false alarm, based on cross validation. On Hence, to answer Q1, the supervised prediction models
average, RF performed slightly better than LR. Thus, this built from IVS attributes can predict SQLI, XSS, RCE, and FI
study allows us to recommend a better supervised learning vulnerabilities in most datasets, with a sufficient level of
scheme (RandomForest) than the one used (Logistic accuracy to be useful. And even in the few cases where the
702 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015

Fig. 11. Cross validation results of the semi-supervised learner—CoForest (CF) and the supervised learner—RandomForest (RF) at a sampling rate
m ¼ 20%.

learnt classifiers cannot effectively detect vulnerabilities, 5.5.2 Prediction Results


given that our approach consistently achieved low pf results We compared the predictive accuracy of RF and CF at dif-
for all the datasets, we can confidently claim that detected ferent sampling rates and observed that CF clearly outper-
vulnerabilities always come at an acceptable cost. forms RF with data sampling rates below 40 percent in
terms of recall and precision. Here, we discuss the results
5.5 Semi-Supervised Learning Experiments based on accuracy with a sampling rate m ¼ 20%. Fig. 11
In this second case study, based on the same 15 datasets, shows the results.
we compare the accuracy of semi-supervised and super- On average, over all 15 datasets, even though only a
vised prediction models (CF and RF) in the presence of small amount of labeled data is used, the CF model
low amounts of labeled data. We wish to determine if the showed good accuracy (pd ¼ 71%, pf ¼ 5%, pr ¼ 75%),
semi-supervised model should be preferred for vulnerabil- thus outperforming the RF model (pd ¼ 47%, pf ¼ 8%,
ity prediction when there is limited vulnerability data pr ¼ 51%). To test for statistical significance of the differ-
available (Q2). ence between CF and RF, as suggested by Demsar [20],
we conducted one-tailed Wilcoxon signed-ranks tests on
the results. With a significance level equal to 0.01, the
5.5.1 Experimental Design tests show that CF performs better than RF in terms of all
The model evaluation procedure is similar to the one used the accuracy measures we used.
in the above supervised learning experiments (Fig. 7), Comparing with the average result (pd ¼ 77%, pf ¼ 5%,
except that the training data is now split into two—the pr ¼ 72%) of the RF models trained with fully available
labeled training set L and the unlabeled training set U. labeled data (Section 5.4.3), it is interesting to note that
From the training set, a small percentage (denoted as data the CF models achieved comparable predictive accuracy.
sampling rate m) of training data is randomly sampled as L. However, as can be observed in Figs. 10 and 11, the semi-
Like case study 1, adaptive synthetic sampling and correla- supervised learner shows larger variations in accuracy
tion-based feature subset selection is then applied to L across the 15 datasets than the supervised learner. Across
before training. The remaining training data is used as unla- the 15 datasets, the CF models’ accuracy (Fig. 11) shows a
beled training set U for the semi-supervised learner. For larger standard deviation of (pd ¼ 22%, pf ¼ 6%,
example, given a dataset containing 100 instances and pr ¼ 20%) than the RF models’ accuracy (Fig. 10) with a
m ¼ 20%, for each trial during fivefold cross validation, the standard deviation of (pd ¼ 17%, pf ¼ 6%, pr ¼ 19%). This
test set contains 20 instances, L contains 16 instances (20 per- implies that supervised learning with sufficient labeled
cent of available training samples), and U contains 64 data performs more consistently compared to semi-super-
instances. The supervised learner RF is trained on L and vised learning and thus, should be preferred when there
tested on T whereas the semi-supervised learner CoForest is sufficient labeled data available for training. On the
(CF) is trained on L and U, and tested on T. other hand, to address Q2, when labeled data is rare,
SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 703

TABLE 5
Runtime Performance of PhpMiner

Test Subject Static Analysis Time (s) Dynamic Analysis Time (s) Average Learning Time (s) Total time (s)
SchoolMate 1.5.4 8,211 792 99 9,102
FaqForge 1.3.2 6,789 511 84 7,384
Utopia News Pro 1.1.4 7,699 1,250 87 9,036
Phorum 5.2.18 10,592 2,134 132 12,858
CuteSITE 1.2.3 9,205 1,977 105 11,287
PhpMyAdmin 3.4.4 17,549 3,104 141 20,794
PhpMyAdmin 3.5.0 28,700 4,570 160 33,430

semi-supervised learning should be favored to supervised applications in commercial sectors since all our test applica-
learning. tions are open source. But it is difficult to conduct experi-
ments on commercial applications since their vulnerability
data is not publicly accessible. Also, the implicit assumption
5.6 Discussion on Data Collection and Model
Learning Performance of our approach that all the application code is available for
analysis is clearly a limitation in some application contexts.
We showed above that both static analysis- and dynamic
Some (especially commercial) applications might use plug-
analysis-based attributes contribute to achieving sufficient
ins or third-party software components, which may be only
predictive accuracy for the models to be useful in practice,
known at runtime or for which the source code is unavail-
that is, for vulnerabilities to be detected at reasonable cost.
able. We also consider that a security-sensitive program
Still, it is required that we also analyze the scalability of these
operation is vulnerable if it uses an input read from an
analyses. Table 5 shows the runtime performance of
external environment with unknown security controls.
PhpMiner. Since our hybrid analysis technique is based on the
Hence, our result would be incorrect if the application is
work of Balzarotti et al. [11], the runtime performance of our
run inside a framework that provides a layer of safeguards
tool also showed similar results. That is, PhpMiner actually
that properly validate all the incoming inputs.
spent most of the time on static analysis in extracting slices
Our data only reflect the vulnerability patterns of those
and their paths. The time spent on running the test suites
that are reported in vulnerability databases. Hence, our
(dynamic analysis) was considerably less. Although the total
vulnerability predictions may not detect vulnerabilities
time taken was up to a maximum of nine hours, we believe
having different characteristics in terms of our proposed
that it is reasonable considering that some of the test subjects
attributes. But, considering the wide variability in charac-
are widely-used, real world applications and thus, our per-
teristics of the test subjects (see Table 2), our results
formance results suggest that our tool can be applicable in
should be widely applicable. It is also noteworthy that
practice. Also, while implementing our tool, performance
our underlying hybrid analysis may produce classifica-
optimization was not as much a focus as would be expected
tion errors affecting the prediction results. For example,
in an industry strength tool and there is probably significant
dynamic analysis may incorrectly flag a function as Java-
room for improvement. Average learning time in Table 5
Script tag filtering function. But since our predictors are
refers to time spent on training and testing a learner with one
learnt on past data, if the same function is causing a num-
specific setting. We did not differentiate the time spent on
ber of sinks to be vulnerable, machine learning algorithms
supervised learning and semi-supervised learning because
learn from it and the presence of such function in the pro-
the time difference between these machine learning processes
gram slices will indicate vulnerabilities.
is insignificant. It took a maximum of three minutes for train-
The use of additional or different machine learning tech-
ing and testing a learner (including 50 trials for each setting).
niques might alter our results. For data balancing, we
also tried other sampling techniques like undersampling
5.7 Threats to Validity (remove majority class data) [49], but adaptive synthetic
Our current work targets PHP web applications because the oversampling provided better results. Regarding attribute
vulnerabilities we address are very common and serious for selection, we also evaluated learners without any attribute
PHP applications [51]. However, though this is a practical selection and with different attribute selection methods such
limitation, it is possible to extend the logic presented in this as gain ratio [9]. But correlation-based method provided
paper to other programming languages. For example, to slightly better results. For supervised learning, we used two
adapt our approach to Java, the same classification schemes very different classification algorithms which are statistical-
described in this work could be used. One could predefine based and ensemble-based, respectively. We also tried other
Java built-in functions and operations to perform static anal- types of classifiers like multi-layer perceptron and C4.5 that
ysis-based classification. And to perform static and dynamic are neural network-based and tree-based, respectively. But
analyses, there are readily available Java program analysis RandomForest’s results were superior. We have not tried
tools such as Chord [44]. Furthermore, despite these neces- other algorithms for semi-supervised learning. We did not
sary adaptations to other languages, it is important to note focus our attention on fine-tuning the prediction models and
that the overall approach would be similar. therefore, better results might be obtained.
Data sampling bias is one of our threats to validity. Our Like all other empirical studies, our results are limited to
results here may not generalize well to other types of the applied machine learning processes, the test subjects,
704 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015

and the experimental setup used. One good solution to attributes. Their work is based on the concept that software
refute, confirm, or improve our results is to replicate the components similar to known vulnerable ones, in terms of
experiments with new test subjects and probably with fur- imports and function calls, are likely to be vulnerable as
ther machine learning strategies. This can be easily done well. They achieved pd ¼ 45% and pr ¼ 70%.
since we have clearly defined our empirical methods and Yamaguchi et al. [45] and [46] use natural language
setup, and we also provide the data used in the experiments processing techniques to identify and extract API usage
and the data collection tool on our web site [7]. patterns from abstract syntax trees [45] or dependency
graphs [46], which are then represented as attributes for
machine learning. The numbers of attributes are not
6 RELATED WORK bounded. Whereas, we propose 32 code attributes, each
Our work applies machine learning for the prediction of of which is specifically designed to reflect a specific type
vulnerabilities in web applications. Hence, its related work of input validation and sanitization code pattern and
falls into three main categories: defect prediction, vulnera- thus, is an important indicator of vulnerability. Also, we
bility prediction, and vulnerability detection. use program analysis techniques—both static and
Defect prediction. In defect prediction studies, defect dynamic analyses to accurately extract those attributes for
predictors are generally built from static code attributes machine learning.
such as object-oriented design attributes [12], LOC counts, The above vulnerability prediction approaches generally
and code complexity attributes [14], [34], [35] because target software components or program functions. By con-
static attributes can be cheaply and consistently collected trast, our method targets specific program statements for
across many systems [34]. However, it was quickly real- vulnerability prediction. Another major difference is that
ized that such attributes can only provide limited accu- we use code attributes that characterize input validation
racy [13], [15], [25]. Arisholm et al. [13] and Nagappan and sanitization routines.
et al. [25] reported that process attributes (e.g., developer Shar and Tan [2], [16] predicted vulnerabilities using static
experience and fault history) could significantly improve analysis. Similar to this extension work, they classify the
prediction models. On the other hand, as process attrib- types of validation and sanitization functions implemented
utes are difficult to measure and measurements are often for the sinks and reflect those classifications on static code
inconsistent, Menzies et al. [15] showed that static code attributes. Although their supervised learners built from
attributes could still be effective if predictors are tuned to static attributes achieved good accuracies, they observed that
user-specific goals. static analysis could not precisely classify the types of some
In many real world applications, defect data is often lim- of the validation and sanitization functions. Later, Shar et al.
ited, which makes supervised learning infeasible or ineffec- [33] predicted vulnerabilities using hybrid code attributes.
tive. Li et al. [40] and Lu et al. [41] showed that semi- Dynamic analysis was incorporated into static analysis to
supervised learning can be used to address this problem improve the classification accuracy. Although these earlier
and that semi-supervised learners could also perform well works only targeted SQLI and XSS vulnerabilities, they
in software defect prediction. Li et al. [40] used the CoForest stressed that the work should be extended to address other
method, which is also used by our work. types of vulnerabilities as well. This work extends the prior
The similarity with these defect prediction studies is that ones by addressing two additional types of common vul-
our work also uses machine learning techniques in building nerabilities. We propose new attributes and analyze code
vulnerability predictors. However, the major difference is patterns related to these additional vulnerabilities. More
that our study targets security vulnerabilities in web appli- importantly, this work also introduces semi-supervised
cations. Since these studies show that existing set of attrib- learning in the domain of vulnerability prediction.
utes do not work everywhere, we define specific attributes Vulnerability detection. Jovanovic et al. [3] and Xie and
targeted at predicting vulnerabilities based on automated Aiken [4] showed that many XSS and SQLI vulnerabilities
and scalable static and dynamic analysis. can be detected by static program analysis techniques. They
Vulnerability prediction. Shin et al. [23] used code com- identify various input sources and sensitive sinks, and deter-
plexity, code churn, and developer activity attributes to pre- mine whether any input data is used in those sinks without
dict vulnerable programs. They achieved pd ¼ 80% and passing through sanity checks. Such static taint tracking
pf ¼ 25%. Their assumption was that, the more complex the approaches often generate too many false alarms as these
code, the higher the chances of vulnerability. But from our approaches cannot reason about the correctness and the ade-
observations, many of the vulnerabilities arise from simple quacy of those sanity checks. Thus, these approaches are not
code and, if a program does not employ any input valida- precise in general.
tion and sanitization routines, it would be simpler but nev- To improve precision, Fu and Li [27] and Wassermann
ertheless contain many vulnerabilities. Walden et al. [24] and Su [28] approximated the string values that may appear
investigated the correlations between security resource indi- at sensitive sinks by using symbolic execution and string
cator (SRI) and numbers of vulnerabilities in PHP web analysis techniques. More recent approaches incorporate
applications. SRI is derived from publicly available security dynamic analysis techniques such as concolic execution
information such as past vulnerabilities, secure develop- [21], and model checking [22]. These approaches reason
ment guidelines, and security implications regarding sys- about various paths in the program that lead to sensitive
tem configurations. Neuhaus et al. [26] also predicted sinks and attempt to generate test cases that are likely to be
vulnerabilities in software components from the past vul- attack vectors. All these approaches reduce false alarm
nerability information, and the imports and function calls rates. But symbolic, concolic, and model checking
SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 705

techniques often lead to path explosion problem [30]. It is require constraint solving and model checking to reason
difficult to reason about all the paths in the program when about correctness as in existing dynamic techniques, e.g.,
the program contains many branches and loops. Further, concolic execution. Our analysis is also fine-grained since it
the performance of these approaches also depends very identifies vulnerabilities at the program statement level as
much on the capabilities of their underlying string con- opposed to the component level, as in existing vulnerability
straint solvers in handling a myriad of string operations prediction approaches.
offered by programming languages. Therefore, these In our experiments on seven PHP web applications, we
approaches typically suffer from scalability issues. first showed that the proposed IVS attributes can be used
Our static and dynamic analysis technique builds on to detect several types of vulnerabilities. On average, the
Balzarotti et al. [11]. But, similar to the above techniques, RandomForest models, built on IVS attributes, achieved
Balzarotti et al. apply static and dynamic analysis to deter- (pd ¼ 92%, pf ¼ 4%), (pd ¼ 72%, pf ¼ 9%), (pd ¼ 64%,
mine the correctness of custom sanitization functions identi- pf ¼ 1%), (pd ¼ 76%, pf ¼ 1%) when predicting SQL injec-
fied on data flow graphs, thus leading to scalability issues as tion, cross site scripting, remote code execution, and file
well. The difference or the contribution of our work is that inclusion vulnerabilities, respectively. We also showed
we leverage machine learning techniques to mitigate this that, when a limited number of sinks with known vulner-
scalability problem. That is, a predictor can learn correct and abilities are available for training the prediction model,
incorrect custom functions based on historical data. Though semi-supervised learning is a good alternative to super-
we apply Balzarotti et al.’s static and dynamic analysis tech- vised learning. We compared RandomForest (supervised)
nique, we do not do so to precisely compute the correctness and CoForest (semi-supervised) models with a low data
of custom functions, but rather to infer their security pur- sampling rate of 20 percent, that determine the amount
poses and apply these inferences in machine learning. As a of labeled training data. The CoForest model achieved
result, our approach also does not require string solving and (pd ¼ 71%, pf ¼ 5%), on average over 15 datasets, out-
reasoning of (potentially infinite) program paths like con- performing the RandomForest model that achieved (pd ¼
colic execution and model checking techniques. 47%, pf ¼ 8%).
However, symbolic, concolic, and model checking To generalize our current results, our experiment can be
approaches could possibly yield high vulnerability detec- easily replicated and extended as we made our tool and
tion accuracy, which may never be matched by machine data available online [7]. We also intend to conduct more
learning-based methods. Thus, our objective is not to pro- experiments with industrial applications. While we believe
vide a replacement for such techniques but rather to pro- that the proposed approach can be a useful and comple-
vide a complementary approach to combine with them and mentary solution to existing approaches, studies need to be
to use when they are not applicable. One could, for exam- carried out to determine the feasibility and usefulness of
ple, first gather vulnerability predictions on code sections integrating multiple approaches.
using machine learning and then focus on code sections
with predicted vulnerabilities using the more accurate tech- ACKNOWLEDGMENTS
niques mentioned above. Thereafter, ideally, the confirmed The authors would like to thank Hongyu Zhang [40] for
vulnerabilities should be removed by manual audits or by providing us with the Java implementation of CoForest
using automated vulnerability removal techniques such as algorithm. This work was partially supported by the
Shar and Tan [29]. National Research Fund, Luxembourg (FNR/P10/03). Lwin
Khin Shar is the corresponding author.
7 DISCUSSIONS AND CONCLUDING REMARKS
REFERENCES
The main goal of this paper is to achieve both high accuracy
and good scalability in detecting web application vulner- [1] OWASP. (2012, Jan.). The open web application security project
[Online]. Avaialble: http://www.owasp.org
abilities. In principle, our proposed approach leverages all [2] L. K. Shar and H. B. K. Tan, “Predicting SQL injection and
the advantages provided by existing static and dynamic cross site scripting vulnerabilities through mining input saniti-
taint analysis approaches and further enhances accuracy by zation patterns,” Inf. Softw. Technol., vol. 55, no. 10, pp. 1767–
1780, 2013.
using prediction models developed with machine learning [3] N. Jovanovic, C. Kruegel, and E. Kirda, “Pixy: A static analysis
techniques and based on available vulnerability informa- tool for detecting web application vulnerabilities,” in Proc. IEEE
tion. Static analysis is generally sound but tends to generate Symp. Security Privacy, 2006, pp. 258–263.
many false alarms. Dynamic analysis is precise but could [4] Y. Xie and A. Aiken, “Static detection of security vulnerabilities in
scripting languages,” in Proc. USENIX Security Symp., 2006,
miss vulnerabilities as it is difficult or impossible to exercise pp. 179–192.
every test case scenario. Our strategy consisted in building [5] (2012, Mar.). SourceForge. [Online]. Available: http://www.sour-
predictors using machine learners trained with the informa- ceforge.net
tion provided by both static and dynamic analyses and [6] (2013, May). CVE: Distributions of vulnerabilities by types
[Online]. Available: http://www.cvedetails.com/vulnerabilities-
available vulnerability information, in order to achieve by-types.php
good accuracy while meeting scalability requirements. [7] PhpMiner [Online]. Availble: http://sharlwinkhin.com/
Our static analysis only involves computing program sli- phpminer.html, 2013.
[8] J. Ferrante, K. J. Ottenstein, and J. D. Warren, “The program
ces. Dynamic analysis is only used to infer security-check- dependence graph and its use in optimization,” ACM Trans. Pro-
ing types of validation and sanitization functions and we gramm. Languages Syst., vol. 9, pp. 319–349, 1987.
use this inferred information for prediction rather than cor- [9] I. H. Witten, E. Frank, and M. A. Hall, Data Mining, 3rd ed. San
rectness analysis. This approach is scalable since it does not Mateo, CA, USA: Morgan Kaufmann, 2011.
706 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 12, NO. 6, NOVEMBER/DECEMBER 2015

[10] (2012, Mar.). RSnake [Online]. Available: http://ha.ckers.org [35] Q. Song, Z. Jia, M. Shepperd, S. Ying, and J. Liu, “A general soft-
[11] D. Balzarotti, M. Cova, V. Felmetsger, N. Jovanovic, E. Kirda, ware defect-proneness prediction framework,” IEEE Trans. Softw.
C. Kruegel, and G. Vigna, “Saner: Composing static and dynamic Eng., vol. 37, no. 3, pp. 356–370, May/Jun. 2011.
analysis to validate sanitization in web applications,” in Proc. [36] D. Fisher, L. Xu, and N. Zard, “Ordering effects in clustering,”
IEEE Symp. Security Privacy, 2008, pp. 387–401. in Proc. Int. Workshop Mach. Learning, 1992, pp. 163–168.
[12] L. C. Briand, J. W€ ust, J. W. Daly, and D. V. Porter, “Exploring the [37] L. Breiman, “Random forests,” Mach. Learning, vol. 45, no. 1,
relationships between design measures and software quality in pp. 5–32, 2001.
object-oriented systems,” J. Syst. Softw., vol. 51, no. 3, pp. 245–273, [38] D. W. Hosmer Jr, S. Lemeshow, and R. X. Sturdivant, Applied
2000. Logistic Regression, 3rd ed. New York, NY, USA: Wiley, 2013.
[13] E. Arisholm, L. C. Briand, and E. B. Johannessen, “A systematic [39] O. Chapelle, B. Sch€ olkopf, and A. Zien, Eds., Semi-Supervised
and comprehensive investigation of methods to build and evalu- Learning. Cambridge, MA, USA: MIT Press, 2006.
ate fault prediction models,” J. Syst. Softw., vol. 83, no. 1, pp. 2–17, [40] M. Li, H. Zhang, R. Wu, and Z.-H. Zhou, “Sample-based software
2010. defect prediction with active and semi-supervised learning,”
[14] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, “Benchmarking Automated Softw. Eng., vol. 19, pp. 201–230, 2012.
classification models for software defect prediction: a proposed [41] H. Lu, B. Cukic, and M. Culp, “Software defect prediction using
framework and novel findings,” IEEE Trans. Softw. Eng., vol. 34, semi-supervised learning with dimension reduction,” in Proc. Int.
no. 4, pp. 485–496, Jul./Aug. 2008. Conf. Automated Softw. Eng., 2012, pp. 314–317.
[15] T. Menzies, Z. Milton, B. Turhan, B. Cukic, Y. Jiang, and A. Bener, [42] M. Li and Z.-H. Zhou, “Improve computer-aided diagnosis with
“Defect prediction from static code features: current results, limi- machine learning techniques using undiagnosed samples,” IEEE
tations, new approaches,” Automated Softw. Eng., vol. 17, no. 4, Trans. Syst., Man Cyberne., Part A: Syst. Humans, vol. 37, no. 6,
pp. 375–407, 2010. pp. 1088–1098, Nov. 2007.
[16] L. K. Shar and H. B. K. Tan, “Predicting common web application [43] Z.-H. Zhou, “When semi-supervised learning meets ensemble
vulnerabilities from input validation and sanitization code learning,” in Proc. Int. Workshop Multiple Classifier Syst., 2009,
patterns,” in Proc. Int. Conf. Automated Softw. Eng., 2012, pp. 310– pp. 529–538.
313. [44] Chord: A versatile platform for program analysis. (2011). Proc.
[17] C. Anley, Advanced SQL Injection in SQL Server Applications, Next Tutorial ACM Conf. Program. Language Des. Implementation
Generation Security Software Ltd., White Paper, 2002. [Online]. Available: http://pag.gatech.edu/chord
[18] S. Palmer, Web application vulnerabilities: Detect, exploit, pre- [45] F. Yamaguchi, M. Lottmann, and K. Rieck, “Generalized vulnera-
vent, Syngress, 2007. bility extrapolation using abstract syntax trees,” in Proc. Annu.
[19] Y. Kamei, A. Monden, S. Matsumoto, T. Kakimoto, and K. Matsu- Comput. Security Appl. Conf., 2012, pp. 359–368.
moto, “The effects of over and under sampling on fault-prone [46] F. Yamaguchi, C. Wressnegger, H. Gascon, and K. Rieck,
module detection,” in Proc. Int. Symp. Empirical Softw. Eng. Meas., “Chucky: Exposing missing checks in source code for vulnerabil-
2007, pp. 196–204. ity discovery,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secu-
[20] J. Demsar, “Statistical comparisons of classifiers over multiple rity, 2013, pp. 499–510.
data sets,” J. Mach. Learning Res., vol. 7, pp. 1–30, 2006. [47] PHP Security [Online]. Available: http://www.php.net/manual/
_
[21] A. Kiezun, P. J. Guo, K. Jayaraman, and M. D. Ernst, “Automatic en/security.php, 2013.
creation of SQL injection and cross-site scripting attacks,” in Proc. [48] H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive syn-
Int. Conf. Softw. Eng., 2009, pp. 199–209. thetic sampling approach for imbalanced learning,” in Proc. Int.
[22] M. Martin and M. S. Lam, “Automatic generation of XSS and SQL Joint Conf. Neural Netw., 2008, pp. 1322–1328.
injection attacks with goal-directed model checking,” in Proc. [49] H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE
USENIX Security Symp., 2008, pp. 31–43. Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, Sep. 2009.
[23] Y. Shin, A. Meneely, L. Williams, and J. A. Osborne, “Evaluating [50] M. A. Hall, “Correlation-based feature selection for machine
complexity, code churn, and developer activity metrics as indica- learning,” Ph.D. thesis, Dept. Comput. Sci., Univ. Waikato, Hamil-
tors of software vulnerabilities,” IEEE Trans. Softw. Eng., vol. 37, ton, New Zealand, 1998.
no. 6, pp. 772–787, Nov./Dec. 2011. [51] PHP Top 5 [Online]. Available: https://www.owasp.org/index.
[24] J. Walden, M. Doyle, G. A. Welch, and M. Whelan, “Security of php/PHP_Top_5, 2014.
open source web applications,” in Proc. Int. Symp. Empirical Softw.
Eng. Meas., 2009, pp. 545–553. Lwin Khin Shar received the PhD degree in
[25] N. Nagappan, T. Ball, and B. Murphy, “Using historical in-process electrical and electronic engineering from the
and product metrics for early estimation of software failures,” Nanyang Technological University of Singapore.
in Proc. Int. Symp. Softw. Rel. Eng., 2006, pp. 62–74. He is a research associate in software verification
[26] S. Neuhaus, T. Zimmermann, C. Holler, and A. Zeller, “Predicting and validation at the SnT centre for Security, Reli-
vulnerable software components,” in Proc. ACM Conf. Comput. ability, and Trust, University of Luxembourg. His
Commun. Security, 2007, pp. 529–540. research interests include software security and
[27] X. Fu and C.-C. Li, “A string constraint solver for detecting web privacy analysis using program analysis and
application vulnerability,” in Proc. Int. Conf. Softw. Eng. Knowl. machine learning techniques. He is a member of
Eng., 2010, pp. 535–542. the IEEE.
[28] G. Wassermann and Z. Su, “Sound and precise analysis of web
applications for injection vulnerabilities,” in Proc. ACM SIGPLAN
Conf. Program. Language Des. Implementation, 2007, pp. 32–41.
[29] L. K. Shar and H. B. K. Tan, “Automated removal of cross site
scripting vulnerabilities in web applications,” Inf. Softw. Technol.,
vol. 54, no. 5, pp. 467–478, 2012.
[30] K.-K. Ma, K. Y. Phang, J. S. Foster, and M. Hicks, “Directed sym-
bolic execution,” in Proc. Int. Conf. Static Anal., 2011, pp. 95–111.
[31] M. Weiser, “Program slicing,” in Proc. Int. Conf. Softw. Eng., 1981,
pp. 439–449.
[32] S. Horwitz, T. Reps, and D. Binkley, “Interprocedural slicing
using dependence graphs,” ACM Trans. Program. Languages Syst.,
vol. 12, no. 1, pp. 26–61, 1990.
[33] L. K. Shar, H. B. K. Tan, and L. C. Briand, “Mining SQL injection
and cross site scripting vulnerabilities using hybrid program ana-
lysis,” in Proc. Int. Conf. Softw. Eng., 2013, pp. 642–651.
[34] T. Menzies, J. Greenwald, and A. Frank, “Data mining static code
attributes to learn defect predictors,” IEEE Trans. Softw. Eng.,
vol. 33, no. 1, pp. 2–13, Jan. 2007.
SHAR ET AL.: WEB APPLICATION VULNERABILITY PREDICTION USING HYBRID PROGRAM ANALYSIS AND MACHINE LEARNING 707

Lionel C. Briand is a full professor and a vice- Hee Beng Kuan Tan received the PhD degree in
director of the Interdisciplinary Centre for ICT computer science from the National University
Security, Reliability, and Trust (SnT), University of Singapore. He is an associate professor in
of Luxembourg. He was granted the IEEE Com- the Division of Information Engineering in the
puter Society Harlan Mills award in 2012 for School of Electrical and Electronic Engineering,
contributions to Model-based Verification and Nanyang Technological University. He has
Testing, and elected Reliability Engineer of the 13 years of experience in IT industry before mov-
year (2013) by the IEEE Reliability Society. His ing to academic. He was also a lecturer in the
research interests include software testing and Department of Information Systems and Com-
verification, model-driven engineering, quality puter Science in the National University of Singa-
assurance and control, and applications of pore. His research interests include software
machine learning and evolutionary computation to software engineering. testing and analysis, software security, and software size estimation. He
He is a fellow of the IEEE (2010) and a Canadian professional engineer is a senior member of IEEE and a member of the ACM.
(P. Eng.) registered in Ontario, Canada.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

You might also like