Web Application Vulnerability Prediction Using Machine Learning
Web Application Vulnerability Prediction Using Machine Learning
ISSN 2229-5518
IJSER
In this proposed system hybrid (static + techniques that they can adopt to protect
dynamic) program attributes are used to their programs from such common
characterize input validation and vulnerabilities. Input validation typically
sanitation code patterns which act as a checks an input against required properties
significant indicator of web application like data length, range, type, and sign. Input
vulnerabilities. Current vulnerable sanitization, in general, cleanses an input
prediction techniques rely on the string by accepting only pre-defined
availability of data labeled with the characters and rejecting others, including
vulnerability information for training. For characters with special meaning to the
most web application, past vulnerability interpreter under consideration. Intuitively,
data is often not available or at least not an application is vulnerable if the
complete. Hence to address both situations developers failed to implement these
where labeled past data is fully available or techniques correctly or to a sufficient
not fully available, this approach can be degree.
used. The web program is sliced into small The code attributes that characterize
sinks and by using dynamic and static validation and sanitization code
program analysis, input validation and implemented in the program could be used
sanitation attributes are generated. to predict web application vulnerabilities.
Based on this hypothesis, we propose a set
Keywords – Input validation, Input of code attributes called input validation
sanitation, Static Program analysis, and sanitization (IVS) attributes from which
Dynamic program analysis, Machine we build vulnerability predictors that are
learning. fine-grained, accurate, and scalable. The
approach is fine-grained because it
I INTRODUCTION identifies vulnerabilities at program
statement levels. We use both static and
Web applications play an important dynamic program analysis techniques to
role in many of our daily activities such as extract IVS attributes. Static analysis can
social networking, email, banking, help assess general properties of a program.
shopping, registrations, and so on. As web Yet, dynamic analysis can focus on more
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 81
ISSN 2229-5518
specific code characteristics that are This taint analysis would identify points
complementary to the information obtained where tainted data can enter the program,
with static analysis. We use dynamic propagate taint values along assignments
analysis only to infer the possible types of and similar constructs, and inform the user
input validation and sanitization code, of every sensitive sink that receives tainted
rather than to precisely prove their input. also perform an alias analysis for
correctness, and apply machine learning on providing information about alias
these inferences for vulnerability prediction. relationships. Moreover, it is very beneficial
Therefore, we mitigate the scalability issue for the taint analysis to know about the
typically associated with dynamic analysis. literal values that variables and constants
Thus, our proposed IVS attributes reflect may hold at each program point. This task is
relevant properties of the implementations performed by literal analysis.
of input validation and input sanitization
methods in web programs and are expected Y. Xie and A. Aiken [2], “Static
to help predict vulnerabilities in an accurate detection of security vulnerabilities in
and scalable manner. Furthermore, both scripting languages,” In this approach, static
supervised learning and semi-supervised analysis is applied to finding security
learning methods are used to build vulnerabilities in PHP. The goal is to
vulnerability predictors from IVS attributes, develop a bug detection tool that
such that our method can also be used in automatically finds serious vulnerabilities
IJSER
contexts where there is limited vulnerability with high confidence. An interprocedural
data for training. static analysis algorithm for PHP is
proposed. A language as dynamic as PHP
II RELATED WORKS presents unique challenges for static
analysis: language constructs that allow
N. Jovanovic, C. Kruegel, and E. dynamic inclusion of program code,
Kirda [1], have proposed “Pixy: A static variables whose types change during
analysis tool for detecting web application execution, operations with semantics that
vulnerabilities,” Pixy is the first open depend on the runtime types of the
source tool for statically detecting XSS operands, and pervasive use of hash tables
vulnerabilities in php code by means of data and regular expression matching are just
flow analysis. A flow-sensitive, some features that must be modelled well to
interprocedural, and context sensitive edata produce useful results. Proposed static
flow analysis for PHP, targeted detecting analysis algorithm is used to find SQL
taint-style vulnerabilities. This analysis injection vulnerabilities. Once configured,
process had to overcome significant the analysis is fully automatic. Although we
conceptual challenges due to the untyped focus on SQL injections in this system, the
nature of PHP. Additional literal analysis same techniques can be applied to detecting
and alias analysis are the steps that lead to other vulnerabilities such as cross site
more comprehensive and precise results scripting (XSS) and code injection in web
than those provided by previous applications. We parse the PHP source code
approaches. Pixy is a system that into abstract syntax trees (ASTs). The
implements the proposed analysis parser is based on the standard open source
technique, written in Java and licensed implementation of PHP 5.0.5. Each PHP
under the GPL. A straightforward approach source file contains a main section and zero
to solving the problem of detecting taint- or more user defined functions. We store
style vulnerabilities would be to the user-defined functions in the
immediately conduct a taint analysis on the environment and start the analysis from the
intermediate three-address code main function. For each function in the
representation generated by the front-end. program, the analysis performs a standard
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 82
ISSN 2229-5518
conversion from the abstract syntax tree helps to identify cases where the
(AST) of the function body into a control sanitization is incorrect or incomplete. A
flow graph (CFG). The nodes of the CFG dynamic analysis technique is introduced,
are basic blocks: maximal single entry, that is able to reconstruct the code that is
single exit sequences of statements. The responsible for the sanitization of
edges of the CFG are the jump 6 application inputs, and then execute this
relationships between blocks. For code on malicious inputs to identify faulty
conditional jumps, the corresponding CFG sanitization procedures. By composing the
edge is labelled with the branch predicate. two techniques to leverage their advantages
Each basic block is simulated using and mitigate their disadvantages. We
symbolic execution. The goal is to implemented this approach and evaluated
understand the collective effects of the system on a set of real-world
statements in a block on the global state of applications. In the process, a number of
the program and summarize their effects previously unknown vulnerabilities in the
into a concise block summary. After sanitization routines of the analyzed
computing a summary for each basic block, programs are identified.
we use a standard reachability analysis to
combine block summaries into a function L. K. Shar and H. B. K. Tan [4],
summary. The function summary describes “Predicting SQL injection and cross site
the pre- and post conditions of a function. scripting vulnerabilities through mining
IJSER
input sanitization patterns,” An application
D. Balzarotti, M. Cova, V. that accesses database via a SQL language
Felmetsger, N. Jovanovic, E. Kirda, C. is vulnerable if an unrestricted input is used
Kruegel, and G. Vigna [3], “Saner: to build the query string because an attacker
Composing static and dynamic analysis to might craft the input value to have
validate sanitization in web applications,”A unauthorized access to the database and
novel approach to analyze the correctness of perform malicious actions. This security
the sanitization process is introduced. The issue is called SQLI vulnerability. An
approach combines two complementary application that sends HTTP response data
techniques to model the sanitization process to a web client is vulnerable if an
and to verify its thoroughness. More unrestricted input is included in the
precisely, this is the first technique based on response data because an attacker might
static analysis models how an application inject a malicious JavaScript code in the
modifies its inputs along the paths to a sink, input value. The injected code when
using precise modelling of string executed by the client’s browser could
manipulation routines. This approach uses a perform malicious actions to the client. This
conservative model of string operations, security issue is called XSS vulnerability.
which might lead to false positives. Web developers generally implement input
Therefore, a second technique based on sanitization schemes to prevent these two
dynamic analysis is devised. This approach vulnerabilities. Input sanitization code
works bottom-up from the sinks and attributes which can be statically collected.
reconstructs the code used by the From these attributes, we aim to build SQLI
application to modify the inputs. The code and XSS vulnerability predictors which
is then executed, using a large set of provide high recalls and low false alarm
malicious input values to identify rates so that the predictors can be used
exploitable flaws in the sanitization process. alternatively or in combination with existing
In this approach, a static analysis technique taint-based approaches. Compared to
is used that characterizes the sanitization current vulnerability prediction approaches,
process by modeling the way in which an we only use static code attributes and we
application processes input values. This target vulnerable code at statement level.
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 83
ISSN 2229-5518
IJSER
build and evaluate fault prediction models,” interpreter under consideration. Intuitively,
This paper describes a study performed in an application is vulnerable if the
an industrial setting that attempts to build developers failed to implement these
predictive models to identify parts of a Java techniques correctly or to a sufficient
system with a high fault probability. The degree. To address these security threats,
system under consideration is constantly many web vulnerability detection
evolving as several releases a year are approaches, such as static taint analysis,
shipped to customers. Developers usually dynamic taint analysis, modeling checking,
have limited resources for their testing and symbolic and concolic testing, have been
would like to devote extra resources to proposed. Static taint analysis approaches
faulty system parts. The main research are scalable in general but are ineffective in
focus of this paper is to systematically practice due to high false positive rates.
assess three aspects on how to build and Dynamic taint analysis, model checking,
evaluate fault-proneness models in the symbolic and concolic testing techniques
context of this large Java legacy system can be highly accurate as they are able to
development project: (1) compare many generate real attack values, but have
data mining and machine learning scalability issues for large systems due to
techniques to build fault-proneness models, path explosion problem. There are also
(2) assess the impact of using different scalable vulnerability prediction methods.
metric sets such as source code structural But the granularity of current prediction
measures and change/fault history (process approaches is coarse-grained: they identify
measures), and (3) compare several vulnerabilities at the level of software
alternative ways of assessing the modules or components.
performance of the models, in terms of (i)
confusion matrix criteria such as accuracy IV PROPOSED SYSTEM
and precision/recall, (ii) ranking ability,
using the receiver operating characteristic Input validation and input
area (ROC), and (iii) our proposed cost- sanitization are two secure coding
effectiveness measure (CE). techniques that they can adopt to protect
their programs from such common
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 84
ISSN 2229-5518
IJSER
and scalable. The approach is fine-grained 5. Static and dynamic analysis on each slice
because it identifies vulnerabilities at 6. Classification of path in each slice
program statement levels. We use both 7. IVS attributes
static and dynamic program analysis 8. Building vulnerability prediction model
techniques to extract IVS attributes. Static A. Data representation
analysis can help assess general properties B. Data processing
of a program. Yet, dynamic analysis can 9. Supervised learning
focus on more specific code characteristics 10. Semi-supervised learning
that are complementary to the information 11. Final predictor
obtained with static analysis. We used
dynamic analysis only to infer the possible 1. STATIC AND DYNAMIC
types of input validation and sanitization PROGRAM ANALYSIS
code, rather than to precisely prove their
correctness, and apply machine learning on Both static and dynamic program
these inferences for vulnerability prediction. analysis techniques are used to extract IVS
Therefore, we mitigate the scalability issue attributes. Static analysis can help assess
typically associated with dynamic analysis. general properties of a program. Yet,
Thus, our proposed IVS attributes reflect dynamic analysis can focus on more
relevant properties of the implementations specific code characteristics that are
of input validation and input sanitization complementary to the information obtained
methods in web programs and are expected with static analysis. The dynamic analysis is
to help predict vulnerabilities in an accurate used only to infer the possible types of input
and scalable manner. Furthermore, we use validation and sanitization code, rather than
both supervised learning and semi- to precisely prove their correctness.
supervised learning methods to build
vulnerability predictors from IVS attributes, 2. BACKWARD SLICING
such that our method can also be used in
contexts where there is limited vulnerability Program slicing is a program
data for training. analysis and transformation technique to
decompose programs by analyzing their
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 85
ISSN 2229-5518
IJSER
A sink is a node in a CFG that uses inputs for validation. Therefore, in security
variables defined from input sources and analysis, it is important to first identify all
thus, may be vulnerable to input the input sources. The reason for classifying
manipulation attacks. This allows us to the inputs into different types is that each
predict vulnerabilities at statement levels. class of inputs causes different types of
Input nodes are the nodes at which data vulnerabilities and different security
from the external environment are accessed. defense schemes may be required to secure
A variable is tainted if it is defined from these different classes of inputs.
input nodes. As described earlier, the first
step the approach is to compute a backward 6. CLASSIFICATION OF PATH IN
static program slice for each sink and the set EACH SCLICE
of tainted variables used. Backward static
slice with respect to slicing criterion For each sink, a backward static
consists of all nodes (including predicates) program slice is computed with respect to
in the CFG that may affect the values of, the sink statement and the variables used in
subset of variables are used. We first the sinks. Each path in the slice is analyzed
construct the PDG for the main method of a using hybrid (static and dynamic) analysis
web application program and also construct to extract its validation and sanitization
PDGs for the methods called from the main effects on those variables. The path is then
method according to the algorithm given by classified according to its input validation
Ferrante et al. We then construct the SDG. and sanitization effects inferred by the
A PDG models a program procedure as a hybrid analysis.
graph in which the nodes represent program
statements and the edges represent data or 7. IVS ATTRIBUTES
control dependences between statements.
SDG extends PDG by modeling These attributes characterize various
interprocedural relations between the main types of program functions and operations
program and its subprograms. that are commonly used as input validation
and sanitization procedures to defend
4. SLICING OF EACH SINK against web application vulnerabilities.
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 86
ISSN 2229-5518
Using these attributes, functions and information and then predict future
operations are classified according to their vulnerabilities.
security-related properties. Hybrid analysis-
based attributes are attributes to be 8. BUILDING VULNERABILITY
extracted combining static analysis and PREDICTION MODEL
dynamic analysis. The reason for including
input sources in our classification scheme is Many machine learning techniques
that most of the common vulnerabilities can be used to build vulnerability
arise from the misidentification of inputs. predictors. Regardless of the specific
That is, developers may implement technique used, the goal is to learn and
adequate input validation and sanitization generalize patterns in the data associated
methods but yet, they may fail to recognize with sinks, which can then be efficiently
all the data that could be manipulated by used for predicting vulnerability for new
external users, thereby missing some of the sinks. As more sophisticated security
inputs for validation. Therefore, in security attacks are being discovered, it is important
analysis, it is important to first identify all for a vulnerability analysis approach to be
the input sources. able to adapt. With machine learning, it is
possible to adapt to new vulnerability
This hybrid analysis-based patterns via re-training.
classification is applied for validation and
IJSER
sanitization methods implemented using A. DATA REPRESENTATION
both standard security functions and Our unit of measurement, an instance in
nonstandard security functions. If there are machine learning terminology, is a path in
only standard security functions to be the slice of a sink and we characterize each
classified, we classify them based on their path with IVS attributes. The attribute
security-related information else dynamic values may range from zero to an upper
analysis is used. Various input validation bound that depends on the number of
and sanitization processes may be classified program operations or functions.
implemented using language built-in Since 33 IVS attributes are proposed, each
functions and/or custom functions. Since path would be represented by a 33-
inputs to web applications are naturally dimensional attribute vector.
strings, string replacement/ matching
functions or string manipulation procedures B. DATA PROCESSING
like escaping are generally used to In most of our datasets, the proportion of
implement custom input validation and vulnerable sinks to non-vulnerable ones is
sanitization procedures. A good security small. This is an imbalanced data problem
function generally consists of a set of string and should be expected in many such
functions that accept safe strings or reject vulnerability datasets. Prior studies have
unsafe strings. These functions are clearly shown that imbalanced data can
important indicators of vulnerabilities, but significantly affect the performance of
we 19 need to analyze the purpose of each machine learning classifiers, because some
validation and sanitization function since of the data might go unlearned by the
different defense methods are generally classifier due to their lack of representation,
required to prevent different types of thus leading to induction rules which tend to
vulnerabilities. It is important to classify explain the majority class data and
these methods implemented in a program favouring its predictive accuracy. Since for
path into different types because, together our problem, the minority class data capture
with their associated vulnerability data, our the ‘vulnerable’ instances, we need a high
vulnerability predictors can learn this predictive accuracy for this class as missing
vulnerability is far more critical than
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 87
ISSN 2229-5518
reporting a false alarm. To address this much larger amount of unlabeled data. This
problem, we use a sampling method called method that exploits unlabeled data can
adaptive synthetic oversampling. It balances enable ensemble learning when there are
the (unbalanced) data by generating very few labeled data. Combining semi
synthetic, artificial data for the minority supervised learning with ensembles has
class instances, thus reducing the bias many advantages. Unlabeled data is
introduced by the class imbalance problem. exploited to help enrich labeled training
It does not require modification of standard samples allowing ensemble learning: Each
classifiers and thus, can be conveniently individual learners improved with unlabeled
added as an additional data pre-processing data labeled by the ensemble consisting of
step. all other learners. A few different types of
semi-supervised methods, such as
9. SUPERVISED LEARNING EMbased, clustering-based, and
disagreement-based learning, have been
Classification is a type of supervised proposed in literature. But none of these
learning methods because the class label of techniques has been explored for
each training instance has to be provided. In vulnerability prediction so far. Hence, based
this study, we build logistic regression and on these motivations, we explore the use of
Random Forest (RF) models from the an algorithm called Co Forest, Co-trained
proposed attributes. LR is a type of Random Forest (CF), which applies semi-
IJSER
statistical classification model. It can be supervised learning on RF. It is a
used for predicting the outcome (class label) disagreement-based, semi-supervised
of a dependent attribute based on one or learner. CF uses multiple, diverse learners,
more predictor attributes. The probabilities and combines them to exploit unlabeled
describing the possible outcomes of a given data (semi supervised learning), and
instance are modelled. Logistic regression maintains a large disagreement between the
analysis is flexible in terms of the types of learners to promote the learning process.
monotonic relationships it can model
between the probability of vulnerability and 11. FINAL PREDICTOR
predictor attributes. RF is an ensemble A qualified web application
learning method for classification that vulnerability predictor can be built with the
consists of a collection of tree-structured help of the input validation and sanitation
classifiers. In many cases the predictive attributes and the machine learning
accuracy is greatly enhanced as the final techniques. By using the above attributes
prediction output comes from an ensemble we will be able to generate a web
of learners, rather than a single learner. application predictor which is highly
Given an input sample, each tree casts a accurate, fine-grained and scalable.
vote (classification) and the forest outputs
the classification having the majority vote V IMPLEMENTATION
from the trees.
A. DERIVATION OF IVS ATTRIBUTES
10. SEMI-SUERVISED LEARNING The code attributes that characterize
As ensemble learning works by validation and sanitization code
combining individual classifiers, it typically implemented in the program could be used
requires significant amounts of labeled data to predict web application vulnerabilities.
for training. In certain industrial contexts, Based on this hypothesis, we propose a set
relevant and labeled data available for of code attributes called input validation
learning may be limited. Semi-supervised and sanitization (IVS) attributes from which
methods [39] use, for training, a small we build vulnerability predictors that are
amount of labeled data together with a fine-grained, accurate, and scalable. The
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 88
ISSN 2229-5518
IJSER
3. Text-database - Text-based input 25. Slash Function that filters slash (/)
accessed from database 26. Newline Function that filters newline
4. Numeric-database - Numeric-based input (\n)
accessed from database 27. Colon Function that filters colon (,) or
5. Session - Input accessed from persistent semi-colon (;)
data object such as HTTP Session 28. Other-special Function that filters any
6. Uninit - Un-initialized program variable other special characters different from the
7. Un-taint - Function that returns above
predefined information or information not 29. Encode Function that encodes a string
influenced by external users. into a different format
8. Known-vuln-user - Custom function that 30. Canonicalize Function that converts a
has caused security issues in the past. string into its most standard, simplest form
9. Known-vuln-std - Language built-in 31. Path Function that filters directory paths
function that has caused security issues in or URLs
the past. 32. Limit-length Function or operation that
10. Propagate - Function or operation that limits a string into a specific length
propagates partial or complete value of a
string.
11. Numeric Function or operation that
converts a string into a numeric
12. DB-operator Function that filters query
operators such as ( = )
13. DB-comment-delimiter Function that
filters query comment delimiters such as (–)
14. DB-special Function that filters other
database special characters different from
the above, such as (\x00) and (\x1a) 5.1 Implementation of vulnerability
15. String-delimiter Function that filters predictor
string delimiters such as (‘) and (“)
IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 5, May-2017 89
ISSN 2229-5518
IJSER
pp. 31–43.
dependence graph and system dependence
graph of a web program. The input validation
and sanitation attributes will act as the building
blocks for the web application vulnerability
predictor.
REFERENCES
[1]. N. Jovanovic, C. Kruegel, and E. Kirda,
“Pixy: A static analysis tool for detecting
web application vulnerabilities,” in Proc.
IEEE Symp. Security Privacy, 2006, pp.
258–263.
IJSER © 2017
http://www.ijser.org