A System For Profiling and Monitoring Database Access Patterns by Application Programs For Anomaly Detection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1

A System for Profiling and Monitoring Database


Access Patterns by Application Programs for
Anomaly Detection
Lorenzo Bossi, Elisa Bertino, Fellow, IEEE, and Syed Rafiul Hussain, Member, IEEE

Abstract—Database Management Systems (DBMSs) provide access control mechanisms that allow database administrators (DBAs)
to grant application programs access privileges to databases. Though such mechanisms are powerful, in practice finer-grained access
control mechanism tailored to the semantics of the data stored in the DMBS is required as a first class defense mechanism against
smart attackers. Hence, custom written applications which access databases implement an additional layer of access control.
Therefore, securing a database alone is not enough for such applications, as attackers aiming at stealing data can take advantage of
vulnerabilities in the privileged applications and make these applications to issue malicious database queries. An access control
mechanism can only prevent application programs from accessing the data to which the programs are not authorized, but it is unable to
prevent misuse of the data to which application programs are authorized for access. Hence, we need a mechanism able to detect
malicious behavior resulting from previously authorized applications. In this paper, we present the architecture of an anomaly detection
mechanism, DetAnom, that aims to solve such problem. Our approach is based the analysis and profiling of the application in order to
create a succinct representation of its interaction with the database. Such a profile keeps a signature for every submitted query and
also the corresponding constraints that the application program must satisfy to submit the query. Later, in the detection phase,
whenever the application issues a query, a module captures the query before it reaches the database and verifies the corresponding
signature and constraints against the current context of the application. If there is a mismatch, the query is marked as anomalous. The
main advantage of our anomaly detection mechanism is that, in order to build the application profiles, we need neither any previous
knowledge of application vulnerabilities nor any example of possible attacks. As a result, our mechanism is able to protect the data
from attacks tailored to database applications such as code modification attacks, SQL injections, and also from other data-centric
attacks as well. We have implemented our mechanism with a software testing technique called concolic testing and the PostgreSQL
DBMS. Experimental results show that our profiling technique is close to accurate, requires acceptable amount of time, and the
detection mechanism incurs low run-time overhead.

Index Terms—Database, Insider Attacks, Anomaly Detection, Application Profile, SQL Injection

1 I NTRODUCTION
Data stored in databases is often critical to the orga- subject to thorough security checks in the same way as
nization’s operations and also sensitive, for example with external actions are. For instance, there is often no internal
respect to privacy. Therefore, securing data stored in a firewall within the organization network. Third, insiders are
database is a critical requirement. Data must be protected often highly trained computer experts, who have knowl-
not only from external attackers, but also from users within edge about the internal configuration of the network and the
the organizations [3]. A wide range of institutions from security and auditing control deployed. Therefore, they may
government agencies (e.g., military, judiciary etc.) to com- be able to circumvent conventional security mechanisms.
mercial enterprises are witnessing attacks by insiders at an Protecting data from insider threats requires combining
alarming rate. The most important objective of these insiders different techniques. One important such technique is rep-
is to either exfiltrate sensitive data (e.g., military plans, trade resented by the access control system that is implemented
secrets, intellectual property, etc.) or maliciously modify the as part of the database management system (DBMS) code.
data for deception purposes or for attack preparation [1], An access control system allows one to specify which user-
[8], [16]. s/applications can access which data for which purpose.
There are a number of facts that make the prevention of In addition to the access control system implemented as
insider attacks more challenging compared with other con- part of the DBMS, applications may also perform their own
ventional (external) attacks [4]. First, insiders are allowed to “application-level” access control in order to implement
access resources, such as data and computer systems, and more complex access control policies. In such cases, accesses
services inside the organization networks as they possess by users to the data stored in a database are mediated
valid credentials. Second, the actions of insiders originate by the application programs. However, whereas the use
at a trusted domain within the network, and thus are not of DBMS-level and application-level access control mecha-
nisms provide a first layer of defense against insider threats,
• L. Bossi, E. Bertino, and S.R. Hussain are with the Department of these mechanisms are unable to protect against malicious
Computer Science, Purdue University, West Lafayette, IN, 47907. insiders that have access to the applications and can thus
E-mail: [email protected], [email protected], [email protected] modify the code to change the queries issued to the database
Manuscript received April 19, 2005; revised September 17, 2014. and also modify the logics of the application-level access

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 2

control. Software-based attestation [32] or simple integrity execution [6] and concrete execution. Symbolic execution
measurement by a trusted platform module [23] could be is a classical software verification and dynamic program
used for detecting any unauthenticated change to the appli- analysis technique where program variables are considered
cation source code by expert insiders. However, attestation as symbolic variables and an automated constraint solver
is typically executed during the loading of the application’s based on constraint programming logic is used to generate
executable and hence it cannot detect changes of program new concrete inputs (test cases) with the aim of maximizing
behaviors at run-time. As a result, during execution if a code coverage. Concrete execution is commonly used for
program is compromised by an insider using known attack testing applications on a particular set of inputs along an ex-
techniques, e.g., buffer overflow [9] or return-oriented pro- ecution path. As the program may have different behaviors
gramming (ROP) [26], attestation mechanisms cannot detect for different values of input parameters, our approach with
such malicious changes of behavior in the program. Also concolic testing generates inputs automatically to explore
a malicious insider may be able to modify the information all such program behaviors. Note that we do not use our
used for the attestation of the target application program, proposed mechanism in conjunction with concolic testing
thus rendering attestation useless. Apart from that, using for finding bugs or verify the correctness of the program.
just a simple integrity measurement technique is not a viable Using concolic testing we leverage the advantages of
solution because this technique cannot provide integrity for dynamic program analysis over static analysis which cannot
self modifying code (e.g., JAVA, C#) [35] which is widely detect malicious changes of program’s behavior at run-time.
used as front end database applications. Later in the anomaly detection phase, whenever the applica-
In order to address the above problem, one possible tion issues a query, the corresponding query signature and
approach is to analyze the data access patterns of the ap- constraints are checked against the current context of the
plication to create profiles of legitimate activities and then application. If there is a mismatch, the query is considered
use at run-time these profiles to detect anomalous database as anomalous. The main advantage of our anomaly detection
accesses by application programs. mechanism is that we do not need any knowledge about
The design of such an anomaly detection system is chal- possible attacks to build the application profiles.
lenging, as the system should fulfill the following require- Note that we target our approach to securing internal en-
ments: terprise software, because this is the most common category
• It should require minimal modifications to the code of of applications which directly connect to the database. But
the application program and the DBMS. we want to emphasize that our approach can be extended to
• It should not introduce significant delays that may protect also multi-tiered applications, by creating profiles of
negatively impact the performance. the application layer which directly communicates with the
• It should have the least possible number of false database using its API calls as input which can be generated
positives and false negatives. by the concolic testing engine.
In this paper, we propose DetAnom, an anomaly detection Moreover, we want to highlight that our goal is to protect
mechanism able to identify malicious database transactions the database by monitoring the queries submitted by clients.
that addresses the above requirements. DetAnom consists of Even if our mechanism can identify hosts compromised by
two phases: the profile creation phase and the anomaly detection viruses or Trojan horses, this is not our main goal, because
phase. In the first phase, we create a profile of the application our primary goal is to detect malicious or compromised host
program that can succinctly represent the application’s nor- administrators, who are supposed to access the database
mal behavior in terms of its interaction (i.e., submission of only through the application but who may explicitly disable
SQL queries) with the database. For each query, we create or tamper host based anomaly detection tools.
a signature and also capture the corresponding precondi- In this paper, in addition to provide the details of the
tions that the application program must satisfy to submit approach, we discuss the issues that we encountered in
the query. Note that an application program may execute using the concolic testing technique and in capturing the
different query sequences depending on different values of application input at run-time to perform the anomaly detec-
the input parameters. Hence, the profile of the application tion. We also report experimental data showing the run-time
needs to consider all possible execution paths that lead to performance overhead introduced by our anomaly detection
interaction with the database. Each query in the application technique. To the best of our knowledge, our approach is the
belongs to one of these paths and has a set of preconditions first using software testing techniques for creating execution
(i.e., constraints) in order to be issued. profiles of application programs for the purpose of detecting
A major issue in our approach is that exploring all execution anomalies at run-time. Such anomalies may be
possible execution paths of an application program requires indicative of application program tampering. Notice that
identifying all possible combinations of program inputs, our approach is complementary to techniques for static anal-
which is sometimes not feasible. As a result, the unex- ysis. Such techniques aim at analyzing programs to detect
plored paths introduce incompleteness in the application bugs that can be exploited by attacks at run-time, such
profile. The higher the number of paths explored, the more as buffer vulnerabilities. Our approach aims at preventing
complete and accurate an application profile is. Hence, to malicious changes to programs, after the completion of the
make our profiling technique close to complete and accu- static analysis, by insiders who have the ability to modify
rate, we resort to a software testing methodology, known the application source code or the application binary.
as concolic testing [31] [11], that ensures high coverage The rest of the paper is organized as follows: Section 2
of the application’s code as well as of the created profile. presents relevant preliminary concepts. Section 3 provides
Concolic testing works with a combination of symbolic an overview of our system architecture. Section 4 describes

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 3

the adversary model and explains some common attacks not require the application source code. The bytecode is
that our system can block. Section 5 and 6 describe the inspected using reflection to find the branches and track the
profile creation phase and the anomaly detection phase, respec- input sources to the branch conditions. Then, the application
tively. Section 7 discusses implementation details. Section 8 is started inside an instrumented virtual machine which
analyses the security of the proposed approach. Section 9 links the concolic execution engine to the channels used to
presents an experimental evaluation of DetAnom. Section 10 interact with the user. In this way the concolic engine can
surveys related work. Section 11 concludes the paper with a generate input to force the execution of different branches.
discussion on future work. Therefore, the concolic execution module executes the
instrumented application for a number of times with the
aim of exploring as many execution paths as possible. Since
2 P RELIMINARIES
there is no guarantee that the application terminates on each
Software Testing is the process of examining the quality input, the concolic execution uses a depth bounded search
of a software product. It involves monitoring the actual to limit the profiling time. The depth of the search is a
program execution with the goal of observing unexpected configurable parameter.
behavior (e.g., wrong output values, program crashes or Each time the application program issues a query to the
early termination) which implies the existence of bugs. It database, the constraint extractor in the profile builder module
can also give a perspective about the security and risks extracts the constraints that lead the application program to
in the product or service under test. One of the main follow the current path. These constraints compose a part
challenges in software testing is the capability of testing all of the application profile. In addition, each query submitted
possible program inputs of an application to achieve high to the database is also forwarded to the profile builder mod-
code coverage. Concolic testing is one of the widely used ule where the signature generator sub-module generates the
techniques addressing this challenge. signature of that query.
Concolic Execution is a program analysis technique [11], Since the values returned by the database may change
[19], [31] that tries to explore all possible execution paths of the application control flow, these values are considered as
a program by acting according to the following steps. The the database inputs to the application program. Hence, in
program to be tested is first concretely executed with some order to automatically generate database inputs for concolic
initial random inputs. Then the concolic execution engine execution, the instrumentation library hacks the standard
examines the branch conditions along the executed path’s database connection library and mocks the behavior of the
control-flow and uses a decision procedure to find an input real database to let the concolic execution generating the
that would reverse the branch conditions from true to false values required to force different execution flows of the
or vice-versa. This process is repeated to discover more application.
inputs that trigger new control-flow paths, and thus more Section 5 discusses details about the constraint extrac-
program states are tested. This technique is particularly tor and signature generator sub-modules. Finally, the profile
useful for the automatic generation of high-coverage test builder module binds the query signature with its corre-
inputs and for software vulnerability discovery. sponding constraints and inserts this record into the applica-
tion profile.
3 D ETA NOM A RCHITECTURE
The system architecture consists of several components, 3.2 Anomaly Detection Component
supporting the two phases of DetAnom, that we describe
in what follows. Result Result
Application SQL
Target
Query (Q) Proxy SQL Query (Q)
Database
Instrumentation application
3.1 Profile Creation Component input (I)

Detection
(Q, I)
result

Anomaly Detection Engine

Signature Signature Application


profile
Generator Comparator

Fig. 2. System architecture for anomaly detection

The main modules supporting the anomaly detection phase


Fig. 1. System architecture for profile creation are: the anomaly detection engine (ADE), the SQL proxy, the
signature comparator, and the target database as shown in
Figure 1 shows the modules supporting the profile cre- Figure 2.
ation phase and their interactions. The data to protect is stored in the target database. We
This phase starts by providing the application program assume that the database server is already secured to the
as input to the concolic execution module which first instru- best of current security technology and can be accessed only
ments the application. Note that the concolic execution does through our proxy. The monitored application interacts with

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 4

the database through SQL queries which are intercepted by the same application use the same database credentials
the SQL proxy and forwarded to the ADE for anomaly detec- for the connection and handle the extra layer of security
tion. Moreover, the instrumented environment collects the internally. If an application is compromised so to return a
application input and adds it as meta-data to the query. The successful authentication, on the database side we see only
ADE also includes the signature generator sub-module that a sequence of allowed queries for which the constraints may
generates the signature of the received query. Upon receiv- not be satisfied with the program inputs.
ing the query, the ADE checks whether the current program We assume that every component involved in the profile
inputs satisfy the constraints of some possible execution creation phase and anomaly detection phase is trusted. We also
paths. If the constraints are satisfied, the signature comparator assume that profiles are stored in a secure storage and are
compares the signature of the query associated with the not tampered by an insider or database administrator.
satisfied constraint to that of the received query. If there
is a match, the query is considered legitimate, otherwise an
anomaly is detected. This information is then sent back to 5 P ROFILE C REATION P HASE
the proxy, where a custom logic is used to decide the actions In the profile creation phase, the application program interacts
to be executed in order to manage the anomaly. Examples of with the mock database through SQL queries. We represent
such actions include rejecting the query, sending an alarm to the queries internally in a specific format which we refer
a security administrator, revoking the application program to as signatures. Queries’ signatures and corresponding con-
authorizations etc. straints are used to build the profile of the application. For
each query, we record its signature and constraints, and
refer to this pair as query record. All query records of the
4 A DVERSARY M ODEL program are organized in a hierarchical data structure which
We assume that at run-time the application program can be represents the control-flow of the application. We refer to
tampered and thus become untrusted. Therefore, we assume this data structure as the application profile.
that while the program is executing, the program may issue Before explaining the application profiling technique, we
a query that: discuss the model that describes the applications’ normal
(a) has never been encountered in the profile creation behavior, i.e., the fingerprint with respect to the queries
phase, i.e., the query does not belong to the application at issued to the database. For our purpose, an application
all; can be ideally represented using a directed graph where
(b) belongs to the application but is not relevant to the the nodes represent the application states in which the
current execution path; application issues queries to the database, and the edges
(c) is relevant to the current execution path, but the represent the application inputs required to change the state.
program input variables do not satisfy that query’s corre- We use cycles in the graph to represent the loops in the
sponding constraints. application code.
All of these cases can be easily mapped to well known The challenge in creating such profiles is in representing
security attacks. correctly the dynamic behavior of the application, as the
In case (a), an attacker may simply use a network sniffer application may change its own code, or dynamically down-
or perform a man-in-the-middle attack to steal the creden- load code from internet, or use reflection to dynamically
tials that the application uses to connect to the database. choose which code to invoke. For this reason we use a
Once the credentials are stolen, the attacker may use any dynamic analysis technique to create the profile.
other client to connect to the database, elude all the appli- The problem, therefore, is that when we deal with com-
cation level security checks, and issue queries that do not plex applications it is difficult to map the actual code to
belong to the application. the graph representation we need. A loop in the code may
In case (b), an attacker may obtain the credentials as dynamically creates different queries, being mapped as a se-
described in the previous case and can use a similar tech- quence in the graph; while a sequence of different functions
nique to record the queries that the application issues. By may issue the same query, being better represented using
repeating an allowed query the attacker can pass through a cycle in the graph. When we create the profiles, using
simpler security checks and thus can retrieve sensitive data. the concolic execution, what we do is to unroll the abstract
Let us assume that a query retrieves only a row of sensitive graph recording an execution tree. This is the reason why
data after the application has performed some sanity checks we need a bounded search and why our profiles may be
on the values used to retrieve the row. An attacker may incomplete.
replay the query several times, changing only the values Following in this section, we discuss the format of the
used to filter the result in order to retrieve all the data query signatures and constraints, and the detailed proce-
he/she wants. dure for building the application profile.
In case (c), the attacker compromises the application and
changes its access control policy. For example, most of the
applications add an extra layer of security which requires 5.1 Query Signature Representation
the user to provide a pair of username and password. In our system, we consider a subset of the SQL Data Manip-
Usually, such applications retrieve a database table for ulation Statements commands. Specifically, we focus on the
the provided credentials to retrieve the set of permissions SELECT, INSERT, UPDATE, and DELETE commands.
granted to the user. Note that this level of security is usually SQL syntax is usually represented using Backus Normal
implemented outside of the database. All the instances of Form [20] and allow one to specify very complex queries,

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 5

typically nesting them at different levels. In order to de- For completeness, we briefly describe the other com-
scribe our query representation techniques, we organize the mands as well and we show how they differ from the main
presentation in two parts. In the first part we describe how example.
we create the signature of simple queries; in the second we The insert statement is in the form:
focus on how we deal with advanced queries which contain
nested sub-queries, arithmetic operators and function calls. INSERT INTO {RELATION}
SET {TARGET-LIST}

5.1.1 Simple Queries An INSERT command can specify only one relation, that is,
the table where the new values are going to be added. The
Consider as example the format of a simple SELECT com-
target list is a list of the form of target = value where
mand:
target is a column name and value is an expression that
SELECT [DISTINCT] {TARGET-LIST} can be evaluated to the value to be added. A query signature
FROM {RELATION-LIST} of an insert statement has the form:
WHERE {QUALIFICATION}
hI, {TARGET-COLUMNS}, {RELATION}, ∅, 0i
Our system internally represents an SQL query as a
signature of the form hc, t, r, q, ni. Here, c represents the The update statement has the form:
type of the SQL command which takes one of the values: UPDATE {RELATION}
‘S’, ‘I’, ‘U’, and ‘D’ in case of SELECT, INSERT, UPDATE, SET {TARGET-LIST}
and DELETE commands, respectively. The second field, t, WHERE {QUALIFICATION}
is a list that contains the identifiers (IDs) of the attributes
projected in the query, i.e., the attributes that appear in the An UPDATE statement can specify only one relation, that is,
query result or are modified by the query; this information the table to be updated; the target list similar to the one of
is extracted from the TARGET-LIST of the query. Attributes the INSERT case, with the newer values; and a qualification
have a unique ID among all the tables. The third field, r, is a list, similar to the SELECT case, which specifies which rows
list that contains the IDs of the tables being accessed in the are going to be updated. A query signature of an update
query, i.e., the tables that appear in the RELATION-LIST. statement is like the SELECT but specifies the U in the first
The next field, q , is a list of IDs of attributes referenced in the position and has exactly one table in the relation list. As
QUALIFICATION in the WHERE clause of the query. The last example:
field, n, in the signature denotes the number of predicates
hU, {TARGET-COLUMNS}, {RELATION},
in the WHERE clause.
As an example, consider the relation schema in Table 1. {QUALIFICATION}, {#predicates}i
ID’s of tables and attributes are as shown in the table. Now,
consider the query: The delete statement has the form:

SELECT employee_id, work_experience DELETE {RELATION}


FROM WorkInfo WHERE {QUALIFICATION}
WHERE work_experience > 10;
A DELETE statement specifies only one relation, that is, the
The signature of the above query is: table whose rows must be deleted and a qualification list
specifying the rows to delete. A query signature of a DELETE
hS, {201, 202}, {200}, {202}, 1i statement is like the signature of a SELECT statement but
specifies D in the first position, has exactly one table in the
We explain this signature construction in order from left relation list and has an empty target list.
to right. The leftmost S represents the SELECT command.
201, and 202 represent the IDs of attributes employee_id hD, ∅, {RELATION}, {QUALIFICATION}, {# predicates}i
and work_experience, respectively. 200 represents the ID
of the table WorkInfo. 202 represents the attribute used in 5.1.2 Complex Queries
the WHERE clause, i.e, work_experience. The rightmost 1 We focus on two different aspects: complex predicates in
corresponds to the number of predicates in WHERE clause. the WHERE clause and nested queries. Note that these two
aspects are not strictly disjoint, because a sub-query can be
TABLE 1 nested also inside the WHERE clause.
Relation schema Sub-queries can appear almost everywhere a value can
appear. For example, the following query returns a list of
Table Table Attribute Attribute
ID name ID name employees with their working experience and the overall
100 PersonalInfo 101 employee_id company maximum salary. This query includes a sub-query
102 employee_name as part of the projection clause, that is, the list of data to be
200 WorkInfo 201 employee_id
202 work_experience returned by the query.
203 salary
204 performance SELECT employee_id, work_experience, (
300 JobInfo 301 base_salary SELECT max(salary) FROM WorkInfo
302 min_work_experience ) as maxSalary
303 max_work_experience FROM WorkInfo

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 6

Sub-queries can also appear in the WHERE clause. For obtained by using the division operator over two different
example, the following query uses a nested query to retrieve columns. In this example we can see clearly why the profile
the highest salary and uses this value to select the set of contains both the columns used in the WHERE clause and
employees who earn it. the number of predicates. Even if such values are strictly
correlated in trivial cases, it is important to know both in
SELECT employee_id
order to identify more complex queries. The signature of
FROM WorkInfo
WHERE salary = ( the query in the last example is:
SELECT max(salary) FROM WorkInfo hS, {201, 202, 203, 204}, {200}, {204, 202, 203}, 1i
)

Sub-queries can appear also in the FROM clause. In this 5.2 Concolic Execution
example, a virtual table is materialized that contains the This section describes the basics of concolic execution used
total salary paid for every performance level, and such during the profile creation phase to explore the possible exe-
table is used to check the quota that every employee earns cution flows of the application.
compared to his/her performance level. The concolic execution takes the application as input
and instruments it to log each operation that may affect a
SELECT employee_id, performance, symbolic variable value or a path condition. This module
salary/total then executes the program concretely with some initial
FROM WorkInfo, (
SELECT sum(salary) as total, default input. In order to explore other paths, it examines
performance as per_group the branch conditions (i.e., constraints) along the executed
FROM WorkInfo path, and uses a constraint solver to find inputs that would
GROUP BY performance reverse the branch conditions. The execution is repeated for
) as SalaryInfo a number of times until all the execution paths are explored
WHERE performance = per_group or the depth search limit is reached in all the explored ones.
Considering that the goal of our system is to protect
Eventually, sub-queries may use tables and columns
a database, we expect that the instrumented application
used in the outer queries and mix query types. In the
issues queries along some of these execution paths. The
following example, the base salary is updated according to
issued queries are forwarded to both the profile builder and
the average salary of the employees.
the mocked database. Upon receiving a query, the constraint
UPDATE JobInfo SET base_salary = ( extractor sub-module in the profile builder extracts the con-
SELECT avg(salary) straints that are prerequisite to execute that query. The
FROM WorkInfo mocked database uses the concolic engine to generate the
WHERE min_work_experience < query results that are required to explore newer execution
work_experience AND
work_experience <=
paths.
max_work_experience We want to highlight that the queries are captured at
) the time when they are sent to the database; therefore we
can create correct profiles even when queries are built by
Note that the inner query accesses two columns, concatenating strings. For example, consider the following
min_work_experience and max_work_experience, of code:
the table JobInfo which is not declared in its FROM clause,
but in the parent’s one. Thus, the inner query signature 1 String query = "SELECT ";
2 switch (what) {
contains such columns ID but not their table ID, as shown
3 case 1: query += "name, surname ";
by the following signature: 4 break;
5 case 2: query += "address ";
hS, {203}, {200}, {302, 202, 303}, 2i
6 break;
To create the global query signature we nest signatures 7 default: query += "* ";
8 }
as they appear in the query. Thus, the complete signature of 9 query += " WHERE ssn=’" + ssn + "’";
the query in the last example is: 10 s.executeQuery(query);
hU, {301, hS, {203}, {200}, {302, 202, 303}, 2i}, 300, ∅, 0i
Depending on the value of the variable what, three differ-
Another way to create complex queries is to use func- ent queries can be executed. During the profile creation,
tions or operators to manipulate data, as we can see in the the constraint extractor collects the constraints until the
following example. executeQuery is executed and the signature generator
gets actual string values as passed to the same function.
SELECT * Therefore, in this example, three nodes will be added to the
FROM WorkInfo
WHERE filter(performance, profile, one for every concrete query that can be issued.
work_experience / salary) Before explaining in detail how the profiles are created,
it is important to discuss about how the query signatures are
The WHERE clause of this query contains a custom function extracted. The problem is that in order to create meaningful
which returns a Boolean value starting from the perfor- signatures it is necessary to know the database schema.
mance and the work experience over salary ratio, which is Consider the following query:

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 7

SELECT a, b 1 public static void salaryAdjustment(int


FROM t1 join t2 profit, int investment){
WHERE t1.c = t2.c 2 Statement s;
3 ...
Without knowing the schema it is impossible to map the 4 int employee_count = 0;
columns a and b to their respective tables. 5 if(profit >= 0.5 * investment){
Therefore the signature generator module requires a setup 6 String query1 = "SELECT employee_id,
which specifies the tables used by the application. For every work_experience FROM WorkInfo
issued query, this module parses the query to identify the WHERE work_experience > 10";
composing tokens and resolves the columns name to match 7 resultSet1 = s.executeQuery(query1);
8 resultSet1.last();
them to their tables. Once all the columns are correctly
9 if(resultSet1.getRow() > 100){
resolved, the query signature can be created according to 10 String query3 = "SELECT employee_id
the approach introduced in Section 5.1. FROM WorkInfo WHERE
work_experience > 10 AND
performance = ’good’";
5.3 Profile Creation
11 resultSet3 = s.executeQuery(query3);
In this section, we describe in details, with a running ex- 12 ... // do other operations
ample, the profile creation phase: how the concolic engine 13 } else{
executes the application; how the user input and the mocked 14 String query2 = "UPDATE WorkInfo
database are used to explore newer paths; how the profile SET salary = salary * 1.2";
15 s.executeUpdate(query2);
builder creates the query records and composes them to create 16 }else{
the profile. 17 String query4 = "SELECT
The definition of application profile is as follows: p.employee_name FROM
Application Profile: The profile of an application pro- PersonalInfo p, WorkInfo w WHERE
gram P is a directed tree T (VP , EP ). Each node vi ∈ VP is performance = ’poor’ AND
a query record of query qi represented as hsig(qi ), ci i, where p.employee_id = w.employee_id";
18 resultSet2 = s.executeQuery(query4);
sig(qi ) is the signature of qi , and ci is the set of constraints
19 ... // do other operations
to execute qi . An edge eij ∈ EP denotes that the query qj 20 }
is executed after query qi and hence, node vj is a child of 21 }
node vi .
To illustrate the profile creation procedure, we continue
with the examples given in Section 5.1 and 5.2, following Fig. 3. An example of database application
the code shown in Figure 3.
The concolic execution takes the java bytecode, instru-
an unexplored branch. In this case it is the if statement at
ments it to find the branch conditions, and executes it
line 9. Therefore a new value for the variable is generated
inside the instrumented environment. Consider the pro-
in order to negate the previous result and explore the new
gram in Figure 3 that asks the user to input the values
branch. In this case it means that the concolic engine must
profit and investment and passes them to the func-
return 101 rows as result of the query1. Therefore query3
tion salaryAdjustment. The concolic execution uses the
is executed and its features together with the constraint c3
environment instrumentation of such program to block the
are added into the profile as shown in Figure 4(c).
interactive requests by passing to the program automatically
Finally, the concolic execution has left unexplored only
generated values.
the else branch of the if at line 5. Therefore it uses a
The first time a numeric variable is encountered, the
constraint solver to find values of profit and investment
value returned is 0. Therefore, during the first execution, the
which negate the conditions. Assuming that it sets profit
function is called with both parameters set to 0. Following
and investment variables to 49999 and 100000 respec-
the code, at line 5, the condition will be evaluated true
tively; the constraint c4 , as shown in Table 2, is then
and the constraint extractor will compute the constraint c1
generated; query4 is executed and the node QR4 =
as shown in Table 2. Once query1 is submitted, the profile
hsig(query4 ), c4 i is added into the profile as child of the
graph node QR1 = hsig(query1 ), c1 i is created and added
root as shown in Figure 4(d).
as the first child of the root of the application profile as
shown in Figure 4(a).
In this phase, the queries do not reach the real database,
but they are blocked by the environment instrumentation
which uses the concolic engine to generate the returned
values. In the first execution, the instrumentation will return
0 rows; therefore the condition at line 9 will be evaluated
to false, generating the constraint c2 , issuing the query
query2 and creating a new node in the profile as shown
in Figure 4(b).
At this point nothing is left to do in the function. As- Fig. 4. Steps of profile graph construction
suming that the application ends too, the concolic execution
backtracks the execution to the last jump encountered with Note that the environment instrumentation blocks the

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 8

execution of this query too and lets the concolic engine to children or distance from the root, because the depth used
generate the returned data. The concolic execution will try by the concolic execution counts the number of branches,
to explore all the possible paths starting from the unwritten while the depth of the profile tree counts the number of
code at line 19, but since none of these will issue any new queries.
query nothing is recorded inside the application profile. At Knowledge about which nodes are incomplete is neces-
this point, if the concolic execution has to stop because the sary for the detection phase. Unfortunately, we cannot infer
maximum search depth has been reached, the node QR3 the minimum depth to use to completely profile an appli-
will be marked as incomplete, otherwise it will be marked as cation because this can be reduced to the halting problem.
complete. Therefore, we deal with the problem of incomplete profiles
At this point, since the concolic execution module has as follow. If a profile contains too many incomplete nodes,
completed exploring the execution paths, the profile creation the administrator may decide to create the profile again by
phase ends. increasing the limit of the search, or to manually compute it
Table 3 and Table 4 show respectively the query sig- using the logs obtained after the program execution.
natures and the query records generated to represent the
example application. Table 2 shows all the constraints gen-
erated profiling the application shown in Figure 3. Note 6 A NOMALY DETECTION PHASE
that the constraints do not contain meaningful names for We now describe how application program profiles are used
the variables. This happens because the concolic execution to distinguish between legitimate and anomalous database
works using the compiled application and the variable queries. The steps of the anomaly detection procedure are
names are not stored inside the bytecode. Therefore the presented in Algorithm 1.
concolic execution identifies variables just by their order of
appearance. Algorithm 1 Anomaly Detection
1: Input: Application Profile (AP)
TABLE 2 2: vp = root of AP
Constraints for queries 3: while the program is executing do
4: q = issued query
c1 arithmetic: 1.0 x1 − 0.5 x2 >= 0.0 5: ci = input constraints
c2 database: x3 ≤ 100.0 6: signature generator generates sig(q)
c3 database: x3 > 100.0 7: f ound = false
c4 arithmetic: 1.0 x1 − 0.5 x2 <= −1.0 8: for each child vi of vp do
9: if ci is satisfied then
10: signature comparator compares sig(q) to sig(queryi )
TABLE 3
11: if signatures match then
Query signatures 12: response: NOT-ANOMALOUS
13: vp = vi
14: else
Query Signature
query1 {S, {201, 202}, {200}, {202}, 1}
15: response: ANOMALOUS
query2 {U, {203}, {200}, ∅, 0} 16: end if
query3 {S, {201}, {200}, {202, 204}, 2} 17: f ound = true
query4 {S, {102}, {100, 200}, {204, 101, 201}, 2} 18: break
19: end if
20: end for
21: if f ound == false and vp is an incomplete node then
TABLE 4 22: response: WARNING
Query records 23: end if
24: end while
Query record Contents
QR1 hsig(query1 ), c1 i
QR2 hsig(query2 ), c2 i
QR3 hsig(query3 ), c3 i
6.1 Detection of Anomalous Queries
QR4 hsig(query4 ), c4 i
In the anomaly detection phase, whenever the application
Note that the concolic execution uses a bounded depth- program issues a query, the proxy module intercepts and
first search strategy to explore the execution paths. How- forwards it to the ADE module.
ever, due to possible failures in solving complex constraints, When an application program starts executing in the
the concolic execution may not be able to actually explore anomaly detection phase, the ADE module sets the root node
all execution paths of a large and complex application, thus of the application profile as the current parent node (vp ).
leaving the profile incomplete. We handle this situation as Upon receiving the first query along an execution path of
follows. the program, the ADE considers all the children of vp as
During the concolic execution, we stop exploring a path candidate nodes. The ADE then takes the inputs from the
when we reach either the maximum depth search limit or executing application and for each candidate node it verifies
the end of a path whose depth is smaller than the maximum whether the inputs satisfy the constraint in the query record.
limit. If we reach the maximum limit and stop exploring If the inputs satisfy constraint ci , the program is expected to
that path, we mark the last state as incomplete. Note that a execute the query which is associated with the query record
node can be incomplete without any regard of its number of QRi containing the satisfied ci . As next step, the signature

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 9

generator sub-module generates the signature of the received 6.2.3 Execution of a query that does not belong to the
query and the signature comparator sub-module compares it program
with the signature stored in QRi , i.e., sig(queryi ). For a If a query is issued in the anomaly detection phase that has
legitimate query, the signatures match. The verification out- been never encountered in the profile creation phase, the
come is then passed to the proxy module which then sends signature of this query does not match with any of the query
the legitimate query to the target database for execution. records. In this case, if the query is issued in a state that
For subsequent queries issued by the program, the ADE is profiled completely, the ADE module raises an anomaly.
module considers the query record of the most recently However, if the program execution reaches a state where the
executed query as the current parent node, and verifies the profile is incomplete (because the maximum depth search
signature and corresponding constraints in a similar way as was reached during the testing), the ADE module raises a
described above. warning.
As we already discussed, during the profile creation
phase we use a depth bounded search to explore the ex-
6.2.4 SQL injection attacks
ecution paths. So it is possible to have incomplete profiles.
This is the reason the ADE module can return three different Halfond et al. [14] classified different types of SQL injection
results: NOT-ANOMALOUS, ANOMALOUS and WARNING. attacks. The tautology attacks are the easiest to detect,
During the profile creation, we know when we do not because they introduce tautologies in the WHERE clause that
enable the backtrack because we reached the maximum can be easily detected by a simple SQL parser. Some attacks
search limit. Therefore we mark the last node as incomplete. consist of sending illegal queries, because by analyzing the
When we receive a new query to analyze, if we cannot resulting error messages it is possible to infer meaningful
find any matching result we check if the last status was an information about the schema and the type of database.
incomplete node. If this is true, it means we are entering in Such attacks can be identified by analyzing the error logs.
an unseen state that may receive unexpected queries. In this The remaining attacks use special encoding or unescaped
case we return a warning, because our system is unable to parameters to alter the executed query.
decide if the query is anomalous or not. It is the duty of the It is important to note that, as these attacks typically
SQL proxy to decide how to handle this case. modify the queries by adding new predicates, they can
Ideally, whenever a program creates too many warnings be easily detected by our anomaly detection mechanism
in our system, an administrator should verify and edit the because the query signature contains both the columns used
profile to fix the problem, or create a new profile using a and the number of predicates in the WHERE clause.
deeper search. We illustrate the detection of SQL injections with a
sample application program. Such program has the function
of displaying the medical records of an authenticated signed
6.2 Case Studies
user. The user is authenticated by entering his username
In this section we present some case studies to illustrate how and password. The legitimate query execution would look
the ADE module works in the anomaly detection phase. We as- like:
sume that the values of profit and investment variables
are set to 60000 and 100000, respectively. We consider the 1 username = readInputUser();
following cases. 2 password = readInputPassword();
3 query = "SELECT * FROM MedicalRecords
WHERE uname = ’" + username + "’ AND
6.2.1 Execution of query1 and query2 password = ’" + password + "’";
According to the values of input variables, the application
program is eligible to issue query1 . So in the anomaly de- If the username is John and the password is Smith, then
tection phase, upon receiving the issued query1 , the ADE the query would be:
module takes the program inputs to check whether they sat-
isfy the constraints of either QR1 or QR4 . As c1 is satisfied, SELECT *
FROM MedicalRecords
the signature generator sub-module generates the signature
WHERE uname = ’John’ AND password = ’Smith’;
of the input query and the signature comparator sub-module
compares it with the signature part of QR1 . The match is However, such query is vulnerable to SQL injection at-
positive and hence query1 is assessed as non-anomalous. tacks by which the attacker can display the medical records
Now assume that the number of records returned by query1 of other users. This can be achieved if the attacker enters
is less than 100. In this case, the constraint c2 is satisfied and in the password input field the string password = ’
the attempt to execute query2 is considered non-anomalous OR uname = ’Carl’. If so, the following query would
because the signature of query2 matches to that of the QR2 . be issued which would display the medical records of the
username Carl to the attacker.
6.2.2 Execution of query1 and query3
In this case, query1 is executed legitimately as described SELECT *
FROM MedicalRecords
in the previous case. Afterwards, when the program issues WHERE uname = ’John’ AND password = ’ ’ OR
query3 , the signature comparator sub-module finds that the uname = ’Carl’;
signatures of query3 and that of the expected query do not
match. As a result, the ADE module raises an alert indicating Such a vulnerability exists in any application that allows
query3 as anomalous. the user input to change the structure of an SQL query. Since

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 10

SQL injection attacks are based on re-structuring the SQL sending the user input may introduce significant de-
query, our mechanism by comparing the query structure to lays.
the query signatures saved in application profile is able to Therefore we have also developed a simple version of
detect changes in the query. More specifically, as we count our anomaly detection approach which does not require the
the number of predicates of the WHERE clause as part of instrumentation for the anomaly detection phase. We refer
the query signature, we are able to detect any additional to this approach as simple detection, whereas we refer to the
predicates introduced by SQL injection. In the example previous approach as complete detection.
above, the number of predicates is 2 before the injection, Since the profile creation phase does not change, the
and it becomes 3 after the injection. same profile can be used for both the simple and complete
detection phases. The main difference is that, without re-
6.2.5 Two-step SQL injection attacks ceiving the application input during the anomaly detection,
These attacks are also referred to as second-order injection we can verify only that an allowed sequence of queries is
attacks and represent a complex form of data-centric attacks. issued but without checking the constraints.
The purpose of these attacks is to create an SQL injection Hence, the simple approach can still verify that only
attack that can be processed at a later time. This is achieved allowed queries are submitted in the right order, but cannot
by injecting malicious input into the database that is legit- enforce any longer that the sequence of queries is consistent
imately saved into the database, but will result in an SQL with the input received by the application. Therefore, for
injection attack at a later time when other types of queries example, a control flow attack that modifies the code from
perform actions on the maliciously inserted data. To clarify,
1 if ( userInput() == ’y’ ) {
consider an example of a web application that registers its 2 // delete a record
users upon using their service. If a malicious user chooses 3 }
(’OR ’1’ = ’1’) as his username, then adding this user
to the database will execute the SQL query: to
INSERT INTO users VALUES ("’ OR ’1’ = ’1’"); 1 if ( true ) {
2 // delete a record
This is a legitimate query and will not result in an SQL 3 }
injection attack, and thus the username ’ OR ‘1’ = ‘1’
will be successfully created. However, if at a later time the cannot be detected any longer.
malicious user or even the web administrator decides to
delete this account, the executed SQL query is:
7 I MPLEMENTATION
DELETE FROM users In this section, we discuss the implementation details of
WHERE uname=’ ’ OR ’1’ = ’1’; the proposed system. In our implementation, we consider
This is when the attack is effective as the query will result applications in the form of Java bytecode, which is mostly
in deleting all the users in the database. produced by compiling Java source code, but can also
be generated starting from other languages, most notably
Our AD mechanism will be able to detect this type of
Scala. However, our proposed anomaly detection mechanism
attacks when the SQL injection is about to perform the in-
can be used for other kinds of application programs.
tended attack action on the database. Consider the example
above. Our AD mechanism will find a mismatch with the
DELETE SQL query signature because of the change in the 7.1 Constraint Extractor
number of predicates in the WHERE clause. As a result, the Our implementation of the constraint extractor is built on top
ADE will assess the execution of such query as anomalous. of the JCute concolic testing framework [30]. This frame-
Like the case of SQL injection attacks, additional predicates work uses Soot [37] for instrumenting Java class files and
will result in a mismatch of the SQL injected queries when lpsolve for solving linear programs.
compared to the existing query signatures and therefore will In the profile creation phase the concolic execution en-
result in the query being identified as anomalous. gine takes the application and the depth search limit as in-
puts and instruments the application using Soot for branch
analysis and backtracking. We instrument the runtime en-
6.3 Simple Detection
vironment to generate user input and database content that
Instrumenting all the instances of the application to be direct the concolic execution to visit different branches of
secured is not always possible. Possible reasons may be the application. Once all the branches have been explored,
related, but not limited, to: or the concolic execution has reached the maximum depth
• time constraints as the application may be already search limit, the concolic execution and the profile creation
deployed on a large number of machines and update phase ends.
all of them may not be easy; Our custom instrumentation allows one to capture all
• technical reasons as the environment used may not the queries issued to the target database. Every time a
expose any API for the instrumentation (i.e. JVMs for new query is issued, the constraint extractor first captures
mobile devices); the constraints of the current path from the root to the
• performance reasons as in applications with high num- intercepted query. Since the constraint extractor knows the
bers of user interactions, the overhead introduced by constraints of the most recently executed query along this

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 11

path, it extracts only the constraints that are extension of an intranet, or whether query is trying to retrieve sensitive
those of the most recent query and stores them into the data. Based on these factors, the application may be required
application profile along with the signature of the intercepted to disconnect immediately before the data is returned.
query. Since the proxy filters the data in both directions, one
can decide for performance reasons to let the query reach
7.2 Signature Generator the target database as soon as it is recorded by the proxy
and block the result returned by the query in case of an
We use PostgreSQL-9.1.8 [22] to implement the signature anomaly is detected. Obviously this can be safely and easily
generator module. PostgreSQL delivers all issued queries to done only for queries that do not update the database.
the parser to generate a query parse tree using the method Our current implementation consists of a Java program
exec_simple_query(). In this method, our customized that supports the last version of Oracle. Whenever an
function for signature generator imports necessary query in- anomaly is found, the query is put in hold and a message is
formation (command, target list, relation list, and qualifiers) prompted to an administrator asking the action to perform.
from the parse tree and creates the query signature. The valid actions that can be chosen are: 1) log the query,
Application Profile: The profile builder module creates 2) drop the query, 3) close the connection, and 4) redirect
query records by combining the query signatures generated the program to an honeypot.
from signature generator and the constraints extracted by
constraint extractor. These records are organized according
to a hierarchical data structure (see Figure 5) and stored in 7.4 Architectural Techniques for Capturing Program In-
PostgreSQL. put
The system relies on the ability to instrument applications in
order to inject custom code to generate – during the profile
creation phase – or capture – during the anomaly detection
phase – the user input.
Though our implementation focuses on Java, the same
technique can also be applied to other application platforms
and environments.
There are two well known approaches to instrument an
application program: change the application itself or change
its working environment. In our scenario we consider that
we do not have the source code of the application; therefore,
changing the application itself means having to change the
compiled code (i.e., the binary of the application) or using
Fig. 5. Profile graph
some decompiling mechanism. Even if this is technically
possible, it raises the following issues. The application bi-
nary may be obfuscated [2]; therefore, analyzing the control-
7.3 SQL Proxy flow to find the places where the application input is
The SQL Proxy module is in the charge of intercepting the collected is not a trivial task which may even introduce
queries before they reach the database, forwarding them to bugs in the software. Another issue is that, especially in
the ADE and handling the query responses. high security environments, the applications may require
The proxy is written so to offer an interface that is binary to be digitally signed. Since such kind of instrumentation
compatible to the target DBMS. Therefore a plugin must be breaks the application signature, a security check may block
implemented for every supported DBMS. its execution. Taking into account of all these issues, we
Since not all the database protocols support sending decided to adopt the second approach: instrumenting the
custom meta-data along with the query, our SQL proxy environment.
takes care of discarding the custom meta-data and the user The Java working environment can be easily instru-
input before forwarding the query to the target database. mented by changing the Java Virtual Machine that is in
It is important to note the the proxy can filter the con- charge of interpreting and translating the bytecode into the
nection in both the directions; therefore it is also possible to machine language. This can be easily done, in most of the
alter the returned data in case an anomaly is found. virtual machine implementations, using the non-standard
The proxy is also in charge of taking actions in response option -Xbootclasspath/a:path which let the user to
to an anomalous query. As the most suitable response to override the standard classes with a custom provided jar
anomalous query depends on many application related file. The main advantages of this approach is that we do not
factors, the proxy provides a policy language by which the change the application itself; therefore we do not break any
system administrator can define customized actions. Since signature based security check. Moreover, since OpenJDK is
not all the anomalous queries are attempts to steal sensitive the reference implementation for Java SE, it is very easy
data, in order to select a response the administrator may to find the source code of the classes that we need to
need to consider other factors, e.g., whether the query is instrument and change them without the help of any de-
generated during working hours, or from a trusted com- compiler. Last but not the least, by instrumenting the whole
puter, whether the query retrieves some low relevance data, environment we can rewrite the database connection library
or whether the query is issued during the weekend, or from to mock the behavior of a database. Implementing a fake

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 12

database in software allows us to create the profiles of the number of times to retrieve all the sensitive data. DetAnom
application program faster and also to change the database detects such attempt by following the profile graph which
content easily to explore new execution paths by the concolic maintains the order or the sequence of the queries. If the
execution module during the profile creation phase. We issued query is out of the order with respect to the current
have thus created a new JDBC library which mocks the execution context, DetAnom flags the query as ANOMALOUS.
database by returning data generated by the concolic engine The attacker cannot execute a query that is relevant to
and using the instrumentation to replace, during the profile the current execution path, but for which the program
creation phase, the selected database connection library with inputs do not satisfy the constraints: An attacker can
our custom code. During the execution of the complete exploit any vulnerabilities of the application and change
detection, we use the instrumentation to wrap the used the application level access control policy. In this case, the
JDBC library with a custom code which adds the user input attacker may execute a query that is relevant to the current
as meta-data every time a query is sent to the database. execution context but the input values do not satisfy the
This approach is not free of drawbacks. The biggest constraints. Therefore, whenever a query is issued, DetAnom
challenge we had to solve is that, by instrumenting the checks whether the constraints associated with the candi-
whole environment, we do not change just the application date nodes (i.e, nodes which are reachable from the current
behavior but also the behavior of all the libraries that it state) are satisfied by the program inputs. If constraints are
uses. Most of the time this is the desired behavior, but there not satisfied the query is flagged as ANOMALOUS.
are few exceptions. The problem is that Java input classes The attacker cannot tamper/change the profile: Our
read from streams without knowing the real source of the proposed approach stores the profile in the ADE which
data. When we instrument a class – i.e. BufferedReader is outside the scope of the attackers. Only the security
– we do not know if the actual instance is reading from the administrator can access the profile. However, we enforce a
console, from a file, from a network stream or if it is used as separation-of-duty policy to prevent any malicious security
a wrapper for another already instrumented class. In order administrator from tampering the profile.
to filter out the input that we do not need, an inspection of
the stack is required with a good heuristic that should be
tuned according to the tested application. 9 E XPERIMENTAL E VALUATION
Note that a complete instrumentation library must con- We have evaluated the performance of our proposed De-
sider all the classes and functions that can be used to tAnom mechanism. Our experiments have been performed
collect user input. As it is easy to guess, this is a very on a virtual machine running Ubuntu-14 as operating sys-
large set because it includes most of the GUI components; tem, with 10GB of RAM memory and 4 processors.
all the methods to read from network, files and console; Considering the deterministic behavior of our approach,
uncommon input sources, like accelerometers, joy pads or and considering that in case of a control-flow attack we
any other set of sensors and so on. Even if providing a expect to find all the queries after the attack to be flagged as
reasonable complete implementation of the instrumentation anomalous, we focused the evaluation on the performance
library is possible, it is outside the scope of our work. For and the overhead required to send the user input and verify
this reason we limited our implementation to the input the constraints.
sources necessary for our tests only. Since to the best of our knowledge there is no public
Concluding, we would like to mention again that our available dataset suitable for our needs, we generated some
approach was tested on Java, but it can also be eas- test applications. The goal was to test DetAnom using ap-
ily implemented in other languages. For example, the plications with different size, in order to check the behavior
LD_LIBRARY_PATH can be used to instrument native Unix in case of partial profiles. The details of such applications
applications, forcing them to use custom libraries instead of are listed in Table 5, ordered by increasing complexity; the
the default ones. first two use only binary branches, while the third contains
also for loops. As we can see in the second column, the
profile creation time increases very fast. The reason is that
8 S ECURITY A NALYSIS in the worst case this time is exponential in the number of
In what follows we analyze the security of our proposed branches. A limitation of the concolic testing tool we use
system. is that the backtrack support is not implemented. Therefore
The attacker cannot execute queries that do not belong every time a new branch has to be explored, a new execution
to the application: DetAnom enforces a policy that any query of the application is required. Considering that we gener-
outside of the application is considered as ANOMALOUS or ated the test applications nesting binary branches evenly,
WARNING. Our approach checks the signature of the issued profiling an application with an extra “if-else” requires twice
query against the signature stored in the profile. If the the time. Adding loops slows down even more the profile
signatures do not match, the issued query is considered as creation because, as explained in Section 5, jCute actually
ANOMALOUS. If the program reaches the maximum depth unroll loops that can be seen as a series of nested “if”s where
search limit and issues a query, the ADE generates an every “if”, but the last one, contains the loop body and the
WARNING message and holds that query until the security next if.
administrator resolves the issue. To test the applications, a pseudo random input genera-
The attacker cannot execute a query that is irrelevant tor has been used to simulate the user input. Initializing the
to the current execution path: If an attacker has knowledge generator with the same seed makes it possible to test the
about an allowed query, he/she may repeat that query a same execution flow. We analyzed 100 different execution

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 13

TABLE 5
Test application details

Test Profile Code Profile size Number Number Number of Number of Line of
application time coverage (nodes) of “if” of “for” nested blocks unique queries code
#1 40 seconds 100% 103 7 0 3 14 235
#2 5 minutes 70% 351 15 0 4 30 283
#3 4 days 20% 319 37 32 8 106 543

flows for each application. For each execution flow we 900


recorded the execution time and the network usage of the Detection disabled
800
Detection enabled
application in both a normal execution and an execution 700

Execution time (msec)


protected by DetAnom. 600
Figure 6 shows the average execution time of the appli-
500
cations compared to their average execution time with our
400
anomaly detection system enabled. As we can see the run-
300
time overhead is small and around 20%. We can also notice
200
that the average execution time of the longer application is
just few milliseconds higher than the execution time of the 100
smaller one. The reason is that the time required to start the 0
App #1 App #2 App #3
JVM is considerably higher than the time required to send a
query.
Figure 7 shows the network overhead introduced in Fig. 6. Execution time overhead
order to send the application input to the anomaly detection
engine in order to check the path constraints. We can see 16000
that in the first two applications the overhead is between Detection disabled
14000
30% and 40% whereas in the third it reaches 60%. Such Network transfer (bytes) Detection enabled
results match our expectation that larger size programs have 12000
more complicated control flows and therefore require more 10000
data to be transferred to check the constraints. Even if these 8000
percentages may appear very high, we should consider that 6000
the absolute values are small. In the last test the average
4000
overhead was only of 5.6 kilobytes. Whereas in the first two
is respectively 1.1 and 1.2 kilobytes. Moreover, it is impor- 2000
tant to point out that the network overhead is not related 0
to the amount of data transferred between the database and App #1 App #2 App #3
the application, but only to data required to transfer the
application input to DetAnom. In our tests we used a very Fig. 7. Network overhead
small database. We expect, however, that in real applica-
tions the actual data retrieved by the database is higher
than the few rows retrieved by our tests. Therefore the we discuss only about the false negatives, that is, anomalies
overhead introduced by the transmission of the application that DetAnom is unable to detect and approaches to address
input to DetAnom will likely be negligible compared to the these types of false negative occurrences. It is important to
overhead incurred by the transmission of the query results notice that false negatives are due to the level of details
from the database to the application. In case, however, of according to which queries are represented in the profiles.
applications retrieving very small datasets and for which In what follows we discuss the two limitations of the query
the transmission overhead introduced by DetAnom may be signatures and techniques to address these limitations.
too high, the simple detection mechanism can be used at the Consider the following fragment as example of code to
cost, however, of a weaker detection (see Section 6.3). The attack.
standard deviation of the last test is very high because this
is the only test application that contains loops; therefore, 1 int productivity = userInput();
depending on the input, the application may send more 2 sql = "UPDATE employee SET salary = salary
queries just because iterates more on some code block. * 1.1 WHERE productivity > " +
productivity + " AND work_experience >
5";
9.1 Accuracy Limitations
Considering the query signature used and the anomaly The query signature does not contain any information
detection techniques implemented – described respectively about the operators used in the WHERE clause. However
in Section 5.1 and 6.1 – we now analyze in detail what kind changing such operators changes the semantics of the query.
of attack cannot be detected by DetAnom. In the following example the operators “and” and “greater
As already discussed, we expect to have false positives than” have been replaced respectively by “or” and “less
only in case of incomplete profiles. Therefore, in this section than”, respectively.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 14

Checking that user input is not arbitrarily changed is a


1 int productivity = userInput();
general problem outside the scope of this project. We plan
2 sql = "UPDATE employee SET salary = salary
however to investigate this issue as part of future research
* 1.1 WHERE productivity < " +
productivity + " OR work_experience > to also determine available software engineering techniques
5"; that can help with this problem when attacks are carried
out by insiders. We emphasize that insiders may have direct
Such change in the query can be easily detected by extend- access to source and binary of applications and therefore
ing the signature to record the type of the operators together would be able to compromise an application even when
with the relative occurrence of these operators within the the application does not have any vulnerability (like buffer
query. The occurrence of an operator can be determined overflow).
based on a traversal of the query parse tree. Another ap-
proach is to include information about predicate selectiv-
9.2 Technical limitations
ities. As well known from the large body of research on
query optimization (see [29] for the pioneer work on query Our experiments have shown some technical limitations
optimization), different operators result in different query of our current approach. In what follows we discuss such
selectivities. The selectivity of a query gives an indication limitations and outline possible solutions.
of the expected cardinality of a query result. For example, a The profile creation phase is very slow. This is a limit
query with the logical “or” of two predicates returns more of the testing technique we use which actually runs the
tuples than a query with the logical “and” of the same program as many times as it is required to explore the
two predicates. Therefore, if the expected selectivity of the possible execution paths. Moreover, the concolic testing tool
query is recorded in the query signature, changes to the we used, JCute [30], was developed to write small unit
operators would result in a different selectivity. As a result, tests. Therefore it does not implement any mechanism to
an anomaly due to mismatch in query selectivity would be speed up the analysis of large programs. Multiple exe-
detected. We have developed an initial prototype of this cutions could be parallelized and distributed on different
technique for the case in which queries are issued directly machines; moreover, saving a snapshot of the execution
by users [28]. We plan to further tune this technique and in order to being able to backtrack without the need of
integrate it into DetAnom as part of future work. We did not restarting the application from the beginning may result in
include this technique in the current release of DetAnom to a big improvement of the profile creation time. The use
keep the system stable for release to our industry collabora- of a concolic engine, which supports backtracking doing
tors. snapshot of the application execution, may also be useful
The query signature does not contain any information in supporting the incremental profile creation. This can be
regarding the parameters used to compose the query. There- used both to quickly deploy a partial profile to start securing
fore changing a parameter generates an anomaly that cannot the application while building a more accurate profile, or
be detected. In the example below a user input is ignored to incrementally change the profile to reflect application
and replaced by a fixed value. updates.
Our current approach to deal with application updates is
1 int productivity = userInput(); that an administrator should check if such updates change
2 productivity = 0;
the execution flow (with respect of the issued queries). We
3 sql = "UPDATE employee SET salary = salary
expect that most of the updates will not contain substantial
* 1.1 WHERE productivity > " +
productivity + " AND work_experience > changes which impact the execution flow with respect of the
5"; queries issued; in this case there thus is no need of a new
profile. Whenever the update changes the flow in a minor
A first line of defense against this attack is to add to the way, the profile can be manually fixed by an administrator.
database some triggers which check that sensitive param- In case of major updates which heavily change the execution
eters are within a valid range. A second line to defense flow, a new profile must be created from scratch.
is to extend the instrumentation to check that the user JCute can only solve numerical constraints. Whenever
input correctly reaches the SQL connection library; however, the application input is in form of strings, the solver can-
this in turn requires protecting the instrumentation from not force the execution of different branches. To solve this
tampering. A third line of defense is to identify, during problem a major extension of JCute is required to add a
the profile creation phase, relationships (such as equality constraint solver for string values.
or some other mathematical relationship) between the input JCute can only analyze variables when the execution
parameters of the application program and the parameters flow is inside the main code. However, whenever the execu-
passed to the query. These relationships can be identified tion moves to some external library, the solver loses control
via some statistical analyses. At run-time, the actual values of what happens to the values and is not any longer able to
of the application input parameters and the values of the pa- generate inputs to solve the future constraints. For example,
rameters passed to the query would analyzed to determine consider the code if ( a > Math.max(b, c) ){...}
whether the relationships still hold. When this is not the else {...}, where all the variables provided as input to
case, an anomaly would be raised. We notice that these three the application are integer. The solver cannot generate val-
defense techniques could be all applied to provide a strong ues to force both the branches because JCute has no knowl-
defense against attacks that change the input parameters of edge about what happens inside the Math.max(int,
queries. int) function. In the case of Java libraries this problem can

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 15

be easily solved decompressing the jar file and letting JCute a highly parallel environment. Therefore, when multiple
analyze it. But the standard Math library is implemented simultaneously requests are executed, it is possible to inter-
mostly using native code. A solution to this problem would leave the SQL queries in a way that generates unexpected
be to provide an instrumented version completely written behavior. Such a kind of attack may be mitigated by an
in Java to be used during the profile creation phase. approach, like the one we propose in this paper, which can
The Java Language Specification states some constraints enforce the correct order of the queries.
that the code must fulfill [18], and some of them are related Our previous poster paper [27] outlines some prelimi-
to the maximum size of classes and methods. JCute injects nary ideas to protect against data exfiltration through ma-
at runtime some code to follow the execution flow and licious modification of the application program. However,
correctly analyze the branches. Therefore, if a class is already the approach proposed in this paper reduces the perfor-
close to the limits, with the injected code it may exceed mance overhead by allowing the ADE to simply traverse
these limits resulting in an invalid bytecode that cannot be the application profile instead of concretizing of the symbolic
executed and, consequently, analyzed. Luckily these limits execution tree of the application program. Such concretiza-
are very high and very difficult to be reached, especially if tion in the detection engine results in extra delay when
the code is written following good coding style techniques. verifying a query. In addition, our preliminary approach
does not cover the combination of testing-based techniques
with program analysis techniques nor cover implementation
10 R ELATED W ORK and assessment of the proposed approach.
A formal framework to categorize anomaly detection sys- The current paper is an extended version of conference
tems has been proposed by Shu et al. [34]. According to this paper [15]. Compared with this previous paper, the cur-
classification, our proposed approach uses a deterministic rent paper has the following novel contributions. We have
language defined on the top of the database interactions to created a stronger architecture which can easily support
perform the detection. different target databases. We have adopted the approach
Several approaches have been proposed to pro- of instrumenting the environment of the application instead
tect databases against malicious application programs. of the application itself, with the benefits described in Sec-
DIDAFIT [17] is an intrusion detection system that works tion 7.4. We extended the profile signature to represent
at the application level. Like our system, DIDAFIT works sub-queries. We proposed a simpler version of the anomaly
in two phases: training phase and detection phase. During detection which does not require receiving the application
the training phase, database logs are analyzed to generate input and thus does not require instrumenting the envi-
fingerprints of the queries found in the log. Fingerprints are ronment nor the application. Such simple version can be
regular expressions of queries with constants in the WHERE used in environments where deploying a new instrumented
clause replaced by place-holders that reflect the data types application is difficult or impossible, and we argued that
of the constants. During the detection phase, input queries it still gives a reasonable level of safety against different
are checked against such fingerprints. Queries that match kinds of attacks (see Section 6.3). We introduced the im-
some expression in the profiles are considered benign, and portant notion of confidence for the profiles which let us
anomalous otherwise. DIDAFIT has however some major decide whether to issue alerts or warnings according to the
drawbacks. First, the system relies only on logs to create confidence obtained during the profile creation phase. This
program profiles. There is therefore no guarantee that the last extension removes the difference between flexible and
log would contain all legitimate queries. To address this strict policy previously introduced to deal with incomplete
drawback, the authors propose a technique to generate profiles. Our previous approach was based on the idea that,
new signatures from other signatures that are similar in after the profile creation, an administrator had to check the
all portions and have some predicates in common. While code coverage as reported by jCute and decide whether
this solution works in some cases, the system would not to use the strict or the flexible policy. If the flexible policy
be able to recognize queries that do not appear in the were chosen, the administrator had to also choose a number
log. Another problem is that DIDAFIT does not take into of “safe” anomalies with the idea that, if an anomalous
account the control flow and data flow of the program, query had been issued a number of times greater than a
i.e., the algorithm neither checks the correct order of the given threshold, a stronger alert had to be raised asking
queries, nor the constraints that have to be verified for a the administrator to revise the profile. But, considering that
query to be executed. The approaches proposed by Bertino in high security environments, even one anomaly can be a
et al. [5] and Valeur et al. [36] also analyze training logs for problem, the flexible policy would not adequate. By con-
creating profiles of queries. Therefore they have the same trast, the current approach based on the confidence degree,
drawbacks mentioned earlier. These approaches focus on allows us to explicitly mark the portion of profiles from
the detection of web-based attacks, like SQL Injection and where we expect unseen code to be executed, thus clearly
Cross-Site Scripting (XSS) attacks, and fail to detect other differentiating anomalies and warnings also in very poor
attacks performed through application programs, e.g., code coverage profiles and without requiring any administrator
modification attacks. decision. We performed a new set of experiments to evaluate
Securing a database can be a difficult task, Paleari et the network usage overhead. Finally we have discussed in
al. [21] described a new category of attacks which rely on details the limitations we have identified in the use of jCute
race conditions. Such kind of attacks are easier in web ap- and outlined approaches to address such limitations.
plications, where the tools used (mostly PHP and MySQL) Programs profiling techniques have also been proposed
offer a poor set of synchronization primitives but provide for many other purposes, such as debugging and collecting

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 16

usage statistics [25], monitoring system calls [13] [39] [12], protection.
and enhancing the performance of database applications.
For example, the Pyxis system [7] uses static analysis of
application code to partition the code into two pieces: one to
11 C ONCLUSION AND FUTURE WORK
be executed on the application server and the other on the Though access control mechanisms deployed in DBMS are
database server, trying to reduce the control transfers and able to prevent application programs from accessing the
amount of exchanged data between the two components. data for which they are not authorized, they are unable
Dasgupta et al. [10] propose a static analysis to database to prevent data misuse caused by authorized application
applications that use ADO.net APIs in order to extract programs. In this paper, we have proposed an anomaly
features of SQL queries, query parameters, and usage of detection mechanism that is able to identify anomalous
query results in order to detect SQL injection attacks and queries resulting from previously authorized applications.
potential data integrity violations. Ramachandra and Sudar- Our mechanism builds close to accurate profile of the appli-
shan have developed DBridge [24], a tool that optimizes the cation program, without the need of its source code, and
performance of database applications by prefetching query checks at run-time incoming queries against that profile.
results. Control-flow and data-flow analysis are used to find In addition to anomaly detection, our DetAnom mecha-
locations in the program where instrumented code can be nism is capable of detecting any injections or modifications
added; at program runtime this code sends requests to the to the SQL queries. We want to emphasize two benefits of
database to prepare results of queries predicted to be sent our approach compared to other more conventional tech-
by the program at later points. niques. The first is that by using the concolic testing tech-
Many other approaches have been proposed to detect nique instead of static analysis techniques, we can profile
abnormal execution behavior. Xu et al. [40] propose a tool the actual execution of the code which includes queries ex-
which can detect an abnormal control flow with respect to ecuted by self-modifying or dynamically downloaded code.
system and library calls. They use static analysis combined The second is that we are able to enforce the actual order
together with a probabilistic model to evaluate the likeli- of the queries sent to the database, unlike conventional
hood that a sequence of calls has been issued by a compro- SQL injection detection approaches which are unable to
mised program. Shu at al. [33] argue that a new category determine whether a query is added or removed from an
of control flow attacks exists, namely aberrant path attacks, application program.
that are difficult to detect because they do not directly We have implemented DetAnom with JCute and Post-
change the flow of the execution, but change some data greSQL which results in low run-time overhead and high
which is used by the program itself to decide the execution accuracy in detecting anomalous database accesses.
flow. Such kind of attacks can generate montage anomalies, We are currently extending our work along several direc-
when we can observe multiple legitimate control flows tions. Our current implementation of DetAnom exploits the
that are incompatible in a single execution, or frequency constraints that JCute [30] supports, i.e., arithmetic, pointer,
anomalies, that is, a legitimate code block that is called and thread constraints. We plan to improve our signature
too frequently. They argue also that usually such attacks generation scheme by incorporating information about pro-
happen in a large-scale execution window, being unseen by gram constants, variables, logical and relational operators
classical detection techniques that, for performance reason, used in the WHERE clause of a query as this information
can analyze only a small portion of the execution window. may enhance the accuracy of detection. We also plan to en-
They propose a probabilistic method that does not suffer hance the completeness and accuracy of our profile creation
by combinatorial explosion and can scale to analyze the mechanism using both static and dynamic analysis of the
flow on larger execution windows. It is important to notice program. In this approach, we will first analyze the program
that, in proposing our anomaly detection technique, we statically to find all the execution paths that contain SQL
considered a totally different scenario. Approaches, like the queries and then guide the concolic execution dynamically
one by Shu et al. [33] aim at protecting the user against so that it does not leave any paths unexplored.
exploited or compromised applications. In our scenario
we aim at protecting the database also against users who ACKNOWLEDGMENTS
may intentionally alter applications and/or disable locally
installed security tools in order to steal or alter data stored The work reported in this paper has been funded in part
in the DBMS. under subcontract to Northrop Grumman Systems Corpora-
Finally, we would like to point out that security must tion in support of a contract with Department of Homeland
be approached by combining different techniques, each pro- Security (DHS) Science and Technology Directorate, Home-
tecting against specific types of attack, and that a compre- land Security Advanced Research Projects Agency, Cyber
hensive intrusion detection system should aggregate mul- Security Division. The views expressed in this work are
tiple warnings from different sources. In this respect, our those of the authors and do not necessarily reflect the official
anomaly detection system would be one of such warning policy or position of the Department of Homeland Security
sources. Vigna et al. [38], for example, have shown that it is or of Northrop Grumman Systems Corporation.
possible to significantly increase the detection accuracy by
combining a web-based and a database anomaly detection R EFERENCES
system. The idea of our approach is to provide another
[1] Cybersecurity watch survey: How bad is the insider threat? Tech-
warning source and anomaly detection tool, that can be nical report, Carnegie Mellon University, 2012. http://resources.
used together with existing tools to increase the overall sei.cmu.edu/asset files/Presentation/2013 017 101 57766.pdf.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 17

[2] A. Balakrishnan and C. Schulze. Code obfuscation literature [24] K. Ramachandra and S. Sudarshan. Holistic optimization by
survey. CS701 Construction of compilers, 19, 2005. prefetching query results. In Proceedings of the 2012 ACM SIGMOD
[3] E. Bertino. Data Protection from Insider Threats. Synthesis Lectures International Conference on Management of Data, SIGMOD ’12, pages
on Data Management. Morgan & Claypool Publishers, San Rafael, 133–144, New York, NY, USA, 2012. ACM.
2012. [25] T. Reps, T. Ball, M. Das, and J. Larus. The use of program profiling
[4] E. Bertino and G. Ghinita. Towards mechanisms for detection and for software maintenance with applications to the year 2000 prob-
prevention of data exfiltration by insiders: Keynote talk paper. In lem. In Proceedings of the 6th European SOFTWARE ENGINEERING
Proceedings of the 6th ACM Symposium on Information, Computer and Conference Held Jointly with the 5th ACM SIGSOFT International
Communications Security, ASIACCS ’11, pages 10–19, New York, Symposium on Foundations of Software Engineering, ESEC ’97/FSE-5,
NY, USA, 2011. ACM. pages 432–449, New York, NY, USA, 1997. Springer-Verlag New
[5] E. Bertino, A. Kamra, and J. P. Early. Profiling database application York, Inc.
to detect sql injection attacks. In IEEE International Performance, [26] R. Roemer, E. Buchanan, H. Shacham, and S. Savage. Return-
Computing, and Communications Conference, IPCCC 2007, pages 449– oriented programming: Systems, languages, and applications. vol-
458, April 2007. ume 15, pages 2:1–2:34, New York, NY, USA, Mar. 2012. ACM.
[6] C. Cadar and K. Sen. Symbolic execution for software testing: [27] A. Sallam and E. Bertino. Poster: Protecting against data exfiltra-
Three decades later. Commun. ACM, 56(2):82–90, Feb. 2013. tion insider attacks through application programs. In Proceedings of
[7] A. Cheung, S. Madden, O. Arden, and A. C. Myers. Automatic the 2014 ACM SIGSAC Conference on Computer and Communications
partitioning of database applications. VLDB Endow., 5(11):1471– Security, CCS ’14, pages 1493–1495, New York, NY, USA, 2014.
1482, July 2012. ACM.
[8] M. Collins, D. M. Cappelli, T. Caron, R. F. Trzeciak, and A. P. [28] A. Sallam, E. Bertino, S. R. Hussain, D. Landers, R. M. Lefler,
Moore. Spotlight on: Programmers as malicious insiders (up- and D. Steiner. Dbsafe – an anomaly detection system to pro-
dated and revised). Technical report, Carnegie Mellon Univer- tect databases from exfiltration attempts. IEEE Systems Journal,
sity, 2013. http://resources.sei.cmu.edu/asset files/WhitePaper/ PP(99):1–11, 2015.
2013 019 001 85232.pdf.
[9] C. Cowan, P. Wagle, C. Pu, S. Beattie, and J. Walpole. Buffer [29] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie,
overflows: attacks and defenses for the vulnerability of the decade. and T. G. Price. Access path selection in a relational database
In DARPA Information Survivability Conference and Exposition, 2000. management system. In Proceedings of the 1979 ACM SIGMOD
DISCEX ’00. Proceedings, volume 2, pages 119–129 vol.2, 2000. international conference on Management of data, pages 23–34. ACM,
1979.
[10] A. Dasgupta, V. Narasayya, and M. Syamala. A static analysis
framework for database applications. In Proceedings of the 2009 [30] K. Sen and G. Agha. Cute and jcute: Concolic unit testing and
IEEE International Conference on Data Engineering, ICDE ’09, pages explicit path model-checking tools. In Proceedings of the 18th In-
1403–1414, Washington, DC, USA, 2009. IEEE Computer Society. ternational Conference on Computer Aided Verification, CAV’06, pages
[11] M. Emmi, R. Majumdar, and K. Sen. Dynamic test input genera- 419–423, Berlin, Heidelberg, 2006. Springer-Verlag.
tion for database applications. In Proceedings of the 2007 Interna- [31] K. Sen, D. Marinov, and G. Agha. Cute: A concolic unit testing
tional Symposium on Software Testing and Analysis, ISSTA ’07, pages engine for c. In Proceedings of the 10th European Software Engineer-
151–162, New York, NY, USA, 2007. ACM. ing Conference Held Jointly with 13th ACM SIGSOFT International
[12] D. Gao, M. K. Reiter, and D. Song. Gray-box extraction of Symposium on Foundations of Software Engineering, ESEC/FSE-13,
execution graphs for anomaly detection. In Proceedings of the 11th pages 263–272, New York, NY, USA, 2005. ACM.
ACM Conference on Computer and Communications Security, CCS ’04, [32] A. Seshadri, A. Perrig, L. van Doorn, and P. Khosla. Swatt:
pages 318–329, New York, NY, USA, 2004. ACM. software-based attestation for embedded devices. In Security and
[13] J. T. Giffin, S. Jha, and B. P. Miller. Efficient context-sensitive Privacy, 2004. Proceedings. 2004 IEEE Symposium on, pages 272–282,
intrusion detection. In Proceedings of the 11th Annual Network and May 2004.
Distributed System Security Symposium NDSS, 2004. [33] X. Shu, D. Yao, and N. Ramakrishnan. Unearthing stealthy
[14] W. G. Halfond, J. Viegas, and A. Orso. A classification of sql- program attacks buried in extremely long execution paths. In
injection attacks and countermeasures. In Proceedings of the IEEE Proceedings of the 22nd ACM SIGSAC Conference on Computer and
International Symposium on Secure Software Engineering, volume 1, Communications Security, pages 401–413. ACM, 2015.
pages 13–15. IEEE, 2006. [34] X. Shu, D. D. Yao, and B. G. Ryder. A formal framework for
[15] S. R. Hussain, A. M. Sallam, and E. Bertino. Detanom: Detecting program anomaly detection. In Research in Attacks, Intrusions, and
anomalous database transactions by insiders. In Proceedings of the Defenses, pages 270–292. Springer, 2015.
5th ACM Conference on Data and Application Security and Privacy,
[35] R. Srinivasan, P. Dasgupta, T. Gohad, and A. Bhattacharya. De-
pages 25–35. ACM, 2015.
termining the integrity of application binaries on unsecure legacy
[16] C. Huth and R. Ruefle. Components and considerations in build- machines using software based remote attestation. In Proceedings
ing an insider threat program. Technical report, Carnegie Mel- of the 6th International Conference on Information Systems Security,
lon University, 2013. http://resources.sei.cmu.edu/asset files/ ICISS’10, pages 66–80, Berlin, Heidelberg, 2010. Springer-Verlag.
Webinar/2013 018 101 69083.pdf.
[17] S. Y. Lee, W. L. Low, and P. Y. Wong. Learning fingerprints for [36] F. Valeur, D. Mutz, and G. Vigna. A learning-based approach to
a database intrusion detection system. In Proceedings of the 7th the detection of sql attacks. In Proceedings of the Second International
European Symposium on Research in Computer Security, ESORICS ’02, Conference on Detection of Intrusions and Malware, and Vulnerability
pages 264–280, London, UK, UK, 2002. Springer-Verlag. Assessment, DIMVA’05, pages 123–140, Berlin, Heidelberg, 2005.
Springer-Verlag.
[18] T. Lindholm, F. Yellin, G. Bracha, and A. Buckley. The Java
Virtual Machine Specification, Java SE 7 Edition. Addison-Wesley [37] R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sun-
Professional, 1st edition, 2013. daresan. Soot - a java bytecode optimization framework. In
[19] R. Majumdar and K. Sen. Hybrid concolic testing. In Proceedings of Proceedings of the 1999 Conference of the Centre for Advanced Studies
the 29th International Conference on Software Engineering, ICSE 2007, on Collaborative Research, CASCON ’99, pages 13–. IBM Press, 1999.
pages 416–426, May 2007. [38] G. Vigna, F. Valeur, D. Balzarotti, W. Robertson, C. Kruegel, and
[20] J. Melton and A. R. Simon. SQL: 1999: understanding relational E. Kirda. Reducing errors in the anomaly-based detection of web-
language components. Morgan Kaufmann, 2001. based attacks through the combined analysis of web requests and
[21] R. Paleari, D. Marrone, D. Bruschi, and M. Monga. On race sql queries. J. Comput. Secur., 17(3):305–329, Aug. 2009.
vulnerabilities in web applications. In Detection of Intrusions and [39] D. Wagner and D. Dean. Intrusion detection via static analysis.
Malware, and Vulnerability Assessment, pages 126–142. Springer, In Proceedings of the IEEE Symposium on Security and Privacy, S&P
2008. 2001, pages 156–168, 2001.
[22] PostgreSQL Global Development Group. PostgreSQL-9.1.8. http: [40] K. Xu, D. Yao, B. Ryder, and K. Tian. Probabilistic program
//www.postgresql.org/docs/9.1/static/release-9-1-8.html. modeling for high-precision anomaly classification. In Computer
[23] PostgreSQL Global Development Group. Trusted Platform Security Foundations Symposium (CSF), 2015 IEEE 28th, pages 497–
Module. http://www.trustedcomputinggroup.org/developers/ 511, July 2015.
trusted platform module.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSE.2016.2598336, IEEE
Transactions on Software Engineering
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 18

Lorenzo Bossi is a postdoc at the department of computer science at Jose, at the Microelectronics and Computer Technology Corporation,
Purdue University. His research interests are on data protection from at Rutgers University, at Telcordia Technologies. Her recent research
insider threat. He got his PhD in computer science from the Insubria focuses on data security and privacy, digital identity management, policy
University in Italy. systems, and security for the Internet-of-Things. She is a Fellow of ACM
and of IEEE. She received the IEEE Computer Society 2002 Technical
Achievement Award, the IEEE Computer Society 2005 Kanai Award,
and the ACM SIGSAC 2014 Outstanding Contributions Award. She
is currently serving as EiC of IEEE Transactions on Dependable and
Elisa Bertino is professor of computer science at Purdue University, Secure Computing.
and serves as Director of Purdue Cyber Center and Research Director
of the Center for Information and Research in Information Assurance
and Security (CERIAS). She is also an adjunct professor of Computer
Science & Info tech at RMIT. Prior to joining Purdue in 2004, she was a
professor and department head at the Department of Computer Science Syed Rafiul Hussain is a PhD student at the department of computer
and Communication of the University of Milan. She has been a visiting science at Purdue University. His research interests are on data protec-
researcher at the IBM Research Laboratory (now Almaden) in San tion from insider threat and provenance techniques for sensor networks.

0098-5589 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

You might also like