JST An Automatic Test Generation Tool

This document introduces JST, a tool that automatically generates test cases for industrial Java applications involving strings. It uses a hybrid symbolic execution engine combining numeric and string solving. The tool enhances an existing symbolic execution platform to support essential Java libraries, data structures, string constraints, regular expressions, and their interactions with numbers. It aims to improve test coverage, scalability, and performance for testing large industrial Java applications.

Uploaded by

didier.diazmena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views10 pages

JST An Automatic Test Generation Tool

Uploaded by

didier.diazmena

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

JST: An Automatic Test Generation Tool for

Industrial Java Applications with Strings

Indradeep Ghosh∗ , Nastaran Shafiei∗∗† , Guodong Li∗ , Wei-Fan Chiang∗∗‡
∗ Software Systems Innovation Group, Fujitsu Laboratories of America, Sunnyvale, CA, USA
† Department of Computer Science and Engineering, York University, Toronto, ON, Canada
‡ School of Computing, University of Utah, Salt Lake City, UT, USA

Abstract—In this paper we present JST, a tool that auto- Even though traditional symbolic execution has mainly dealt
matically generates a high coverage test suite for industrial with numeric input variables, industrial Java applications are
strength Java applications. This tool uses a numeric-string hybrid different as almost all inputs to these applications are strings.
symbolic execution engine at its core which is based on the
Symbolic Java PathFinder platform. However, in order to make While some string inputs are used and manipulated as strings
the tool applicable to industrial applications the existing generic inside these applications, other inputs are converted to integers
platform had to be enhanced in numerous ways that we describe or floating point numbers after extensive format checking
in this paper. The JST tool consists of newly supported essential in the string domain. Such back-and-forth conversions pose
Java library components and widely used data structures; novel unique challenges to the symbolic execution tool which has
solving techniques for string constraints, regular expressions, and
their interactions with integer and floating point numbers; and traditionally handled only numeric constraints well. Even in
key optimizations that make the tool more efficient. We present the numeric domain traditional SMT solvers cannot handle
a methodology to seamlessly integrate the features mentioned non-linear equations as solving such equations is undecidable
above to make the tool scalable to industrial applications that in general. However, the industrial examples that we deal
are beyond the reach of the original platform in terms of with routinely have non-linear operations which we have to
both applicability and performance. We also present extensive
experimental data to illustrate the effectiveness of our tool. tackle during symbolic execution. Instead of giving up on such
situations, we have devised some techniques that are able to
I. I NTRODUCTION circumvent the problem in certain situations which we describe
in this paper. Finally, symbolic execution suffers from the path
With the ubiquitous presence of software programs per-
explosion problem which is even more acute in large industrial
meating almost all aspects of daily life, it has become a
examples. In this paper we describe effective steps to eliminate
necessity to provide robust and reliable software. Traditionally,
large portions of the symbolic search tree thus making the
software quality has been assured through manual testing
analysis engine scale to large examples.
which is tedious, difficult, and often gives poor coverage of
The tool that we present in this paper, JST (Java String
the source code especially when availing of random testing
Testing), is a comprehensive Java testing tool that addresses
approaches. This has led to much recent work in the arena
the above described issues in traditional symbolic execution
of formal validation and testing. One such formal technique
engines. It has extensive support for string operations and
is symbolic execution [1], [2], [3], [4] which can be used to
complex interactions between strings and numbers. For ex-
automatically generate test inputs with high structural coverage
ample, the JST symbolic executor can handle virtually all
for the program under test.
Java string operations, regular expressions, as well as string-
Symbolic execution is a model checking technique that
number conversions. In addition, we have also added support
treats input variables to a program as unknown quantities or
for symbolic container classes like Maps, Arrays, etc., and
symbols [5]. It then creates complex equations by executing
other numeric data structures like BigDecimal and BigInteger
all possible finite paths in the program with the symbolic
which are widely used in financial Java applications.
variables. These equations are then solved through an SMT
JST is based on the Java PathFinder 1 (JPF) model checker
(Satisfiability Modulo Theories, [6]) solver to obtain test
and its symbolic execution extension, Symbolic PathFinder
cases and error scenarios, if any. Thus this technique can
[2], [7]. JPF consists of a highly configurable and easy to
reason about all possible values at a symbolic input in an
extend toolkit which is the main reason for using it as our
application. Though it is very powerful, and can lead to
underlying platform. JPF implements its own Java virtual
detection of interesting input values that uncover corner case
machine (JVM) to execute Java bytecode which we have
bugs, it is computationally intensive. Hence some techniques
extended to handle all Java primitive types and Strings. We
and methodology are needed to make symbolic execution scale
have addressed many bottlenecks in this process through a
to industrial size applications.
variety of innovations, each of which is essential to achieve
** Contributed to this work during an intership program at Fujitsu Labo-
ratories of America, Sunnyvale, CA, USA 1 http://babelfish.arc.nasa.gov/trac/jpf/wiki/projects/jpf-core
scalability of the tool and the desired quality of results. We to numbers and finally, branches over arithmetic operations
summarize these experiences in the following sections. on those numbers. For example, a valid test case for path 4
We organize the paper as follows. We first give the back- is “-1,000,200”. It is not trivial to automatically cover all the
ground and a motivating example. This is followed by the branches through constraint solving, especially when the string
details of the symbolic execution framework and the string- formats and the numeric computation are very complicated.
numeric hybrid solver. Finally, after presenting experimental The satisfiability of string+numeric constraints is an un-
results, we end with the discussion and conclusion. decidable problem in general ([9] shows that a small subset
of string operations will result in undecidability of a hybrid
II. BACKGROUND AND M OTIVATING E XAMPLE
set of constraints). Hence practical solutions are important to
Symbolic execution is used to achieve high code coverage tackle string-intensive programs. Existing string solvers (see
by reasoning on all possible input values. It characterizes each [10] for a comparison of the automaton-based approaches)
program path it explores with a path condition encoded as a cannot fulfill our needs completely. These solvers provide no
conjunction of Boolean clauses. A path condition denotes a support or only very limited support for hybrid constraints,
set of branching decisions. When the execution is finished, i.e. non-trivial combinations of numeric constraints and string
multiple path conditions may be generated, each corresponding constraints. In contrast, our solver supports almost all Java
to a feasible execution path with respect to the symbolic inputs. string operations and, more importantly, hybrid constraints. In
The solutions to path conditions are the test inputs that will addition, we use various techniques to control and optimize the
assure that the program under test runs along a particular solving during symbolic execution rather than use the solver
concrete path during concrete execution. Exhaustive testing as a black box.
is achieved by exploring all true paths. The motivating example in Figure 1 involves the following
In addition to comprehensiveness, symbolic execution has techniques:
other benefits. First, it actually “executes” the code in a real
environment, hence eliminating the need to build models, 1) s.matches(...) checks whether s is of a spe-
or apply abstract analysis. Second, it is highly automated, cific format, and f=Integer.parseInt(s) checks
producing test cases without requiring the intervention from whether s is an integer. For these cases, we need to inter-
users [1], [8]. This makes it a good choice to apply symbolic sect the automaton representing s and the ones modeling
execution to realistic programs such as web applications. the regular expression and the syntax of valid integers.
On the other hand, there exist various problems that pre- This may incur complicated automaton refinements.
clude its widespread use. The main issue is poor scalability 2) s.lastIndexOf(...) may return a symbolic inte-
due to the problem of path explosion. This is because the ger whose value is used to break s into multiple parts.
engine creates a new path for every comparison, or branching Some parts may be converted into numbers (e.g. through
instruction and may create thousands of paths. What is worse, parseInt) and then used in intensive computations
each branching or assertion check (e.g. on memory out-of- (e.g., x>=100). Hence the path condition may contain
bound accesses) in the program will invoke the solver for hybrid constraints manipulating strings and numbers
a satisfiability check. The vast number of invocations to in tricky ways. Our solver handles these constraints
the solver makes constraint solving the main bottleneck of by connecting the string and numeric domains through
a symbolic execution tool. This necessitates good solving guided iterations, relational graphs, refinement rules, and
techniques which is one focus of this work. Specifically, we other advanced techniques.
have developed an in-house string-numeric solver to cater to 3) f is further converted into a specific data structure
the special needs of Java web applications. BigDecimal, which stores the numeric values in ad-
Now, consider an example with its corresponding branching hoc format. Then some rounding operations are applied.
tree shown in Figure 1. As the tree shows, at each branching We model many widely-used data structures.
point (if statement) the path condition splits into two, and 4) In order to make the tool scale, we use relaxed solving, a
each branch extends with new constraints. Notation L(q) two-tier execution and solving technique, to handle hy-
represents the length of string q. This example first checks brid constraints at run-time. Specifically, we separate the
whether the symbolic input s starts with ’-’ such that s may two domains and avoid iterations between them when
represent a negative number. If so, it checks whether s is of checking the satisfiability of an intermediate branch,
a popular format represented by a regular expression (e.g. a and only apply the iterations at the end of a path. This
string starting with ’-’, followed by at least one digit, and a key optimization enables us to achieve a good balance
comma, and then 3 digits). Then it checks whether ’,’ appears between accuracy and performance.
in s. If not then s is converted into a BigDecimal number. Oth- Among these, Item 4 has not been used in any prior work,
erwise, the substring after character ’,’ is taken and converted and Item 2 has been studied to a very limited extent. The
into an integer x. Then the computation continues by checking automaton refinement scheme in Item 1 is quite different
whether x is not less than 100. This is a typical computation from existing automaton-based ones. One of the first attempts
sequence in web applications. The application first performs at solving hybrid constraints using automata in the string
format checking on input strings. Then it converts the strings domain is described in our earlier work [11]. This paper
1) String s; // symbolic input
2) if (s.charAt(0)=='-') { PC0: true
3) if (s.matches("-\d+,\d{3}"))
4) ...; // path 1
5) else { PC1: s.charAt(0)='-' ! PC2: s.charAt(0) ≠'-' !!
6) int i=s.lastIndexOf(',');
2)
s.subString(0,1)="-" s.subString(0,1) ≠"-"
7) if (i==-1) {
8) int f=Long.parseLong(s);
PC3: PC1 ! PC4: PC1 ! path 6
9) BigDecimal f1=BigDecimal(f,2); 3)
10) BigDecimal d1=BigDecimal(-1); s.matches("-\d+,\d{3}") s.notMatches("-\d+,\d{3}")
11) if (f1.divid(3,BigDecimal.ROUND_HALF_UP).equals(d1))
12) ...; // path 2 path 1
13) else PC5: PC4 ! PC6: PC4 !
7)
14) ...; // path 3 s.notContains(",") ! i=-1 s.subString(i,i+1)=","
15) } else {
16) String s1=s.substring(i+1);
17) int x=Integer.parseInt(s1); PC7: PC5 ! f=VOF(s) ! PC8: PC5 ! f=VOF(s) !
11)
18) if (x>=100) 2
f1=(f/10 ) ! Round(f1/3)=d1 f1=(f/102) ! Round(f1/3)≠d1
19) ...; // path 4
20) else
21) ...; // path 5 path 2 path 3 PC9: PC6 ! s1=s.subString(i+1) ! PC10: PC6 ! s1=s.subString(i+1) !
22) } 18)
VOF(s1)"100 VOF(s1)<100
23) }
24)} else
25) ...; // path 6 path 4 path 5

Fig. 1. Our motivating example and its corresponding branching tree with path conditions.

summarizes the advances that we have made in the solver

after that. Even with all these techniques, it is still a challenge JST Hybrid Solver
to handle industrial strength Java applications. We will discuss
Sym-JPF
the difficulties we encountered and the solutions that we finally Symbolic
JPF Executer
settled on, in the following sections.
III. T HE E XECUTION F RAMEWORK Library Model
JST is built on top of the JPF platform, an explicit state
model checker that implements a customized Java virtual
Fig. 2. Main components in JST.
machine as its core and can execute all Java bytecode in-
structions. JPF offers a highly configurable structure, and a
variety of mechanisms to incorporate extensions. This allows Instruction classes to introduce new or customized se-
us to integrate our extensions into JPF without tying them mantics for all Java bytecode instructions. Hence JST inter-
too closely to the original implementation. This is important prets Java programs in a different way. Though this mecha-
to make our extensions more general. For example our string nism is the same as Symbolic PathFinder, JST has different
solver can be used as a stand-alone tool without involving execution semantics for many of the bytecodes than Symbolic
JPF. Figure 2 shows JST’s main components. JST’s executor PathFinder to handle symbolic strings and their interaction
implements new instruction semantics, especially for those with numbers. For example, when a “method invocation”
involving string operations. It uses our own hybrid solver for instruction such as invokestatic or invokevirtual is
string-numeric solving. executed, we skip its original Symbolic PathFinder processing,
JST also models many Java libraries to improve the sym- and activate the special handling of pre-selected methods for
bolic execution performance. For example, we have created strings, containers, and so on. This approach also speeds up
symbolic versions of all the container libraries such as Set, the symbolic execution process and leads to smaller search
Map, etc. This allows us to model their behaviors in a way trees. We handle the methods of about 25 Java classes using
that reduces the branching during symbolic execution of these this approach, including String, BigDecimal, Integer,
classes. These optimizations are crucial to reduce the number PrintStream, etc.
of useless paths. JST is designed to be loosely coupled with the JPF core
In customizing the symbolic execution, JST defines different platform. Main components such as the solver and the library
semantics for many bytecode instructions. The JPF framework models can be used separately without involving JPF. The
allows for changing the semantics of a bytecode instruction by executor core is largely independent of JPF too. Hence it
implementing the instruction factory as a configurable abstract will be minimally affected when JPF changes its core. Our
factory. In this way one can choose which Instruction experience shows that such design separation is important to
classes to use (i.e. these classes capture the bytecode instruc- develop significant extensions over an evolving platform like
tions and define their execution semantics). JST overrides JPF.
IV. T HE S OLVER a new iteration is performed after the string solver sends
Path conditions usually contain both numeric (inte- the “learned” constraints to the numeric one. This process
ger and real) constraints and string constraints. String continues until a solution is found or a pre-defined time or
constraints impose restriction on string variables, e.g. iteration limit is reached.
s.startsWith("ab") for symbolic string s. In our im- It should be noted that such an iteration method is not new
plementation, we separate these two kinds of constraints in a by itself, SMT solvers [6] use iterations to combine multiple
path condition. Numeric constraints are handled using existing theories. In one sense, our hybrid solver is an SMT solver that
solvers like Coral [12], Yices [13], and Choco2 . We mainly combines a string theory and some numeric theories. The key,
use Yices for linear constraints, and Coral for simple non- however, is how to perform efficient iterations in such a solver.
linear constraints as we found these two solvers to be superior In subsequent sections we describe a variety of techniques to
in performance to most of the other unrestricted use solvers. optimize the iteration process.
String constraints, on the other hand, are handled by our in- B. String Constraint Solving
house solver developed over a Finite State Machine (FSM)
We use an FSM to represent a symbolic string variable such
package [14]. In addition, we reconcile the automaton-based
that all possible values of this variable constitutes the language
string constraints and the numeric ones in the solver which is
accepted by this FSM. Our implementation is based on the
an unique challenge that all prior work [10] fails to address
FSM package dk.brics.automaton describe in [14]. Ba-
well.
sically, an edge (transition) in an automaton is associated with
A. Hybrid Solver with Iterations a character range (interval). When an automaton is refined, we
may change not only its nodes and edges but also the edge
ranges. For example, the following shows the automata for the
numeric &
string constraint Give up symbolic string s (left, accepting all values) and the regular
expression “-\d+,\d{3}” (right) in the motivating example.
Add additional For simplicity, we assume ASCII characters ranging from 0
Find numeric
numeric constraints to 255 while our implementation supports Unicode characters.
solutions
from RULE library
'-' ['0'-'9'] ','
[0-255]
No Solution No start
found? ['0'-'9'] ['0'-'9']
Yes start
Time out?
Yes
Add additional string ['0'-'9'] ['0'-'9']
constraints from
RULE library String operations are modeled by automaton operations.
No For example, the concatenation of two (symbolic) strings s1
Find string Solution and s2 is implemented by the appending s2 ’s automaton to
solutions found?
s1 ’s, by adding transitions from all accepting states of s1 ’s
Yes
automaton to all initial states of s2 ’s automaton. Many such
Unsatifiable Satisfiable operations are supported by the dk.brics.automaton
package. However, we have to model many more string oper-
ations. We have also extended the package for solving rather
Fig. 3. The general solving process in JST than static analysis (for which the original package is designed
Solving hybrid constraints requires communication between [14]). While static analysis allows over-approximations, string
the two domains. This often requires multiple iterations be- solving must be accurate and report no false solutions. Hence
tween the domains to reach an agreement on whether the com- we need to use extra techniques to ensure the preciseness.
bined constraints are sat (satisfiable) or unsat (unsatisfiable). Our system supports most of the string operations in the
Our solver in JST is no exception, with the main flow shown String class in the Java standard library. Some of these
in Figure 3. At each iteration, the numeric solver first attempts operations are applied by performing regular expression op-
to solve the numeric constraints. If unsat, then we can safely erations in the underlying package, such as union, intersec-
terminate the entire solver and report unsat. Otherwise, the tion and complement (here we use automaton intersection
results of the variables used by both the numeric and the string and language intersection interchangeably). There are other
constraints are sent to the string solver. Other information may operations, such as trim, substring, and toUpperCase,
also be sent from the numeric solver to the string solver to that require extra handling. For example, Figure 4 shows
make the iteration converge faster. how we deal with substring(2,4) which returns a
If the string solver can also find a solution, then the path substring from begin index 2 to end index 4 (not inclu-
condition is sat and the solver assigns concrete values to (a sive). Roughly, we first advance 2 transitions from the start
subset of) numeric and string variables. If no solution is found, state, then mark the reached states as the new start state
snew , and then identify all states reachable from snew in
2 http://www.emn.fr/z-info/choco-solver/ 2 transitions as new accepting states. Finally, we intersect
this automaton with the one accepting all words of length respectively. This graph is used to produce concrete values
2 to get the final automaton. There are other operations respecting the logical relations. For example, the following
from the String class that need similar extensions, e.g. graph indicates that string s1 is less than s2 , and s2 starts
substring, replace, replaceAll, replaceFirst, with s3 , and so on. After all the node automata are refined,
toUpperCase, toLowerCase, and split. we get that s2 = s3 α∗ and s3 = α∗ xyαx where α denotes any
character. Then we can search the graph in a post-topological
Start order to find out a concrete assignment to all the string
New Start
� variables, e.g. s3 = “xy”, s2 = “xyb”, and s1 = “xya”. Since
� � the string values have been refined before the search, finding
concrete solutions is usually very fast.

s1 ≠
"xy"
< ≥
� � Accept
� s2 startsWith contains
s3
New Accept

Fig. 4. Model operation substring(2, 4) using automaton. C. Numeric and String Solving Interactions
String constraints depict the relation between strings String constraints and numeric constraints must agree on
(and numbers). The dk.brics.automaton package does the same value of every shared symbolic variable. Roughly,
not address string constraints directly. Hence we enhance we have the numeric solver N give some candidate values,
it to refine string values according to given string con- and ask the string solver S whether these values violate the
straints. This procedure includes (1) automaton refinement, string constraints. If no violation is found, then these values
(2) fix-point calculation, and (3) optimizations to speed-up are valid for both domains. Otherwise, S gives some feedback,
the convergence. In the motivating example, for constraint like the conflicts it learns, to N, and asks for other candidates.
s.charAt(0)==‘-’, we intersect s with an automaton ac- Next, N adds the conflicts and the negation of the current
cepting any string starting from character ‘-’. Later on, since assignment, and then searches for another valid assignment
a portion of s is converted into an integer using parseInt, using some heuristics. Note that only the concrete values of
we intersect this portion with the automaton modeling all valid shared variables will be passed from N to S, and N can
integers; we also need to refine s based on the updated portion, learn some string conflicts and passes them to S too. For
i.e. intersect s’s automaton with the one that accepts strings example, numeric constraint s.length() > 5 enforces that
ending with this portion. Basically, if a constraint enforces a the corresponding automaton should be intersected with one
relation over strings s1 and s2 , e.g. s1 .beginsWith(s2 ), then accepting strings of length > 5.
we refine (1) s1 by enforcing that it starts with s2 , and (2) s2 Since the two domains interact with each other mainly
by enforcing that it is the beginning part of s1 . This process through concrete values, it is important to avoid using fruit-
is repeated until no more refinement is possible and a fix- less values. During the iteration we exchange the constraints
point (on the possible string values) is reached. We will skip learned from each domain (called interactive constraints) so as
the details here, but the basic algorithm is similar to abstract to speed-up the convergence. For example, consider string con-
interpretation [15], e.g. for abstract domains such as integer straint s1 =s2 .trim(). A numeric constraint L(s1 )≤L(s2 )
intervals. is added into N, where L(s1 ) and L(s2 ) are (symbolic)
In addition to automaton refinement, we apply special integer variables representing the lengths of s1 and s2 . In our
handing to some cases where the pure automaton model is implementation, interactive constraints are modeled in a RULE
inadequate. Take constraint s1 < s2 ∧ s2 < s3 ∧ s3 < s1 library which we describe next.
for example. It is difficult to model s1 < s2 using au- Interactive Constraint Propagation. The RULE library in-
tomata, as the automaton capturing the < relation may have cludes commonly occurring patterns that we observed when
a huge size. Furthermore, this constraint should be proven applying JST to a wide range of Java applications. These
to be false immediately without involving any automaton patterns are particularly useful in the web and financial
computation. In our implementation, we introduce two extra application domain, especially those from S to N. Table I
models for such constraints. In the first one, each string shows an excerpt of these patterns, The “a” rules have been
variable is associated with a “number” representative, e.g. s1 ’s used by other solvers [9], [16], [17], while the “b” rules are
representative is integer is1 , such that this constraint can be new in JST. One example is (s.startswith(’-’) ∧
falsified immediately by the numeric solver through checking n=Integer.valueOf(s)) where s is a string variable
is1 < is2 ∧ is2 < is3 ∧ is3 < is1 . This model allows us to prove and n is an integer. This pattern leads to numeric constraint
unsat cases quickly. n<0. For another example, if the numeric solution found for
In the second model, we maintain a relational graph with VOF(s) is 5, then string constraint s.equals("5") is
the string variables and their relations as the nodes and edges enforced.
TABLE I
S OME RULES IN THE RULE LIBRARY. H ERE s1 , s2 , AND s3 ARE STRING VARIABLES ; i AND j ARE INTEGER VARIABLES ; n IS A NUMERIC VARIABLE ; c IS
A POSITIVE INTEGER CONSTANT.

# Observed Constraint Derived Constraints

a.1 s3 =s1 .concat(s2 ) L(s3 )=L(s1 )+L(s2 )
a.2 s2 =s1 .trim() L(s2 )<=L(s1 )
a.3 s2 =s1 .substring(i,j) i≥0 ∧ j<L(s2 ) ∧ i≤j ∧ L(s2 )= j - i
b.1 i=s.lastIndexOf(c) s.substring(i,i+1).equals(c)
b.2 s.charAt(0)=‘-’ ∧ n=VOF(s) n<0
b.3 s.charAt(0)=‘+’ ∧ n=VOF(s) n>0
b.4 VOF(s)=n s.equals("n")
b.5 VOF(s)<0 s.charAt(0)=‘-’
b.6 VOF(s)>c L(s)≥log(c) ∧ s.charAt(0)!=‘-’
b.7 VOF(s)<c L(s)≤log(c) ∧ s.charAt(0)!=‘-’ ∨ s.charAt(0)=‘-’
b.8 L(s.trim())=c L(s)≥c

For illustration, consider path 4 in the motivating example. The situation becomes worse when the solver is invoked mul-
We give below its path condition pc where some additional tiple times. Suppose a program contains n satisfiable branches,
“S to N” constraints (#1-#4) have been added into the nu- then O(2n ) paths can be generated and O(2n ) queries are in-
meric domain. To solve this pc, the numeric solver N first voked on the solver. If m iterations are needed for each query,
obtains a valid assignment to numeric variables, e.g. L(s)=2, then O(2n m) iterations happen in total. Fortunately, we have
L(s1)=1, i=0, and x=100, then passes these values to the observed in real applications that many path constraints will
string solver S. Unfortunately S cannot find a solution since turn into unsat in subsequent computations quickly leading to
constraint x = parseInt(s1) is unsat. It can ask N to perform false paths. We also know that only the solved values at the
new iterations until s1’s length is at least 3. Or, the solver end of a symbolic path will contribute to valid test cases.
uses a rule to derive from “x=parseInt(s1) ∧ x≥100” Based on these observations, we apply a two-phase solving
an additional constraint L(s1)≥3. Moreover, the automata technique during symbolic execution. For the intermediate
enforce that s’s length is at least 2 more than that of s1. branching nodes, we use relaxed solving which checks only
With these two new constraints the solver can quickly find a whether the string and the numeric constraints are sat without
valid solution, e.g. s=“-,100”, s1=“100”, i=1, and x=100. multiple iterations. At the end of each path, we use the usual
regular solving with full iterations and constraint propagation.
Numeric String This method dramatically improves the performance since we
(1 : L(s) > 0) s.charAt[0] = ‘−0 not only avoid expensive solving for intermediate nodes but
(2 : i = −1 ∨ 0 ≤ i < L(s)) i = s.lastIndexOf(‘,0 )
(3 : i + 1 + L(s1) = L(s)) s1 = s.substring(i + 1)
are able to quickly cut out many false paths using the over
(4 : L(s1) > 0) x = parseInt(s1) approximation techniques of relaxed solving.
i 6= −1 ∧ x ≥ 100 ¬(s.matches(−\d+, \d{3})) The relaxed solving process is similar to the regular one but
without multiple iterations. It starts with solving the numeric
When S cannot find a valid solution, additional constraints constraints. If a solution is found, it passes additional con-
are passed to N and then a new iteration starts by throwing straints (e.g. those in Table I) but not the values of shared vari-
away previous candidate values. For example, suppose that ables to the string solver. Note that these additional constraints
numeric shared variables a, b, and c with the solutions 5, -7, represent a strict over-approximation of the set of solutions
0 are found unsat in S, then we will need to add the numeric possible in the numeric domain. If the string solver cannot
constraint ¬(a = 5 ∧ b = −7 ∧ c = 0) or a 6= 5 ∨ b 6= −7 ∨ c 6= 0. find a solution, the path condition pc is unsatisfiable because
Clearly this may lead to an exponential number of case no matter what other solution is passed from the numeric
splittings. To mitigate the blow-up, we apply an optimization domain it will satisfy the over-approximation constraint and
through utilizing a feature provided by the Yices solver: Yices hence be unsat in the string domain. Otherwise, pc is regarded
returns satisfiable values closest to zero for numeric variables. to be sat. Thus in the relaxed mode the two domains do not
Hence we can safely try three narrower cases: a > 5; b < −7; communicate through concrete values.
and c 6= 0, which searches the same space but can converge The main disadvantage of relaxed solving is that it may
faster. We also consider another case, (a > 5 ∧ b < −7 ∧ c 6= 0), explore infeasible paths whose path conditions are sat in
in hope of finding a valid solution rapidly. Our experience the relaxed mode while unsat in the regular mode. This
demonstrates that such simple improvements and heuristics seems to increase the number of intermediate paths. However,
can have considerable effects in practice. this happens rarely in practice since (1) we still use the
string solver S and the numeric solver N to rule out most
D. Relaxed Solving and Execution infeasible paths, and (2) an infeasible path is often falsified by
One of the main bottlenecks of our hybrid solver is that it subsequent relaxed solving at intermediate nodes. Thus relaxed
may need many iterations to find a solution or derive unsat. solving can be very effective in pruning infeasible paths early.
Regular: then be solved using faster SMT solvers. One example is the
pc
round method in the Java class java.lang.Math which
returns the closest integer to the given floating point number.
If the nearest integers to the given number are equidistant,
S1 ↔n N1 ¬S1 ↔n ¬N1
this operation returns the greater integer (e.g. round(2.5)
... ... becomes 3). Specifically, for the operation round(e), we
add the following linear constraint to the numeric set, where
Sk ↔n Nk ¬Sk ↔n ¬Nk Sk ↔n Nk ¬Sk ↔n ¬Nk e is a real variable, and x1, x2, and result are introduced
• • • ◦ integer variables. result represents the value of round(e),
Relaxed: and it replaces all its occurrences in numeric constraints.
pc When e is positive, the constraint enforces e − 0.5 < x1 ≤
e + 0.5, e.g. when e = 1.6, constraint 1.1 < x1 ≤ 2.1 implies
S1 ↔1 N1 ¬S1 ↔1 ¬N1 that result = 2. Variable x2 is for the case when e is negative.
... ((e − 0.5 < x1) ∧ (e + 0.5 ≥ x1)) ∧
... ... ((e − 0.5 ≤ x2) ∧ (e + 0.5 > x2)) ∧
Sk ↔1 Nk (result = (if (e > 0) x1 x2))
Sk ↔1 Nk ¬Sk ↔1 ¬Nk
• Similarly, we replace other operations in the Math class,
• ¬S1 ↔n ¬N1
such as ceil and floor, with their equivalent linear con-
... straints. Many rounding methods in the BigDecimal class
Fig. 5. Comparing regular solving and relaxed solving. are also handled in this manner. While most conversions are
not complicated, this represents one important enhancement
Consider Figure 5 which contains two branching trees start- we use to scale JST to handle industrial applications. It is
ing from a node with path condition pc. Each tree branches obvious that this transformation can only work for non-linear
over a sequence of conditions S1 ↔m N1 , . . . , Sk ↔m Nk , functions that are piecewise linear. Otherwise we still need to
where Si ↔m Ni denotes the ith condition such that its use Coral and abort in case of a time-out.
numeric constraint Ni and string constraint Si iterate m times. V. E VALUATION
When m = 1 the relaxed mode is used. The regular mode
We evaluate JST on three string-intensive benchmark ex-
assumes m = n where n is the average iteration number. The
amples whose characteristics are described in Table II. They
“regular” tree is of height k, incurring (2k+1 −2)n iterations in
represent many other similar applications we have tested. For
total. Suppose all the leaf nodes except the rightmost one are
each benchmark we create a driver and and some stubs to
invalid (marked by •) and the rightmost path is valid (marked
produce a closed system on which JST can be run. Since the
by ◦). In this case exploring the others is fruitless. This can
first two benchmarks are very large and relied on multiple
be avoided through the relaxed mode. Assume that in all the
external libraries, packages and jars, and a lot of source code
paths except for the rightmost one, Si and Ni may conflict
is missing, we focus on the parts where the core logic are
with Sj and Nj for i 6= j, i.e. the paths become unsat in
implemented. Despite their small sizes (1-4k lines), they rep-
the relaxed solving phase. Hence there is no need to perform
resent the most complicated string and numeric computations
expensive iterative solving on those paths.
in the applications. Note that, in order to to test the core logic,
The number of iterations now is reduced to 2k+1 − 2 + kn, we have to symbolically execute the whole application leading
(2k+1 −2)n
an improvement of 2k+1 −2+kn ≈ n times (for large k) over to huge symbolic paths through the complete application.
the regular mode. That is, if the average iteration number is We only measure the coverage of the core logic although
1000, then relaxed solving can produce a speed-up of 1000x. many other parts of the applications are also covered (many
This is validated in the experimental results section. of which are not sensitive to the symbolic inputs). We run
the experiments on a Ubuntu Linux Machine with quad core
E. Handling Some Nonlinear Operations
3.4Ghz Intel core i7 processor and 8GB of RAM.
Some applications that we tested generate path conditions We could not compare our results with any other freely
involving nonlinear numeric constraints. Solving them is un- available tool because we found all of them to be completely
decideable in general and beyond the capabalities of SMT inadequate in handing all the String operations that existed in
solvers. For these constraints, we tried the Coral solver [12] our examples and the specific interactions between the numeric
which is a randomized solver that uses machine-learning algo- and string domain. There is also currently no standard format
rithms to search solutions for complex non-linear constraints. for expressing String constraints. Thus it is not possible to
Unfortunately, using such a random solver considerably translate all the complex constraints to operations that other
slows down the solving process (it has been reported that String-Numeric constraint solving tools can understand.
Coral can be > 100x slower than SMT solvers for linear Table III shows the result of running JST on Example A
constraints [12]). To mitigate this problem, we transform some while gradually increasing the number of symbolic inputs to
nonlinear constraints to equivalent linear ones, which can 4. We also present some results for different combinations of
TABLE II
B ENCHMARK S TATISTICS

Application Description #classes #bytecode #lines of core logic # Inputs

A Business to Business Ordering 2,801 1,226K 3,379 4 (strings)
B Financial Application 1 5,232 2,704K 2,613 5 (strings)
C Financial Application 2 Snippet 3 1,672 1,157 3 (strings)

symbolic variables. All these inputs are string variables though with even as little as 3 symbolic variables and even for the
some are converted to integers and double values within the cases it finishes, it is unable to find many solutions within the
application. The number of true paths in the table corresponds stipulated 2,000 iterations. The reasons for this are twofold
to the actual test cases generated at valid end states in the (see Section IV-D too). First, the relaxed mode is able to prove
program. However, there are also some unhandled exception some intermediate path conditions unsat in the middle of the
paths that are encountered during the symbolic execution like symbolic execution tree thus pruning off large portions of the
NumberFormatException, ArithmeticExecption, symbolic search tree. This speeds up the exploration of the
etc. which are shown in the Column 3. These exception complete symbolic execution tree. Second, even at a true leaf
paths usually indicate that the application is missing the error node, the rule based constraint exchange between the numeric
handling code. Manual investigation indicates that a small and string domains is able to quickly converge to a solution
portion of them are “real” issues (wrong assumptions or bugs). whereas without that feature the hybrid solver often hits the
In Column 4 the maximum number of iterations required 2,000 iteration limit without finding a solution. Thus a lot of
between the numeric and string domains to arrive at a valid time is wasted without producing any new results. Note that
solution is shown. As expected this number does increase as the relaxed mode does not lead to over-approximations since
we have more symbolic inputs interacting with each other full solving is applied on the leaf nodes to eliminate false
but due to the optimizations mentioned in subsection IV-B, “tentative” paths.
JST is able to keep the number of iterations to a reasonable
TABLE IV
limit. There are no unsolved constraints in this case and E XPERIMENTAL R ESULTS FOR E XAMPLE B. #T.P AND #E.P. DENOTE
with 4 symbolic inputs we can achieve a greater than 80% (#T RUE PATHS ) AND (#E XCEPTION PATHS ) RESPECTIVELY. I N RELAXED
code coverage on the core logic part of the application. We MODE , #T.P. CONTAINS “# FINAL PATHS /# TENTATIVE PATHS ”

will discuss in Section VII the reasons for the coverage not # Sym. With Relaxed Mode Without Relaxed Mode
attaining 100%. Vars #T.P. #E.P. Time #T.P. #E.P. Time
1 3/5 9 7s 3 9 35s
TABLE III 2 (subset 1) 15/26 15 20s 15 15 2:37h
E XPERIMENTAL RESULTS FOR E XAMPLE A 2 (subset 2) 20/36 18 22s 20 18 3:05h
3 (subset 1) 178 183 7:47m - - TO
#Sym.Vars #Paths Time #Iterations 3 (subset 2) 253 301 9:42m - - TO
True Exception 4 (subset 1) 890 2,535 4:31h - - TO
1 6 1 8s 1 4 (subset 2) 1,430 3,242 5:54h - - TO
2 (subset 1) 27 5 16s 1 5 7,156 20,530 28:50h - - TO
2 (subset 2) 35 9 55s 2
3 (subset 1) 595 16 3:24m 5
3 (subset 2) 493 25 4:01m 5 Finally, we present results to demonstrate the effectiveness
4 1,971 95 11:08m 24
of non-linear constraint modeling as described in subsection
IV-E. In this case we run experiments of example C which is
Table IV shows the effectiveness of the relaxed solving carved out of a second financial application for unit testing.
described in subsection IV-D. In this experiment we run These classes consist of a lot of different types of rounding
multiple experiments on Example B first by using the relaxed operations on the Java BigDecimal data type. We make
mode (Columns 2-4) and subsequently turning this feature off multiple runs with the non-linear modeling turned on (Table
(Columns 5-7). The hybrid solver is allowed 2,000 iterations V, Columns 2-4) and then off (Table V, Columns 5-7). When
before giving up. The complete time out of a run is set the non-linear modeling is turned on all the resulting path
to 48 hours. Again we gradually increase the number of conditions can be solved by the Yices [13] solver but when it
symbolic variables from 1 to 5 and again we report results is turned off we have to invoke the non-linear solver Coral
for multiple combinations of same number of variables. In [12]. Again it is evident from the table that we can get
the relaxed mode run for the 5 variable case, JST finishes orders of magnitude improvement in runtimes and dramatic
after 1 day with a total of about 28,000 test cases. Again, we improvements in the quality of results. Since this is unit
can achieve about 80% code coverage on the core logic code. testing, with all input variables symbolic, we can obtain 100%
All valid path conditions can be solved or proved unsat and code coverage in the whole example. However, when we turn
the maximum number of iterations needed among the solvers off this feature we run into frequent time-outs inside Coral.
is 268. However, when we turn off that feature the effect is This is to be expected as the Coral algorithm is by definition
a dramatic reduction in effectiveness of JST. JST times out incomplete and weaker than SMT solving which relies on effi-
cient algorithms for solving Boolean and numeric constraints. TABLE VII
Of course there will be cases when it is impossible to model D EVELOPMENT SIZE OF JST. W E COUNT ONLY OUR EXTENSIONS .
non-linear constraints with piecewise linear operations. In such Main Component Extend Over l.o.c
cases we do invoke Coral as a last resort and give up in case Execution Engine Sym. JPF [2] 23,418
Coral times out without finding a solution. Automaton Package JSA [14] 4,546
Hybrid Solver – 8,784
Nonlinear Solver Coral [12] 1,180
TABLE V Others – 1,500
E XPERIMENTAL RESULTS FOR E XAMPLE C. #T.P AND #E.P. DENOTE THE Total 39,428
NUMBERS OF TRUE PATHS AND EXCEPTION PATHS RESPECTIVELY.

# Sym. With Nonlinear Models Without Nonlinear Models

Vars #T.P. #E.P. Time #T.P. #E.P. Time models many string operations including replacement but no
1 5 0 1s 3 0 9s
2 (subset 1) 35 10 19s 28 8 57:1m
interactions with numbers.
2 (subset 2) 29 22 23s 24 7 45:1m A lazy solving technique is proposed in [23]. It uses a graph
3 175 50 1:39m 31 10 6:37h to represent constraints over string variables, and searches
the graph with back-tracking to guess solutions. It can find
a solution (if it exists) fast but the unsat case is much more
Overall, our experience indicates that “automaton+iteration”
difficult. However, it handles only a very limited set of string
is a promising way to handle hybrid constraints. In addition,
and regular operations, and allows no symbolic numbers.
other key features that make the tool practical include: (1) use
In [21], [22] a string solver similar to ours is implemented
of relaxed solving to prune paths and speed-up the solving,
for JPF. This solver uses range automata and bitvectors to
(2) convert non-linear expressions to linear ones whenever
model a subset of string operations and regular languages. It
possible, (3) optimize Java libraries to cater for the need
supports fewer string operations, has a limited set of iteration
of symbolic execution, and (4) applying divide-and-conquer
rules, and maintains weak connection between the automaton
techniques to divide the state space.
domain and the numeric domain. It does not support many of
the features described here such as relaxed solving.
VI. R ELATED W ORK
As for symbolic executors, some widely used symbolic
Table VI gives a brief comparison on string solvers in execution engines [1], [2], [3], [4] handle high level languages
terms of string model, support for basic string operations, such as C and Java. Recently we have extended KLEE [1]
string relations and regular expressions, support for string and to handle C++ programs [8] and GPU programs [24]. All
numeric interactions, using rules during the iterations, and the the tools have no built-in string solvers. Hence they rely
targeted applications. For instance, “Reln” denotes relational on specific extensions such as Rex for Pex [4], and Hampi
constraints such as s1 > s2 . Roughly, the solvers can be for KLEE. Note that, only JST uses relaxed solving and bi-
divided into two kinds: bitvector based and automaton based. directional flow of derived constraints.
Microsoft’s solver [9] encodes string operations with bitvec-
tors. Constraints over string lengths are derived from string VII. D ISCUSSIONS AND C ONCLUSION
constraints before the bitvectors are instantiated and solved. It This paper describes the JST tool which is used to uncover
does not support string relations and regular sets, and does not bugs and generate test vectors for industrial strength Java
apply constraint propagation from the numeric domain to the applications. Though the tool is based on the well known
string domain. Hampi [16] handles fixed-size strings but sup- Symbolic Java PathFinder platform, the extensions needed to
ports constraints checking membership in regular languages make the tool applicable to industrial examples are non-trivial.
and fixed-size context-free languages. It converts string con- This is evident from Table VII which enumerates the size
straints into fixed-size bitvectors. Numeric constraints are not of the different blocks that we have implemented in order to
supported (except for the simple ones over the lengths). Kudzu achieve our goals.
[17] uses bitvector encoding and derived length constraints. The primary issues that we have tackled are extensive
It uses the Hampi solver, and translates constraints into the support for different Java primitives and libraries that are
input of Hampi. Kudzu provides support for many basic string frequently encountered in such examples, support for sym-
operations such as concatenation, plus limited support for bolic strings, and multiple enhancements in the hybrid string-
symbolic numbers and their interactions with strings. numeric solver to make the approach scalable. While the
A recent evaluation of automaton-based string solvers is original platform had support for some numeric and Boolean
given in [10]. Among these tools, the Rex tool [18] uses primitive types, we have implemented support of all symbolic
automata and an SMT solver, and represents automaton transi- primitive types in Java as well as symbolic strings. While
tions using logical predicates. This allows it to handle numbers implementing symbolic strings we have had to add support for
using the SMT solver. Its encoding is shown be too inefficient, most of the string manipulation methods in the Java String,
and slower than Hampi by several orders of magnitude [10]. StringBuilder and StringBuffer libraries. Even
Stranger [20] use an automaton-based method to model string some conversions between String and CharSequence, and
constraints and length bounds for abstract interpretation. It String and Character arrays, are tackled along with methods
TABLE VI
C OMPARISON OF STRING SOLVERS . N OTATIONS “⊕”, “ ” AND “–” INDICATE “ STRONG ”, “ WEAK ” AND “ NONE ” RESPECTIVELY.

Solver String Model Support w. Numeral + Rules App.

Basic Reln R.E. S→N N→S
Microsoft [9] bitvector ⊕ – – ⊕ – .NET
Hampi [16] bitvector – ⊕ – – – Web,C
Kudzu [17] bitvector ⊕ – ⊕ – JavaScript
Rex [18] automaton+SMT ⊕ – ⊕ – – General
Stranger [19], [20] bit automaton ⊕ – – – PHP
Lazy [10] range automaton – ⊕ – – N/A
[21], [22] range automaton ⊕ ⊕ – Java
JST range automaton ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ Java

in those classes. Java regular expressions are also supported. R EFERENCES

In addition we have created symbolic models for different [1] C. Cadar, D. Dunbar, and D. R. Engler, “KLEE: Unassisted and auto-
Java container classes like Vector, ArrayList etc. Some matic generation of high-coverage tests for complex systems programs,”
Arithmetic classes like BigDecimal and BigInteger in OSDI, 2008.
[2] C. S. Pǎsǎreanu and N. Rungta, “Symbolic PathFinder: symbolic exe-
are also modeled as they are used frequently in financial cution of Java bytecode,” in ASE, 2010.
applications. After enhancing the symbolic support to the [3] K. Sen, D. Marinov, and G. Agha, “CUTE: a concolic unit testing engine
various data types and libraries that we have encountered, an for C,” in ESEC/FSE, 2005.
[4] N. Tillmann and J. De Halleux, “Pex: white box test generation for .net,”
efficient hybrid numeric-string solver was needed to solve the in International conference on Tests and proofs (TAP), 2008.
symbolic constraints arising out of the symbolic execution. [5] J. King, “Symbolic execution and program testing,” Communications of
Though a lot of work exists in the literature to do this, we the ACM, vol. 19, no. 7, pp. 385–394, 1976.
[6] D. Kroening and O. Strichman, Decision Procedures: An Algorithmic
have found them inadequate to tackle unique challenges arising Point of View. Springer Publishing Company, Incorporated, 2008.
out of type conversions using functions like .valueOf(), [7] C. S. Pǎsǎreanu, P. C. Mehlitz, D. H. Bushnell, K. Gundy-Burlet,
.parseInt(), etc.. In such cases we have devised a rule M. Lowry, S. Person, and M. Pape, “Combining unit-level symbolic
execution and system-level concrete execution for testing nasa software,”
based constraint propagation technique between the numeric in ISSTA, 2008.
and string domains so that the solver can quickly decide unsat [8] G. Li, I. Ghosh, and S. P. Rajan, “KLOVER : A symbolic execution
on a hybrid formula or use fewer iterations to arrive at a and automatic test generation tool for C++ programs,” in CAV, 2011.
[9] N. Bjørner, N. Tillmann, and A. Voronkov, “Path feasibility analysis for
solution. We have also modeled certain non-linear constraints string-manipulating programs,” in TACAS, 2009.
in a piecewise linear fashion so that they can be solved quickly [10] P. Hooimeijer and M. Veanes, “An evaluation of automata algorithms
by a standard SMT solver. for string analysis,” in VMCAI, 2011.
[11] D. Shannon, I. Ghosh, S. Rajan, and S. Khurshid, “Efficient symbolic
The effectiveness of JST incorporating the above set of execution of strings for validating web applications,” in 2nd Workshop
enhancements is shown on two large industrial strength ex- on Defects in Large Software Systems, 2009.
amples. In the results we are not able to achieve 100% code [12] M. Souza, M. Borges, M. d’Amorim, and C. S. Psreanu, “CORAL:
solving complex constraints for symbolic pathfinder,” in NFM, 2011.
coverage for the sections under test due to the deficiencies [13] B. Dutertre and L. D. Moura, “The Yices SMT Solver,” Computer
in the driver which are manually generated. Note that the Science Laboratory, SRI International, Tech. Rep., 2006.
complete coverage of the code under test not only depends on [14] A. S. Christensen, A. Møller, and M. I. Schwartzbach, “Precise analysis
of string expressions,” in SAS, 2003.
the various data values in the inputs that exercises different [15] P. Cousot and R. Cousot, “Abstract interpretation: a unified lattice model
branches, but also on the different event scenarios that the for static analysis of programs by construction or approximation of
driver takes the code through. While symbolic execution is fixpoints,” in POPL, 1977.
[16] V. Ganesh, A. Kiezun, S. Artzi, P. J. Guo, P. Hooimeijer, and M. D.
very effective in finding data values that cover all branches, Ernst, “HAMPI: A string solver for testing, analysis and vulnerability
it is unable to unearth possible scenarios that the driver has detection,” in CAV, 2011.
completely missed. This is one of the deficiencies of JST [17] P. Saxena, D. Akhawe, S. Hanna, F. Mao, S. McCamant, and D. Song,
“A Symbolic Execution Framework for JavaScript,” in IEEE Symposium
which we are looking into next. The tool is currently being on Security and Privacy, 2010.
used internally by Java developers for Unit and Component [18] M. Veanes, P. de Halleux, and N. Tillmann, “Rex: Symbolic regular
level testing in the order of 2K-20K source lines. System expression explorer,” in ICST, 2010.
[19] F. Yu, T. Bultan, and O. H. Ibarra, “Symbolic string verification:
level testing is not advised with JST as the symbolic execution Combining string analysis and size analysis,” in TACAS, 2009.
becomes unmanageable even with all the above enhancements. [20] F. Yu, M. Alkhalaf, and T. Bultan, “Stranger: An automata-based string
Even though we have described a tool applicable for the Java analysis tool for PHP,” in TACAS, 2010.
[21] G. Redelinghuys, “Symbolic string execution,” Master’s thesis, Univer-
language, it is possible to use similar techniques to handle sity of Stellenbosch, 2012.
C/C++, JavaScript or other popular programming languages. [22] G. Redelinghuys, W. Visser, and J. Geldenhuys, “Symbolic execution of
programs with strings,” in SAICSIT Conf., 2012.
ACKNOWLEDGMENT [23] P. Hooimeijer and W. Weimer, “Solving string constraints lazily,” in
ASE, 2010.
The authors would like to thank Tadahiro Uehara and [24] G. Li, P. Li, G. Sawaga, G. Gopalakrishnan, I. Ghosh, and S. P.
Shoichiro Fujiwara from Fujitsu Labs Japan for extensive Rajan, “GKLEE: Concolic verification and test generation for GPUs,”
in PPoPP, 2012.
testing of JST and providing examples to make the tool better.