Code Verification by Static Analysis A Mathematica
Code Verification by Static Analysis A Mathematica
net/publication/228935665
CITATIONS READS
2 261
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Fabrizio Marinelli on 21 May 2014.
∗
LIX, École Polytechnique, F-91128 Palaiseau, France
Email:[email protected], {leroux,liberti}@lix.polytechnique.fr
†
DIIGA, Università Politecnica delle Marche, Ancona, Italy
Email:[email protected]
Abstract
1 Introduction
In this paper we propose a mathematical program whose optimal solution encodes a sequence of intervals
X such that: (a) the relation x ∈ X holds for all values x taken by the variables x of a given computer
program during its execution, and (b) for all other sequences of intervals Y ( X, the relation x ∈ Y might
become false during the execution of the program. An X obeying (a) is called a program invariant (because
the relation x ∈ X remains true during the program execution); among the program invariants, the
smallest ones (i.e. those obeying (b)) are most interesting. Invariants are used to verify given properties
of computer programs, such as for example “the variable xi never exceeds the bounds [0, 10]”: if we are
able to show that the smallest invariant for xi is [1, 5] then we are sure that the property is verified. This
should also explain why large invariants are less interesting: the interval [−∞, ∞] might be an invariant,
but it can only prove the trivial property xi ∈ [−∞, ∞]. The verification of such properties is very useful
in finding bugs before programs are actually run, and therefore is extremely important in safety-critical
applications. The discipline that studies program invariants is called Static Analysis (SA).
We recall that the syntax of a computer language over an alphabet is a set of rules establishing whether
a given string of characters in the alphabet belongs to the language or not. Its semantics is an assignment
of mathematical entities (for example, sets) to the variable symbols of strings in the language. Syntax
1 Corresponding author.
1 INTRODUCTION 2
and semantics of a language are usually defined conjunctively by means of formal grammars (for example
using the software tools LEX and YACC [17]). SA by Abstract Interpretation (AI) [6, 7] aims to find
program invariants as over-approximations (also called abstract semantics) of the sets of values (also
called concrete semantics) that the program variables can take at each control point of the program.
We usually restrict abstract semantics to belong to a pre-specified class of sets (called abstract domain):
e.g. intervals, octagons, spheres, polyhedra, etc. Abstract domains may be relational (e.g. polyhedra,
linear equations, linear congruences, octagons), where each lattice element is parametrized according
to several variable symbols, or non-relational (e.g. intervals, translated integer ideals such as aZ + b).
Whenever the application requires finding the abstract semantics explicitly, domains which are amenable
to a finite numeric representation are used. For example, closed real intervals can be represented by
triplets (lower and upper bounds, and a binary value stating whether the interval is empty or not),
octagons can be represented by a set of eight planes, and so on. Abstract domains are made into lattices
(under setwise inclusion) via the definition of appropriate union and intersection operators. Given such a
lattice (L, ⊆), the action of the program on the abstract semantics can be seen as a function F mapping
L to itself. Thus some X ∈ L is invariant with respect to the program action if it obeys the fixpoint
equations for F :
X = F (X), (1)
usually called semantic equations. A solution of (1) is a fixpoint of F . In particular, the least fixpoint of
F with respect to set inclusion is the smallest invariant of the computer program encoded in F .
The best known techniques for finding a solution to (1) are Kleene’s Iteration (KI) and Policy Iteration
(PI). KI [6, 8] is an iterative procedure based on applying F to the largest lattice element until convergence
to a fixpoint is attained. The main disadvantage of KI is that in general there is no finite time bound
for its termination, even for programs only involving integer affine arithmetic. A significant body of
research on KI exists, however, implying bounded termination for many special cases (for example if the
program variables can only attain values within given bounds known a priori), excellent implementations
and good scalability with respect to program size. PI [4, 12, 13] is a sort of Newton’s method for lattices
which only converges to a guaranteed least fixpoint in bounded time under some additional conditions
on F — namely non-expansiveness — playing the same role as convexity in the traditional Newton’s
method. The main disadvantage of PI is that if the program is not non-expansive, there is no guarantee
about the minimality of the found fixpoint. Both techniques can be adapted to be used with most of the
abstract domains aforementioned: it suffices to re-define the lattice’s union and intersection operators.
The alternative approach proposed in this paper finds a guaranteed least fixpoint in (exponentially)
bounded time, at least for programs involving integer arithmetic. Although its applicability is currently
limited to the abstract domain of intervals, we hope to extend this approach to other domains. We use
the Mathematical Programming (MP) language in order to describe the smallest interval feasible in (1)
and employ a Branch-and-Bound algorithm to solve this MP to optimality. Depending on the operations
carried out in the computer program being analyzed, our MP may be a Mixed-Integer Linear Program
(MILP) or a Mixed-Integer Nonlinear Program (MINLP). Typical techniques for finding optimal solutions
of MILPs and MINLPs are based on the Branch-and-Bound (BB) algorithm [1]. In the case of MILPs, the
BB algorithm yields finite convergence to an exact optimum [27]. In the case of MINLPs, in general, the
spatial Branch-and-Bound (sBB) algorithm yields finite convergence only to an ǫ-approximate optimum
(for a given ǫ > 0) [26]. As long as the computer program is limited to integer affine arithmetic, we get
a MILP that we can solve exactly. If the computer program has integer non-affine arithmetic, we get
a MINLP that we can reformulate exactly to a MILP [19, 20]. If the computer program has floating
point affine arithmetic, we can derive a MILP whose optimal solution in principle exactly encodes the
least fixpoint of (1), but whose numeric parameters are so badly scaled that only a good approximation
can be provided. In the remaining case (floating-point non-affine arithmetic) we can only provide an
approximation. In this paper we only explicitly discuss the case of integer affine arithmetics.
We remark that optimization techniques were previously employed in software verification [5, 22, 23]
but in different contexts. In [5], the analysis addresses semi-algebraic programs of the form while B do
C od where B is a boolean condition and C is an imperative command and aims to establish loop termi-
nation and (not necessarily smallest) invariants, which are found using Lagrangian relaxation semidefinite
programming. In [22], a program is seen as a dynamical system: its termination and boundedness prop-
2 A GRAPH MODEL FOR THE SEMANTIC EQUATIONS 3
erties are inferred from appropriate Lyapunov functions, which are approximated via finite sequences of
values found using convex optimization techniques. In [23], mathematical programming type constraints
are proposed to model several arithmetic operators; the union and intersection operators, however, nec-
essary to model (1), are not treated.
This paper makes two original contributions. Most importantly, it provides a new method for solving
the semantic equations (1) that finds a provably optimal solution in bounded time, at least for programs
with integer affine arithmetic, and approximate solutions for floating point and/or non-affine arithmetic.
Secondly, our MP is built from the computer program’s syntactical elements via a first reformulation
to a flowchart-like computational model and a second reformulation to semantic equations. This is in
contrast with respect to the more usual, static way of presenting a MP: it is interesting that a static set of
declarations (the MP) can capture the semantics of a dynamic computer program cast in an imperative
language.
The rest of this paper is organized as follows. In Sect. 2 we introduce a graph-based syntax (close to
flowcharts) for computer programs. This syntax allows us to derive semantic equations where the right
hand side involves at most one operator. In Sect. 3 we present an MP formulation whose constraints
model a relaxation of the semantic equations applied to interval domains, and whose objective function
identify the least fixpoint of (1). In Sect. 4 we discuss the solution techniques employed and in Sect. 5
we present our computational results. Sect. 6 concludes the paper.
Abstract interpretation provides a way to represent the action of a computer program on the domain X
of the program variables as a function F acting on X. After the application of the program, the domain
becomes F (X). This means that domains with the property that X = F (X) are invariant with respect
to the action of the computer program: these are precisely the fixpoints of F . The main interest here is
finding the smallest fixpoints with respect to set inclusion, as they represent the minimal set of values
that the program variables may take after any run of the computer program. The semantic equations
(1) provide a representation of the semantics of a given computer program. Deriving (1) explicitly from
a piece of code can be done directly by using the formal grammar that generates the language used to
write the code (e.g. by means of such well-established tools as LEX and YACC [17]). Since the intended
readership of this paper is likely to be more literate on graph theory than formal languages, we are first
going to encode the concrete semantics of a program in a flowchart-like graph [3, 15], and then assign an
abstract semantics X to the arcs of this graph.
where δ − (v) = {u ∈ V | (u, v) ∈ A} and δ + (v) = {w ∈ V | (v, w) ∈ A}. The full framework of abstract
interpretation [6, 7] allows many types of code analyses (flow analysis, performance analysis and so on).
We are focusing here on the static determination of the domains of program variables. Because of this
2 A GRAPH MODEL FOR THE SEMANTIC EQUATIONS 4
scope limitation, we can drop some of the formalisms of [6] and only consider program variable domains
on the arcs A.
Let X be the domain of the program variables, extended with infinity values (e.g. for a program with
two 4-bytes non-negative integer variables, X = {0, . . . , 232 − 1, ±∞}2 ). For all (u, v) ∈ A let X(uv) be
the domain of the program variables on the arc (u, v). Depending on the context we would also write
Xi to indicate the domain of the program variables on the arc i ≤ m. To all nodes u ∈ V with ℓ(u) = S
(assignment nodes) we associate a function φu : X → X such that, if the program variables x have values
x̄ at u and δ + (u) = {v}, then x have values φu (x̄) at v (we remark that we limit assignments to arithmetic
functions, all of them extended to deal with infinity values). For all nodes u ∈ V with ℓ(u) = T (test
nodes) we consider a set τu ⊆ X such that, for given values x̄ of the program variables x, the test is true
if x̄ ∈ τu and false otherwise. We can take the following statements as axioms describing the semantics
of the program.
1. ForSa node v ∈ V such that ℓ(v) = S, δ − (v) = {u} and δ + (v) = {w} we have X(vw) = φv (X(uv) ) =
φu (x̄).
x̄∈X(uv)
2. For a node v ∈ V such that ℓ(v) = T, δ − (v) = {u} and δ + (v) = {w1 , w2 } we have X(vw1 ) = X(uv) ∩τv
and X(vw2 ) = Xuv ∩ (X r τv ).
S S
3. For every node v ∈ V with ℓ(v) not in {S, T} we have X(uv) = X(vw) .
u∈δ − (v) w∈δ + (v)
By using (3)-(6) and by Axioms 1-3, for all (u, v) ∈ A it is possible to write the following equations
describing the action of the program on the domain variables on (u, v):
−
X(uv) = fuv ({X(tu) | t ∈ δ − (u)}) (7)
+ +
X(uv) = fuv ({X(vw) | w ∈ δ (v)}), (8)
− +
where fuv , fuv are functions whose exact form is derived from Axioms 1-3 and involves operators/functions/sets
∩, ∪, φ, τ . Eq. (7)-(8) provide the explicit form of the semantic equations (1). See Fig. 1 for a simple,
worked out example that appears in most of the abstract interpretation literature.
Eq. (5) says that junction nodes cannot have other junction nodes as immediate predecessors. The
reason for this choice is that it reduces the number of nodes, as two successive junction nodes (j1 , j2 ) ∈ A
can be contracted into a single junction node j without changing the program semantics (Fig. 2). On the
other hand, this implies the existence of in-stars δ − (v) of various cardinalities. The contraction operation
mentioned above, however, can be reversed: a single junction node j with δ − (j) = {v1 , . . . , vk } can be
replaced by a sequence of k − 1 junction nodes {j1 , . . . , jk−1 } by adding arcs (ji−1 , ji ) for all 2 ≤ i ≤ k − 1,
(vi+1 , ji ) for all 1 ≤ i ≤ k − 1, and (v1 , j1 ) (Fig. 2). In the sequel, we employ the following, modified
version of (5):
∀v ∈ V ℓ(v) = J ↔ |δ − (v)| = 2 ∧ |δ + (v)| = 1, (9)
because in the corresponding semantic equations (1) the union operator always has at most two operands.
We call a program graph defined as per (2)-(6) a contracted graph, whereas the program graph obtained
by replacing (5) with (9) an expanded graph. It is easy to establish that in a program graph such
that (2)-(4), (6), (9) hold, the semantic equations (1) involving the union operator all have the form
X(vw) = X(u1 v) ∪ X(u2 v) for nodes u1 , u2 , v, w ∈ V . Furthermore, an expanded graph G = (V, A) and
a contracted graph G′ = (V ′ , A′ ) for the same program give rise to sets of semantic equations having
essentially the same set of solutions.
Next, we remark that all tests can be written in the form (variable ∈ interval). Expressions f1 ∧ f2
appearing in tests can be replaced by sequences of adjacent test nodes representing “if f1 then if f2
then. . . ”. Expressions f1 ∨f2 appearing in tests can be rewritten as “if f1 then p endif; if f2 then
if ¬f1 then p endif”, where p is a set of instructions of the program to be executed only if f1 ∨ f2
2 A GRAPH MODEL FOR THE SEMANTIC EQUATIONS 5
//a: l(a) = E 6
int x = 1; //b: l(b) = S a 1 b 2 c 3 d
//c: l(c) = J τd = [−∞, 100]
X1 = Id(input) X1 = [−∞, ∞]
X2 = φb (X1 ) X2 = [1, 1]
X3 = X2 ∪ X5 X3 = [1, 1] ∪ X5
X4 = X3 ∩ τ d X4 = X3 ∩ [−∞, 100]
X5 = φe (X4 ) X5 = X4 + [1, 1]
X6 = X3 ∩ (X r τd ). X6 = X3 ∩ [101, ∞].
Figure 1: The program graph and semantic equations worked out for a simple but classic example.
v1 v1
j1
v2 j
v2
j2
v3
v3
holds. This program transformation can be applied recursively so that all tests are of the form f or ¬f
where f are boolean conditions not including ∧ and ∨, i.e. they have the form (f (x) ∈ τ ), where x are
program variables, f : Rn → R and τ = [τ L , τ U ] is an interval. Negated conditions f (x) 6∈ τ can be
replaced by pairs f (x) ≤ τ L ∨ f (x) ≥ τ U , which we know how to rewrite eliminating the ∨ operator.
Tests such as f (x) ∈ τ can be rewritten by adding a new program variable yf and an assignment node
yf = f (x) to the program, so that the test condition becomes yf ∈ τ . Finally, we can break up complex
assignment nodes xi = f (x) (where f is a mathematical expression involving arithmetic operators) into
a list of assignment nodes involving at most one operator on the right hand side by means of introducing
added program variables and assignment nodes. This is similar to the standard MINLP form described
in [24, 18].
This yields a rewriting of a program P into a program P ′ with program variables x, y such that the
smallest fixpoint (X ∗ , Y ∗ ) of the semantic equations for P ′ has the property that X ∗ is the smallest
fixpoint of the semantic equations for P . The argument (technical but not difficult) is based on using
the added assignment nodes in P ′ in order to eliminate Y from the semantic equations for P ′ and obtain
those for P . The relevance of this rewriting is that we can assume that the semantic equations only
involve unary and binary operators on the variable domains. Given a program graph G = (V, A), the
3 MATHEMATICAL PROGRAMMING FORMULATION 6
where O0 is the set of 0-ary operators (or constants), Ou = {Id, c×, ∩τ } is the set of unary operators (Id
being the identity, c× the multiplication by a constant, and ∩τ the intersection with a constant interval)
and Ob = {+, ∪} is the set of binary operators. We shall assume that for all considered domains (type of
sets assigned to the X symbols) we have Xi − Xj = Xi + (−1) × Xj : this is certainly true for intervals.
We note that we need Id because it is the operator assigned to the arc incident to the entry nodes, and
if the program graph represents a function in a program, the first arc actually “copies” the values passed
as function arguments.
One last important remark: every arc i ≤ m of the program graph encodes the changes carried out to
all program variables x = (x1 , . . . , xn ). Thus, each Xi is a sequence of sets (Xi1 , . . . , Xin ). If the operator
⊗i on the arc i only changes the value of variable xj , it is taken for granted that ⊗i acts as the identity
for all other program variables xk with k 6= j. Hence, X is in fact an m × n rectangular array where Xij
is a set for all i ≤ m, j ≤ n.
In this section, we assume that the domain type of the abstract semantics is I, the set of all closed
intervals in R ∪ {±∞}. Since with the syntax given in Sect. 2 each semantic equation (10) only involves
one operator that only acts on one single program variable, all the other program variables being fixed,
it suffices to exhibit MP constraints defining the behaviour of the following operators:
We first remark that we can relax set equality with set inclusion provided we minimize with respect
to the interval width |Z| = z U − z L . More precisely, as long as R, S are not empty,
∀⊗ ∈ O0 argmin{|Z| | Z ⊇ ⊗} = {Z ∈ I | Z = ⊗}
∀⊗ ∈ Ou argmin{|Z| | Z ⊇ ⊗R} = {Z ∈ I | Z = ⊗R}
∀⊗ ∈ Ob argmin{|Z| | Z ⊇ R ⊗ S} = {Z ∈ I | Z = R ⊗ S}.
The provided MP constraints will therefore actually model a post-fixpoint Z ∈ Im , i.e. an interval vector
Z such that Z ⊇ F (Z), equality being enforced by the minimization of the interval widths.
Observe that I is not closed under ∩τ since ∅ ∈ / I. Indeed the behaviour of the empty set w.r.t. the
operators in Ou ∪ Ob is not the same as other intervals: Id∅ = c × ∅ = ∅ ∩ τ = ∅, and ∅ ∪ S = S, R ∪ ∅ = R,
3 MATHEMATICAL PROGRAMMING FORMULATION 7
∅ + S = R + ∅ = ∅. There is no interval in I with such properties: thus, the empty interval cannot be
represented in the form [z L , z U ]. We therefore introduce, for each interval Z, a binary variable z̄ that has
value 1 if and only if Z is empty: L U
[z , z ] if z̄ = 0
Z= (11)
∅ if z̄ = 1.
Equivalently, each interval Z ∈ I is represented by a triplet (z L , z U , z̄) ∈ R2 × {0, 1} such that if z̄ = 1
the MP constraints enforce the empty interval behaviour and by convention, z L = z U = 0.
In all cases described in this section, the definition of interval must be enforced:
zL ≤ zU , rL ≤ rU , sL ≤ sU . (12)
In the following, we will choose a suitable constant to represent the infinity value, exploiting the
fact that practical computers are equivalent to bounded Turing machines rather than truly universal
Turing machines: commonly available implementations of the floating point number field FP are such
that ∃M ∈ R ∀r ∈ FP (|r| ≤ M ) and ∃ε > 0 ∀r, s ∈ FP (r 6= s → |r − s| > ε). Therefore, all the intervals
are considered bounded:
−M ≤ z L , z U ≤ M (13)
and the assignment z = M is read as z = ∞. Notationwise, for a set Θ(x1 , . . . , xp ) ⊆ Rp and a (sub)list
of coordinate directions x′ = (xi1 , . . . , xiq ), we indicate by proj(Θ, x′ ) the projection of Θ on the subspace
spanned by unit vectors in the coordinate directions listed in x′ . Most of the arguments below follow
from interval analysis [14].
zL ≤ τL (14)
zU ≥ τU (15)
z̄ = 0. (16)
The only point worth emphasizing is that in case an interval constant appears explicitly in the semantic
equations, we assume the interval to be non-empty (there exist no valid assignment imperative statement
in the reference programming language corresponding to the constant assignment Z = ∅). Thus, there
are some intervals Z for which z̄ is forced to be zero: without (16), the objective function direction (see
(56) below) would force all such variables to be set at 1, which would cause all the relevant constraints
to be inactive, which in turn would yield the empty set as the (invalid) least fixpoint of (10).
The unary operator Id(R) is modeled by slightly modifying constraints (14)-(16) as follows:
zL ≤ rL + 2M z̄ (17)
zU ≥ rU − 2M z̄ (18)
z̄ = r̄. (19)
3.2 Sum
In interval arithmetic, the sum of two non-empty intervals R, S is the interval Z = [rL + sL , rU + sU ]. In
order to extend the semantic of the sum operator to the set of closed intervals in R ∪ {±∞}, we introduce
the following binary variables and constraints:
LR
• z+ = 1 if and only if rL > −∞;
3 MATHEMATICAL PROGRAMMING FORMULATION 8
UR
• z+ = 1 if and only if rU < +∞;
LS
• z+ = 1 if and only if sL > −∞;
US
• z+ = 1 if and only if sU < +∞;
L
• z+ = 1 if rL = −∞ or sL = −∞;
U
• z+ = 1 if rU = +∞ or sU = +∞.
LR LR
ǫ − M (3 − 2z+ ) ≤ rL ≤ M (2z+ − 1) (20)
UR U UR
M (1 − 2z+ ) ≤r ≤ M (3 − 2z+ )−ǫ (21)
LS L LS
ǫ − M (3 − 2z+ ) ≤s ≤ M (2z+ − 1) (22)
US U US
M (1 − 2z+ ) ≤s ≤ M (3 − 2z+ )−ǫ (23)
L LR LS
2z+ ≥ 2 − z+ − z+ (24)
U UR US
2z+ ≥ 2 − z+ − z+ (25)
L
z ≤ (rL + sL )(1 − z+
L L
) − M z+ + 2M z̄ (26)
U U U U U
z ≥ (r + s )(1 − z+ ) + M z+ − 2M z̄ (27)
2z̄ ≥ r̄ + s̄ (28)
z̄ ≤ r̄ + s̄. (29)
3.1 Lemma
Let Θ = {(z L , z U , rL , rU , sL , sU , z+
LR U R LS U S L U
, z+ , z+ , z+ , z+ , z+ , z̄, r̄, s̄) | (12), (13), (20)-(29)}. Then, if R, S 6=
L U
∅, Z = proj(Θ, (z , z )) ⊇ R + S; otherwise z̄ = 1.
If one of both operands R and S are empty, i.e., if either r̄ = 1 or s̄ = 1, than (28) implies z̄ = 1. As
a consequence, Constraints (26) and (27) are inactive at all and variables z L and z U are free. 2
Observe that Constraints (26) and (27) are needed to guarantee model feasibility since they correctly
allow the operations rU + M = M and rL − M = −M . Moreover, it is easy to provide cases having least
fixpoints with at least one interval that diverges to infinity.
The modeling of the c× operator is very similar to that described for the sum. Again, the main issue
concerns the extension of the semantic to the closed intervals in R ∪ {±∞} but now we need of less
variables and constraints since c× is a unary operator. Assume c ≥ 0 and consider the following binary
variable and constraints:
L
• z× = 1 if and only if rL > −∞;
U
• z× = 1 if and only if rU < +∞;
3 MATHEMATICAL PROGRAMMING FORMULATION 9
L L
ǫ − M (3 − 2z× ) ≤ rL ≤ M (2z× − 1) (30)
U U U
M (1 − 2z× ) ≤r ≤ M (3 − 2z× )−ǫ (31)
L L L L
z ≤ cr (1 − z× ) − M z× + 2M z̄ (32)
U U U U
z ≥ cr (1 − z× ) + M z× − 2M z̄ (33)
z̄ = r̄. (34)
3.4 Union
In interval analysis, the union of two non-empty intervals R, S is the convex hull of the setwise union
R ∪ S of the two intervals, i.e. the smallest interval containing both:
zL ≤ rL + 2M z̄ (35)
L
z ≤ sL + 2M z̄ (36)
zU ≥ rU − 2M z̄ (37)
zU ≥ sU − 2M z̄ (38)
z̄ ≥ r̄ + s̄ − 1 (39)
2z̄ ≤ r̄ + s̄. (40)
3.2 Lemma
Let Θ = {(z L , z U , rL , rU , sL , sU , z̄, r̄, s̄) | (12), (13), (35)-(40)}. Then, if either R or S is not empty (or
neither is), Z = proj(Θ, (z L , z U )) ⊇ R ∪ S; otherwise, if R = S = ∅, z̄ = 1.
Proof. By (39) and (40), z̄ = 1 if and only if r̄ = s̄ = 1, i.e., if and only if R, S are both empty. In
that case, constraints (35)-(38) are all inactive. Otherwise, by (35)-(36) and (37)-(38), z̄ = 0 implies
z L ≤ min(rL , sL ) and z L ≥ max(rL , sL ), so z ∈ Z. 2
3.5 Intersection
1. if rU < τ L then Z = ∅;
2. if rL > τ U then Z = ∅;
3. if rL ≤ τ L and rU ≤ τ U and rU ≥ τ L then Z = [τ L , rU ];
4. if rL ≤ τ L and rU ≥ τ U then Z = [τ L , τ U ];
5. if rL ≥ τ L and rU ≤ τ U then Z = [rL , rU ];
6. if rL ≥ τ L and rU ≥ τ U and rL ≤ τ U then Z = [rL , τ U ];
3 MATHEMATICAL PROGRAMMING FORMULATION 10
τ τ
R R
τ τ
R R
τ τ
R R
The intersection R ∩ τ can be modeled by the following binary variables and constraints:
Variables z∩U L and z∩LU are needed to model the empty intersection. It is immediate to see that
(41)-(42) defines z∩U L and (43)-(44) defines z∩LU . Constraint (45) ensures that Z will set to empty if at
least one of the intervals R and τ is empty, whereas constraint (46) models the cases 1. and 2. Finally,
constraint (47) enforces Z to be not empty when R or τ are not empty or when cases 1. and 2. do not
occur.
Variables z∩Lτ and z∩LR and constraints (50) and (51) assign to the lower endpoint of Z one lower
endpoint between τ L and xL . Similarly, the upper endpoint of Z is modeled by variables z∩U τ and z∩U R
and by constraints (52) and (53). Clearly, z∩Lτ and z∩LR as well as z∩U τ and z∩U R are mutually exclusive,
as imposed by constraints (48) and (49).
Observe that if z̄ is set to one then all the variables z∩Lτ , z∩LR , z∩U τ and z∩U R are forced to be zero by
Constraints (48) and (49) and, therefore, variables z L and z U are free.
3.3 Lemma
Let Θ = {(z L , z U , rL , rU , τ L , τ U , z∩LU , z∩U L , z∩Lτ , z∩LR , z∩U τ , z∩U R , z̄, r̄, τ̄ ) | (12), (13), (41)-(53)}. Then Z =
proj(Θ, (z L , z U )) contains R ∩ τ , otherwise z̄ = 1.
Case 1: by (41)-(42), rU < τ L implies z∩U L = 1 and, by (46), z̄ = 1. Constraints (48) and (49) then
enforce z∩Lτ = z∩LR = z∩U τ = z∩U R = 0 and therefore (50)-(53) are all satisfied for any value of z L and z U .
Finally, by (12), rL ≤ τ U and hence z∩LU = 0 by (43) and (44).
Case 2: by (43) and (44), τ U < rL implies z∩LU = 1 and, by (46), z̄ = 1. As in the previous case,
constraints (48) and (49) enforce z∩Lτ = z∩LR = z∩U τ = z∩U R = 0 and therefore (50)-(53) are all satisfied
for any value of z L and z U . Since τ L ≤ rU is implied by (12), z∩U L = 0 is enforced by (41) and (42).
Case 3: rU ≥ τ L , (41) and (42) imply z∩U L = 0, whereas z∩LU = 0 by (12), (43) and (44). The setting
z∩U L = z∩LU = 0 makes (46) inactive and enforces z̄ = 0 through (47). By (48), exactly one between
z∩Lτ = 1 and z∩LR = 1 must occur. In the former case z L = τ L by (50); in the latter case z L = rL by
(51). Analogously, By (49), exactly one between z∩U τ = 1 and z∩U R = 1 must occur. In the former case
z U = τ U by (52); in the latter case z U = rU by (53). In any case, Z contains R ∩ τ .
Case 4 and 5: (12), (41) and (42) enforce z∩U L = 0, whereas (12), (43) and (44) enforce z∩LU = 0. So far
the cases boil down to Case 3.
Case 6: rL ≤ τ U , (43) and (44) imply z∩LU = 0, whereas z∩U L = 0 by (12), (41) and (42). So far the case
boils down to Case 3.
Now suppose R = ∅ or τ = ∅, i.e., r̄ = 1 or τ̄ = 1. Constraints (45) and (47) enforce z̄ = 1 and make
(46) inactive. z∩Lτ = z∩LR = z∩U τ = z∩U R = 0 by constraints (48) and (49) and therefore (50)-(53) are all
satisfied for any value of z L and z U . 2
We now consider (10) where Xi = (Xi1 , . . . , Xin ), with Xij ∈ I(M ) for all i ≤ m and j ≤ n. The
empty interval needs special handling whenever it appears in unreachable segments of code: Xij = ∅ in
fact implies that the j-th variable cannot take any value at control point i or, equivalently, it signals
that the control point i is never executed for any given program input. Clearly, if the control point i is
unreachable, no variable can ever take a value at i. Thus, Xij = ∅ implies Xik = ∅ for any k 6= j. This
behaviour however is not correctly modeled by the constraints given above, as shown in the following
example:
3 MATHEMATICAL PROGRAMMING FORMULATION 12
The intersection X31 = [0, 0] ∩ X21 results in an empty interval indicating that the code x2 = 4 is
unreachable (the conditional statement x1 == 0 is in fact always false). However, the equation X42 =
[4, 4], being modeled by constraints (14)-(16), results in the non-empty interval [4, 4]. Therefore, X72 =
[0, 4] instead of X72 = [0, 0].
Due to unreachability, the interval Xij = R ⊗ S should be empty even though the constraints given
above for R ⊗ S would normally not yield the empty interval. Therefore, when n > 1 a further binary
variable x̂ is required in order to distinguish the case R ⊗ S = ∅ from the case when i is unreachable.
Thus, we introduce:
1 if R ⊗ S = ∅
x̂ij = (54)
0 otherwise,
replace z̄ with ẑ in (16), (19), (28), (29), (34), (39), (40), (45), (46) and (47), and add the following
linking constraints:
Pn
∀i ≤ m, j ≤ n x̄ij ≤ n x̂ij + x̄ik
k=1
k6=j
n
∀i ≤ m, j ≤ n x̂ij ≤ nx̄ij −
P
x̄ik (55)
k=1
k6=j
n n
P P
∀i ≤ m ≤ n
x̄ik x̂ik .
k=1 k=1
For each i ≤ m, the i-th semantic equation Xi = Fi (X) is replaced by its post-fixpoint relaxation
Xi ⊇ Fi (X): we let gi (xL , xU , x̄, x̂, z⊗ ) ≤ 0 be the constraints (given in Sect. 3.1-3.5) whose feasible
region defines precisely all those intervals X such that Xi ⊇ Fi (X).
3.4 Theorem
The least fixpoint of (10) is the optimal solution of the following mathematical program:
(xU L
P P
min ij − xij − x̄ij )
i≤m j≤n
∀i ≤ m gi (xL , xU , x̄, x̂, z⊗ ) ≤ 0 (56)
−M ≤ xL ≤ xU ≤ M
x̄, x̂, z⊗ ∈ {0, 1}.
3 MATHEMATICAL PROGRAMMING FORMULATION 13
Proof. First note that I(M )mn is a complete lattice with respect to inclusion, union (defined over intervals
as the smallest interval containing both arguments), and intersection.
Claim 1. For all i ≤ m, Fi is monotone in the interval lattice.
For all U, W, Y ∈ I(M ) such that U ⊆ W , it is easy to establish that U ∩Y ⊆ W ∩Y , U ∪Y ⊆ W ∪Y ,
U + Y ⊆ W + Y . For these binary operators, monotonicity in the second argument follows by
commutativity. For unary operators, c × U ⊆ c × W (c ∈ R) are also easy to establish.
Since F is monotone, it has a unique least fixpoint by Tarski’s fixpoint theorem [25]: there is X ∈ I(M )mn
such that F (X) = X. Furthermore, the least fixpoint coincides with the least post-fixpoint of F [25],
i.e. F (X) ⊆ X. As shown in Sect. 3.1-3.5, the intervals in I(M )mn that are feasible with respect to the
constraints gi (xL , xU , x̄, x̂, z⊗ ) ≤ 0 are precisely those for which Fi (X) ⊆ Xi for each i ≤ m.
Claim 2: the objective function, mapping I(M )mn to R, is strictly increasing.
For non-empty intervals, this is obvious. Otherwise, notice that whenever x̄ij = 1 for some
i ≤ m, j ≤ n, all the relevant constraints in gi ≤ 0 are inactive, save for (12): the objective
function direction then ensures that at the optimum, xL U
ij = xij . Since x̄ij = 1 if and only if
Xij = ∅, the value of the objective function when Xij is empty is lower.
Thus, any globally optimal solution of (56) is the least fixpoint of F . 2
The theorem above shows that when the domain type for program variables consists of uniformly
bounded closed intervals, the least fixpoint of the semantic equations (1) is characterised as the optimal
solution of a special mathematical program that is built component-wise (i.e. control point by control
point) along the abstract semantics of the program.
If only integer arithmetic is used in the computer program, i.e., when all decision variables xL , xU are
constrained to be integer, the mathematical model (56) can be always written as a MILP. In fact, in the
case of affine arithmetic the operators appearing in (10) are only constant, Id, c×, +, ∩τ and ∪. The
associated constraints (see Sect. 3.1-3.5) are all linear except (26)-(27), (32)-(33), (41)-(44) and (50)-(53),
exhibiting products between a binary and a continuous (or integer) decision variable. Due their particular
form, they admit a well-known exact linear reformulation (in the sense given in [19, 20]).
In the case of non-affine integer arithmetic, additional operators (such as e.g. inverse, constant power
and product) must be considered. The constraints needed to model them — which we do not report here
— exhibit nonlinear terms, but all the nonlinearities can be reformulated exactly to linear terms via the
introduction of added variables and constraints (in particular, by the reformulations Int2Bin, ProdBin,
ProdBinCont, AbsDiff and Step 2, p. 188 of [20]). Finally, we can choose ε = 1, which means that
the obtained MILP will be well-scaled and practically solvable.
If floating point affine arithmetic is used in the computer program, we get a MILP where ε must be set
to the smallest possible floating point number: although the optimal solution of this MILP in principle
encodes the guaranteed least fixpoint of (10), practical solution will not be achievable without a possibly
large numerical error. In any case, for a suitable choice of ε we still derive an over-approximation of the
least fixpoint, which is a valid invariant. If floating point non-affine arithmetic is used, products, inverses
and powers of continuous decision variables cannot be reformulated to linear terms exactly as in the
integer case. Although the obtained MINLP is badly scaled, its optimal solution encodes the guaranteed
least fixpoint of (10). However, in this case even a good choice of ε would yield a MINLP for which we
can only hope to find an ǫ-approximate solution at best.
4 SOLUTION TECHNIQUES 14
4 Solution techniques
The model developed in Sect. 3 belongs to the family of MILP formulations. These can be solved by
employing an exact BB algorithm such as the one implemented in CPLEX [16] or a heuristic such as local
branching [10] or feasibility pump [9] if a guarantee of fixpoint minimality is not needed. More precisely,
the qualitative difference between an exact and heuristic algorithm, in this context, simply concerns the
minimality of objective function. In other words, both approaches provide valid solutions to the semantic
equations (1) as these are the constraints of the model. Only the exact algorithm can provide a guarantee
of optimality of the solution with respect to inclusion-wise fixpoint minimization.
Our implementation employs a parser to transform a C program into a set of semantic equations
(1). These are then automatically transformed into a mathematical program as per Sect. 3 expressed
in the AMPL language [11]. The AMPL interpreter is finally instructed to call the CPLEX 11 solver
[16], which solves the problem to optimality. As further detailed in Sect. 5, although the worst-case time
complexity of any BB algorithm is exponential, the particular form of the mathematical programming
formulations obtained by typical C code snippets includes many equality constraints, which usually speed
up the solution process considerably (as they may be used to substitute a decision variable out of the
model). We were able to ascertain empirically that the practical performance of the proposed solution
method mainly depends on the number of intersections present in the semantic equations, i.e., on the
number of test nodes present in the program graph. Therefore the BB performance would seem to be
mainly linked to the cyclomatic number, the well known McCabe software complexity metric [21], rather
than to the actual number of code lines.
Mathematical program (56) resorts to the numerical constant M , which represents a valid upper bound
to the finite values that each variable can take in each control point of the program. The value assigned
to M could be critical. In fact, on the one hand, underestimated values for M result in a feasible set
strictly included into the set of fixpoints of F or even empty, making Thm. 3.4 false. On the other hand,
too much large values for M could make the model ill-conditioned and harder to solve. We therefore are
interested in computing the smallest value for the upper bound M . Let i be the arc between nodes u and
v, and X̄i = [x̄L U L U n
i , x̄i ], i ≤ m, be a tight over-approximation of Xi , with x̄i , x̄i ∈ (R ∪ {±∞}) . X̄i can
be easily derived on the base of the rules listed in Table 1:
operator Xi x̄L
i x̄U
i
constant Xi = [aL , aU ] aL aU
identity Xi = Id(Xj ) x̄L
j x̄U
j
arithmetic Xi = φu (Xj ) φu (x̄L
j) φu (x̄U
j )
intersection Xi = Xj ∩ τ τL τU
union Xi = Xj ∪ Xk min{x̄L L
j , x̄k } max{x̄U U
j , x̄k }
4.1 Proposition
The smallest valid upper bound M for program (56) is given by
M = max{max{|x̄L L U U
i |, x̄i 6= −∞}, max{|x̄i |, x̄i 6= +∞}} + ǫ. (57)
i≤m i≤m
Proof. The set V of the nodes of the program graph can be totally ordered so that each cycle
(vp , . . . , vq ) of G with p < q (representing loops in the program) has ℓ(vp ) = J. Therefore, we can assume
j < i for all the indices in Table 1 and, except for the union operator, an over-approximation of each
interval Xi can be obtained by a forward substitution process. Observe that the operand Xk in Table 1
corresponds to the arc (q, p) of the relevant cycle and, due the particular structure of cycles, ℓ(vp+1 ) = T.
Since the intersection is over-approximated by the constant interval τ and ℓ(vh ) 6= J, for p < h ≤ q, all
5 COMPUTATIONAL RESULTS 15
the intervals within nodes vp+1 and vq can be over-approximated by forward substitution as well, and
therefore also Xk . 2
Similar arguments can be used for computing the greater value for the lower bound ǫ. However in our
computational experience ǫ has been set to 1 since all the instances only use integer variables.
Our implementation is able to deal with many program variables of different type (integer or floating
point), arrays, and function calls. In particular, each array is treated as a single summary object.
According to this approach, known as array smashing [2], the interval that represents the array is shared
by all the array elements.
5 Computational results
Computational experience on a set of small instances has been gathered in order to validate the mathemat-
ical model. These instances can be downloaded from http://www.lix.polytechnique.fr/~liberti/
verif-instances.zip. These are simple C programs without pointers or dynamic memory allocation.
For each instance, Table 2 lists the number of lines of code, variables, loops, and the maximum level of
loop nesting (columns lines, vars, loops and indent lev., respectively).
For each instance, column simplex reports the total number of pivot operations needed to solve the
linear relaxations associated to the nodes of the branch-and-bound tree, whereas column BB nodes shows
the size of the branch-and-bound tree.
The computational results were obtained on an Intel Xeon 2.4GHz with 8GB RAM running Linux.
Notice that 84% of the instances are solved at root node (i.e. no variable branching is ever required) just
by means of preprocessing and/or simplex algorithm; 32% of these are solved by pre-processing only. For
all instances requiring full BB, the search tree is very small: 46.71 nodes on average. The mean number
of simplex iterations is also very small (79.04), resulting in negligible computational times in practice.
This points out that our method can also be used for considerably larger instances than those we tested.
Solutions obtained by CPLEX 11 have been compared to those provided by an implementation of the
Policy Iteration (PI) algorithm [4], which only guarantees optimality in the case of non-expansive fixpoint
mappings F . This is apparent in Table 2: instances for which our method finds a better fixpoint are
emphasized in boldface. As per the usual efficacy/efficiency trade-off, it is also apparent that the simple
PI implementation we used is one order of magnitude faster than the state-of-the-art CPLEX solver; so
for large-scale programs, obtaining a guaranteed optimal solution might be computationally infeasible.
Lastly, our test PI implementation could not tackle static arrays and function calls (hence the ‘-’ sign in
the table).
We described a new mathematical programming based method for finding guaranteed least fixpoints
of the semantic equations arising from static analysis of computer programs. We developed our mod-
elling procedure using non-relational domains (interval domains for the values of program variables) and
presented some promising computational results.
One extremely attractive feature of using mathematical programming as a solution method for the se-
mantic equations (1) is that it allows to seamlessly add arbitrary relations between the program variables.
Supposing we know that during the execution of a given program a set of program variables x ∈ Rn al-
ways attain values within a given polyhedron {x ∈ Rn | Ax ≤ b}, it suffices to add the constraints linking
lower and upper variable bounds according to the linear relations Ax ≤ b. This has a special relevance
6 CONCLUSION AND EXTENSIONS 16
Acknowledgments
We are profoundly indebted to Prof. Eric Goubault, who initiated us to the mysteries of static analysis by
abstract interpretation. Financial support by: Île-de-France research council (post-doctoral fellowship),
System@tic consortium (“EDONA” project), ANR 07-JCJC-0151 “Ars”, ANR 08-SEGI-023 “Asopt”,
Digiteo Emergence “Paso” is gratefully acknowledged.
References
[1] A.V. Aho, J.E. Hopcroft, and J.D. Ullman. Data Structures and Algorithms. Addison-Wesley,
Reading, MA, 1983.
[2] B. Blanchet, P. Cousot, R. Cousot, J. Feret, L. Mauborgne, A. Mine, D. Monniaux, and X. Rival.
Design and implementation of a special-purpose static program analyzer for safety-critical real-
time embedded software. In T. Mogensen, D. Schmidt, and I. Sudborough, editors, The Essence of
Computation: Complexity, Analysis, Transformation, volume 2566 of LNCS, pages 85–108. Springer-
Verlag, Berlin, 2002.
[3] C. Böhm and G. Jacopini. Flow diagrams, turing machines and languages with only two formation
rules. Communications of the ACM, 9(5):366–371, 1966.
[4] A. Costan, S. Gaubert, E. Goubault, M. Martel, and S. Putot. A policy iteration algorithm for
computing fixed points in tatic analysis of programs. In K. Etessami and S.K. Rajamani, editors,
Computer Aided Verification, volume 3576 of LNCS, pages 462–475. Springer, 2005.
[5] P. Cousot. Proving program invariance and termination by parametric abstraction, lagrangian re-
laxation and semidefinite programming. In R. Cousot, editor, Verification, Model Checking and
Abstract Interpretation, volume 3385 of LNCS, pages 17–19, 2005.
[6] P. Cousot and R. Cousot. Abstract interpretation: a unified lattice model for static analysis of
programs by construction of approximations of fixed points. Principles of Programming Languages,
4:238–252, 1977.
[7] P. Cousot and R. Cousot. Systematic design of program analysis frameworks. In Conference Record
of the Sixth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages,
pages 269–282, San Antonio, Texas, 1979. ACM Press, New York, NY.
[8] P. Cousot and R. Cousot. Comparing the Galois connection and widening/narrowing approaches
to abstract interpretation. In M. Bruynooghe and M. Wirsing, editors, Programming Language
Implementation and Logic Programming, volume 631 of LNCS, pages 269–295. Springer, 1992.
[9] M. Fischetti, F. Glover, and A. Lodi. The feasibility pump. Mathematical Programming, 104(1):91–
104, 2005.
[10] M. Fischetti and A. Lodi. Local branching. Mathematical Programming, 98:23–37, 2005.
[11] R. Fourer and D. Gay. The AMPL Book. Duxbury Press, Pacific Grove, 2002.
[12] S. Gaubert, E. Goubault, A. Taly, and S. Zennou. Static analysis by policy iteration on relational
domains. In R. De Nicola, editor, European Symposium on Programming (ESOP), volume 4421 of
LNCS, pages 237–252. Springer, 2007.
[13] T. Gawlitza and H. Seidl. Precise fixpoint computation through strategy iteration. In R. De Nicola,
editor, European Symposium on Programming (ESOP), volume 4421 of LNCS. Springer, 300-315.
[14] E. Hansen. Global Optimization Using Interval Analysis. Marcel Dekker, Inc., New York, 1992.
REFERENCES 18
[15] D. Harel, P. Norvig, J. Rood, and T. To. A universal flowcharter. In 2nd Computers in Aerospace
Conference, volume A79-54378/24-59, pages 218–224, New York, 1979. AAIA.
[16] ILOG. ILOG CPLEX 11.0 User’s Manual. ILOG S.A., Gentilly, France, 2008.
[17] R. Levine, T. Mason, and D. Brown. Lex and Yacc. O’Reilly, Cambridge, second edition, 1995.
[18] L. Liberti. Writing global optimization software. In L. Liberti and N. Maculan, editors, Global
Optimization: from Theory to Implementation, pages 211–262. Springer, Berlin, 2006.
[21] T.J. McCabe. A complexity measure. IEEE Transactions on Software Engineering, 2(4):308–320,
1976.
[22] M. Roozbehani, A. Megretski, and E. Feron. Convex optimization proves software correctness. In
Proceedings of the American Control Conference, 2005.
[23] R. Rugina and M. Rinard. Symbolic bounds analysis of pointers, array indices, and accessed memory
regions. ACM Transactions on Programming Languages and Systems, 27(2):183–235, 2005.
[24] E.M.B. Smith and C.C. Pantelides. A symbolic reformulation/spatial branch-and-bound algorithm
for the global optimisation of nonconvex MINLPs. Computers & Chemical Engineering, 23:457–478,
1999.
[25] A. Tarski. A lattice-theoretical fixpoint theorem and its applications. Pacific Journal of Mathematics,
5(2):285–309, 1955.
[26] H. Tuy. Convex Analysis and Global Optimization. Kluwer Academic Publishers, Dordrecht, 1998.