Deterministic Parallel Fixpoint Computation
Deterministic Parallel Fixpoint Computation
Abstract interpretation is a general framework for expressing static program analyses. It reduces the problem
of extracting properties of a program to computing an approximation of the least fixpoint of a system of
equations. The de facto approach for computing the approximation of this fixpoint uses a sequential algorithm
based on weak topological order (WTO). This paper presents a deterministic parallel algorithm for fixpoint
computation by introducing the notion of weak partial order (WPO). We present an algorithm for constructing
a WPO in almost-linear time. Finally, we describe Pikos, our deterministic parallel abstract interpreter, which
extends the sequential abstract interpreter IKOS. We evaluate the performance and scalability of Pikos on a
suite of 1017 C programs. When using 4 cores, Pikos achieves an average speedup of 2.06x over IKOS, with a
maximum speedup of 3.63x. When using 16 cores, Pikos achieves a maximum speedup of 10.97x.
CCS Concepts: • Software and its engineering → Automated static analysis; • Theory of computation
→ Program analysis.
Additional Key Words and Phrases: Abstract interpretation, Program analysis, Concurrency
ACM Reference Format:
Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur. 2020. Deterministic Parallel Fixpoint Computation.
Proc. ACM Program. Lang. 4, POPL, Article 14 (January 2020), 33 pages. https://doi.org/10.1145/3371082
1 INTRODUCTION
Program analysis is a widely adopted approach for automatically extracting properties of the
dynamic behavior of programs [Balakrishnan et al. 2010; Ball et al. 2004; Brat and Venet 2005;
Delmas and Souyris 2007; Jetley et al. 2008]. Program analyses are used, for instance, for program
optimization, bug finding, and program verification. To be effective, a program analysis needs to be
efficient, precise, and deterministic (the analysis always computes the same output for the same
input program) [Bessey et al. 2010]. This paper aims to improve the efficiency of program analysis
without sacrificing precision or determinism.
Abstract interpretation [Cousot and Cousot 1977] is a general framework for expressing static
program analyses. A typical use of abstract interpretation to determine program invariants involves:
C1 An abstract domain A that captures relevant program properties. Abstract domains have been
developed to perform, for instance, numerical analysis [Cousot and Halbwachs 1978; Miné 14
2004, 2006; Oulamara and Venet 2015; Singh et al. 2017; Venet 2012], heap analysis [Rinetzky
et al. 2005; Wilhelm et al. 2000], and information flow [Giacobazzi and Mastroeni 2004].
Authors’ addresses: Sung Kook Kim, Computer Science, University of California, Davis, Davis, California, 95616, U.S.A.,
[email protected]; Arnaud J. Venet, Facebook, Inc. Menlo Park, California, 94025, U.S.A., [email protected]; Aditya V. Thakur,
Computer Science, University of California, Davis, Davis, California, 95616, U.S.A., [email protected].
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses,
contact the owner/author(s).
© 2020 Copyright held by the owner/author(s).
2475-1421/2020/1-ART14
https://doi.org/10.1145/3371082
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:2 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
C2 An equation system X = F (X) over A that captures the abstract program behavior:
X1 = F 1 (X1 , . . . , Xn ), X2 = F 1 (X1 , . . . , Xn ), . . . , Xn = Fn (X1 , . . . , Xn ) (1)
Each index i ∈ [1, n] corresponds to a control point of the program, the unknowns Xi of the
system correspond to the invariants to be computed for these control points, and each Fi is a
monotone operator incorporating the abstract transformers and control flow of the program.
C3 Computing an approximation of the least fixpoint of Eq. 1. The exact least solution of the system
can be computed using Kleene iteration starting from the least element of A n provided A is
Noetherian. However, most interesting abstract domains require the use of widening to ensure
termination, which may result in an over-approximation of the invariants of the program. A
subsequent narrowing iteration tries to improve the post solution via a downward fixpoint
iteration. In practice, abstract interpreters compute an approximation of the least fixpoint. In
this paper, we use “fixpoint” to refer to such an approximation of the least fixpoint.
The iteration strategy specifies the order in which the equations in Eq. 1 are applied during fixpoint
computation and where widening is performed. For a given abstraction, the efficiency, precision,
and determinism of an abstract interpreter depends on the iteration strategy. The iteration strategy
is determined by the dependencies between the individual equations in Eq. 1. If this dependency
graph is acyclic, then the optimal iteration strategy is any topological order of the vertices in the
graph. This is not true when the dependency graph contains cycles. Furthermore, each cycle in the
dependency graph needs to be cut by at least one widening point.
Since its publication, Bourdoncle’s algorithm [Bourdoncle 1993] has become the de facto approach
for computing an efficient iteration strategy for abstract interpretation. Bourdoncle’s algorithm
determines the iteration strategy from the weak topological order (WTO) of the vertices in the de-
pendency graph corresponding to the equation system. However, there are certain disadvantages to
Bourdoncle’s algorithm: (i) the iteration strategy computed by Bourdoncle’s algorithm is inherently
sequential: WTO gives a total order of the vertices in the dependency graph; (ii) computing WTO
using Bourdoncle’s algorithm has a worst-case cubic time complexity; (iii) the mutually-recursive
nature of Bourdoncle’s algorithm makes it difficult to understand (even for seasoned practitioners
of abstract interpretation); and (iv) applying Bourdoncle’s algorithm, as is, to deep dependency
graphs can result in a stack overflow in practice [Crab 2018; ReDex 2017].
This paper addresses the above disadvantages of Bourdoncle’s algorithm by presenting a concur-
rent iteration strategy for fixpoint computation in an abstract interpreter (§ 5). This concurrent
fixpoint computation can be efficiently executed on modern multi-core hardware. The algorithm for
computing our iteration strategy has a worst-case almost-linear time complexity, and lends itself to
a simple iterative implementation (§6). The resulting parallel abstract interpreter, however, remains
deterministic: for the same program, all possible executions of the parallel fixpoint computation
give the same result. In fact, the fixpoint computed by our parallel algorithm is the same as that
computed by Bourdoncle’s sequential algorithm (§7).
To determine the concurrent iteration strategy, this paper introduces the notion of a weak partial
order (WPO) for the dependency graph of the equation system (§4). WPO generalizes the notion of
WTO: a WTO is a linear extension of a WPO (§7). Consequently, the almost-linear time algorithm for
WPO can also be used to compute WTO. The algorithm for WPO construction handles dependency
graphs that are irreducible [Hecht and Ullman 1972; Tarjan 1973]. The key insight behind our
approach is to adapt algorithms for computing loop nesting forests [Ramalingam 1999, 2002] to the
problem of computing a concurrent iteration strategy for abstract interpretation.
We have implemented our concurrent fixpoint iteration strategy in a tool called Pikos (§8). Using
a suite of 1017 C programs, we compare the performance of Pikos against the state-of-the-art
abstract interpreter IKOS [Brat et al. 2014], which uses Bourdoncle’s algorithm (§9). When using 4
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:3
cores, Pikos achieves an average speedup of 2.06x over IKOS, with a maximum speedup of 3.63x.
We see that Pikos exhibits a larger speedup when analyzing programs that took longer to analyze
using IKOS. Pikos achieved an average speedup of 1.73x on programs for which IKOS took less
than 16 seconds, while Pikos achieved an average speedup of 2.38x on programs for which IKOS
took greater than 508 seconds. The scalability of Pikos depends on the structure of the program
being analyzed. When using 16 cores, Pikos achieves a maximum speedup of 10.97x.
The contributions of the paper are as follows:
• We introduce the notion of a weak partial order (WPO) for a directed graph (§4), and show
how this generalizes the existing notion of weak topological order (WTO) (§7).
• We present a concurrent algorithm for computing the fixpoint of a set of equations (§5).
• We present an almost-linear time algorithm for WPO and WTO construction (§6).
• We describe our deterministic parallel abstract interpreter Pikos (§8), and evaluate its perfor-
mance on a suite of C programs (§9).
§2 presents an overview of the technique; §3 presents mathematical preliminaries; §10 describes
related work; §11 concludes.
2 OVERVIEW
Abstract interpretation is a general framework that captures most existing approaches for static
program analyses and reduces extracting properties of programs to approximating their semantics
[Cousot and Cousot 1977; Cousot et al. 2019]. Consequently, this section is not meant to capture all
possible approaches to implementing abstract interpretation or describe all the complex optimiza-
tions involved in a modern implementation of an abstract interpreter. Instead it is only meant to set
the appropriate context for the rest of the paper, and to capture the relevant high-level structure of
abstract-interpretation implementations such as IKOS [Brat et al. 2014].
Fixpoint equations. Consider the simple program P represented by its control flow graph (CFG)
in Figure 1(a). We will illustrate how an abstract interpreter would compute the set of values
that variable x might contain at each program point i in P. In this example, we will use the
standard integer interval domain [Cousot and Cousot 1976, 1977] represented by the complete
def
lattice ⟨Int, ⊑, ⊥, ⊤, ⊔, ⊓⟩ with Int = {⊥} ∪ {[l, u] | l, u ∈ Z ∧ l ≤ u} ∪ {[−∞, u] | u ∈ Z} ∪ {[l, ∞] |
l ∈ Z} ∪ {[−∞, ∞]}. The partial order ⊑ on Int is interval inclusion with the empty interval ⊥ = ∅
encoded as [∞, −∞] and ⊤ = [−∞, ∞].
Figure 1(b) shows the corresponding equation system X = F (X), where X = (X0 , X1 , . . . , X8 ).
Each equation in this equation system is of the form Xi = Fi (X0 , X1 , . . . , X8 ), where the variable
Xi ∈ Int represents the interval value at program point i in P and Fi is monotone. The operator
+ represents the (standard) addition operator over Int. As is common (but not necessary), the
dependencies among the equations reflect the CFG of program P.
The exact least solution X lfp of the equation system X = F (X) would give the required set of
values for variable x at program point i. Let X 0 = (⊥, ⊥, . . . , ⊥) and X i+1 = F (X i ), i ≥ 0 represent
the standard Kleene iterates, which converge to X lfp .
Chaotic iteration. Instead of applying the function F during Kleene iteration, one can use chaotic
iterations [Cousot 1977; Cousot and Cousot 1977] and apply the individual equations Fi . The order
in which the individual equations are applied is determined by the chaotic iteration strategy.
Widening. For non-Noetherian abstract domains, such as the interval abstract domain, termina-
tion of this Kleene iteration sequence requires the use of a widening operator (▽) [Cousot 2015;
Cousot and Cousot 1977]. A set of widening points W is chosen and the equation for i ∈ W is replaced
by Xi = Xi ▽Fi (X0 , . . . , Xn ). An admissible set of widening points “cuts” each cycle in the dependency
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:4 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
X0 = ⊤ X1 = [0, 0] X0 = ⊤ X1 = [0, 0]
0:
X2 = X1 ⊔ X3 X2 = X2 ▽(X1 ⊔ X3 )
X3 = X2 + [1, 1] X3 = X2 + [1, 1]
1 : x=0 8 : x=1
X4 = X2 ⊔ X8 ⊔ X5 ⊔ X6 X4 = X4 ▽(X2 ⊔ X8 ⊔ X5 ⊔ X6 )
2: 4: 7: X5 = X4 + [1, 1] X5 = X4 + [1, 1]
X6 = [0, 0] X7 = X4 X6 = [0, 0] X7 = X4
3 : x=x+1 5 : x=x+1 6 : x=0 X8 = [1, 1] X8 = [1, 1]
(a) (b) (c)
Fig. 1. (a) A simple program P that updates x; (b) Corresponding equation system for interval domain;
(c) Corresponding equation system with vertices 2 and 4 as widening points.
graph of the equation system by the use of a widening operator to ensure termination [Cousot
and Cousot 1977]. Finding a minimal admissible set of widening points is an NP-complete prob-
lem [Garey and Johnson 2002]. A possible widening operator for the interval abstract domain is
defined by: ⊥▽I = I ▽⊥ = I ∈ Int and [i, j]▽[k, l] = [if k < i then − ∞ else i, if l > j then ∞ else j].
This widening operator is non-monotone. The application of a widening operator may result in a
crude over-approximation of the least fixpoint; more sophisticated widening operators as well as
techniques such as narrowing can be used to ensure precision [Amato and Scozzari 2013; Amato
et al. 2016; Cousot and Cousot 1977; Gopan and Reps 2006; Kim et al. 2016]. Although the discussion
of our fixpoint algorithm uses a simple widening strategy (§5), our implementation incorporates
more sophisticated widening and narrowing strategies implemented in IKOS (§8).
Bourdoncle’s approach. Bourdoncle [1993] introduces the notion of hierarchical total order (HTO)
of a set and weak topological order (WTO) of a directed graph (see §7). An admissible set of widening
points as well as a chaotic iteration strategy, called the recursive strategy, can be computed using
a WTO of the dependency graph of the equation system. A WTO for the equation system in
def
Figure 1(b) is T = 0 8 1 (2 3) (4 5 6) 7. The set of elements between two matching parentheses
are called a component of the WTO, and the first element of a component is called the head of the
component. Notice that components are non-trivial strongly connected components (“loops”) in
the directed graph of Figure 1(a). Bourdoncle [1993] proves that the set of component heads is an
admissible set of widening points. For Figure 1(b), the set of heads {2, 4} is an admissible set of
widening points. Figure 1(c) shows the corresponding equation system that uses widening.
def
The iteration strategy generated using WTO T is S 1 = 0 8 1 [2 3]∗ [4 5 6]∗ 7, where occurrence of i
in the sequence represents applying the equation for Xi , and [. . .]∗ is the “iterate until stabilization”
operator. A component is stabilized if iterating over its elements does not change their values.
The component heads 2 and 4 are chosen as widening points. The iteration sequence S 1 should be
interpreted as “apply equation for X0 , then apply the equation for X8 , then apply the equation for X1 ,
repeatedly apply equations for X2 and X3 until stabilization” and so on. Furthermore, Bourdoncle
[1993] showed that stabilization of a component can be detected by the stabilization of its head.
For instance, stabilization of component {2, 3} can be detected by the stabilization of its head 2.
This property minimizes the number of (potentially expensive) comparisons between abstract
values during fixpoint computation. For the equation system of Figure 1(c), the use of Bourdoncle’s
fp
recursive iteration strategy would give us X7 = [0, ∞].
Asynchronous iterations. The iteration strategy produced by Bourdoncle’s approach is nec-
essarily sequential, because the iteration sequence is generated from a total order. One could
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:5
3 MATHEMATICAL PRELIMINARIES
A binary relation R on set S is a subset of the Cartesian product of S and S; that is, R ⊆ S × S.
Given S ′ ⊆ S, let R⇂S ′ = R ∩ (S ′ × S ′). A relation R on set S is said to be one-to-one iff for all
w, x, y, z ∈ S, (x, z) ∈ R and (y, z) ∈ R implies x = y, and (w, x) ∈ R and (w, y) ∈ R implies x = y. A
transitive closure of a binary relation R, denoted by R+ , is the smallest transitive binary relation
that contains R. A reflexive transitive closure of a binary relation R, denoted by R∗ , is the smallest
reflexive transitive binary relation that contains R.
A preorder (S, R) is a set S and a binary relation R over S that is reflexive and transitive. A partial
order (S, R) is a preorder where R is antisymmetric. Two elements u, v ∈ S are comparable in a
partial order (S, R) if (u, v) ∈ R or (v, u) ∈ R. A linear (total) order or chain is a partial order in
which every pair of its elements are comparable. A partial order (S, R′) is an extension of a partial
order (S, R) if R ⊆ R′; an extension that is a linear order is called a linear extension. There exists a
linear extension for every partial order [Szpilrajn 1930].
def def
Given a partial order (S, R), define ⌊⌊x⌉R = {y ∈ S | (x, y) ∈ R}, and ⌊x⌉⌉R = {v ∈ S | (v, x) ∈ R},
def
and ⌊⌊x, y⌉⌉R = ⌊⌊x⌉R ∩ ⌊y⌉⌉R . A partial order (S, R) is a forest if for all x ∈ S, (⌊x⌉⌉R , R) is a chain.
Example 3.1. Let (Y , T) be a partial order with Y = {y1 , y2 , y3 , y4 } and T = {(y1 , y2 ), (y2 , y3 ),
(y2 , y4 )}∗ . Let Y ′ = {y1 , y2 } ⊆ Y , then T⇂Y ′ = {(y1 , y1 ), (y1 , y2 ), (y2 , y2 )}.
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:6 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
We see that the partial order (Y , T) is a forest because for all y ∈ Y , (⌊y⌉⌉T , T) is a chain.
⌊⌊y1 , y1 ⌉⌉T = {y1 } ⌊⌊y1 , y2 ⌉⌉T = {y1 , y2 } ⌊⌊y1 , y3 ⌉⌉T = {y1 , y2 , y3 } ⌊⌊y1 , y4 ⌉⌉T = {y1 , y2 , y4 }
⌊⌊y4 , y1 ⌉⌉T = ∅ ⌊⌊y4 , y2 ⌉⌉T = ∅ ⌊⌊y4 , y3 ⌉⌉T = ∅ ⌊⌊y4 , y4 ⌉⌉T = {y4 }
■
A directed graph G(V , ) is defined by a set of vertices V and a binary relation over V . The
reachability among vertices is captured by the preorder ∗ : there is a path from vertex u to vertex
v in G iff u ∗ v. G is a directed acyclic graph (DAG) iff (V , ∗ ) is a partial order. A topological order
of a DAG G corresponds to a linear extension of the partial order (V , ∗ ). We use G ⇂V ′ to denote
the subgraph (V ∩ V ′, ⇂V ′ ). Given a directed graph G(V , ), a depth-first numbering (DFN) is the
order in which vertices are discovered during a depth-first search (DFS) of G. A post depth-first
numbering (post-DFN) is the order in which vertices are finished during a DFS of G. A depth-first
tree (DFT) of G is a tree formed by the edges used to discover vertices during a DFS. Given a DFT
of G, an edge u v is called (i) a tree edge if v is a child of u in the DFT; (ii) a back edge if v is an
ancestor of u in the DFT; (iii) a forward edge if it is not a tree edge and v is a descendant of u in the
DFT; and (iv) a cross edge otherwise [Cormen et al. 2009]. In general, a directed graph might contain
multiple connected components and a DFS yields a depth-first forest (DFF). The lowest common
ancestor (LCA) of vertices u and v in a rooted tree T is a vertex that is an ancestor of both u and v
and that has the greatest depth in T [Tarjan 1979]. It is unique for all pairs of vertices.
A strongly connected component (SCC) of a directed graph G(V , ) is a subgraph of G such that
u ∗ v for all u, v in the subgraph. An SCC is trivial if it only consists of a single vertex without
any edges. A feedback edge set B of a graph G(V , ) is a subset of such that (V , ( \ B)∗ ) is a
partial order; that is, the directed graph G(V , \ B) is a DAG. The problem of finding the minimum
feedback edge set is NP-complete [Karp 1972].
Example 3.2. Let G(V , ) be directed graph shown in Figure 1(a). The ids used to label the
vertices V of G correspond to a depth-first numbering (DFN) of the directed graph G. The following
lists the vertices in increasing post-DFN numbering: 3, 5, 6, 7, 4, 2, 1, 8, 0. Edges (3, 2), (5, 4), and
(6, 4) are back edges for the DFF that is assumed by the DFN, edge (8, 4) is a cross edge, and the
rest are tree edges. The lowest common ancestor (LCA) of 3 and 7 in this DFF is 2. The subgraphs
induced by the vertex sets {2, 3}, {4, 5}, {4, 6}, and {4, 5, 6} are all non-trivial SCCs. The minimum
feedback edge set of G is F = {(3, 2), (5, 4), (6, 4)}. We see that the graph G(V , \ F ) is a DAG. ■
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:7
def
For each (x, h) ∈ N, the set ⌊⌊h, x⌉⌉ ⪯ = {u ∈ S | h ⪯ u ⪯ x } defines a component of the HPO,
with x and h referred to as the exit and head of the component. A component can be identified
using either its head or its exit due to condition H2; we use Ch or C x to denote a component with
head h and exit x. Condition H3 states that the nesting relation N is in the opposite direction of the
partial order ⪯. The reason for this convention will be clearer when we introduce the notion of
WPO (Definition 4.3), where we show that the nesting relation N has a connection to the feedback
edge set of the directed graph. Condition H4 implies that the set of components CH is well-nested;
that is, two components should be either mutually disjoint or one must be a subset of the other.
Condition H5 states that if an element v depends upon an element u in a component C x , then
either v depends on the exit x or v is in the component C x . Recall that (x, h) ∈ N and h ⪯ u ⪯ x
implies u ∈ C x by definition. Furthermore, v ⪯ x and u ⪯ v implies v ∈ C x . Condition H5 ensures
determinism of the concurrent iteration strategy (§5); this condition ensures that the value of u
does not “leak” from C x during fixpoint computation until the component C x stabilizes.
Example 4.2. Consider the partial order (Y , T) defined in Example 3.1. Let N1 = {(y3 , y1 ), (y4 , y2 )}.
(Y , T, N1 ) violates condition H4. In particular, the components Cy3 = Cy1 = {y1 , y2 , y3 } and Cy4 =
Cy2 = {y2 , y4 } are neither disjoint nor is one a subset of the other. Thus, (Y , T, N1 ) is not an HPO.
Let N2 = {(y3 , y1 )}. (Y , T, N2 ) violates condition H5. In particular, y2 ∈ Cy3 and (y2 , y4 ) ∈ T, but
we do not have y3 ≺ y4 or y4 ⪯ y3 . Thus, (Y , T, N2 ) is not an HPO.
Let N3 = {(y2 , y1 )}. (Y , T, N3 ) is an HPO satisfying all conditions H1–H5. ■
Building upon the notion of an HPO, we now define a Weak Partial Order (WPO) for a directed
graph G(V , ). In the context of fixpoint computation, G represents the dependency graph of the
fixpoint equation system. To find an effective iteration strategy, the cyclic dependencies in G need
to be broken. In effect, a WPO partitions the preorder ∗ into a partial order ∗ and an edge set
defined using of a nesting relation .
Definition 4.3. A weak partial order W for a directed graph G(V , ) is a 4-tuple (V , X , , )
such that:
W1. V ∩ X = ∅.
W2. ⊆ X × V , and for all x ∈ X , there exists v ∈ V such that x v.
W3. ⊆ (V ∪ X ) × (V ∪ X ).
W4. H (V ∪ X , ∗ , ) is a hierarchical partial order (HPO).
W5. For all u v, either (i) u + v, or (ii) u ∈ ⌊⌊v, x⌉⌉ ∗ and x v for some x ∈ X . ■
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:8 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
3 4
1 2 5 1 2 3 4 x3 5 x2 10
6 7 8 6 7 8 x6
10
9 9
(a) (b)
Fig. 2. (a) Directed graph G 1 . Vertices V are labeled using depth-first numbering (DFN); (b) WPO W1 for G 1
with exits X = {x 2 , x 3 , x 6 }.
Example 4.5. Consider the directed graph G 1 (V , ) in Figure 2(a). Figure 2(b) shows a WPO
W1 (V , X , , ) for G 1 , where X = {x 2 , x 3 , x 6 }, and satisfies all conditions in Definition 4.3. One
can verify that (V ∪ X , ∗ , ) satisfies all conditions in Definition 4.1 and is an HPO.
Suppose we were to remove x 6 5 and instead add 6 5 to W1 (to more closely match the
edges in G 1 ), then this change would violate condition H5, and hence condition W4.
If we were to only remove x 6 5 from W1 , then it would still satisfy condition W4. However,
this change would violate condition W5. ■
Definition 4.6. For graph G(V , ) and its WPO W(V , X , , ), the back edges of G with respect to
def
the WPO W, denote by B W , are defined as B W = {(u, v) ∈ | ∃x ∈ X .u ∈ ⌊⌊v, x⌉⌉ ∗ ∧x v}. ■
In other words, (u, v) ∈ B W if u v satisfies condition W5-(ii) in Definition 4.3. Theorem 4.7
proves that B W is a feedback edge set for G, and Theorem 4.9 shows that the subgraph (V , \ B W )
forms a DAG. Together these two theorems capture the fact that the WPO W(V , X , , ) partitions
the preorder ∗ of G(V , ) into a partial order ∗ and a feedback edge set B W .
Theorem 4.7. For graph G(V , ) and its WPO W(V , X , , ), B W is a feedback edge set for G.
Proof. Let v 1 v 2 · · · vn v 1 be a cycle of n distinct vertices in G. We will show that there
exists i ∈ [1, n) such that vi vi+1 ∈ B W ; that is, vi ∈ ⌊⌊vi+1 , x⌉⌉ ∗ and x vi+1 for some x ∈ X .
If this were not true, then v 1 + · · · + vn + v 1 (using W5). Therefore, v 1 + vn and vn + v 1 ,
which contradicts the fact that ∗ is a partial order. Thus, B W cuts all cycles at least once and is a
feedback edge set. □
Example 4.8. For the graph G 1 (V , ) in Figure 2(a) and WPO W1 (V , X , , ) in Figure 2(b),
B W1 = {(4, 3), (8, 6), (5, 2)}. One can verify that B W1 is a feedback edge set for G 1 . ■
Proof. Each edge (u, v) ∈ B W satisfies W5-(ii) by definition. Therefore, all edges in ( \ BW)
must satisfy W5-(i). Thus, u + v for all edges (u, v) ∈ ( \ B W ), and ( \ B W )∗ ⊆ ∗ . □
Given the tuple W(V , X , , ) and a set S, we use W⇂S to denote the tuple (V ∩ S, X ∩
S, ⇂S , ⇂S ). The following two theorems enable us to decompose a WPO into sub-WPOs, which
allows us to use structural induction when proving properties of WPOs.
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:9
Theorem 4.10. For graph G(V , ) and its WPO W(V , X , , ), W⇂C is a WPO for subgraph
G ⇂C for all C ∈ CW .
Proof. We show that W⇂C satisfies all conditions W1–W5 in Definition 4.3 for all C ∈ CW .
Conditions W1, W2, W3, W4-[H1, H2, H3, H4] trivially holds true.
[W4-H5] If v < C, H5 is true because (u, v) < ∗ ⇂C . Else, H5 is still satisfied with ∗ ⇂C .
[W5] We show that u + v implies u +⇂C v, if u, v ∈ C. Let C = ⌊⌊h, x⌉⌉ ∗ with x h. If u + ⇂C v
is false, there exists w ∈ ⌊⌊u⌉ ∗ ∩ ⌊v⌉⌉ ∗ such that w < C. However, u + w and w < C implies
x + w (using H5). This contradicts that (V ∪ X , ∗ ) is a partial order, because w ∈ ⌊v⌉⌉ ∗ and
v ∈ ⌊⌊h, x⌉⌉ ∗ implies w + x. Thus, W5 is satisfied. □
Theorem 4.11. For graph G(V , ) and its WPO W(V , X , , ), if V ∪ X = ⌊⌊h, x⌉⌉ ∗ for some
def
(x, h) ∈ , then W⇂S is a WPO for subgraph G ⇂S , where S = V ∪ X \ {h, x }.
Proof. We show that W⇂S satisfies all conditions W1–W5 in Definition 4.3. Conditions W1, W2,
W3, W4-[H1, H2, H3, H4] trivially holds true.
[W4-H5] x has no outgoing and h has no incoming scheduling constraints. Thus, W⇂S still
satisfies H5.
[W5] Case (i) is still satisfied because h only had outgoing scheduling constraints and x only had
incoming scheduling constraints. Case (ii) is still satisfied due to H2 and H4. □
Example 4.12. The decomposition of WPO W1 for graph G 1 in Figure 2 is:
W1 W W⇂S 2
⇂C 2
W⇂C3
1 2 3 4 x3 5 x2 10
6 7 8 x6
W⇂C6 9 W⇂S 6
Theorem 4.14. For graph G(V , ) and its WPO W(V , X , , ), if there is a cycle in G consisting
0 such that V ′ ⊆ C.
of vertices V ′, then there exists C ∈ CW
Proof. Assume that the theorem is false. Then, there exists multiple maximal components that
partition the vertices in the cycle. Let (u, v) be an edge in the cycle where u and v are in different
maximal components. By W5, u + v, and by H5, xu + v, where xu is the exit of the maximal
component that contains u. By the definition of the component, v + xv , where xv is the exit of
the maximal component that contains v. Therefore, xu + xv . Applying the same reasoning for all
such edges in the cycle, we get xu + xv + · · · + xu . This contradicts the fact that (V ∪ X , ∗ ) is
a partial order for the WPO W. □
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:10 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
Init
forall v ∈ V , X[v 7→ (v is entry ? ⊤ : ⊥)] forall v ∈ V ∪ X , N [v 7→ 0]
v ∈V N (v) = NumSchedPreds(v)
NonExit
N [v 7→ 0] ApplyF (v) forall v w, N [w 7→ (N [w] + 1)]
def
ComponentStabilized(x) = ∃h ∈ V .x h ∧ Fh (X) ⊑ X(h)
def
SetN ForComponent(x) = forall v ∈ C x , N [v 7→ NumOuterSchedPreds(v, x)]
def
NumSchedPreds(v) = |{u ∈ V ∪ X | u v}|
def
NumOuterSchedPreds(v, x) = |{u ∈ V ∪ X | u v, u < C x , v ∈ C x }|
Fig. 3. Deterministic concurrent fixpoint algorithm for WPO. X maps an element in V to its value. N maps
an element in V ∪ X to its count of executed scheduling predecessors. Operations on N are atomic.
Corollary 4.15. For G(V , ) and its WPO W(V , X , , ), if G is a non-trivial strongly connected
0 = {⌊⌊h, x⌉⌉ ∗ } and ⌊⌊h, x⌉⌉ ∗ = V ∪ X .
graph, then there exists h ∈ V and x ∈ X such that CW
Proof. Because there exists a cycle in the graph, there must exists at least one component in
the WPO. Let h ∈ V and x ∈ X be the head and exit of a maximal component in W. Because V ∪ X
contains all elements in the WPO, ⌊⌊h, x⌉⌉ ∗ ⊆ V ∪ X . Now, suppose ⌊⌊h, x⌉⌉ ∗ ⊉ V ∪ X . Then,
there exists v ∈ V ∪ X such that v < ⌊⌊h, x⌉⌉ ∗ . If v ∈ V , because the graph is strongly connected,
there exists a cycle v + h + v. Then, by Theorem 4.14, v ∈ ⌊⌊h, x⌉⌉ ∗ , which is a contradiction. If
v ∈ X , then there exists w ∈ V such that v w by W2. Due to H4, w < ⌊⌊h, x⌉⌉ ∗ . By the same
reasoning as the previous case, this leads to a contradiction. □
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:11
Rule NonExit applies to a non-exit element v ∈ V whose scheduling predecessors are all
executed (N (v) = NumSchedPreds(v)). This rule applies the function Fv to update the value Xv
(ApplyF (v)). Definition of the function ApplyF shows that the widening is applied at the image of
(see Theorem 5.3). The rule then notifies the scheduling successors of v that v has executed by
incrementing their counts. Because elements within a component can be iterated multiple times,
the count of an element is reset after its execution. If there is no component in the WPO, then only
the NonExit rule is applicable, and the algorithm reduces to a DAG scheduling algorithm.
Rules CompStabilized and CompNotStabilized are applied to an exit x (x ∈ X ) whose schedul-
ing predecessors are all executed (N (x) = NumSchedPreds(x)). If the component C x is stabilized,
CompStabilized is applied, and CompNotStabilized otherwise. A component is stabilized if
iterating it once more does not change the values of elements inside the component. Boolean
function ComponentStabilized checks the stabilization of C x by checking the stabilization of its
head (see Theorem 5.5). Upon stabilization, rule CompStabilized notifies the scheduling successors
of x and resets the count for x.
Example 5.1. Consider WPO W1 in Figure 2(b). An iteration sequence generated by the concurrent
fixpoint algorithm for WPO W1 is:
Time step in N −→
Scheduled element 1 2 3 4 x3 3 4 x3 3 4 x3 5 x2 10
u ∈V ∪X 6 7 8 x6 6 7 8 x6
9 9
The initial value of N (8) is 0. Applying NonExit to 7 and 9 increments N (8) to 2. N (8) now
equals NumSchedPreds(8), and NonExit is applied to 8. Applying NonExit to 8 updates X8 by
applying the function F 8 , increments N (x 6 ), and resets N (8) to 0. Due to the reset, same thing
happens when C 6 is iterated once more.
The initial value of N (x 6 ) is 0. Applying NonExit to 8 increments N (x 6 ) to 1, which equals
NumSchedPreds(x 6 ). The stabilization of component C 6 is checked at x 6 . If it is stabilized, Comp-
Stabilized is applied to x 6 , which increments N (5) and resets N (x 6 ) to 0. ■
If the component C x is not stabilized, rule CompNotStabilized is applied instead. This rule does
not notify the scheduling successors of x, blocking further advancement until the component stabi-
lizes. To drive the iteration over C x , each count for an element in C x is set to SetN ForComponent(x),
which is the number of its scheduling predecessors not in C x . In particular, the count for the head
of C x , whose scheduling predecessors are all not in C x , is set to the number of all scheduling pre-
decessors, allowing rule NonExit to be applied to the head. The map NumOuterSchedPreds(v, x),
which returns the number of outer scheduling predecessors of v w.r.t. component C x , can be
computed by running the WPO construction twice: in the first run, compute NumSchedPreds; in the
second run, initialize NumOuterSchedPreds to 0s, and set NumOuterSchedPreds(v ′, exit[v]) to
NumSchedPreds(v ′) minus the number of scheduling predecessors of v ′ found so far, if scheduling
constraint targeting v ′ with v , exit[v] is found in Line 26 and Line 47 of Algorithm 2 in §6. The
rule also resets the count for x.
5 5
Example 5.2. Let G 2 (V , ) be 1 2 3 4 and its WPO W2 be 1 2 3 4 x3 x2 .
6 6
An iteration sequence generated by the concurrent fixpoint algorithm for WPO W2 is:
Time step in N −→
Scheduled element 1 2 3 4 x3 3 4 x3 x2 2 3 4 x3 x2
u ∈V ∪X 6 5 5
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:12 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
Consider the element 4, whose scheduling predecessors in the WPO are 3, 5, and 6. Further-
more, 4 ∈ C 3 and 4 ∈ C 2 with C 3 ⊊ C 2 . After NonExit is applied to 4, N (4) is reset to 0.
Then, if the stabilization check of C 3 fails at x 3 , CompNotStabilized sets N (4) to 2, which is
NumOuterSchedPreds(4, x 3 ). If it is not set to 2, then the fact that elements 5 and 6 are executed will
not be reflected in N (4), and the iteration over C 3 will be blocked at element 4. If the stabilization
check of C 2 fails at x 2 , CompNotStabilized sets N (4) to NumOuterSchedPreds(4, x 2 ) = 1. ■
In ApplyF , the image of is chosen as the set of widening points. These are heads of the
components. The following theorem proves that the set of component heads is an admissible set of
the widening points, which guarantee the termination of the fixpoint computation:
Theorem 5.3. Given a dependency graph G(V , ) and its WPO W(V , X , , ), the set of com-
ponent heads is an admissible set of widening points.
def
Proof. Theorem 4.7 proves that B W = {(u, v) ∈ | ∃x ∈ X .u ∈ ⌊⌊v, x⌉⌉ ∗ ∧ x v} is a
feedback edge set. Consequently, the set of component heads {h | ∃x ∈ X .x h} is a feedback vertex
set. Therefore, the set W is an admissible set of widening points [Cousot and Cousot 1977]. □
Example 5.4. The set of component heads {2, 3, 6} is an admissible set of widening points for the
WPO W1 in Figure 2(b). ■
The following theorem justifies our definition of ComponentStabilized; viz., checking the
stabilization of Xh is sufficient for checking the stabilization of the component Ch .
Theorem 5.5. During the execution of concurrent fixpoint algorithm with WPO W(V , X , , ),
stabilization of the head h implies the stabilization of the component Ch at its exit for all Ch ∈ CW .
Proof. Suppose there exists an element v ∈ Ch that is not stabilized despite the stabilization of h.
That is, Xv changes if Ch is iterated once more. For this to be possible, there must exist u such that
u v and whose value Xu changed after the last application of function Fv . By W5 and u v, it’s
either (i) u + v or (ii) u ∈ ⌊⌊v, x⌉⌉ ∗ = Cv and x v for some x ∈ X . It cannot be case (i) because
if it were true, function Fu cannot be applied after Fv . Even if u were in some other component
C x ′ , due to H5, x ′ + v, resulting in the same conclusion. Therefore, it should be case (ii). By H4,
Cv ⊊ Ch . However, because u ∈ Cv , our algorithm checks the stabilization of v at the exit of Cv
after the last application of function Fu . This contradicts the assumption that Xu changed after the
last application of function Fv . □
A WPO W(V , X , , ) where V = {v}, X = = = ∅ is said to be a trivial WPO, which is
represented as v . It can only be a WPO for a trivial SCC with vertex v. A WPO W(V , X , , )
where V = {h}, X = {x }, = {(h, x)}, = (x, h) is said to be a self-loop WPO, and is represented
as h x . It can only be a WPO for a trivial SCC with vertex h or a single vertex h with a self-loop.
The following theorem proves that the concurrent fixpoint algorithm in Figure 3 is deterministic.
Theorem 5.6. Given a WPO W(V , X , , ) for a graph G(V , ) and a set of monotonone,
deterministic functions {Fv | v ∈ V }, concurrent fixpoint algorithm in Figure 3 is deterministic,
computing the same approximation of the least fixpoint for the given set of functions.
Proof. We use structural induction on the WPO W to show this.
[Base case]: The two cases for the base case are (i) W = v and (ii) W = h x . If W =
v , v is the only vertex in G. Functions are assumed to be deterministic, so applying the function
Fv () in rule NonExit of Figure 3 is deterministic. Because Fv () does not take any arguments, the
computed value Xv is a unique fixpoint of Fv ().
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:13
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:14 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:15
1 2 3 4 1 2 3 x2 4 x4
5 6 7 8 5 6 7 x6 8 x5
(a) (b)
Fig. 4. (a) Directed graph G 3 . Vertices V are labeled using depth-first numbering (DFN); (b) WPO W3 for G 3
with exits X = {x 2 , x 4 , x 5 , x 8 }.
graph. Exit x h is moved from V ′ to X ′ on Line 23, scheduling constraints regarding the head h is
added on Line 24, and x h h is added on Line 25 to satisfy W5 for the removed back edges.
Example 6.1. Consider the graph G 3 in Figure 4(a). SCC(G 3 ) on Line 1 of ConstructWPOTD re-
turns a trivial SCC with vertex 1 and three non-trivial SCCs with vertex sets {5, 6, 7, 8}, {2, 3},
and {4}. For the trivial SCC, sccWPO on Line 3 returns ( 1 , 1, 1). For the non-trivial SCCs, it
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:16 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
contracted to a single vertex) of the modified graph (V ′, ′) is acyclic. Therefore, by applying the
similar reasoning as the base case on this graph of super-nodes, we see that the lemma holds. □
Armed with the above lemma, we now prove that ConstructWPOTD constructs a WPO.
Theorem 6.3. Given a graph G(V , ) and its depth-first forest D, the returned value (V , X , , )
of ConstructWPOTD (G, D) is a WPO for G.
Proof. We show that the returned value (V , X , , ) satisfies all properties W1–W5 in Defini-
tion 4.3.
[W1] V equals the vertex set of the input graph, and X consists only of the newly created exits.
[W2] For all exits, x h h is added on Line 12 and 25. These are the only places stabilization
constraints are created.
[W3] All scheduling constraints are created on Line 6, 12, and 24.
[W4-H1] ( i∗ , Vi ∪ X i ) is reflexive and transitive by definition. Because the graph with maximal
SCCs contracted to single vertices (super-nodes) is acyclic, scheduling constraints on Line 6 cannot
create a cycle. Also, Line 24 only adds outgoing scheduling constraints and does not create a cycle.
Therefore, ( ∗ , V ∪ X ) is antisymmetric.
[W4-H2] Exactly one stabilization constraint is created per exit on Line 12 and 25. Because h is
removed from the graph afterwards, it does not become a target of another stabilization constraint.
[W4-H3] By Lemma 6.2, x h h implies h + x h .
[W4-H4] Because the maximal SCCs on Line 3 are disjoint, by Lemma 6.2, all components
⌊⌊hi , x i ⌉⌉ ∗ are disjoint.
[W4-H5] All additional scheduling constraints going outside of a component have exits as their
sources on Line 6.
[W5] For u v, either (i) scheduling constraint is added in Line 6 and 24, or (ii) stabilization
constraint is added in Line 18 and 25. In the case of (i), one can check that the property holds for
Line 24. For Line 6, if u is a trivially maximal SCC, x i = u. Else, u ∗ x i by Lemma 6.2, and with
added x i v, u + v. In the case of (ii), u ∈ ⌊⌊h, x h ⌉⌉ ∗ by Lemma 6.2 where v = h. □
The next theorem proves that the WPO constructed by ConstructWPOTD does not include super-
fluous scheduling constraints, which could reduce concurrency during the fixpoint computation.
Theorem 6.4. For a graph G(V , ) and its depth-first forest D, WPO W(V , X , , ) returned by
ConstructWPOTD (G, D) has the smallest ∗ among the WPOs for G with the same set of .
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:17
If G is not strongly connected, then by the induction hypothesis, the theorem holds for the
returned values of self-loopWPO on Line 3 for all maximal SCCs. Line 6 only adds the required
scheduling constraints to satisfy W5 and H5 for dependencies between different maximal SCCs, □
Auxiliary data structures rep, exit, and R are initialized on Line 5. The map exit maps an SCC
(represented by its header h) to its corresponding exit x h . Initially, exit[v] is set to v, and updated
on Line 32 when a non-trivial SCC is discovered by the algorithm.
The map R maps a vertex to a set of edges, and is used to handle irreducible graphs [Hecht
and Ullman 1972; Tarjan 1973]. Initially, R is set to ∅, and updated on Line 17. The function
findNestedSCCs relies on the assumption that the graph is reducible. This function follows the
edges backwards to find nested SCCs using rep(p) instead of predecessor p, as on Lines 36 and
43, to avoid repeatedly searching inside the nested SCCs. rep(p) is the unique entry to the nested
SCC that contains the predecessor p if the graph is reducible. To make this algorithm work for
irreducible graphs as well, cross or forward edges are removed from the graph initially by function
removeAllCrossForwardEdges (called on Line 6) to make the graph reducible. Removed edges
are then restored by function restoreCrossForwardEdges (called on Line 10) right before the
edges are used. The graph is guaranteed to be reducible when restoring a removed edge u v as
u rep(v) when h is the lowest common ancestor (LCA) of u, v in the depth-first forest. Cross
and forward edges are removed on Line 16 and are stored at their LCAs in D on Line 17. Then, as h
hits the LCAs in the loop, the removed edges are restored on Line 20. Because the graph edges are
modified, map O is used to track the original edges. O[v] returns set of original non-back edges that
now targets v after the modification. The map O is initialized on Line 8 and updated on Line 21.
The call to function constructWPOForSCC(h) on Line 11 constructs a WPO for the largest
SCC such that h = arg minv ∈V ′ DF N (D, v), where V ′ is the vertex set of the SCC. For example,
constructWPOForSCC(5) constructs WPO for SCC with vertex set {5, 6, 7, 8}. Because the loop on
Line 9 traverses the vertices in descending DFN, the WPO for a nested SCC is constructed before
that of the enclosing SCC. For example, constructWPOForSCC(6) and constructWPOForSCC(8) are
called before constructWPOForSCC(5), which construct WPOs for nested SCCs with vertex sets
{6, 7} and {8}, respectively. Therefore, constructWPOForSCC(h) reuses the constructed sub-WPOs.
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:18 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
The call to function findNestedSCCs(h) on Line 23 returns Nh , the representatives of the nested
SCCs Nh , as well as Ph , the predecessors of h along back edges. If Ph is empty, then the SCC is
trivial, and the function immediately returns on Line 24. Line 26 adds scheduling constraints for
the dependencies crossing the nested SCCs. As in ConstructWPOTD , this must be from the exit
of maximal SCC that contains u but not v ′ for u v ′. Because u v ′ is now u rep(v ′), O[v],
where v = rep(v ′), is looked up to find u v ′. exit is used to find the exit, where rep(u) is the
representative of maximal SCC that contains u but not v. If the parameter lift is true, scheduling
constraint targeting v is also added, forcing all scheduling predecessors outside of a component to
be visited before the component’s head. Similarly, function connectWPOsOfMaximalSCCs is called
after the loop on Line 47 to connect WPOs of maximal SCCs. The exit, scheduling constraints to the
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:19
h=8 - 1|2|3|4|5|6|7|8 1 2 3 4
5 6 7 8
h=7 - 1|2|3|4|5|6|7|8 1 2 3 4
5 6 7 8
h=6 - 1|2|3|4|5|67|8 1 2 3 4
5 6 7 x6 8
h=5 - 1|2|3|4|5678 1 2 3 4
5 6 7 x6 8 x5
1 2 3 4 x4
h=4 - 1|2|3|4|5678
5 6 7 x6 8 x5
1 2 3 4 x4
h=3 - 1|2|3|4|5678
5 6 7 x6 8 x5
1 2 3 x2 4 x4
h=2 - 1|23|4|5678
5 6 7 x6 8 x5
1 2 3 x2 4 x4
h=1 {(7, 2), (8, 4)} 1|23|4|5678
added 5 6 7 x6 8 x5
1 2 3 x2 4 x4
Final - 1|23|4|5678
5 6 7 x6 8 x5
exit, and stabilization constraints are added on Lines 29–31. After the WPO is constructed, Line 32
updates the map exit, and Line 33 updates the partition.
Example 6.6. Table 1 describes the steps of ConstructWPOBU for the irreducible graph G 3 (Figure 4).
The updates column shows the modifications to the graph edges, the Current partition column
shows the changes in the disjoint-set data structure, and the Current WPO column shows the
WPO constructed so far. Each row shows the changes made in each step of the algorithm. Row
‘Init’ shows the initialization step on Lines 5–6. Row ‘h = k’ shows the k-th iteration of the loop
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:20 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
on Lines 9–11. The loop iterates over the vertices in descending DFN: 8, 7, 6, 5, 4, 3, 2, 1. Row ‘Final’
shows the final step after the loop on Line 12.
During initialization, the cross or forward edges {(7, 3), (8, 4)} are removed, making G 3 reducible.
These edges are added back as {(7, rep(3)), (8, rep(4))} = {(7, 2), (8, 4)} in h = 1, where 1 is the LCA
of both (7, 3) and (8, 4). G 3 remains reducible after restoration. In step h = 5, WPOs for the nested
SCCs are connected with 5 6 and x 6 8. The new exit x 5 is created, connected to the WPO via
8 x 5 , and x 5 5 is added. Finally, 1 2, 1 5, x 5 3, x 5 4, and x 2 4 are added, connecting
the WPOs for maximal SCCs. If lift is true, scheduling constraint x 5 2 is added. ■
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:21
A hierarchical total order is a string over the alphabet S augmented with left and right parenthesis.
A hierarchical total order of S induces a total order ⪯ over the elements of S. The elements between
two matching parentheses are called a component, and the first element of a component is called
the head. The set of heads of the components containing the element l is denoted by ω(l).
Definition 7.2. A weak topological order (WTO) of a directed graph is a hierarchical total order of
its vertices such that for every edge u → v, either u ≺ v or v ⪯ u and v ∈ ω(u). ■
A WTO factors out a feedback edge set from the graph using the matching parentheses and
topologically sorts the rest of the graph to obtain a total order of vertices. A feedback edge set
defined by a WTO is {(u, v) ∈ | v ⪯ u and v ∈ ω(u)}.
Example 7.3. The WTO of graph G 1 in Figure 2(a) is 1 (2 (3 4) (6 7 9 8) 5) 10. The feedback edge
set defined by this WTO is {(4, 3), (8, 6), (5, 2)}, which is the same as that defined by WPO W1 . ■
Algorithm 3 presents a top-down recursive algorithm for constructing a WTO for a graph G
and its depth-first forest D. Notice the use of increasing post DFN order when merging the results
on Line 5. In general, a reverse post DFN order of a graph is its topological order. Therefore,
ConstructWTOTD , in effect, topologically sorts the DAG of SCCs recursively. Because it is recursive,
it preserves the components and their nesting relationship. Furthermore, by observing the corre-
spondence between ConstructWTOTD and ConstructWPOTD , we see that ConstructWTOTD (G, D) and
ConstructWPOTD (G, D, ∗) construct the same components with same heads and nesting relationship.
The definition of HPO (Definition 4.1) generalizes the definition of hierarchical total order
(Definition 7.1) to partial orders, while the definition of WPO (Definition 4.3) generalizes the
definition of WTO (Definition 7.2) to partial orders. In other words, the two definitions define the
same structure if we strengthen H1 to a total order. If we view the exits in X as closing parenthesis
“)” and x h as matching parentheses (h . . . ), the correspondence between the two definitions
becomes clear. The conditions that a hierarchical total order must be well-parenthesized and that it
disallows two consecutive “(” correspond to conditions H4, H2, and W1. While H5 is not specified in
Bourdoncle’s definition, it directly follows from the fact that ⪯ is a total order. Finally, the condition
in the definition of WTO matches W5. Thus, using the notion of WPO, we can define a WTO as:
Definition 7.4. A weak topological order (WTO) for a graph G(V , ) is a WPO W(V , X , , )
for G where (V ∪ X , ∗ ) is a total order. ■
Definition 7.4 hints at how a WTO for a graph can be constructed from a WPO. The key is
to construct a linear extension of the partial order (V ∪ X , ∗ ) of the WPO, while ensuring that
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:22 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
properties H1, H4, and H5 continue to hold. ConstructWTOBU (Algorithm 4) uses the above insight
to construct a WTO of G in almost-linear time, as proved by the following two theorems.
Theorem 7.5. Given a directed graph G and its depth-first forest D, the returned value (V , X , , )
of ConstructWTOBU (G, D) is a WTO for G.
Proof. The call ConstructWPOBU (G, D, true) on Line 1 constructs a WPO for G. With lift set to
true in ConstructWPOBU , all scheduling predecessors outside of a component are visited before
the head of the component. The algorithm then visits the vertices in topological order according
to . Thus, the additions to on Line 10 do not violate H1 and lead to a total order (V ∪ X , ∗ ).
Furthermore, because a stack is used as the worklist and because of H5, once a head of a component
is visited, no element outside the component is visited until all elements in the component are
visited. Therefore, the additions to on Line 10 preserve the components and their nesting
relationship, satisfying H4. Because the exit is the last element visited in the component, no
scheduling constraint is added from inside of the component to outside, satisfying H5. Thus,
ConstructWTOBU (G, D) constructs a WTO for G. □
Example 7.6. For graph G 1 in Figure 2(a), ConstructWTOBU (G 1 , D) returns the following WTO:
1 2 3 4 x3 6 7 9 8 x6 5 x2 10
.
ConstructWTOBU first constructs the WPO W1 in Figure 2(b) using ConstructWPOBU . The partial
order ∗ of W1 is extended to a total order by adding x 3 6 and 7 9. The components and
their nesting relationship in W1 are preserved in the constructed WTO. This WTO is equivalent to
1 (2 (3 4) (6 7 9 8) 5) 10 in Bourdoncle’s representation (see Definition 7.2 and 7.3). ■
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:23
Theorem 7.8. Given a directed graph G and its depth-first forest D, ConstructWTOBU (G, D) and
ConstructWTOTD (G, D) construct the same WTO for G.
Proof. We shown above, ConstructWTOTD (G, D) and ConstructWPOTD (G, D) construct the same
components with same heads and nesting relationship. Therefore, using Theorem 6.7, we can
conclude that ConstructWTOBU (G, D) and ConstructWTOTD (G, D) construct the same WTO. □
The following theorem shows that our concurrent fixpoint algorithm in Figure 3 computes the
same fixpoint as Bourdoncle’s sequential fixpoint algorithm:
Theorem 7.9. The fixpoint computed by the concurrent fixpoint algorithm in Figure 3 using the
WPO constructed by Algorithm 2 is the same as the one computed by the sequential Bourdoncle’s
algorithm that uses the recursive iteration strategy.
Proof. With both stabilization constraint, , and matching parentheses, (. . . ), interpreted as the
“iteration until stabilization” operator, our concurrent iteration strategy for ConstructWTOBU (G, D)
computes the same fixpoint as Bourdoncle’s recursive iteration strategy for ConstructWTOTD (G, D).
The only change we make to a WPO in ConstructWTOBU is adding more scheduling constraints.
Further, Theorem 5.6 proved that our concurrent iteration strategy is deterministic. Thus, our
concurrent iteration strategy computes the same fixpoint when using the WPO constructed by
either ConstructWTOBU (G, D) or ConstructWPOBU (G, D, false). Therefore, our concurrent fixpoint
algorithm in Figure 3 computes the same fixpoint as Bourdoncle’s sequential fixpoint algorithm. □
8 IMPLEMENTATION
Our deterministic parallel abstract interpreter, which we called Pikos, was built using IKOS [Brat
et al. 2014], an abstract-interpretation framework for C/C++ based on LLVM.
Sequential baseline IKOS. IKOS performs interprocedural analysis to compute invariants for all
programs points, and can detect and prove the absence of runtime errors in programs. To compute
the fixpoint for a function, IKOS constructs the WTO of the CFG of the function and uses Bour-
doncle’s recursive iteration strategy [Bourdoncle 1993]. Context sensitivity during interprocedural
analysis is achieved by performing dynamic inlining during fixpoint: formal and actual parameters
are matched, the callee is analyzed, and the return value at the call site is updated after the callee
returns. This inlining also supports function pointers by resolving the set of possible callees and
joining the results.
Pikos. We modified IKOS to implement our deterministic parallel abstract interpreter using Intel’s
Threading Building Blocks (TBB) library [Intel 2019]. We implemented the almost-linear time
algorithm for WPO construction (§6). We implemented the deterministic parallel fixpoint iterator
(§5) using TBB’s parallel_do. Multiple callees at an indirect call site are analyzed in parallel using
TBB’s parallel_reduce. We refer to this extension of IKOS as Pikos; we use Pikos⟨k⟩ to refer to
the instantiation of Pikos that uses up to k threads.
Path-based task spawning in Pikos. Pikos relies on TBB’s tasks to implement the parallel
fixpoint iterator. Our initial implementation would spawn a task for each WPO element when it is
ready to be scheduled. Such a naive approach resulted Pikos being slower than IKOS; there were
10 benchmarks where speedup of Pikos⟨2⟩ was below 0.90x compared to IKOS, with a minimum
speedup of 0.74x. To counter such behavior, we implemented a simple path-based heuristic for
spawning tasks during fixpoint computation. We assign ids to each element in the WPO W as
follows: assign id 1 to the elements along the longest path in W , remove these elements from W
and assign id 2 to the elements along the longest path in the resulting graph, and so on. The length
of the path is based on the number of instructions as well as the size of the functions called along
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:24 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
the path. During the fixpoint computation, a new task is spawned only if the id of the current
element differs from that of the successor that is ready to be scheduled. Consequently, elements
along critical paths are executed in the same task.
Memory allocator for concurrency. We experimented with three memory allocators optimized
for parallelism: Tcmalloc [Google 2019], Jemalloc [Evans 2019], and Tbbmalloc [Intel 2019]. Tcmalloc
was chosen because it performed the best in our settings for both Pikos and IKOS.
Abstract domain. Our fixpoint computation algorithm is orthogonal to the abstract domain in use.
Pikos⟨k⟩ works for all abstract domains provided by IKOS as long as the domain was thread-safe.
These abstract domains include interval [Cousot and Cousot 1977], congruence [Granger 1989],
gauge [Venet 2012], and DBM [Miné 2001]. Variable-packing domains [Gange et al. 2016] could
not be used because their implementations were not thread-safe. We intend to explore thread-safe
implementations for these domains in the future.
9 EXPERIMENTAL EVALUATION
In this section, we study the runtime performance of Pikos (§8) on a large set of C programs using
IKOS as the baseline. The experiments were designed to answer the following questions:
RQ0 [Determinism] Is Pikos deterministic? Is the fixpoint computed by Pikos the same as that
computed by IKOS?
RQ1 [Performance] How does the performance of Pikos⟨4⟩ compare to that of IKOS?
RQ2 [Scalability] How does the performance of Pikos⟨k⟩ scale as we increase the number of
threads k?
Platform. All experiments were run on Amazon EC2 C5, which use 3.00 GHz Intel Xeon Platinum
8124M CPUs. IKOS and Pikos⟨k⟩ with 1 ≤ k ≤ 4 were run on c5.2xlarge (8 vCPUs, 4 physical cores,
16GB memory), Pikos⟨k⟩ with 5 ≤ k ≤ 8 on c5.4xlarge (16 vCPUs, 8 physical cores, 32GB memory),
and Pikos⟨k⟩ with 9 ≤ k on c5.9xlarge (36 vCPUs, 18 physical cores, 72GB memory). Dedicated
EC2 instances and BenchExec [Beyer et al. 2019] were used to improve reliability of timing results.
The Linux kernel version was 4.4, and gcc 8.1.0 was used to compile Pikos⟨k⟩ and IKOS.
Abstract Domain. We experimented with both interval and gauge domain, and the analysis
precision was set to track immediate values, pointers, and memory. The results were similar for
both interval and gauge domain. We show the results using the interval domain. Because we are
only concerned with the time taken to perform fixpoint computation, we disabled program checks,
such as buffer-overflow detection, in both IKOS and Pikos.
Benchmarks. We chose 4319 benchmarks from the following two sources:
SVC We selected all 2701 benchmarks from the Linux, control-flows, and loops categories of
SV-COMP 2019 [Beyer 2019]. These categories are well suited for numerical analysis, and
have been used in recent work [Singh et al. 2018a,b]. Programs from these categories have
indirect function calls with multiple callees at a single call site, large switch statements,
nested loops, and irreducible CFGs.
OSS We selected all 1618 programs from the Arch Linux core packages that are primarily written
in C and whose LLVM bitcode are obtainable by gllvm [gllvm 2019]. These include, but are
not limited to, apache, coreutils, cscope, curl, dhcp, fvwm, gawk, genius, ghostscript,
gnupg, iproute, ncurses, nmap, openssh, postfix, r, socat, vim, wget, etc.
We checked that the time taken by IKOS and Pikos⟨1⟩ was the same; thus, any speedup achieved
by Pikos⟨k⟩ is due to parallelism in the fixpoint computation. Note that the time taken for WTO
and WPO construction is very small compared to actual fixpoint computation, which is why
Pikos⟨1⟩ does not outperform IKOS. The almost-linear algorithm for WPO construction (§ 6.2)
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:25
Fig. 5. Log-log scatter plot of analysis time taken by IKOS and Pikos⟨4⟩ on 1017 benchmarks. Speedup is
defined as the analysis time of IKOS divided by analysis time of Pikos⟨4⟩. 1.00x, 2.00x, and 4.00x speedup
lines are shown. Benchmarks that took longer to analyze in IKOS tended to have higher speedup.
is an interesting theoretical result, which shows a new connection between the algorithms of
Bourdoncle and Ramalingam. However, the practical impact of the new algorithm is in preventing
stack overflow in the analyzer that occurs when using a recursive implementation of Bourdoncle’s
WTO construction algorithm. Such stack overflows occur when analyzing SV-COMP benchmarks
as well as production code [Crab 2018; ReDex 2017].
There were 130 benchmarks for which IKOS took longer than 4 hours. To include these bench-
marks, we made the following modification to the dynamic function inliner, which implements
the context sensitivity in interprocedural analysis in both IKOS and Pikos: if the call depth during
the dynamic inlining exceeds the given limit, the analysis returns ⊤ for that callee. For each of
the 130 benchmarks, we determined the largest limit for which IKOS terminated within 4 hours.
Because our experiments are designed to understand the performance improvement in fixpoint
computation, we felt this was a reasonable thing to do.
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:26 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
(a) 0% ~ 25% (5.02 seconds ~ 16.01 seconds) (b) 25% ~ 50% (16.04 seconds ~ 60.45 seconds)
(c) 50% ~ 75% (60.85 seconds ~ 508.14 seconds) (d) 75% ~ 100% (508.50 seconds ~ 14368.70 seconds)
Fig. 6. Histograms of speedup of Pikos⟨4⟩ for different ranges. Figure 6(a) shows the distribution of benchmarks
that took from 5.02 seconds to 16.01 seconds in IKOS. They are the bottom 25% in terms of the analysis time
in IKOS. The distribution tended toward a higher speedup in the upper range.
respectively. Total speedup of all the benchmarks was 2.16x. As we see in Figure 5, benchmarks
for which IKOS took longer to analyze tended to have greater speedup in Pikos⟨4⟩. Top 25%
benchmarks in terms of the analysis time in IKOS had higher averages than the total benchmarks,
with arithmetic, geometric, and harmonic mean of 2.38x, 2.29x, and 2.18x, respectively. Table 2
shows the speedups for the five benchmarks with the highest speedup and the longest analysis
time in IKOS.
Figure 6 provides details about the distribution of the speedup achieved by Pikos⟨4⟩. Frequency
on y-axis represents the number of benchmarks that have speedups in the bucket on x-axis. A
bucket size of 0.25 is used, ranging from 0.75 to 3.75. Benchmarks are divided into 4 ranges using the
analysis time in IKOS, where 0% represents the benchmark with the shortest analysis time in IKOS
and 100% represents the longest. The longer the analysis time was in IKOS (higher percentile), the
more the distribution tended toward a higher speedup. The most frequent bucket was 1.25x-1.50x
with frequency of 52 for the range 0% ~ 25%. For the range 25% ~ 50%, it was 1.50x-1.75x with
frequency of 45; for the range 50% ~ 75%, 3.00x-3.25x with frequency of 38; and for the range 75% ~
100%, 3.00x-3.25x with frequency of 50. Overall, 533 benchmarks out of 1017 (52.4%) had speedups
over 2.00x. The number of benchmarks with more than 3.00x speedup were 106 out of 1017 (10.4%).
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:27
Fig. 7. Box and violin plot for speedup of Pikos⟨k⟩ with k ∈ {2, 4, 6, 8}.
Benchmarks with high speedup contained code with large switch statements nested inside loops.
For example, ratpoison-1.4.9/ratpoison, a tiling window manager, had an event handling loop
that dispatches the events using the switch statement with 15 cases. Each switch case called an event
handler that contained further branches, leading to more parallelism. Most of the analysis time for
this benchmark was spent in this loop. On the other hand, benchmarks with low speedup usually
had a dominant single execution path. An example of such a benchmark is xlockmore-5.56/xlock,
a program that locks the local X display until a password is entered.
Maximum speedup gained by parallelism in Pikos⟨4⟩ was 3.63x, where 4.00x is the maximum
possible speedup. Arithmetic, geometric, and harmonic mean of the speedup were 2.06x, 1.95x,
1.84x, respectively. The performance was generally better for the benchmarks for which IKOS
took longer to analyze.
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:28 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
(a) Speedup of Pikos⟨k⟩ for 3 benchmarks with differ- (b) Distribution of scalability coefficients for 1017
ent scalability coefficients. The lines show the linear benchmarks. (x, y) in the plot means that y number
regressions of these benchmarks. of benchmarks have scalability coefficient at least x.
Speedup of Pikos⟨k⟩
Benchmark (5 out of 1017. Criteria: Scalability) Src. IKOS (s) k = 4 k = 8 k = 12 k = 16
audit-2.8.4/aureport OSS 684.29 3.63x 6.57x 9.02x 10.97x
feh-3.1.3/feh.bc OSS 9004.83 3.55x 6.57x 8.33x 9.39x
ratpoison-1.4.9/ratpoison OSS 1303.73 3.36x 5.65x 5.69x 5.85x
ldv-linux-4.2-rc1/32_7a-net-ethernet-intel-igb SVC 1206.27 3.10x 5.44x 5.71x 6.46x
ldv-linux-4.2-rc1/08_1a-net-wireless-mwifiex SVC 10224.21 3.12x 5.35x 6.20x 6.64x
Table 3. Five benchmarks with the highest scalability out of 1017 benchmarks.
Figure 7 shows the box and violin plots for speedup obtained by Pikos⟨k⟩, k ∈ {2, 4, 6, 8}. Box
plots show the quartiles and the outliers, and violin plots show the estimated distribution of the
observed speedups. The box plot [Tukey 1977] on the left summarizes the distribution of the results
for each k using lower inner fence (Q1 − 1.5 ∗ (Q3 − Q1)), first quartile (Q1), median, third quartile
(Q3), and upper inner fence (Q3 + 1.5 ∗ (Q3 − Q1)). Data beyond the inner fences (outliers) are
plotted as individual points. Box plot revealed that while the benchmarks above the median (middle
line in the box) scaled, speedups for the ones below median saturated. Violin plot [Hintze and
Nelson 1998] on the right supplements the box plot by plotting the probability density of the results
between minimum and maximum. In the best case, speedup scaled from 1.77x to 3.63x, 5.07x, and
6.57x. For each k, the arithmetic means were 1.48x to 2.06x, 2.26x, and 2.46x. The geometric means
were 1.46x to 1.95x, 2.07x, and 2.20x. The harmonic means were 1.44x to 1.84x, 1.88x, and 1.98x.
To better measure the scalability of Pikos⟨k⟩ for individual benchmarks, we define a scalability
coefficient as the slope of the linear regression of the number of threads and the speedups. The
maximum scalability coefficient is 1, meaning that the speedup increases linearly with the number
of threads. If the scalability coefficient is 0, the speedup is the same regardless of the number of
threads used. If it is negative, the speedup goes down with increase in number of threads. The
measured scalability coefficients are shown in Figure 8. Figure 8(a) illustrates benchmarks exhibiting
different scalability coefficients. For the benchmark with coefficient 0.79, the speedup of Pikos
roughly increases by 4, from 2x to 6x, with 6 more threads. For benchmark with coefficient 0, the
speedup does not increase with more threads. Figure 8(b) shows the distribution of scalability
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:29
Fig. 9. CFG of function per_event_detailed in aureport for which Pikos had the maximum scalability.
This function calls the appropriate handler based on the event type. This function is called inside a loop.
coefficients for all benchmarks. From this plot we can infer, for instance, that 124 benchmarks have
at least 0.4 scalability coefficient. For these benchmarks, speedups increased by at least 2 when 5
more threads are given.
Table 3 shows the speedup of Pikos⟨k⟩ for k ≥ 4 for a selection of five benchmarks that had the
highest scalability coefficient in the prior experiment. In particular, we wanted to explore the limits
of scalability of Pikos⟨k⟩ for this smaller selection of benchmarks. With scalability coefficient 0.79,
the speedup of audit-2.8.4/aureport reached 10.97x using 16 threads. This program is a tool
that produces summary reports of the audit system logs. Like ratpoison, it has an event-handler
loop consisting of a large switch statement as shown in Figure 9.
In the best case, the speedup of Pikos⟨k⟩ scaled from 1.77x to 3.63x, 5.07x, and 6.57x with k =
2, 4, 6 and 8. With this benchmark, the speedup reached 10.97x with 16 threads. The scalability
varies on the structure of the analyzed programs, and programs with multiple paths of similar
lengths exhibit high scalability.
10 RELATED WORK
Since its publication in 1993, Bourdoncle’s algorithm [Bourdoncle 1993] has become the de facto
approach to solving equations in abstract interpretation. Many advances have been developed since,
but they rely on Bourdoncle’s algorithm; in particular, different ways of intertwining widening and
narrowing during fixpoint computation with an aim to improve precision [Amato and Scozzari
2013; Amato et al. 2016; Halbwachs and Henry 2012].
C Global Surveyor (CGS) [Venet and Brat 2004] that performed array bounds checking was the
first attempt at distributed abstract interpretation. It performed distributed batch processing, and a
relational database was used for both storage and communication between processes. Thus, the
communication costs were too high, and the analysis did not scale beyond four CPUs.
Monniaux [2005] describes a parallel implementation of the ASTRÉE analyzer [Cousot et al.
2005]. It relies on dispatch points that divide the control flow between long executions, which are
found in embedded applications; the tool analyzes these two control paths in parallel. Unlike our
approach, this parallelization technique is not applicable to programs with irreducible CFGs. The
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:30 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
particular parallelization strategy can also lead to a loss in precision. The experimental evaluation
found that the analysis does not scale beyond 4 processors.
Dewey et al. [2015] present a parallel static analysis for JavaScript by dividing the analysis into
an embarrassingly parallel reachability computation on a state transition system, and a strategy for
selectively merging states during that reachability computation.
Prior work has explored the use of parallelism for specific program analysis. BOLT [Albarghouthi
et al. 2012] uses a map-reduce framework to parallelize a top-down analysis for use in verification
and software model checking. Graspan [Wang et al. 2017] implements a single-machine disk-based
graph system to solve graph reachability problems for interprocedural static analysis. Graspan is
not a generic abstract interpreter, and solves data-flow analyses in the IFDS framework [Reps et al.
1995]. Su et al. [2014] describe a parallel points-to analysis via CFL-reachability. Garbervetsky et al.
[2017] use an actor-model to implement distributed call-graph analysis.
McPeak et al. [2013] parallelize the Coverity Static Analyzer [Bessey et al. 2010] to run on an
8-core machine by mapping each function to its own work unit. Tricorder [Sadowski et al. 2015]
is a cloud-based static-analysis platform used at Google. It supports only simple, intraprocedural
analyses (such as code linters), and is not designed for distributed whole-program analysis.
Sparse analysis [Oh et al. 2014, 2012] and database-backed analysis [Weiss et al. 2015] are
orthogonal approaches that improve the memory cost of static analysis. Newtonian program
analysis [Reps 2018; Reps et al. 2017] provides an alternative to Kleene iteration used in this paper.
11 CONCLUSION
We presented a generic, parallel, and deterministic algorithm for computing a fixpoint of an
equation system for abstract interpretation. The iteration strategy used for fixpoint computation is
constructed from a weak partial order (WPO) of the dependency graph of the equation system. We
described an axiomatic and constructive characterization of WPOs, as well as an efficient almost-
linear time algorithm for constructing a WPO. This new notion of WPO generalizes Bourdoncle’s
weak topological order (WTO). We presented a linear-time algorithm to construct a WTO from a
WPO, which results in an almost-linear algorithm for WTO construction given a directed graph.
The previously known algorithm for WTO construction had a worst-case cubic time-complexity.
We also showed that the fixpoint computed using the WPO-based parallel fixpoint algorithm is the
same as that computed using the WTO-based sequential fixpoint algorithm.
We presented Pikos, our implementation of a WPO-based parallel abstract interpreter. Using a
suite of 1017 open-source programs and SV-COMP 2019 benchmarks, we compared the performance
of Pikos against the IKOS abstract interpreter. Pikos⟨4⟩ achieves an average speedup of 2.06x over
IKOS, with a maximum speedup of 3.63x. Pikos⟨4⟩ showed greater than 2.00x speedup for 533
benchmarks (52.4%) and greater than 3.00x speedup for 106 benchmarks (10.4%). Pikos⟨4⟩ exhibits
a larger speedup when analyzing programs that took longer to analyze using IKOS. Pikos achieved
an average speedup of 1.73x on programs for which IKOS took less than 16 seconds, while Pikos
achieved an average speedup of 2.38x on programs for which IKOS took greater than 508 seconds.
The scalability of Pikos depends on the structure of the program being analyzed with Pikos⟨16⟩
exhibiting a maximum speedup of 10.97x.
ACKNOWLEDGMENTS
The authors would like to thank Maxime Arthaud for help with IKOS. This material is based upon
work supported by a Facebook Testing and Verification research award, and AWS Cloud Credits
for Research.
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:31
REFERENCES
Aws Albarghouthi, Rahul Kumar, Aditya V Nori, and Sriram K Rajamani. 2012. Parallelizing top-down interprocedural
analyses. In ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI). ACM.
Gianluca Amato and Francesca Scozzari. 2013. Localizing Widening and Narrowing. In Static Analysis Symposium (SAS)
(Lecture Notes in Computer Science), Vol. 7935. Springer, 25–42.
Gianluca Amato, Francesca Scozzari, Helmut Seidl, Kalmer Apinis, and Vesal Vojdani. 2016. Efficiently intertwining widening
and narrowing. Science of Computer Programming (SCP) 120 (2016), 1–24.
Gogul Balakrishnan, Malay K. Ganai, Aarti Gupta, Franjo Ivancic, Vineet Kahlon, Weihong Li, Naoto Maeda, Nadia
Papakonstantinou, Sriram Sankaranarayanan, Nishant Sinha, and Chao Wang. 2010. Scalable and precise program
analysis at NEC. In Formal Methods in Computer-Aided Design (FMCAD).
Thomas Ball, Byron Cook, Vladimir Levin, and Sriram K Rajamani. 2004. SLAM and Static Driver Verifier: Technology
transfer of formal methods inside Microsoft. In Integrated formal methods.
Al Bessey, Ken Block, Ben Chelf, Andy Chou, Bryan Fulton, Seth Hallem, Charles Henri-Gros, Asya Kamsky, Scott Mc-
Peak, and Dawson Engler. 2010. A few billion lines of code later: using static analysis to find bugs in the real world.
Communications of the ACM (CACM) 53, 2 (2010), 66–75.
Dirk Beyer. 2019. Automatic Verification of C and Java Programs: SV-COMP 2019. In Tools and Algorithms for the Construction
and Analysis of Systems, Dirk Beyer, Marieke Huisman, Fabrice Kordon, and Bernhard Steffen (Eds.). Springer International
Publishing, Cham, 133–155.
Dirk Beyer, Stefan Löwe, and Philipp Wendler. 2019. Reliable benchmarking: requirements and solutions. International
Journal on Software Tools for Technology Transfer 21, 1 (01 Feb 2019), 1–29. https://doi.org/10.1007/s10009-017-0469-y
François Bourdoncle. 1993. Efficient chaotic iteration strategies with widenings. In Formal Methods in Programming and
Their Applications. Springer Berlin Heidelberg, Berlin, Heidelberg, 128–141.
Guillaume Brat, Jorge A Navas, Nija Shi, and Arnaud Venet. 2014. IKOS: A framework for static analysis based on abstract
interpretation. In International Conference on Software Engineering and Formal Methods. Springer, 271–277.
Guillaume Brat and Arnaud Venet. 2005. Precise and scalable static program analysis of NASA flight software. In 2005 IEEE
Aerospace Conference. IEEE, 1–10.
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms, Third Edition
(3rd ed.). The MIT Press.
Patrick Cousot. 1977. Asynchronous iterative methods for solving a fixpoint system of monotone equations. Technical Report.
Research Report IMAG-RR-88, Université Scientifique et Médicale de Grenoble.
Patrick Cousot. 2015. Abstracting induction by extrapolation and interpolation. In Verification, Model Checking, and Abstract
Interpretation (VMCAI). Springer.
P. Cousot and R. Cousot. 1976. Static Determination of Dynamic Properties of Programs. In International Symposium on
Programming. Paris.
P. Cousot and R. Cousot. 1977. Abstract interpretation: a unified lattice model for static analysis of programs by construction
or approximation of fixpoints. In Conference Record of the Fourth Annual ACM SIGPLAN-SIGACT Symposium on Principles
of Programming Languages. ACM Press, New York, NY, Los Angeles, California, 238–252.
Patrick Cousot, Radhia Cousot, Jerôme Feret, Laurent Mauborgne, Antoine Miné, David Monniaux, and Xavier Rival. 2005.
The ASTRÉE analyzer. In European Symposium on Programming (ESOP), Vol. 5. 21–30.
Patrick Cousot, Roberto Giacobazzi, and Francesco Ranzato. 2019. A2 I: abstract2 interpretation. PACMPL 3, POPL (2019),
42:1–42:31. https://doi.org/10.1145/3290355
Patrick Cousot and Nicolas Halbwachs. 1978. Automatic discovery of linear restraints among variables of a program. In
ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). ACM, 84–96.
Crab. 2018. Possibly stack overflow while computing WTO of a large CFG. https://github.com/seahorn/crab/issues/18.
Accessed November 2019.
David Delmas and Jean Souyris. 2007. Astrée: From research to industry. In Static Analysis Symposium (SAS).
Kyle Dewey, Vineeth Kashyap, and Ben Hardekopf. 2015. A parallel abstract interpreter for JavaScript. In Proceedings of the
13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 34–45.
Jason Evans. 2019. Jemalloc. https://github.com/jemalloc/jemalloc.
Graeme Gange, Jorge A. Navas, Peter Schachte, Harald Søndergaard, and Peter J. Stuckey. 2016. An Abstract Domain of
Uninterpreted Functions. In 17th International Conference on Verification, Model Checking, and Abstract Interpretation
(VMCAI).
Diego Garbervetsky, Edgardo Zoppi, and Benjamin Livshits. 2017. Toward Full Elasticity in Distributed Static Analysis: The
Case of Callgraph Analysis. In Foundations of Software Engineering (FSE).
Michael R Garey and David S Johnson. 2002. Computers and intractability. Vol. 29. wh freeman New York.
Roberto Giacobazzi and Isabella Mastroeni. 2004. Abstract non-interference: Parameterizing non-interference by abstract
interpretation. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). ACM.
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
14:32 Sung Kook Kim, Arnaud J. Venet, and Aditya V. Thakur
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.
Deterministic Parallel Fixpoint Computation 14:33
Gagandeep Singh, Markus Püschel, and Martin T. Vechev. 2018b. A practical construction for decomposing numerical
abstract domains. PACMPL 2, POPL (2018), 55:1–55:28.
Yu Su, Ding Ye, and Jingling Xue. 2014. Parallel pointer analysis with cfl-reachability. In 2014 43nd International Conference
on Parallel Processing (ICPP). IEEE, 451–460.
Edward Szpilrajn. 1930. Sur l’extension de l’ordre partiel. Fundamenta mathematicae 1, 16 (1930), 386–389.
Robert Tarjan. 1973. Testing flow graph reducibility. In Proceedings of the fifth annual ACM symposium on Theory of
computing. ACM, 96–107.
Robert Endre Tarjan. 1979. Applications of Path Compression on Balanced Trees. J. ACM 26, 4 (Oct. 1979), 690–715.
John W. Tukey. 1977. Exploratory data analysis. In Addison-Wesley series in behavioral science : quantitative methods.
Arnaud Venet. 2012. The Gauge Domain: Scalable Analysis of Linear Inequality Invariants.. In Computer Aided Verification
(CAV). Springer, 139–154.
Arnaud Venet and Guillaume P. Brat. 2004. Precise and efficient static array bound checking for large embedded C programs.
In ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI). ACM.
Kai Wang, Aftab Hussain, Zhiqiang Zuo, Guoqing Xu, and Ardalan Amiri Sani. 2017. Graspan: A single-machine disk-based
graph system for interprocedural static analyses of large-scale systems code. In Architectural Support for Programming
Languages and Operating Systems (ASPLOS). ACM, 389–404.
Cathrin Weiss, Cindy Rubio-González, and Ben Liblit. 2015. Database-Backed Program Analysis for Scalable Error Propaga-
tion. In 37th IEEE/ACM International Conference on Software Engineering, (ICSE).
Reinhard Wilhelm, Mooly Sagiv, and Thomas Reps. 2000. Shape analysis. In Compiler Construction (CC). Springer, 1–17.
Proc. ACM Program. Lang., Vol. 4, No. POPL, Article 14. Publication date: January 2020.