Sparse Flow-Sensitive Pointer Analysis For Multithreaded Programs
Sparse Flow-Sensitive Pointer Analysis For Multithreaded Programs
t if act
for Multithreaded Programs Ar en
t * Co m p let e
*
A ECDo
*
st
C GO *
We
* Co n si
ll
*
cu m
se
eu
Ev
e
R nt
ed
* Easy t o
*
alu d
Yulei Sui, Peng Di, and Jingling Xue at e
UNSW Australia
160
p = &x; q = &y; p = &x; q = &y; p = &x; q = &y; p = &x; q = &y; p = &x; q = &y;
r = &z; r = &z; r = &z; r = &z; x = &a; r = &z; u = &v;
void main(){ void main(){ void main(){ void main(){ void main(){
fork(t,foo); fork(t1,foo); *p = r; fork(t,foo); *p = r;
*p = r; join(t1); fork(t,foo); c = *p; fork(t,foo);
c = *p; *p = r; join(t); }
lock(l1); c = *p; unlock(l1);
} } c = *p;
void foo(){ } void foo(){ }
void foo(){ fork(t2,bar); *p = q; void foo(){
*x = r; lock(l2);
*p = q; } void foo(){ }
} void bar(){ *p = q; *p = u; *p = q;
*p = q; } unlock(l2);
c = *p; }
} //l1 and l2 point to same lock
pt(c) = {y, z} pt(c) = {y, z} pt(c) = {y} pt(c) = {y} pt(c) = {y, z}
(a) Interleaving (b) Soundness (c) Precision (d) Data-flow (e) Sparsity
Figure 1: Examples for illustrating some challenges faced by flow-sensitive pointer analysis for multithreaded C programs
(with irrelevant code elided). For brevity, fork() and join() represent pthread create() and pthread join() in the Pthreads API.
Pre-analysis
fork/join memory accesses lock/unlock thread-oblivious def-use
Figure 2: FSAM: a sparse flow-sensitive pointer analysis framework for multithreaded C programs.
with restricted parallelism patterns, unstructured and low- order to accommodate the side-effects of all parallel threads.
level constructs in the Pthreads API allow programmers to Blindly propagating the points-to information this way under
express richer parallelism patterns. However, such flexible all thread interleavings is inefficient in both time and space.
non-lexically-scoped parallelism significantly complicates In Figure 1(d), c “ ˚p in the main thread can interleave with
MHP analysis. For example, a thread may outlive its spawn- ˚p “ q and ˚x “ r in thread t. However, propagating the
ing thread or can be joined partially along some program points-to information generated at ˚x “ r to c “ ˚p is not
paths or indirectly in one of its child threads. In Figure 1(b), necessary, since ˚p and ˚x are not aliases. So ptpcq “ tyu.
thread t2 executes independently of its spawning thread t1 Finally, how do we improve scalability by propagating
and will stay alive even after t1 has been joined by the main points-to facts along only a set of pre-computed def-use
thread. Thus, ˚p “ r executed in the main thread may inter- chains sparsely? It turns out that this pre-computation is
leave with the two statements ˚p “ q and c “ ˚p in bar() much more challenging in the multithreaded setting than the
executed by t2. A sound points-to set for c is ptpcq “ ty, zu. sequential setting [10]. Imprecise handling of synchroniza-
How to maintain precision can also be challenging. Syn- tion statements (e.g., fork/join and lock/unlock) may lead
chronization statements (e.g., fork/join and lock/unlock) to spurious def-use chains, reducing both the scalability and
must be well tracked to reduce spurious interleavings among precision of the subsequent sparse analysis. In Figure 1(e),
non-parallel statements. In Figure 1(c), ˚p “ r, ˚p “ q, and ptpcq “ ty, zu, if l1 and l2 are must aliases pointing to
c “ ˚p are always executed serially in that order. By per- the same lock. However, if a pre-computed def-use edge is
forming a strong update at ˚p “ q with respect to thread added from ˚u “ v to c “ ˚p, then following this spurious
ordering, we can discover that c points to y stored in x by edge makes the analysis not only less efficient but also less
˚p “ q (not z stored in x at ˚p “ r, since x has been precise by concluding that ptpcq “ ty, z, vu is possible.
strongly updated with &y, killing &z). Thus, ptpcq “ tyu.
1.2 Our Solution
How do we scale flow-sensitive pointer analysis for large
multithreaded C programs? One option is to adopt a data- In this paper, we present FSAM, a new Flow-Sensitive
flow analysis to propagate iteratively the points-to facts pointer Analysis for handling large Multithreaded C pro-
generated at a statement s to every other statement s1 that grams (using Pthreads). We address the afore-mentioned
is either reachable along the control flow or may-happen- challenges by performing sparse analysis along the def-
in-parallel with s, without knowing whether the facts are use chains precomputed by a pre-analysis and a series of
needed at s1 or not. This traditional approach computes and thread interference analysis phases, as illustrated in Fig-
maintains a separate points-to graph at each program point in ure 2. To bootstrap the sparse analysis, a pre-analysis (by
applying Andersen’s pointer analysis algorithm [2]) is first
161
performed flow- and context-insensitively to discover over- p = &a;
approximately the points-to information in the program. p = &a; t1 = &b;
a = &b; *p = t1;
Based on the pre-analysis, some thread-oblivious def-
use edges are identified. Then thread interleavings are an- q = &c; q = &c;
alyzed to discover all the missing thread-sensitive def-use *p = *q; t2 = *q;
edges. Our interleaving analysis reasons about fork and join *p = t2;
operations flow- and context-sensitively to discover may- (a) C code (b) Partial SSA
happen-in-parallel (MHP) statement pairs. Our value-flow
analysis adds the thread-aware def-use edges for MHP state- Figure 3: A C code fragment and its partial SSA form.
ment pairs with common value flows to produce so-called
aliased pairs. Our lock analysis analyzes lock/unlock opera-
tions flow- and context-sensitively to identify those interfer- or global variable with its address taken or a dynamically
ing aliased pairs based on the happen-before relations estab- created abstract heap object (at, e.g., a malloc() site).
lished among their corresponding mutex regions. Figure 3 shows a code fragment and its corresponding
Finally, a sparse flow-sensitive pointer analysis algorithm partial SSA form, where p, q, t1, t2 P T and a, b, c P A.
is applied by propagating the points-to facts sparsely along Note that a is indirectly accessed at a store ˚p “ t1 by
the pre-computed def-use chains, rather than along all pro- introducing a top-level pointer t1 in the partial SSA form.
gram points with respect to the program’s control flow. The complex statements like ˚p “ ˚q are decomposed into
This paper makes the following contributions: the basic ones by introducing a top-level pointer t2.
162
Pre-computed s1: *p = q; s1: *p = q; s1:*p = q;
Points-to a = !(a) a1 = !(a0) a = !(a) pc,f ki q pc1 ,f ki1 q
pt(p) = {a,b}
b = !(b) b1 = !(b0) b = !(b) [b] t ùùùùñ t1 t1 ùùùùùñ t2
[T-FORK] pc,f ki q
pt(w) = {b} "(b) "(b1)
pt(x) = {a} [a] "(a) t ùùùùñ t2
pt(r) = {a}
s2: v = *w; s2: v = *w; s2:v = *w; pc,jni q pc1 ,jni1 q
t ðùùùù t1 t1 ðùùùùù t2
s3: *x = y; s3: *x = y; s3:*x = y; [T-JOIN] full
a = !(a) a2 = !(a1) b = !(b)
pc,jni q
2
t ðùùùù t
[a]
"(a) "(a2) "(b) pc,f ki q pc1 ,f ki1 q
s4: s = *r; s4: s = *r; s4:s = *r; t2 ùùùùñ t t2 ùùùùùñ t1 (i ‰ i1 _c ‰ c1 )
[T-SIBLING]
(a) annotated !/" (b) !/"$after SSA$ (c) Def-use graph t ’ t1
Figure 4: A sparse def-use graph. Figure 5: Static modeling of fork and join operations.
a P A may be pointed to by r to represent a potential use of a runtime thread unless t is multi-forked, in which case, t may
at the load. Similarly, a store, e.g., ˚x “ y is annotated with represent more than one runtime thread.
a function a “ χpaq to represent a potential def and use of
a at the store. If a can be strongly updated, then a receives Definition 1 (Multi-Forked Threads). A thread t P M is a
whatever y points to and the old contents in a are killed. multi-forked thread if its fork site, say, f ki resides in a loop,
Otherwise, a must also incorporate its old contents, resulting a recursion cycle, or its spawner thread t1 P M.
in a weak update to a. Third, each address-taken variable, Intra-Thread CFG For an abstract thread t, its intra-thread
e.g., a is converted into SSA form (Figure 4(b)), with each control flow graph, ICFGt , is constructed as in [15], where
µpaq treated as a use of a. and each a “ χpaq as both a def a node s represents a program statement and an edge from
and use of a. Finally, an indirect def-use chain of a is added s1 to s2 signifies a possible transfer of control from s1 to
from a definition of a identified as an (version n) at a store to s2 . For convenience, a call site is split into a call node and
its uses at a store or a load, resulting in two indirect def-use a return node. Three kinds of edges are distinguished: (1) an
a a
edges of a i.e. s1 ãÝÑ s3 and s3 ãÝÑ s4 (Figure 4(c)). Any intra-procedural control flow edge s Ñ s1 from node s to
φ function introduced for an address-taken variable a during calli
its successor s1 , (2) an interprocedural call edge s ÝÝÝÑ s1
the SSA conversion will be ignored as a is not versioned. 1
from a call node s to the entry node s of a callee at callsite
Every callsite is also annotated with µ and χ functions reti
to expose its indirect uses and defs. As is standard, passing i, and (3) an interprocedural return edge s ÝÝÑ s1 from an
1
arguments into and returning results from functions are mod- exit node s of a callee to the return node s at callsite i.
eled by copies. So the def-use chains across the procedural There are no outgoing edges for a fork or join site. Func-
boundaries are added similarly. For details, we refer to [10]. tion pointers are resolved by pre-analysis.
Once the def-use chains are in place for the program, Modeling Thread Forks and Joins Figure 5 gives three
flow-sensitive pointer analysis can be performed sparsely, rules for modeling fork and join operations statically. We
i.e., by propagating points-to information only along these pc,f ki q
pre-computed def-use edges. For example, the points-to sets write t ùùùùñ t1 to represent the spawning relation that a
of a computed at s1 are propagated to s3 with s2 bypassed, spawner thread t creates a spawnee thread t1 at a context-
resulting in significant savings both time and memory. sensitive fork site pc, f ki q, where c is a context stack repre-
sented by a sequence of callsites, [cs0 , ¨ ¨ ¨ , csn ], from the
3. The FSAM Approach entry of the main function to the fork site f ki . Note that
the callsites inside each strongly-connected cycle in the call
We first describe a static thread model used for handling graph of the program are analyzed context-insensitively.
fork and join operations (Section 3.1). We then introduce For a thread t forked at pc, f ki q, we write St to stand
our FSAM framework (Figure 2), focusing on how to pre- for its start procedure, where the execution of t begins.
compute def-use chains (Sections 3.2 and 3.3) and dis- EntrypSt q “ pc1 , sq maps St to its first statement pc1 , sq,
cussing thereafter on how to perform the subsequent sparse where c1 “ c.pushpiq, context-sensitively.
analysis for multithreaded C programs (Section 3.4). Consider the three rules in Figure 5. The spawning rela-
pc,f ki q
3.1 Static Thread Model tion t ùùùùñ t1 is transitive, representing the fact that t can
Abstract Threads A program starts its execution from its create t1 directly or indirectly at a fork site f ki ([T-FORK]).
main function in the main (root) thread. An abstract thread We will handle only the join operations identified by
t refers to a call of pthread create() at a context- [T-JOIN] and ignore the rest in the program. The joining
pc,jni q
sensitive fork site during the analysis. Thus, a thread t al- relation t ðùùùù t1 indicates that a spawnee t1 is joined by
ways refers to a context-sensitive fork site, i.e., a unique its spawner t at a join site pc, jni q As our pre-analysis is
163
void main() {
... t0 t1 t0 t1 t0 t1
s1 : *p = ...; s1 o s1 o s1 o
f k1 : fork(t1 , foo); o s4 o
o s4 o
o s4
s2 : *p = ...; void foo() { s2 o s2 o s2 o
} }
(a) Program P (b) Def-use for Pseq (c) Fork-related def-use (d) Join-related def-use
Figure 6: Thread-oblivious def-use edges (where p and q are found to point to o during the pre-analysis).
flow- and context-insensitive, we achieve soundness by re- procedure, say, fun, in Sf k may be executed nondeterminis-
quiring t1 joined at a join site pthread join() in the pro- tically later. This means that any value flow added in Step 1
gram to be excluded from M, so that t1 represents a unique that go through f un can also bypass it altogether. For ev-
runtime thread (under all contexts). Note that the joining re- ery such a def-use chain that starts at a statement s before
lation is not transitive in the same sense as the spawning rela- the callsite for f un and ends at a statement s1 after, we add a
tion. In Pthreads programs, a thread can be joined fully along def-use edge from s to s1 . (Technically, a weak update is per-
all program paths or partially along some but not all paths. formed for every a “ χpaq function associated with the call-
pc,jni q pc1 ,jni1 q pc,jni q site for f un, so that the old contents of a prior to the call are
Given t ðùùùù t1 and t1 ðùùùùù t2 , t ðùùùù t2 holds when
pc1 ,jni1 q pc1 ,jni1 q preserved.) In our example Figure 6(b) becomes Figure 6(c)
t1 ðùùùùù t2 is a full join, denoted t1 ðùùùùù t2 . o
with the fork-related def-use edge s1 ãÝÑ s2 being added.
full
pc,f ki q pc1 ,f ki1 q In Step 3, we deal with every direct join operation han-
If neither t ùùùùñ t nor t ùùùùùñ t holds, then t and t1
1 1
dled by our static thread model ([T-JOIN]). Let join(t1 q be
are siblings, denoted t ’ t1 ([T-SIBLING]). In this case,
a candidate join site executed in the spawner thread t, which
t and t1 , where t ‰ t1 , share a common ancestor thread
implies, as discussed in Section 3.1, that t1 is a unique run-
t2 . Furthermore, t and t1 do not happen-in-parallel if one
time thread to be joined. Let fun1 be the start procedure of
happens before the other (as defined below).
t1 . In one possible thread interleaving, this join statement
Definition 2 (Happens-Before (HB) Relation for Sibling plays a similar role as an exception-catching statement for
Threads). Given two sibling threads t and t1 , t happens an exception thrown at the end of f un1 . Given this implicit
before t1 , denoted t ą t1 , if the fork site of t1 is backward control flow, we need to make the modification side effects
reachable to a join site of t along every program path. of fun1 visible at the join site. Let a P A be an address-
Presently, FSAM does not model other synchroniza- taken variable defined at the exit of f un1 . For the first use
tion constructs such as barriers and signal/wait, resulting of a reachable from the join site along every program path
in sound, i.e., over-approximate results. in ICFGt , we add a def-use edge from that definition to the
use. In our example, Figure 6(c) becomes Figure 6(d) with
o
3.2 Computing Thread-Oblivious Def-Use Chains the join-related def-use edge s4 ãÝÑ s3 added.
Given a multithreaded C program P , we transform P into
3.3 Computing Thread-Aware Def-Use Chains
a sequential version Pseq , representing one possible thread
interleaving of P . We then derive the def-use chains from For a program P , we must also discover the def-use chains
Pseq as the thread-oblivious def-use chains for P , based on formed by all the other thread interleavings except Pseq .
the points-to information obtained during the pre-analysis. Such def-use chains are thread-aware and computed with the
There are three steps, illustrated in Figure 6. three thread interference analyses incorporated in FSAM.
In Step 1, we transform P into Pseq by replacing every
fork statement f k in P by calls to the start procedures of all 3.3.1 Interleaving Analysis
the threads spawned at f k. Let Sf k be the set of these start As shown in Figure 2, FSAM invokes this as the first of the
procedures. We keep the join operations that we can han- three interference analyses to compute thread-aware def-use
dle by [T-JOIN] and ignore the rest. We then follow [10] chains. The objective here is to reason about fork and join
to compute its def-use chains for Pseq , as discussed in Sec- operations to identify all MHP statements in the program.
tion 2.2. Given P in Figure 6(a), where Sf k1 “ tfoou, we Our interleaving analysis operates flow- and context-
obtain its Pseq by replacing fork(t,foo) with a call to sensitively on the ICFGs of all the threads (but uses points-
foo(). The def-use chains for Pseq are given in Figure 6(b). to information from the pre-analysis). For a statement s in
In Step 2, we add the fork-related missing def-use edges thread t’s ICFGt , our analysis approximates which threads
at a fork site f k by assuming Sf k “ H, since every start may run in parallel with t when s is executed, denoted as
164
pc,f ki q
t ùùùùñ t1 pt, c, f ki q Ñ pt, c, sq pc1 , s1 q “ EntrypSt1 q
[I-DESCENDANT]
tt u Ď Ipt, c, sq ttu Ď Ipt1 , c1 , s1 q
1
ret
pt, c, sq Ñ pt, c, s1 q i
pt, c, sq ÝÝÑ pt, c1 , s1 q i “ c.peekpq c1 “ c.poppq
[I-INTRA] [I-RET]
Ipt, c, sq Ď Ipt, c, s1 q Ipt, c, sq Ď Ipt, c1 , s1 q
Figure 7: Interleaving analysis (where Ñ denotes a control flow edge in a thread’s ICFG introduced Section 3.1).
main() { foo1(){
Fork: Join: Ipt0 , rs, s1 q =H pt0 , rs, s2 q k pt3 , r1, 3s, s5 q
s1 ; f k3 :fork(t3 , bar);
prs,f k1 q prs,jn1 q
f k1 : fork(t1 , foo1); jn3 :join(t3 ); t0 ùùùùñ t1 t0 ðùùùù t1 Ipt0 , rs, s2 q = tt1 , t3 u pt0 , rs, s3 q k pt2 , r2, 4s, s5 q
} pr1s,f k3 q pr1s,jn3 q
s2 ; t1 ùùùùùñ t3 t1 ðùùùùù t3
prs,f k1 q Ipt0 , rs, s3 q = tt2 u pt0 , rs, s3 q k pt2 , r2s, s4 q
jn1 : join(t1 ); foo2(){ t0 ùùùùñ t3
prs,jn1 q
t0 ðùùùù t3
cs4 : bar(); prs,f k2 q prs,jn2 q Ipt2 , r2s, s4 q = tt0 u
s4 ; t0 ùùùùñ t2 t0 ðùùùù t2
f k2 : fork(t2 , foo2);
} Ipt3 , r1, 3s, s5 q = tt0 , t1 u
s3 ;
jn2 : join(t2 ); Sibling: HB:
bar(){ t1 ’ t2 t1 ą t2 Ipt2 , r2, 4s, s5 q = tt0 u
s5 ; t3 ’ t2 t3 ą t2
} }
(a) Program (b) Thread relations (c) Thread interleavings (d) MHP pairs
Figure 8: An illustrating example for interleaving analysis (with t0 denoting the main thread).
Ipt, c, sq, where c is a calling context to capture one in- Given two sibling threads t and t1 , the entry statements
stance of s when its enclosing method is invoked under c. pc, sq and pc1 , s1 q of their start procedures may interleave
For example, if Ipt1 , c, sq “ tt2 , t3 u, then threads t2 and t3 with each other if neither t ą t1 nor t1 ą t ([I-SIBLING]).
may be alive when s1 is executed under context c in t1 . [I-JOIN] represents the fact that a descendent thread
Statement s1 in thread t1 may happen in parallel with will no longer be alive after it has been joined at a join site.
statement s2 in thread t2 , denoted as pt1 , c1 , s1 q k pt2 , c2 , s2 q, For a thread t, [I-CALL] and [I-RET] ([I-INTRA])
if the#following holds (with M from Definition 1): propagate data-flow facts interprocedurally by matching
t2 P Ipt1 , c1 , s1 q ^ t1 P Ipt2 , c2 , s2 q if t1 ‰ t2 calls and returns context-sensitively (intraprocedurally).
t1 P M otherwise Example 1. We illustrate our interleaving analysis with a
pc,f ki q pc,jni q program in Figure 8. As shown in Figure 8(a), the main
Given ùùùùñ (spawning relation), ðùùùù (joining re- thread t0 creates two threads t1 and t2 at fork sites f k1 and
lation), ’ (thread sibling) and ą (HB from Definition 2), f k2 , respectively. In its start procedure foo1, t1 spawns an-
our interleaving analysis is formulated as a forward data- other thread t3 and fully joins it later at jn3 . Figure 8(b)
flow problem pV, [, F q (Figure 7). Here, V represents the shows all the thread relations. Note that t2 continues to ex-
set of all thread interleaving facts, [ is the meet operator ecute after its two sibling threads t1 and t3 have terminated
(Y), and F : V Ñ V represents the set of transfer functions due to jn1 , which joins t1 directly and t3 indirectly.
associated with each node in an ICFG. The results of applying the rules in Figure 7 are listed in
pc,f ki q
[I-DESCENDANT] handles thread creation t ùùùùñ t1 at Figure 8(c). Due to context-sensitivity, our analysis has iden-
a fork site pc, f ki q. The statement pc, sq that appears imme- tified precisely the three MHP relations given in Figure 8(d).
diately after pc, f ki q in ICFGt may-happen-in-parallel with As bar() is called under two contexts, s5 has two differ-
the entry statement pc1 , s1 q of the start procedure of thread t1 . ent instances pt3 , r1, 3s, s5 q and pt2 , r2, 4s, s5 q. The former
165
main() { Pre-computed aliases: Inter-thread value-flows:
... foo1() { foo2() { ASp˚p, ˚qq “ tou t1 t2
cs1 : bar(); s1 : *p = ...; lock(l2); Lock spans:
s1
o
f k2 : fork(t1 , foo1); lock(l1); cs4 : bar(); pt1 , r2s, s2 q P spl1
f k3 : fork(t2 , foo2); s2 : *p = ...; unlock(l2); pt1 , r2s, s3 q P spl1 o sp!1 sp!2
} s3 : *p = ...; } pt2 , r3, 4s, s4 q P spl2 o
s2 X
s4
bar() { unlock(l1); MHP relations:
s4 : ¨ ¨ ¨ “ ˚q } pt1 , r2s, s1 q k pt2 , r3, 4s, s4 q
} // p and q point to the same object o s3 o
pt1 , r2s, s2 q k pt2 , r3, 4s, s4 q
// l1 and l2 point to the same lock pt1 , r2s, s3 q k pt2 , r3, 4s, s4 q
o
Figure 9: A lock analysis example (with irrelevant code elided), avoiding s2 ãÝÑ s4 that would be added by [THREAD-VF].
one may-happen-in-parallel with pt0 , rs, s2 q and the later lease site pc1 , unlockpl1 qq in ICFGt obtained with a forward
one with pt0 , rs, s3 q. As our analysis is context-sensitive, reachability analysis with calls and returns being matched
pt0 , rs, s3 q k pt2 , r2s, s4 q but pt0 , rs, s2 q ∦ pt2 , r2s, s4 q. context-sensitively, where l and l1 points to the same single-
ton (i.e., runtime) lock object, denoted as l ” l1 .
3.3.2 Value-Flow Analysis
Just in the case of MHP analysis for fork/join operations,
Given a pair of MHP statements, we make use of the points-
context-sensitivity ensures that lock analysis can distinguish
to information discovered during the pre-analysis to add
different calling contexts under which a statement appears
the potential (thread-aware) def-use edges in between. In
inside a lock-release span. In Figure 9, bar() is called
partial SSA form, the top-level pointers in T are kept in
twice, but only the instance of statement pt2 , r3, 4s, s4 q
registers and thus thread-local. However, the address-taken
called from cs4 is inside the lock-release span spl2 .
variables in A can be accessed by concurrent threads via
loads and stores. It is only necessary to consider inter- Definition 4 (Span Head). For an object o P A, HDpspl , oq
thread value-flows for MHP store-load and store-store pairs represents a set of context-sensitive loads or stores that
rpt, c, sq, pt1 , c1 , s1 q, where s is a store ˚p “ . . . and s1 is a may access o at the head of the span spl : HDpspl , oq “
o
load ¨ ¨ ¨ “ ˚q or a store ˚q “ ¨ ¨ ¨ . Hence, [THREAD-VF] tpt, c, sq P spl | E pt1 , c1 , s1 q P spl : s1 ãÝÑ su.
comes into play, where ASp˚p, ˚qq is the set of objects in V
pointed to by both p and q (due to pre-analysis). Definition 5 (Span Tail). For an object o P A, T Lpspl , oq
represents a set of context-sensitive stores that may access
s : ˚p “ s1 : “ ˚q or ˚ q “ o at the tail of the span spl : T Lpspl , oq “ tpt, c, sq P spl |
o
pt, c, sq k pt1 , c1 , s1 q o P ASp˚p, ˚qq s is a store, E pt1 , c1 , s1 q P spl : ps1 is a store ^ s ãÝÑ s1 qu.
[THREAD-VF] o
s ãÝÑ s1 Definition 6 (Non-Interference Lock Pairs). Let pt, c, sq k
pt1 , c1 , s1 q be a MHP statement pair, where s is a store, such
Example 2. For the program in Figure 6(a), we apply that both statements are protected by at least one common
[THREAD-VF] to add all the missing thread-aware def- lock, i.e., D l, l1 : pt, c, sq P spl ^ pt1 , c1 , s1 q P spl1 ùñ
use chains on top of Figure 6(d). According to pre-analysis, l ” l1 . We say that the pair is a non-interference lock pair if
o
ASp˚p, ˚qq “ tou. As pt0 , rs, s2 q k pt1 , r1s, s4 q, s2 ãÝÑ s4 pt, c, sq R T Lpspl , oq _ pt1 , c1 , s1 q R HDpspl1 , oq holds.
o
is added. As pt0 , rs, s2 q k pt1 , r1s, s5 q, s2 ãÝÑ s5 is added.
o
While pt1 , r1s, s4 q k pt0 , rs, s2 q, s4 ãÝÑ s2 has been added By refining [THREAD-VF] with Definition 6 being taken
earlier as a thread-oblivious def-use edge (Section 3.2). into account, some spurious value-flows are filtered out.
Example 3. In Figure 9, two lock-release spans spl1 and
3.3.3 Lock Analysis
spl2 are protected by a common lock, since ˚l1 and ˚l2
Statements from different mutex regions are interference- are found to be must aliases. By applying [THREAD-VF]
free if these regions are protected by a common lock. By cap- alone, all the three def-use edges in red will be added. By
turing lock correlations, we can avoid some spurious def-use Definition 6, however, s2 inside spl1 cannot interleave with
edges introduced by [THREAD-VF] in the two lock-release o
s4 inside spl2 . So s2 ãÝÑ s4 is spurious and can be ignored.
spans defined below. We do this by performing a flow- and
context-sensitive analysis for lock/unlock operations (based
on the points-to information from pre-analysis) 3.4 Sparse Analysis
Definition 3 (Lock-Release Spans). A lock-release span spl Once all the def-use chains have been built, the sparse flow-
at a context-sensitive lock site pt, c, lockplqq consists of the sensitive pointer analysis algorithm developed for sequential
statements starting from pc, lockplqq to the corresponding re- C programs [10], given in Figure 10, can be reused in the
166
q
s : p “ &o s : p “ q s1 ãÝÑ s Table 1: Program statistics.
[P-ADDR] [P-COPY]
tou Ď ptps, pq ptps1 , qq Ď ptps, pq
Benchmark Description LOC
q r word count Word counter based on map-reduce 6330
s : p “ φpq, rq s1 ãÝÑ s s2 ãÝÑ s kmeans Iterative clustering of 3-D points 6008
[P-PHI]
ptps1 , qq Ď ptps, pq ptps2 , rq Ď ptps, pq radiosity Graphics 12781
automount Manage autofs mount points 13170
q o
s : p “ ˚q s2 ãÝÑ s o P ptps2 , qq s1 ãÝÑ s ferret Content similarity search server 15735
[P-LOAD] bodytrack Body tracking of a person 19063
ptps1 , oq Ď ptps, pq
httpd server Http server 52616
p q mt daapd Multi-threaded DAAP Daemon 57102
s : ˚p “ q s2 ãÝÑ s o P ptps2 , pq s1 ãÝÑ s
[P-STORE] raytrace Real-time raytracing 84373
ptps1 , qq Ď ptps, oq x264 Media processing 113481
Total 380659
o
s : ˚p “ s1 ãÝÑ s o P Azkillps, pq
[P-SU/WU]
1
ptps , oq Ď ptps, oq 2.70GHz Intel Xeon Quad Core CPU with 64 GB memory,
$ running Ubuntu Linux (kernel version 3.11.0).
1
&to u
’ if ptps, pq “ to1 u ^ po1 P singletonsq The source code of each program is compiled into bit
killps, pq “ A else if ptps, pq “ ∅ code files using clang and then merged together using LLVM
’
%
∅ otherwise Gold Plugin at link time stage (LTO) to produce a whole-
program bc file. In addition, the compiler option mem2reg is
Figure 10: Sparse flow-sensitive pointer analysis. turned on to promote memory into registers.
// word_count-pthread.c
multithreaded setting. For a variable v, ptps, vq denotes its 140 for(i=0; i<num_procs; i++){
166 pthread_create(&tid[i], &attr,
points-to set computed immediately after statement s. wordcount_map, (void*)out) != 0);
The first five rules deal with the five types of statements 167 }
introduced in Section 2.1, by following the pre-computed 170 for (i = 0; i < num_procs; i++){
173 pthread_join(tid[i],
def-use chains ãÝÑ. The last enables a strong or weak update (void **)(void*)&ret_val) != 0);
at a store, whichever is appropriate, where singletons [17] 175 }
is the set of objects in A representing unique locations by ...
excluding heap, array, and local variables in recursion.
FSAM is sound since (1) its pre-analysis is sound, (2) Figure 11: A multi-forked example in word count.
the def-use chains constructed for the program (as described
in Sections 3.2 and 3.3) are over-approximate, and (3) the
sparse analysis given in Figure 10 is as precise as the tradi- 4.2 Implementation
tional iterative data-flow analysis [10]. We have implemented FSAM in LLVM (version 3.5.0). An-
dersen’s analysis (using the constraint resolution techniques
4. Evaluation from [23]) is used to perform its pre-analysis indicated in
Figure 2. In order to distinguish the concrete runtime threads
The objective is to show that our sparse flow-sensitive represented by an abstract multi-forked thread (Definition 1)
pointer analysis, FSAM, is significantly faster than while inside a loop, we use LLVM’s SCEV alias analysis to cor-
consuming less memory than the traditional data-flow-based relate a fork-join pair. Figure 11 shows a code snippet from
flow-sensitive pointer analysis, denoted N ON S PARSE, in an- word count, where a fixed number of threads are forked
alyzing large multithreaded C programs using Pthreads. and joined in two “symmetric” loops. FSAM can recog-
nize that any statement in a slave thread (with its start rou-
4.1 Experimental Setup tine wordcount map) does not happen in parallel with the
We have selected a set of 10 multithreaded C programs, in- statements after its join executed in the main thread.
cluding the two largest (word count and kmeans) from FSAM is field-sensitive. Each field of a struct is treated
Phoenix-2.0, the five largest (radiosity, ferret, as a separate object, but arrays are considered monolithic.
bodytrack, raytrace and x264) from Parsec-3.0, Positive weight cycles (PWCs) that arise from processing
and three open-source applications (automount, mt daapd fields are detected and collapsed [22]. The call graph of a
and httpd-server), as shown in Table 1. All our ex- program is constructed on-the-fly. Distinct allocation sites
periments were conducted on a platform consisting of a are modeled by distinct abstract objects [10, 32].
167
4.3 Methodology Table 2: Analysis time and memory usage.
We are not aware of any flow-sensitive pointer analysis for
multithreaded C programs with Pthreads in the literature Time (Secs) Memory (MB)
Program
or any publicly available implementation. RR [25] is clos- FSAM N ON S PARSE FSAM N ON S PARSE
est; it performs an iterative flow-sensitive data-flow-based word count 3.04 17.40 13.79 53.76
pointer analysis on structured parallel code regions in Clik kmeans 2.50 18.19 18.27 53.19
programs. However, C programs with Pthreads are unstruc- radiosity 6.77 29.29 38.65 95.00
tured, requiring MHP analysis to discover their parallel code automount 8.66 83.82 27.56 364.67
regions. PCG [14] is a recent MHP analysis for Pthreads that ferret 13.49 87.10 52.14 934.57
distinguishes whether two procedures may execute concur- bodytrack 128.80 2809.89 313.66 12410.16
rently. We have implemented RR also in LLVM (3.5.0) for httpd server 191.22 2079.43 55.78 6578.46
multithreaded C programs with their parallel regions discov- mt daapd 90.67 2667.55 37.92 3403.26
ered by PCG, denoted N ON S PARSE, as the base line. raytrace 284.61 OOT 135.06 OOT
To understand FSAM better, we also analyze the im- x264 531.55 OOT 129.58 OOT
pact of each of its phases on the performance of sparse
flow-sensitive points-to resolution. To do this, we measure
the slowdown of FSAM with each phase turned off indi- ysis is more beneficial than the other two in reducing spuri-
vidually: (1) No-Interleaving: with our interleaving analysis ous def-use edges passed to the final sparse analysis.
turned off but the results from PCG used instead, (2) No- Interleaving analysis is very useful for kmeans,
Value-Flow: with our value-flow analysis turned off (i.e., httpd-server and mt daapd in avoiding spurious
o P ASp˚p, ˚qq in [THREAD-VF] disregarded), and (3) No- MHP pairs. These programs adopt the master-slave pat-
Lock: with our lock analysis turned off. tern so that the slave threads perform their tasks in their
Note that some spurious def-use edges may be avoided by start procedures while the master thread handles some post-
more than one phase. Despite this, these three configurations processing task after having joined all the slave threads.
allow us to measure their relative performance impact. Precise handling of join operations is critical in avoiding
spurious MHP relations between the statements in the slave
threads and those after their join sites in the master thread.
4.4 Results and Analysis
Value-flow analysis is effective in reducing redundant
Table 2 gives the analysis times and memory usage of def-use edges among concurrent threads in most of the pro-
FSAM against N ON S PARSE. FSAM spends less than 22 grams evaluated. For automount, ferret and mt daapd,
minutes altogether in analyzing all the 10 programs (total- value-flow analysis has avoided adding over 80% (spurious)
ing 380KLOC). For the two largest programs, raytrace def-use edges. In these programs, the concurrent threads ma-
and x264, FSAM spends just under 5 and 9 minutes, re- nipulate not only global variables but also their local vari-
spectively, while N ON S PARSE fails to finish analyzing each ables frequently. Thus, value-flow-analysis can prevent the
under two hours. For the remaining 8 programs analyzable subsequent sparse analysis from propagating blindly a lot of
by both, FSAM is 12x faster and uses 28x less memory than points-to information for non-shared memory locations.
N ON S PARSE, on average. For the two programs with over Lock analysis is beneficial for programs such as
50KLOC, httpd server and mt daapd, FSAM is 11x automount and radiosity that have extensive usage
faster and uses 117x less memory for httpd server and of locks (with hundreds of lock-release spans) to protect
29x faster and uses 89x less memory for mt daapd. their critical code sections. In these program, some lock-
For small programs, such as word count and kmeans, release spans can cover many statements accessing glob-
FSAM yields little performance benefits over N ON S PARSE ally shared objects. Figure 13 gives a pair of lock-release
due to relatively few statements and simple thread synchro- spans with a common lock accessing the shared global’s
nizations used. For larger ones, which contain more pointers, task queue in two threads. The spurious def-use chains
loads/stores and complex thread synchronization primitives, from the write at line 457 in dequeue task to all the
FSAM has a more distinct advantage, with the best speedup statements accessing the shared task queue object in
39x observed at bodytrack and the best memory usage re- enqueue task are avoided by our analysis.
duction at httpd server. FSAM has achieved these bet-
ter results by propagating and maintaining significantly less 5. Related Work
points-to information than N ON S PARSE.
We discuss the related work on sparse flow-sensitive pointer
Figure 12 shows the relative impact of each of FSAM’s
analysis and pointer analysis for multithreaded programs.
three thread interference analysis phases on its analysis ef-
ficiency for the three configurations defined in Section 4.3. Sparse Flow-Sensitive Pointer Analysis Sparse analysis,
The performance impact of each phase varies considerably a recent improvement over the classic iterative data-flow
across the programs evaluated. On average, value-flow anal- approach, can achieve flow-sensitivity more efficiently by
168
No-Interleaving No-Value-Flow No-Lock 19.7x 13.9x 18.8x
8x
Slowdown over FSAM
6x
4x
2x
0x
word count kmeans radiosity automount ferret bodytrack httpd-server mt daapd raytrace x264
Figure 12: Impact of FSAM’s three thread interference analysis phases on its analysis efficiency.
169
References [23] F. Pereira and D. Berlin. Wave propagation and deep propa-
[1] S. Agarwal, R. Barik, V. Sarkar, and R. K. Shyamasundar. gation for pointer analysis. In CGO ’09, pages 126–135.
[24] P. Pratikakis, J. S. Foster, and M. W. Hicks. LOCKSMITH:
May-happen-in-parallel analysis of X10 programs. In PPoPP
context-sensitive correlation analysis for race detection. In
’07, pages 183–193.
PLDI’ 06, pages 320–331.
[2] L. Andersen. Program analysis and specialization for the C
[25] R. Rugina and M. Rinard. Pointer analysis for multithreaded
programming language. PhD thesis, 1994.
programs. In PLDI ’99, pages 77–90.
[3] R. Barik. Efficient computation of may-happen-in-parallel
[26] A. Salcianu and M. Rinard. Pointer and escape analysis for
information for concurrent Java programs. In LCPC ’05,
multithreaded programs. PPOPP ’01, 36(7):12–23.
pages 152–169.
[27] Y. Smaragdakis, M. Bravenboer, and O. Lhoták. Pick your
[4] J.-D. Choi, R. Cytron, and J. Ferrante. Automatic construction
contexts well: understanding object-sensitivity. POPL ’11,
of sparse data flow evaluation graphs. In POPL ’91, pages 55–
pages 17–30.
66. [28] Y. Sui, D. Ye, and J. Xue. Static memory leak detection using
[5] J.-D. Choi, R. Cytron, and J. Ferrante. On the efficient engi-
full-sparse value-flow analysis. In ISSTA ’12, pages 254–264.
neering of ambitious program analysis. IEEE Transactions on [29] Y. Sui, S. Ye, J. Xue, and P.-C. Yew. SPAS: Scalable path-
Software Engineering, 20(2):105–114, 1994. sensitive pointer analysis on full-sparse ssa. In APLAS ’11,
[6] F. Chow, S. Chan, S. Liu, R. Lo, and M. Streich. Effective
pages 155–171.
representation of aliases and indirect memory operations in [30] Y. Wang, T. Kelly, M. Kudlur, S. Lafortune, and S. A. Mahlke.
SSA form. In CC ’96, pages 253–267. Gadara: Dynamic deadlock avoidance for multithreaded pro-
[7] P. Di, Y. Sui, D. Ye, and J. Xue. Region-based may-happen-in- grams. In OSDI’ 08, pages 281–294.
parallel analysis for c programs. In ICPP ’15, pages 889–898. [31] J. Whaley and M. S. Lam. Cloning-based context-sensitive
[8] I. Evans, F. Long, U. Otgonbaatar, H. Shrobe, M. Rinard, pointer alias analysis using binary decision diagrams. PLDI
H. Okhravi, and S. Sidiroglou-Douskos. Control Jujutsu: On ’04, pages 131–144.
the weaknesses of fine-grained control flow integrity. In CCS [32] S. Ye, Y. Sui, and J. Xue. Region-based selective flow-
’15, pages 901–913. sensitive pointer analysis. In SAS ’14, pages 319–336.
[9] S. J. Fink, E. Yahav, N. Dor, G. Ramalingam, and E. Geay. [33] H. Yu, J. Xue, W. Huo, X. Feng, and Z. Zhang. Level by level:
Effective typestate verification in the presence of aliasing. making flow-and context-sensitive pointer analysis scalable
ACM Transactions on Software Engineering and Methodol- for millions of lines of code. In CGO ’10, pages 218–229.
ogy, 17(2):1–34, 2008.
[10] B. Hardekopf and C. Lin. Flow-Sensitive Pointer Analysis for
Millions of Lines of Code. In CGO ’11, pages 289–298. A. Artifact Description
[11] B. Hardekopf and C. Lin. Semi-sparse flow-sensitive pointer Summary: The artifact includes full implementation of
analysis. In POPL ’09, pages 226–238. FSAM and N ON S PARSE analyses, benchmarks and scripts
[12] M. Hind, M. Burke, P. Carini, and J.-D. Choi. Interprocedural
pointer alias analysis. ACM Transactions on Programming
to reproduce the data in this paper.
Languages and Systems, 21(4):848–894, 1999. Description: You may find the artifact package and all the
[13] M. Hind and A. Pioli. Assessing the effects of flow-sensitivity instructions on how to use FSAM via the following link:
on pointer alias analyses. In SAS ’98, pages 57–81. http://www.cse.unsw.edu.au/˜corg/fsam
[14] P. G. Joisha, R. S. Schreiber, P. Banerjee, H. J. Boehm, and
A brief checklist is as follows:
D. R. Chakrabarti. A technique for the effective and automatic
reuse of classical compiler optimizations on multithreaded • index.html: the detailed instructions for reproducing
code. In POPL’11, pages 623–636. the experimental results in the paper.
[15] W. Landi and B. Ryder. A safe approximate algorithm for
interprocedural aliasing. In PLDI ’92, pages 235–248. • FSAM.ova: virtual image file (4.6G) containing in-
[16] C. Lattner and V. Adve. LLVM: A compilation framework stalled Ubuntu OS and FSAM project.
for lifelong program analysis & transformation. In CGO ’04, • Full source code of FSAM developed on top of the SVF
pages 75–86.
[17] O. Lhoták and K.-C. A. Chung. Points-to analysis with effi-
framework http://unsw-corg.github.io/SVF.
cient strong updates. In POPL ’11, pages 3–16. • Scripts used to reproduce the data in the paper including
[18] L. Li, C. Cifuentes, and N. Keynes. Boosting the performance ./table2.sh and ./figure12.sh.
of flow-sensitive points-to analysis using value flow. In FSE
’11, pages 343–353. • Micro-benchmarks to validate pointer analysis results.
[19] Y. Li, T. Tan, Y. Sui, and J. Xue. Self-inferencing reflection Platform: All the results related to analysis times and
resolution for Java. In ECOOP’ 14, pages 27–53. memory usage in our paper are obtained on a 2.70GHz Intel
[20] S. Nagarakatte, M. M. K. Martin, and S. Zdancewic. Every-
thing You Want to Know About Pointer-Based Checking. In
Xeon Quad Core CPU running Ubuntu Linux with 64GB
SNAPL ’15, pages 190–208. memory. For the VM image, we recommend you to allocate
[21] H. Oh, K. Heo, W. Lee, W. Lee, and K. Yi. Design and imple- at least 16GB memory to the virtual machine. The OS in the
mentation of sparse global analyses for C-like languages. In virtual machine image is Ubuntu 12.04. A VirtualBox with
PLDI ’12, pages 229–238. version 4.1.12 or newer is required to run the image.
[22] D. Pearce, P. Kelly, and C. Hankin. Efficient field-sensitive
pointer analysis of C. ACM Transactions on Programming License: LLVM Release License (The University of Illi-
Languages and Systems, 30(1), 2007. nois/NCSA Open Source License (NCSA))
170