0% found this document useful (0 votes)
18 views

Efficient Field-Sensitive Pointer Analysis of C

Uploaded by

石远翔
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Efficient Field-Sensitive Pointer Analysis of C

Uploaded by

石远翔
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Efficient Field-Sensitive Pointer Analysis for C

David J. Pearce Paul H. J. Kelly Chris Hankin


Department of Computing, Department of Computing, Department of Computing,
Imperial College, London, Imperial College, London, Imperial College, London,
SW7 2BZ, UK SW7 2BZ, UK SW7 2BZ, UK
[email protected] [email protected] [email protected]

ABSTRACT 2. The largest experimental investigation to date into the


The subject of this paper is flow- and context-insensitive trade-offs in time and precision of field-insensitive and
pointer analysis. We present a novel approach for precisely -sensitive analyses for C.
modelling struct variables and indirect function calls. Our In some sense, our formulation can be regarded as an in-
method emphasises efficiency and simplicity and extends the stance of the general framework for field-sensitive pointer
language of set-constraints. We experimentally evaluate the analysis by Yong et al. [28]. In particular, we argue it is
precision cost trade-off using a benchmark suite of 7 common equivalent to the most precise, yet portable analysis their
C programs between 5,000 to 150,000 lines of code. Our system can describe. However, this work goes beyond their
results indicate the field-sensitive analysis is more expensive initial treatment by considering efficient implementation and
to compute, but yields significantly better precision. some important algorithmic issues which were not addressed.
Categories and Subject Descriptors
2. CONSTRAINT-BASED ANALYSIS
F.3.2 [Semantics of Programming Languages]: Pro-
gram Analysis Set-constraints [1] were first used by Andersen for perform-
ing pointer analysis [2]. Since then, this approach has be-
General Terms come popular and, recently, was shown capable of analysing
Algorithms, Theory, Languages, Verification million line programs [10, 15]. We use the following set-
constraint language to formulate our pointer analysis:
Keywords p ⊇ q | p ⊇ {q} | p ⊇ ∗q | ∗p ⊇ q | ∗p ⊇ {q}
Set-Constraints, Pointer Analysis
Where p and q are constraint variables and ∗ is the derefer-
ence operator. We can think of each variable as containing
1. INTRODUCTION the set it points to. Thus, p ⊇ {x} states that p points to x.
Pointer analysis is the problem of statically determining the Note, constraints involving “∗” are called complex. To per-
runtime targets of pointer variables in a program. We say form the analysis we first translate the source program into
that a solution is sound if the inferred target set for each this language, by mapping each variable to a unique con-
variable contains all actual runtime targets for that vari- straint variable and converting assignments to constraints.
able. A solution is imprecise if, for any variable, the in- Consider the following, where the solution (shown below the
ferred target set is larger than necessary. Thus, the most line) is obtained using the rules of Figure 1:
imprecise but sound solution has each variable pointing to
every other. Obtaining a perfect (i.e. flow- and context- int *f(int *p){return p;} (1) f∗ ⊇ fp
sensitive) solution, however, is undecidable in general [14] int g() { int x,y,*p,*q,**r,**s;
and, in practice, obtaining even relatively imprecise infor- s=&p; (2) gs ⊇ {gp }
mation (i.e. flow- and context-insensitive) is expensive [13]. if(...) p=&x; (3) gp ⊇ {gx }
The main contributions of this paper are: else p=&y; (4) gp ⊇ {gy }
r=s; (5) gr ⊇ gs
1. A small extension to the language of set-constraints, q=f(*r); (6) fp ⊇ ∗gr
which elegantly formalises a field-sensitive pointer anal- } (7) gq ⊇ f∗
ysis for the C language. As a byproduct, function
pointers are supported for free with this mechanism. (8) gr ⊇ {gp } (trans, 2 + 5)
(9) fp ⊇ gp (deref2 , 6 + 8)
(10) fp ⊇ {gx } (trans, 3 + 9)
(11) fp ⊇ {gy } (trans, 4 + 9)
Permission to make digital or hard copies of all or part of this work for
(12) f∗ ⊇ {gx } (trans, 1 + 10)
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies (13) f∗ ⊇ {gy } (trans, 1 + 11)
bear this notice and the full citation on the first page. To copy otherwise, to (14) gq ⊇ {gx } (trans, 7 + 12)
republish, to post on servers or to redistribute to lists, requires prior specific (15) gq ⊇ {gy } (trans, 7 + 13)
permission and/or a fee.
PASTE’04, June 7-8, 2004, Washington, DC, USA. Variable names are augmented with scope information to
Copyright 2004 ACM 1-58113-910-1/04/0006 ...$5.00. ensure uniqueness. Also, f∗ represents the return value of f .
τ1 ⊇ {τ2 } τ3 ⊇ τ1 τ1 ⊇ ∗τ2 τ2 ⊇ {τ3 } ∗τ1 ⊇ τ2 τ1 ⊇ {τ3 } ∗τ1 ⊇ {τ2 } τ1 ⊇ {τ3 }
τ3 ⊇ {τ2 } τ1 ⊇ τ3 τ3 ⊇ τ2 τ3 ⊇ {τ2 }

[trans] [deref1 ] [deref2 ] [deref3 ]

Figure 1: An inference system for pointer analysis

From the derivation, we can build the target set of a variable in different ways, with their relative precision depending
v from constraints of the form v ⊇ {x}. Thus, constraints upon the program in question. The focus of this paper is in
14+15 give q = {gx , gy }, meaning q may point to gx or gy extending our system to be field-sensitive. In fact, there ex-
anywhere in the program. To ensure the solution is sound, ist several field-sensitive pointer analyses for Java, which are
we must derive all facts. One feature of the system is that formulated using set constraints [21, 15, 17, 26]. However,
control-flow and calling context are ignored. This is called Java presents a simpler problem and we must go beyond
flow- and context-insensitivity and causes imprecision. But, these to achieve our goal.
without these simplifications, our analysis would not scale In the literature, function pointers are either dealt with
to large programs. in ad hoc ways (e.g. [10, 15]) or through a special lam
To perform the analysis efficiently we use a directed graph, constructor (e.g. [9, 8]). The latter gives something like:
where each variable is represented by a unique node and each
constraint p ⊇ q by an edge p ← q. For each node n we also int f(int *p) { return p; } f∗ ⊇ fp
have a solution set, Sol(n), initialised from all constraints int (*p)(int*) = &f p ⊇ { lamf (fp ) }
of the form n ⊇ {x}. So, for the previous example, we have: int *q = ... q ⊇ { ... }
p(q) ∗p(q)
gs gr
{gp} {} with a corresponding rule for function application:
gp fp f* gq ∗p(τ1 , . . . , τn )
{gx,gy} {} {} {}
p ⊇ { lamv (v1 , . . . , vn ) }
∀1 ≤ i ≤ n. vi ⊇ τi
Here, each solution is placed below its node. The graph is
then solved by propagating the solution of each node into The main issue here is the implementation of lam. Cer-
all reachable successors. During this, complex constraints tainly, we don’t wish to sacrifice the ability to implement
involving n are evaluated as Sol(n) changes. For a constraint solutions as integer sets. One approach is to place the lam
∗n ⊇ q, this is done by adding an edge v ← q, ∀v ∈ Sol(n). constructs into a table, so they are identified by index. Thus,
The final graph looks like: if care is taken to avoid clashes with the variable identifiers,
gs gr the two element types can co-exist in the same solution set.
However, this is inelegant as we must litter our algorithm
{gp} {gp} with special type checks. For example, when dealing with
gp fp f* gq ∗p ⊇ q, we must check for lam values in Sol(p). In the next
Section, we present a simple alternative, which forms part
{gx,gy} {gx,gy} {gx,gy} {gx,gy}
of our field-sensitive formulation — meaning field-sensitive
Notice the new edge, caused by constraint 6 (fp ⊇ ∗gr ). analyses can model function pointers for free.
Various techniques for speeding up this computation have
been proposed in the literature and the reader is referred to
[19, 10, 15, 7, 24, 20]. One important consideration is that,
3. EXTENDING THE BASIC MODEL
for efficiency reasons, it is desirable to implement Sol(n) The main observation behind our method is that, since vari-
as an integer set. This permits the use of data structures ables are identified by integers, we can reference one as an
supporting efficient set union, such as bit vectors or sorted offset from another. Thus, we introduce the following forms:
arrays, and is achieved by indexing each variable. p ⊇ ∗(q+k) | ∗(p+k) ⊇ q | ∗(p+k) ⊇ {q}
There are two limitations to the system described so far:
it cannot handle aggregates or function pointers. For the for- Here k is an arbitrary constant and ∗(p+k) means “load p
mer, three approaches are common: field-insensitive, where into a temporary set, add k to each element and dereference
field information is discarded by modelling an aggregate as before”. When k = 0, these forms are equivalent to those
with a single constraint variable; field-based, where all in- of the original language. The corresponding inference rules
stances of a particular field are modelled with one variable; are given in Figure 2, where idx maps variables to their
field-sensitive, where each instance of a field is modelled with index. Now, suppose in our source program there is some
a separate variable. The following clarifies this: function f , accepting x parameters. If the address of f has
been taken, we create a block of x consecutively indexed
typedef struct { int *f1; int *f2; } aggr;
variables, where the first represents the first parameter of
aggr a,b; (insensitive) (based) (sensitive) f and so on. Thus, we model the address of f using the
int *c,d,e,f; first index in the block, allowing us to reference the other
a.f1 = &d; a ⊇ {d} f 1 ⊇ {d} af 1 ⊇ {d} parameters as an offset. We can also model return values
a.f2 = &f; a ⊇ {f } f 2 ⊇ {f } af 2 ⊇ {f } using this mechanism by allocating another variable after
b.f1 = &e; b ⊇ {e} f 1 ⊇ {e} bf 1 ⊇ {e} the last parameter. Thus, we can determine the offset of
c = a.f1; c⊇a c ⊇ f1 c ⊇ af 1 the return value from the type of the function pointer being
The field-insensitive and field-based solutions are imprecise dereferenced. The following aims to clarify this:
τ1 ⊇ ∗(τ2 +k) τ2 ⊇ {τ3 } ∗(τ1 +k) ⊇ τ2 τ1 ⊇ {τ3 } ∗(τ1 +k) ⊇ {τ2 } τ1 ⊇ {τ3 } τ1 ⊇ τ2 +k τ2 ⊇ {τ3 }
idx(τ4 ) = idx(τ3 )+k idx(τ4 ) = idx(τ3 )+k idx(τ4 ) = idx(τ3 )+k idx(τ4 ) = idx(τ3 )+k
τ1 ⊇ τ4 τ4 ⊇ τ2 τ4 ⊇ {τ2 } τ1 ⊇ {τ4 }

[deref4 ] [deref5 ] [deref6 ] [add]

Figure 2: Extended inference rules. Note, add is only needed for the second half of Section 3
void f(int**p,int*q) (1,2) idx(fp ) = 0, idx(fq ) = 1 constraint graph into a weighted multigraph, where weight
{ *p = q; (3) ∗fp ⊇ fq determines increment. However, this introduces the Positive
} Weight Cycle (PWC) problem:
void g(...) { void (*p)(int**,int*);
int *a,*b,c; (4,5) idx(gp ) = 2, idx(ga ) = 3 aggr a,*p; void *q;
(6,7) idx(gb ) = 4, idx(gc ) = 5 q=&a; q ⊇ {a}
p = &f; (8) gp ⊇ {fp } p=q; p⊇q
b = &c; (9) gb ⊇ {gc } q=&p->f2; q ⊇ p+1
p(&a,b); (10) ∗(gp +0) ⊇ {ga } /* now use q as int* */
} (11) ∗(gp +1) ⊇ gb
This is legal and well-defined C code. The cycle arises from
(12) fp ⊇ {ga } (deref6 8,10,1) flow-insensitivity and, we argue, any such analysis must deal
(13) fq ⊇ gb (deref5 1,2,8,11) with this. Note, cycles can also arise from an imprecise
(14) fq ⊇ {gc } (trans 9,13) model of the heap (see below). In general, the problem is
(15) ga ⊇ fq (deref2 3,12) that cycles describe infinite derivations. To overcome this,
(16) ga ⊇ {gc } (trans 14,15) we use end() information, as with function pointers, so that
a variable is only incremented within its enclosing block.
Thus, we see &f is translated as fp — its first parameter. Another problem with weighted edges is that cycle elimi-
One difficulty is the use of invalid casts: nation is now unsafe. Eliminating cycles is a common op-
void f(int *p) { ... } idx(fp ) = 0 timisation for speeding up the analysis, which exploits the
int g(int *a,int *b) { idx(ga ) = 1, idx(gb ) = 2 fact that nodes in a cycle have the same solution [19, 10,
void (*p)(int*,int*); idx(gp ) = 3 7]. Thus, each cycle is replaced by a single representative —
p = (void(*)(int*,int*)) &f; gp ⊇ {fp } reducing the size and complexity of the graph. Returning
*p(a,b); ∗(gp +0) ⊇ ga to our problem, we observe that cycles can be collapsed if
} ∗(gp +1) ⊇ gb there is a zero weighted path between all nodes and intra-
cycle weighted edges are preserved as self loops.
Here, ∗(gp +1) ⊇ b derives ga ⊇ b as idx(ga ) = idx(fp ) + 1. The heap is a further source of complication. One ap-
This is unfortunate, although it is unclear how to model proach, used by most pointer analyses, is to model all objects
the above anyway. To prevent such unwanted propagation returned by a particular call to malloc with one variable.
we can extend our mechanism with end() information for This has some implications, highlighted in the following:
each variable. This determines where the enclosing block of
consecutively allocated variables ends and we only permit typedef struct {double d1; int *f2;} aggr1;
offsets which remain within this. For example, in the above, typedef struct {int *f1; int *f3;} aggr2;
end(fp ) = 0 and end(ga ) = end(gb ) = 2 and we can identify void *f(int s) {return malloc(s);} f∗ ⊇ {HEAP0}
the problem as idx(∗(gp +1)) > end(∗gp ). void *g(int s) {return malloc(s);} g∗ ⊇ {HEAP1}
We now consider how this system can be made field- aggr1 *p = f(sizeof(aggr1)); p ⊇ f∗
sensitive, which we have already indicated is easier for Java aggr2 *q = f(sizeof(aggr2)); q ⊇ f∗
than C. So, what is the difference? The answer is that, int *x = f(100); x ⊇ f∗
in C, we can take the address of a field. Indeed, it turns int *y = g(100); y ⊇ g∗
out the language of the previous Section is enough for field-
sensitive analysis of Java. This is achieved by using blocks The issue is that we cannot, in general, determine which
of constraint variables, as we did for functions, to represent heap variables will be used as aggregates. Indeed, the same
aggregates. For example: variable can be used as both aggregate and scalar (e.g. HEAP0
above). Thus, we either model heap variables field-insensitively
typedef struct { int *f1; int *f2; } aggr;
or assume they can be treated as aggregates. Our choice is
aggr a,*b; idx(af 1 ) = 0, idx(af 2 ) = 1, idx(b) = 2
the latter, raising a further problem: how many fields should
int *p,**q,c; idx(p) = 3, idx(q) = 4, idx(c) = 5
each heap variable have? A simple solution is to give them
b=&a b ⊇ {af 1 }
the same number as the largest struct in the program. Ef-
b->f2=&c; ∗(b+1) ⊇ {c}
fectively, then, each heap variable is modelling the C union
p=b->f2; p ⊇ ∗(b+1)
of all structs. So, in the above, HEAP0 and HEAP1 both
But, how can we translate “q=&(b->f2);”? The problem is model aggr1 and aggr2 and are implemented with two con-
that we want to load the index of af 2 into Sol(q), but there is straint variables: the first representing fields f1 and d1; the
no mechanism for this. So, we extend the language to permit second f2 and f3. The observant reader will have noticed
the translation: q ⊇ b+1, meaning load b into a temporary, something strange here: the first constraint variable models
add 1 to each element and merge into q. Note the inference fields of different sizes. This seems a problem as, for exam-
rule in Figure 2. This form can be represented by turning the ple, writing to d1 should invalidate f1 and f3. In practice,
Ver LOC Triv Simp Comp Vars # Heap # PWC
make 3.79 16164 1427 4417 1557 4773 / 6920 69 / 1794 0
gawk 3.1.0 19598 2263 7797 2265 7288 / 10125 96 / 2496 0
bash 2.05 55324 3594 12076 2659 10831 / 13109 36 / 936 0
emacs 20.7 93151 11715 10437 5135 17961 / 38170 172 / 12900 0
sendmail 8.11.4 49053 5444 9595 2286 10218 / 12869 13 / 949 1
named 9.2.0 75599 17848 28972 24088 34649 / 47101 24 / 1704 1
gs 6.51 159853 21653 44030 36431 63568 / 100209 17 / 1887 2

Table 1: LOC measures non-comment, non-blank lines. Initial constraints are Trivial (p ⊇ {q}), Simple (p ⊇ q)
and Complex (involving ‘*’). Total number of constraint variables is given in #V ars, with #Heap showing
number modelling the heap. For each, the two numbers are for the field-insensitive and -sensitive analyses
respectively. Finally, “# PWC” counts positive weight cycles in the final graph (for the sensitive analysis).

however, this cannot be exploited without using undefined


foreach y ∈ V do changed(y) = true;
C, such as:
aggr1 *p = malloc(sizeof(aggr1)); idx(HEAP00 ) = 0 while ∃y.changed(y) do
idx(HEAP01 ) = 1 collapse zero weight cycles with Tarjan’s algorithm
int a,*r; p ⊇ {HEAP00 } foreach n ∈ V in (weak) topological order do
aggr2 *q = (aggr2 *) p; q⊇p if changed(n) then
q->f3 = &a; ∗(q+1) ⊇ {a} changed(n) = f alse;
p->f1 = 1.0; /* clobbers q->f3 */ ∗(p+0) ⊇ {?} // process complex constraints involving ∗n
r = q->f3; r ⊇ ∗(q + 1) foreach c ∈ C(n) do case c of
Here, our analysis unsoundly concludes that r only points to ∗(n + k) ⊇ w:
a. Note the special value “?”, used to indicate that a pointer foreach v ∈ Sol[n] do
may target anything. In general, we are not concerned with x = v + k;
this issue as our objective is to model portable C programs if x ≤ end(v) ∧ w → x ∈ / E do
only. Finally, nested structs are easily dealt with by “inlin- E ∪ = w → x;
ing” them into their enclosing struct, so that each nested if Sol[w] 6⊆ Sol[x] do
field is modelled by a distinct constraint variable. Sol[x] ∪ = Sol[w]; changed(x) = true;
w ⊇ ∗(n + k):
4. EXPERIMENTAL STUDY foreach v ∈ Sol[n] do
We now present empirical data on a range of benchmarks x = v + k;
comparing two example field-sensitive and -insensitive solvers. if x ≤ end(v) ∧ x → w ∈ / E do
Figure 3 provides pseudo-code for the field-sensitive solver. E ∪ = x → w;
The insensitive algorithm is similar, but operates on the if Sol[x] 6⊆ Sol[w] do
simpler language of Figure 1. Notice that cycles are identi- Sol[w] ∪ = Sol[x]; changed(w) = true;
fied with Tarjan’s algorithm [25] and not the partial online ∗(n + k) ⊇ {w}:
detector from [7]. In our experience, we have found this con- foreach v ∈ Sol[n] do
figuration to be highly efficient, not least because Tarjan’s x = v + k;
algorithm can topologically sort the graph for free. Our if x ≤ end(v) ∧ w ∈ / Sol[x] do
implementation also used bit vectors for the solution sets, Sol[x] ∪ = {w}; changed(x) = true;
applied the variable substitution methods of [20] and the // process outgoing edges from n
k
(hash-based) duplicate set compaction scheme from [10]. foreach n → w ∈ E do
To generate constraints, the SUIF 2.0 compiler was em- foreach v ∈ Sol[n] do
ployed and a few points must be made about this: the ap- x = v + k;
proach of Section 3 was used for modelling the heap; string if x ≤ end(v) do Sol[w] ∪ = {x};
constants were all treated as one object; lastly, external if Sol[w] changed then changed(w) = true;
function calls were modelled with hand crafted summary
functions. Note, we were able to compile all the benchmarks Figure 3: The Field-sensitive Pointer Analysis Al-
with only superficial modifications, such as adding extra gorithm. All scalar variables (e.g. n and x) have in-
“#include” directives for missing standard library headers. teger type and Sol is an array of integer sets, where
Finally, the experimental machine was a 900Mhz Athlon Sol[n] is initialised by constraints of the form n ⊇ {q}.
with 1Gb of main memory, running Redhat 8.0 (Psyche). C(n) contains all constraints involving “∗n”. The al-
The executables were compiled using gcc 3.2, with “-O3”. gorithm consists of an outer loop which iterates un-
til no change is observed. On each iteration, zero
Table 1 provides information on our benchmarks, which weighted cycles are collapsed and then all remain-
are all open source and available online. In particular, we ing are visited in topological order. To visit a node
note the number of variables differs between the insensitive n, any complex constraints involving it are evalu-
and sensitive analyses. This was expected as, in the latter, ated and then Sol[n] is propagated along all outgoing
each aggregate is now modelled using several variables. In edges. Note the use of end information to prevent
fact, there will be more constraints for similar reasons, al- infinite loops arising from positive weight cycles.
though these are omitted for brevity as they are essentially
Time / s Avg Working Avg Deref Dereference Sites (% of normalised total)
Set Size Size (N) 0 1 2 3-10 11-100 101-1000 1000+
make fdi 0.07 10.3 336.7 5.7 7.3 0.77 5.0 16.0 66.0 0.0
fds 0.17 7.0 17.2 5.9 20.0 6.3 4.2 64.0 0.0 0.0
gawk fdi 0.12 28.7 633.9 6.2 7.8 6.3 4.3 27.0 0.93 48.0
fds 0.3 15.3 22.4 7.1 24.0 21.0 7.0 41.0 0.0 0.0
bash fdi 0.51 84.5 543.0 5.4 7.6 2.7 3.4 20.0 61.0 0.0
fds 0.53 52.5 86.7 5.4 24.0 6.1 2.9 5.8 56.0 0.0
emacs fdi 0.4 12.2 79.3 24.0 17.0 27.0 2.7 7.9 21.0 0.53
fds 0.69 2.6 5.4 25.0 25.0 32.0 10.0 7.6 0.37 0.03
sendmail fdi 0.49 58.7 558.4 3.5 17.0 1.8 4.8 8.0 65.0 0.0
fds 2.05 106.5 214.2 4.6 23.0 3.3 4.2 4.4 60.0 0.0
named fdi 30.0 570.5 2865.5 3.4 3.6 3.9 28.0 1.4 1.2 58.0
fds 129.1 2042.9 2167.7 5.1 8.4 1.3 28.0 2.2 2.7 52.0
gs fdi 277.4 1148.7 7703.1 32.0 2.7 1.2 3.1 3.8 1.7 56.0
fds 2510.4 5977.0 7365.2 33.0 7.2 2.9 2.1 0.25 0.35 54.0

Table 2: Experimental data on the field-sensitive (fds) and field-insensitive (fdi) formulations of our analysis.

the same. This table also looks at the number of positive number of actual heap objects. In fact, the data for emacs
weight cycles in the final graph. It is important to realise appears to support this, since this has an unusually high
the count may be higher during solving, as some cycles may number of heap variables and appears to have a much bet-
end up being combined. Nevertheless, we believe this figure ter distribution of sets than the others.
indicates that positive weight cycles are rare.

Table 2 looks at the effect on time and precision of using 5. RELATED WORK
our field-sensitive analysis versus its insensitive counterpart. Flow- and context-insensitive pointer analysis has been stud-
The data clearly shows that field-sensitivity is more expen- ied extensively in the literature (see e.g. [19, 15, 7, 10, 20, 2,
sive to compute. The average working set size gives the aver- 23, 5]). These works can, for the most part, be placed into
age size of the final solution sets computed by the algorithm. two camps: extensions of either Andersen’s [2] or Steens-
This figure gives insight into the cost of a set union oper- gaard’s [23] algorithm. The former use inclusion constraints
ation during solving and, hence, we expected a correlation (i.e. set-constraints) and are more precise but slower, while
with execution time. Unfortunately, only three benchmarks the latter adopt unification systems and sacrifice precision in
appear to support this and one explanation might be that favour of speed. Thus, new developments tend to be focused
the greater accuracy of the sensitive analysis means there are either on speeding up Andersen’s algorithm (e.g. [10, 7, 20,
fewer cycles to collapse. Furthermore, we note the number 19]) or on improving the precision of Steensgaard’s (e.g. [5,
of positive weight cycles appears to impact upon the average 6, 16]). Furthermore, there have been numerous studies on
working set size. “Avg Deref” reports the average set size the relative precision of these two approaches (see e.g. [17, 9,
at dereference sites. However, to facilitate a comparison (in 12, 22, 5, 6], with the results confirming that set-constraints
terms of precision) between the two analyses we must nor- offer useful improvements in precision. We refer the reader
malise this value. To understand why, consider a pointer p to [11] for a more thorough survey of pointer analysis.
which targets the first three fields of some struct a. For the For field-sensitive pointer analysis of C, there are several
insensitive analysis, we have the solution p ⊇ {a}, whilst the previous works (e.g. [27, 28, 3, 16]), although only two
sensitive analysis gives p ⊇ {af 1 , af 2 , af 3 }. Thus, the latter are for the flow- and context-insensitive setting. The first,
seems less accurate since it is larger. However, this is mis- due to Yong et al. [28], is a framework covering a spec-
leading as the insensitive analysis actually concludes that p trum of analyses from complete field-insensitivity through
may point to any field of a. Therefore, we normalise the various levels of field-sensitivity. The main difference from
insensitive solution by counting each aggregate by the num- our work is the approach taken to modelling field-addresses
ber of variables representing it in the sensitive formulation. where, instead of using integer offsets, the actual field names
We also break up the average deref figure to show its dis- themselves are used. To understand this, consider:
tribution. Note that zero sized sets arise from unreachable
code, typically occurring when a function is linked with the typedef struct { int *f1; int *f2; } aggr1;
program, but not actually called. The results are encourag-
ing and show the field-sensitive analysis to give more precise aggr1 a,*b; int *p,c;
results across the board. However, it appears the payoff de- a.f1 = &c; a.f 1 ⊇ {c}
creases with program size. In particular, a large proportion b = &a; b ⊇ {a}
of sets for the two largest benchmarks have a thousand el- p = b->f2; p ⊇ (∗b)||f 2
ements or more. We believe the main reason for this trend
can be attributed to the number of variables modelling the Here, the || operator is just string concatenation, where
heap — which does not increase with program size. Thus, a||b ⇒ a.b and (∗a)||b ⇒ c.b, if a ⊇ {c}. Thus, p ⊇ (∗b)||x
these variables will likely be modelling an increasingly large replaces p ⊇ ∗(b+k) from our system. While this difference
appears trivial, it hides some complications. For example:
typedef struct { int *f1; int *f2; } aggr1; [4] S. Chandra and T. Reps. Physical type checking for C.
typedef struct { int *f3; int *f4; } aggr2; Technical Report BL0113590-990302-04, Bell Laboratories,
aggr1 a; aggr2 b; void *c; int d; Lucent Technologies, 1999.
[5] M. Das. Unification-based pointer analysis with directional
b.f3 = &d b.f 3 ⊇ {d} assignments. In Proc. ACM Conf. Programming Language
c = &b; c ⊇ {b} Design and Implementation, pages 35–46, 2000.
a = (struct aggr1) *c; a.f 1 ⊇ (∗c)||f 1 [6] M. Das, B. Liblit, M. Fähndrich, and J. Rehof. Estimating the
impact of scalable pointer analysis on optimization. In Proc.
a.f 2 ⊇ (∗c)||f 2 Static Analysis Symposium, volume 2126 of LNCS, pages
260–278. Springer, 2001.
The above is well-defined C code, but the last statement [7] M. Fähndrich, J. S. Foster, Z. Su, and A. Aiken. Partial online
presents an issue for the name string approach. This is be- cycle elimination in inclusion constraint graphs. In Proc. ACM
cause the type of “a” determines which fields are involved in Conf. Programming Language Design and Implementation,
pages 85–96, 1998.
the assignment. The problem is that the constraint variables [8] J. S. Foster, M. Fähndrich, and A. Aiken. Flow-insensitive
b.f 1 and b.f 2 (arising from (∗c)||f 1 and (∗c)||f 2) do not ex- points-to analysis with term and set constraints. Technical
ist as “b” has a different, but compatible type to “a”. To Report CSD-97-964, University of California, Berkeley, 1997.
[9] J. S. Foster, M. Fähndrich, and A. Aiken. Polymorphic versus
overcome this, Yong et al. introduce three functions, nor- monomorphic flow-insensitive points-to analysis for C. In Proc.
malise, lookup and resolve, whose purpose is to bridge the Static Analysis Symposium, volume 1824 of LNCS, pages
gap between different names representing the same location 175–198. Springer-Verlag, 2000.
[10] N. Heintze and O. Tardieu. Ultra-fast aliasing analysis using
(such as b.f 1 and b.f 3 in the above). The key point is that, CLA: A million lines of C code in a second. In Proc. ACM
by using offsets instead of name strings, our system avoids Conf. Programming Language Design and Implementation,
these issues entirely and, thus, provides a simpler and more pages 254–263, 2001.
[11] M. Hind. Pointer analysis: haven’t we solved this problem yet?
elegant formalisation. In Proc. ACM Workshop on Program Analysis for Software
An important feature of the Yong et al. framework is the Tools and Engineering, pages 54–61, 2001.
ability to describe both portable and non-portable analyses. [12] M. Hind and A. Pioli. Which pointer analysis should I use? In
The latter can be used to support commonly found, but Proc. ACM Symp. Software Testing and Analysis, pages
113–123, 2000.
undefined C coding practices which rely on implementation- [13] S. Horwitz. Precise flow-insensitive may-alias analysis is
specific information, such as type size and alignment. In NP-Hard. ACM Transactions on Programming Languages
contrast, our system as described cannot safely handle such And Systems, 19(1):1–6, Jan. 1997.
[14] W. Landi. Undecidability of static analysis. ACM Letters on
practices. However, with some small modification, it could Programming Languages and Systems, 1(4):323–337, 1992.
be made to do so, whilst retaining its relative simplicity. [15] O. Lhoták and L. J. Hendren. Scaling Java points-to analysis
Yong et al. also examine the precision obtainable with using SPARK. In Proc. Conf. Compiler Construction, volume
2622 of LNCS, pages 153–169. Springer, 2003.
field-sensitivity and, although smaller benchmarks were used, [16] D. Liang and M. J. Harrold. Efficient points-to analysis for
their findings match ours. Finally, they do not discuss the whole-program analysis. In Proc. Foundations of Software
PWC problem, perhaps because it is only relevant to partic- Engineering, volume 1687 of LNCS, pages 199–215. 1999.
ular instances of their framework. Nevertheless, to obtain [17] D. Liang, M. Pennings, and M. J. Harrold. Extending and
evaluating flow-insensitive and context-insensitive points-to
an equivalent analysis to ours, this issue must be addressed. analyses for Java. In Proc. ACM Workshop Program Analyses
Indeed, Chandra and Reps do so in their analyses, which for Software Tools and Engineering, pages 73–79, 2001.
they describe as an instance of the Yong et al. framework [18] D. J. Pearce. Some directed graph algorithms and their
application to pointer analysis (work in progress). PhD
[3, 4]. Their solution is to adopt a worse-case assumption thesis, Imperial College, London, 2004.
about pointers in positive weight cycles (i.e. they point to [19] D. J. Pearce, P. H. J. Kelly, and C. Hankin. Online cycle
every field of each target). Unfortunately, they do not pro- detection and difference propagation for pointer analysis. In
Proc. IEEE Workshop on Source Code Analysis and
vide any experimental data which could be used as the basis Manipulation, pages 3–12, 2003.
of a comparison with our system. [20] A. Rountev and S. Chandra. Off-line variable substitution for
scaling points-to analysis. In Proc. ACM Conf. Programming
Language Design and Implementation, pages 47–56, 2000.
6. CONCLUSION [21] A. Rountev, A. Milanova, and B. G. Ryder. Points-to analysis
We have presented a novel approach to modelling indirect for Java using annotated constraints. In Proc. ACM Conf.
Object Oriented Programming Systems, Languages and
function calls and aggregates for pointer analysis of C. Fur- Applications, pages 43–55, 2001.
thermore, we we evaluated its effect on time and precision [22] M. Shapiro and S. Horwitz. Fast and accurate flow-insensitive
using an example implementation. Our results indicate that points-to analysis. In Proc. ACM symposium on Principles of
Programming Languages, pages 1–14, 1997.
field-sensitivity, while offering greater precision, is expensive [23] B. Steensgaard. Points-to analysis in almost linear time. In
to compute. In the future, we are interested in extending our Proc. ACM Symp. Principles of Programming Languages,
formulation to be flow-sensitive and investigating the pros pages 32–41, 1996.
[24] Z. Su, M. Fähndrich, and A. Aiken. Projection merging:
and cons of doing this. We have also been investigating the Reducing redundancies in inclusion constraint graphs. In ACM
potential of several new graph algorithms for speeding up Symp. Principles of Programming Languages, pages 81–95,
pointer analysis [19, 18]. Finally, the reader is referred to 2000.
[25] R. Tarjan. Depth-first search and linear graph algorithms.
[18] for a more thorough examination of this material. SIAM Journal on Computing, 1(2):146–160, 1972.
[26] J. Whaley and M. S. Lam. An efficient inclusion-based
7. REFERENCES points-to analysis for strictly-typed languages. In Proc. Static
Analysis Symp., volume 2477 of LNCS, pages 180–195, 2002.
[1] A. Aiken. Introduction to set constraint-based program [27] R. P. Wilson and M. S. Lam. Efficient context-sensitive pointer
analysis. Sci. Comp. Prog., 35(2–3):79–111, 1999. analysis for C programs. In Proc. ACM Conf. Programming
[2] L. O. Andersen. Program Analysis and Specialization for the Language Design and Implementation, pages 1–12, 1995.
C Programming Language. PhD thesis, DIKU, University of [28] S. H. Yong, S. Horwitz, and T. Reps. Pointer analysis for
Copenhagen, 1994. programs with structures and casting. In Proc. ACM Conf.
[3] S. Chandra and T. Reps. Physical type checking for C. In Programming Language Design and Implementation, pages
Proc. ACM Workshop on Program Analysis for Software 91–103, 1999.
Tools and Engineering, pages 66–75, 1999.

You might also like