Static Single Assignment Book
Static Single Assignment Book
Lots of authors
No list of todos, please compile with tikz to get the todolist for this work.
Contents
v
vi Contents
Part II Analysis
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
Foreword
Author: Zadeck
xiii
Preface
TODO: Roadmap - for students, compiler engineers, etc TODO: Pre-requisites for reading this
book (Tiger book, dragon book, muchnick, etc)?
xv
φ Part I
Vanilla SSA
3
In computer programming, as in real life, names are useful handles for concrete
entities. The key message of this book is that having unique names for distinct
entities reduces uncertainty and imprecision.
For example, consider overhearing a conversation about ‘Homer.’ Without
any more contextual clues, you cannot disambiguate between Homer Simpson
and Homer the classical Greek poet; or indeed, any other people called Homer
that you may know. As soon as the conversation mentions Springfield (rather
than Smyrna), you are fairly sure that the Simpsons television series (rather than
Greek poetry) is the subject. On the other hand, if everyone had a unique name,
then there would be no possibility of confusing 20th century American cartoon
characters with ancient Greek literary figures.
This book is about the static single assignment form (SSA), which is a nam-
ing convention for storage locations (variables) in low-level representations of
computer programs. The term static indicates that SSA relates to properties
and analysis of program text (code). The term single refers to the uniqueness
property of variable names that SSA imposes. As illustrated above, this enables a
greater degree of precision. The term assignment means variable definitions. For
instance, in the code
x = y + 1;
the variable x is being assigned the value of expression (y +1). This is a definition,
or assignment statement, for x . A compiler engineer would interpret the above
assignment statement to mean that the lvalue of x (i.e., the memory location
labeled as x ) should be modified to store the value (y + 1).
5
6 1 Introduction — (J. Singer)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
referential transparency
1.1 Definition of SSA
The simplest, least constrained, definition of SSA can be given using the following
informal prose:
“ Astatement
program is defined to be in SSA form if each variable is a target of exactly one assignment
in the program text.
”
However there are various, more specialized, varieties of SSA, which impose
further constraints on programs. Such constraints may relate to graph-theoretic
properties of variable definitions and uses, or the encapsulation of specific
control-flow or data-flow information. Each distinct SSA variety has specific
characteristics. Basic varieties of SSA are discussed in Chapter 2. Part III of this
book presents more complex extensions.
One important property that holds for all varieties of SSA, including the sim-
?
plest definition above, is referential transparency : i.e., since there is only a single
definition for each variable in the program text, a variable’s value is independent
of its position in the program. We may refine our knowledge about a particu-
lar variable based on branching conditions, e.g. we know the value of x in the
conditionally executed block following an if statement that begins with
if (x == 0)
however the underlying value of x does not change at this if statement. Pro-
grams written in pure functional languages are referentially transparent. Such
referentially transparent programs are more amenable to formal methods and
mathematical reasoning, since the meaning of an expression depends only on
the meaning of its subexpressions and not on the order of evaluation or side
effects of other expressions. For a referentially opaque program, consider the
following code fragment.
x = 1;
y = x + 1;
x = 2;
z = x + 1;
A naive (and incorrect) analysis may assume that the values of y and z are equal,
since they have identical definitions of (x + 1). However the value of variable
x depends on whether the current code position is before or after the second
definition of x , i.e., variable values depend on their context. When a compiler
transforms this program fragment to SSA code, it becomes referentially transpar-
ent. The translation process involves renaming to eliminate multiple assignment
statements for the same variable. Now it is apparent that y and z are equal if and
only if x1 and x2 are equal.
1.2 Informal semantics of SSA 7
x1 = 1; variable name
y = x1 + 1; φ-function
φ-function, as
x2 = 2; multiplexer
z = x2 + 1; control-flow graph
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Informal semantics of SSA
In the previous section, we saw how straightline sequences of code can be trans-
formed to SSA by simple renaming of variable definitions. The target of the
definition is the variable being defined, on the left-hand side of the assignment
statement. In SSA, each definition target must be a unique variable name?. Con-
versely variable names can be used multiple times on the right-hand side of any
assignment statements, as source variables for definitions. Throughout this book,
renaming is generally performed by adding integer subscripts to original variable
names. In general this is an unimportant implementation feature, although it
can prove useful for compiler debugging purposes.
The φ-function?is the most important SSA concept to grasp. It is a special
statement, known as a pseudo-assignment function. Some call it a “notational
fiction.” 1 The purpose of a φ-function?is to merge values from different incoming
paths, at control flow merge points.
Consider the following code example and its corresponding control-flow graph
?
(CFG) representation:
x ← input()
x = input();
(x = 42)?
if (x == 42)
then A B
y = 1;
y ←1 y ← x +2
else
y = x + 2;
end
print(y ); print(y )
φ-function, as introduces a new variable y3 , which takes the value of either y1 or y2 . Thus the
multiplexer
φ-function, parallel SSA version of the program is:
semantic of
φif -function
x ← input()
γ-function x = input(); (x = 42)?
if (x == 42)
then A B
y1 = 1; y1 ← 1 y2 ← x + 2
else
y2 = x + 2;
end
y3 = φ(y1 , y2 ); y3 ← φ(A : y1 , B : y2 )
print(y3 ); print(y3 )
?
In terms of their position, φ-functions are generally placed at control flow
merge points, i.e., at the heads of basic blocks that have multiple predecessors
in control-flow graphs. A φ-function at block b has n parameters if there are n
incoming control-flow paths to b . The behavior of the φ-function is to select
dynamically the value of the parameter associated with the actually executed
control-flow path into b . This parameter value is assigned to the fresh vari-
able name, on the left-hand side of the φ-function. Such pseudo-functions are
required to maintain the SSA property of unique variable definitions, in the pres-
ence of branching control flow. Hence, in the above example, y3 is set to y1 if
control flows from basic block A, and set to y2 if it flows from basic block B . Notice
that the CFG representation here adopts a more expressive syntax for φ-functions
than the standard one, as it associates predecessor basic block labels Bi with
corresponding SSA variable names a i , i.e., a 0 = φ(B1 : a 1 , . . . , Bn : a n ). Throughout
this book, basic block labels will be omitted from φ-function operands when the
omission does not cause ambiguity.
It is important to note that, if there are multiple φ-functions ? at the head
of a basic block, then these are executed in parallel, i.e., simultaneously not
sequentially. This distinction becomes important if the target of a φ-function
is the same as the source of another φ-function, perhaps after optimizations
such as copy propagation (see Chapter 8). When φ-functions are eliminated
in the SSA destruction phase, they are sequentialized using conventional copy
operations, as described in 21.6. This subtlety is particularly important in the
context of register allocated code (see Chapter 22).
Strictly speaking, φ-functions are not directly executable in software, since the
dynamic control-flow path leading to the φ-function is not explicitly encoded as
an input to φ-function. This is tolerable, since φ-functions are generally only
used during static analysis of the program. They are removed before any program
interpretation or execution takes place. However, there are various executable
extensions of φ-functions, such as φif?or γ?functions (see Chapter 14), which
take an extra parameter to encode the implicit control dependence that dictates
1.2 Informal semantics of SSA 9
the argument the corresponding φ-function should select. Such extensions are
useful for program interpretation (see Chapter 14), if conversion (see Chapter 20),
or hardware synthesis (see Chapter 23).
We present one further example in this section, to illustrate how a loop control-
flow structure appears in SSA. Here is the non-SSA version of the program and
its corresponding control-flow graph SSA version:
x1 ← 0
x = 0; y1 ← 0
y = 0;
x2 ← φ(x1 , x3 )
y2 ← φ(y1 , y3 )
while(x < 10){ (x2 < 10)?
y = y + x; y3 ← y2 + x2
x = x + 1; x3 ← x2 + 1
}
print(y )
print(y2 )
The SSA code features two φ-functions in the loop header; these merge in-
coming definitions from before the loop for the first iteration, and from the loop
body for subsequent iterations.
It is important to outline that SSA should not be confused with (dynamic) sin-
gle assignment (DSA or simply SA) form used in automatic parallelization. Static
single assignment does not prevent multiple assignments to a variable during
program execution. For instance, in the SSA code fragment above, variables y3
and x3 in the loop body are redefined dynamically with fresh values at each loop
iteration.
Full details of the SSA construction algorithm are given in Chapter 3. For now,
it is sufficient to see that:
1. A φ-function has been inserted at the appropriate control flow merge point
where multiple reaching definitions of the same variable converged in the
original program.
2. Integer subscripts have been used to rename variables x and y from the
original program.
10 1 Introduction — (J. Singer)
data-flow analysis
data-flow analysis, y1 = 0
non-zero value y =0
x =0 x1 = 0
top, > [x1 , y1 ] = [0, 0]
[x , y ] = [0, 0]
data-flow analysis, dense
C
[x , y ] = [0, 0] C
[y2 , y3 ] = [>, 0]
[x , y ] = [0, 0] [x , y ] = [0, 0]
x =1 x2 = 1
[x , y ] = [0, 0]
C 0]
[x , y ] = [0, C
[x2 ] = [0]
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Comparison with classical data-flow analysis
As we will discover further in Chapters 13and 8, one of the major advantages of
SSA form concerns data-flow analysis.?Data-flow analysis collects information
about programs at compile time in order to make optimizing code transforma-
tions. During actual program execution, information flows between variables.
Static analysis captures this behavior by propagating abstract information, or
data-flow facts, using an operational representation of the program such as
the control-flow graph (CFG). This is the approach used in classical data-flow
analysis.
Often, data-flow information can be propagated more efficiently using a func-
tional, or sparse, representation of the program such as SSA. When a program is
translated into SSA form, variables are renamed at definition points. For certain
data-flow problems (e.g. constant propagation) this is exactly the set of pro-
gram points where data-flow facts may change. Thus it is possible to associate
data-flow facts directly with variable names, rather than maintaining a vector of
data-flow facts indexed over all variables, at each program point.
Figure 1.1 illustrates this point through an example of non-zero value analysis?.
For each variable in a program, the aim is to determine statically whether that
variable can contain a zero integer value (i.e., null) at runtime. Here 0 represents
?
the fact that the variable is null, A0 the fact that it is non-null, and > the fact
?
that it is maybe-null. With classical dense data-flow analysis on the CFG in
Figure 1.1(a), we would compute information about variables x and y for each
1.4 SSA in context 11
of the entry and exit points of the six basic blocks in the CFG, using suitable data-flow analysis, sparse
data-flow equations. Using sparse SSA-based?data-flow analysis on Figure 1.1(b),
def-use chains
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 SSA in context
SSA for High-Level Languages. So far, we have presented SSA as a useful feature
for compiler-based analysis of low-level programs. It is interesting to note that
some high-level languages enforce the SSA property. The SISAL language is
defined in such a way that programs automatically have referential transparency,
since multiple assignments are not permitted to variables. Other languages allow
the SSA property to be applied on a per-variable basis, using special annotations
like final in Java, or const and readonly in C#.
The main motivation for allowing the programmer to enforce SSA in an ex-
plicit manner in high-level programs is that immutability simplifies concurrent
programming. Read-only data can be shared freely between multiple threads,
without any data dependence problems. This is becoming an increasingly im-
portant issue, with the shift to multi- and many-core processors.
High-level functional languages?claim referential transparency as one of the
cornerstones of their programming paradigm. Thus functional programming
supports the SSA property implicitly. Chapter 6 explains the dualities between
SSA and functional programming.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 About the rest of this book
In this chapter, we have introduced the notion of SSA. The rest of this book
presents various aspects of SSA, from the pragmatic perspective of compiler
engineers and code analysts. The ultimate goals of this book are:
1. To demonstrate clearly the benefits of SSA-based analysis.
2. To dispel the fallacies that prevent people from using SSA.
This section gives pointers to later parts of the book that deal with specific topics.
1.5 About the rest of this book 13
Myth Reference
CHAPTER 2
Properties and Flavors P. Brisk
F. Rastello
Recall from the previous chapter that a procedure is in SSA form if every variable
is defined only once, and every use of a variable refers to exactly one definition.
Many variations, or flavors, of SSA form that satisfy these criteria can be defined,
each offering its own considerations. For example, different flavors vary in terms
of the number of φ-functions, which affects the size of the intermediate repre-
sentation; some variations are more difficult to construct, maintain, and destruct
compared to others. This chapter explores these SSA flavors and provides insight
regarding their relative merits in certain contexts.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Def-use and use-def chains
Under SSA form, each variable is defined once. Def-use chains?are data structures
that provide, for the single definition of a variable, the set of all its uses. In turn,
?
a use-def chain, which under SSA consists of a single name, uniquely specifies
the definition that reaches the use. As we will illustrate further in the book (see
Chapter 8)def-use chains are useful for forward data-flow analysis as they provide
direct connections that shorten the propagation distance between nodes that
generate and use data-flow information.
Because of its single definition per variable property, SSA form simplifies
def-use and use-def chains in several ways. First, SSA form simplifies def-use
chains as it combines the information as early as possible. This is illustrated by
Figure 2.1 where the def-use chain in the non-SSA program requires as many
merges as there are uses of x , whereas the corresponding SSA form allows early
and more efficient combination.
15
16 2 Properties and Flavors — (P. Brisk, F. Rastello)
copy folding Second, as it is easy to associate each variable with its single defining oper-
forward control-flow
graph ation, use-def chains can be represented and maintained almost for free. As
this constitutes the skeleton of the so-called SSA graph (see Chapter 14), when
considering a program under SSA form, use-def chains are implicitly considered
as a given. The explicit representation of use-def chains simplifies backward
propagation, which favors algorithms such as dead-code elimination.
For forward propagation, since def-use chains are precisely the reverse of
use-def chains, computing them is also easy; maintaining them requires min-
imal effort. However, even without def-use chains, some lightweight forward
propagation algorithms such as copy folding?are possible: using a single pass
that processes operations along a topological order traversal of a forward CFG?,1
most definitions are processed prior to their uses. When processing an operation,
the use-def chain provides immediate access to the prior computed value of an
argument. Conservative merging is performed when, at some loop headers, a φ-
function encounters an unprocessed argument. Such a lightweight propagation
engine proves to be fairly efficient.
x ←1 x ←2 x1 ← 1 x2 ← 2
x3 ← φ(x1 , x2 )
y ← x +1 z ← x +2 y ← x3 + 1 z ← x3 + 2
1
A forward control-flow graph is an acyclic reduction of the CFG obtained by removing back-
edges.
2.2 Minimality 17
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
join node
2.2 Minimality
SSA construction is a two-phase process: placement of φ-functions, followed by
renaming. The goal of the first phase is to generate code that fulfills the single
reaching-definition property, as already outlined. Minimality is an additional
property relating to code that has φ-functions inserted, but prior to renaming;
Chapter 3 describes the classical SSA construction algorithm in detail, while this
section focuses primarily on describing the minimality property.
A definition D of variable v reaches a point p in the CFG if there exists a path
from D to p that does not pass through another definition of v . We say that
a code has the single reaching-definition property iff no program point can be
reached by two definitions of the same variable. Under the assumption that the
single reaching-definition property is fulfilled, the minimality property states
the minimality of the number of inserted φ-functions.
This property can be characterized using the following notion of join sets. Let
n1 and n2 be distinct basic blocks in a CFG. A basic block n3 , which may or may
not be distinct from n1 or n2 , is a join node?of n1 and n2 if there exist at least two
non-empty paths, i.e., paths containing at least one CFG edge, from n1 to n3 and
from n2 to n3 , respectively, such that n3 is the only basic block that occurs on
both of the paths. In other words, the two paths converge at n3 and no other CFG
node. Given a set S of basic blocks, n3 is a join node of S if it is the join node of at
least two basic blocks in S . The set of join nodes of set S is denoted J (S ).
Intuitively, a join set corresponds to the placement of φ-functions. In other
words, if n1 and n2 are basic blocks that both contain a definition of variable v ,
then we ought to instantiate φ-functions for v at every basic block in J ({n1 , n2 }).
Generalizing this statement, if Dv is the set of basic blocks containing definitions
of v , then φ-functions should be instantiated in every basic block in J (Dv ). As
inserted φ-functions are themselves definition points, some new φ-functions
should be inserted at J (Dv ∪J (Dv )). Actually it turns out that J (S ∪J (S )) = J (S ),
so the join set of the set of definition points of a variable in the original program
characterizes exactly the minimum set of program points where φ-functions
should be inserted.
We are not aware of any optimizations that require a strict enforcement of
minimality property. However, placing φ-functions only at the join sets can
be done easily using a simple topological traversal of the CFG as described in
Chapter 4, Section 4.4. Classical techniques place φ-functions of a variable v at
J (Dv ∪ {r }), with r the entry node of the CFG. There are good reasons for that as
we will explain further. Finally, as explained in Chapter 3, Section 3.3 for reducible
flow graphs, some copy-propagation engines can easily turn a non-minimal SSA
code into a minimal one.
18 2 Properties and Flavors — (P. Brisk, F. Rastello)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
strictness
non-strict 2.3 Strict SSA form and dominance property
dominance
minimal SSA
?
SSA!minimal A procedure is defined to be strict if every variable is defined before it is used
dominator!immediate along every path from the entry to the exit point; otherwise, it is non-strict?. Some
immediate|seedominator languages, such as Java, impose strictness as part of the language definition;
dominator!tree
live-range
others, such as C/C++, impose no such restrictions. The code in Figure 2.2a is
non-strict as there exists a path from the entry to the use of a that does not go
through the definition. If this path is taken through the CFG during the execution,
then a will be used without ever being assigned a value. Although this may be
permissible in some cases, it is usually indicative of a programmer error or poor
software design.
Under SSA, because there is only a single (static) definition per variable, strict-
?
ness is equivalent to the dominance property : each use of a variable is dominated
by its definition. In a CFG, basic block n1 dominates basic block n2 if every path
in the CFG from the entry point to n2 includes n1 . By convention, every basic
block in a CFG dominates itself. Basic block n1 strictly dominates n2 if n1 domi-
nates n2 and n1 6= n2 . We use the symbols n1 dom n2 and n1 sdom n2 to denote
dominance and strict dominance respectively.
Adding a (undefined) pseudo-definition of each variable to the entry point
(root) of the procedure ensures strictness. The single reaching-definition property
discussed previously mandates that each program point be reachable by exactly
one definition (or pseudo-definition) of each variable. If a program point U is a
use of variable v , then the reaching definition D of v will dominate U ; otherwise,
there would be a path from the CFG entry node to U that does not include D .
If such a path existed, then the program would not be in strict SSA form, and a
φ-function would need to be inserted somewhere in J (r, D ) as in our example of
Figure 2.2b where ⊥ represents the undefined pseudo-definition. The so-called
minimal SSA form? is a variant of SSA form that satisfies both the minimality
and dominance properties. As shall be seen in Chapter 3, minimal SSA form is
obtained by placing the φ-functions of variable v at J (Dv , r ) using the formalism
of dominance frontier. If the original procedure is non-strict, conversion to
minimal SSA will create a strict SSA-based representation. Here, strictness refers
solely to the SSA representation; if the input program is non-strict, conversion to
and from strict SSA form cannot address errors due to uninitialized variables. To
finish with, the use of an implicit pseudo-definition in the CFG entry node to
enforce strictness does not change the semantics of the program by any means.
SSA with dominance property is useful for many reasons that directly origi-
nate from the structural properties of the variable live-ranges. The immediate
?
dominator or “idom” of a node N is the unique node that strictly dominates
N but does not strictly dominate any other node that strictly dominates N . All
nodes but the entry node have immediate dominators. A dominator tree? is a
tree where the children of each node are those nodes it immediately dominates.
Because the immediate dominator is unique, it is a tree with the entry node as
?
root. For each variable, its live-range, i.e., the set of program points where it is live,
2.3 Strict SSA form and dominance property 19
interference
liveness
upward-exposed use
chordal graph
register allocation
interference graph
a ← ... b ← ... a0 ← . . . b0 ← . . .
a 1 ← φ(a 0 , ⊥)
b1 ← φ(⊥, b0 )
... ← a ... ← b . . . ← a1 . . . ← b1
tree scan ?
the dominator tree, i.e., a “tree scan,” can color all of the variables in the program,
pruned SSA
without requiring the explicit construction of an interference graph. The tree
SSA!pruned
scan algorithm can be used for register allocation, which is discussed in greater
detail in Chapter 22.
As we have already mentioned, most φ-function placement algorithms are
based on the notion of dominance frontier (see Chapters 3 and 4) and conse-
quently do provide the dominance property. As we will see in Chapter 3, this
property can be broken by copy propagation: in our example of Figure 2.2b, the
argument a 1 of the copy represented by a 2 = φ(a 1 , ⊥) can be propagated and
every occurrence of a 2 can be safely replaced by a 1 ; the now identity φ-function
can then be removed obtaining the initial code, that is still SSA but not strict
anymore. Making a non-strict SSA code strict is about the same complexity as
SSA construction (actually we need a pruned version as described below). Still
the “strictification” usually concerns only a few variables and a restricted region
of the CFG: the incremental update described in Chapter 5 will do the work with
less efforts.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Pruned SSA form
One drawback of minimal SSA form is that it may place φ-functions for a variable
at a point in the control-flow graph where the variable was not actually live prior
to SSA. Many program analyses and optimizations, including register allocation,
are only concerned with the region of a program where a given variable is live.
The primary advantage of eliminating those dead φ-functions over minimal SSA
form is that it has far fewer φ-functions in most cases. It is possible to construct
such a form while still maintaining the minimality and dominance properties
otherwise. The new constraint is that every use point for a given variable must
be reached by exactly one definition, as opposed to all program points. Pruned
SSA?form satisfies these properties.
Under minimal SSA, φ-functions for variable v are placed at the entry points
of basic blocks belonging to the set J (S , r ). Under pruned SSA, we suppress the
instantiation of a φ-function at the beginning of a basic block if v is not live at
the entry point of that block. One possible way to do this is to perform liveness
analysis prior to SSA construction, and then use the liveness information to
suppress the placement of φ-functions as described above; another approach
is to construct minimal SSA and then remove the dead φ-functions using dead
code elimination; details can be found in Chapter 3.
Figure 2.3a shows an example of minimal non-pruned SSA. The corresponding
pruned SSA form would remove the dead φ-function that defines Y3 since Y1 and
Y2 are only used in their respective definition blocks.
2.5 Conventional and transformed SSA form 21
variable name
live-range
φ-web
web|seeφ-web
if(P1 ) if(P1 )
Y1 ← 1 Y2 ← X 1 Y1 ← 1 Y2 ← X 1
. . . ← Y1 . . . ← Y2 . . . ← Y1 . . . ← Y2
Y3 ← φ(Y1 , Y2 ) Y3 ← φ(Y1 , Y2 )
. .
. .
. .
if(P1 )
Z1 ← 1 Z2 ← X 1
Z 3 ← φ(Z 1 , Z 2 )
. . . ← Z3 . . . ← Y3
Fig. 2.3 Non pruned SSA form allows value numbering to determine that Y3 and Z 3 have the same
value.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Conventional and transformed SSA form
In many non-SSA and graph coloring based register allocation schemes, regis-
ter assignment is done at the granularity of webs. In this context, a web is the
maximum unions of def-use chains that have either a use or a def in common.
As an example, the code of Figure 2.4a leads to two separate webs for variable
a . The conversion to minimal SSA form replaces each web of a variable v in the
pre-SSA program with some variable names? vi . In pruned SSA, these variable
names partition the live-range?of the web: at every point in the procedure where
the web is live, exactly one variable vi is also live; and none of the vi are live at
any point where the web is not.
Based on this observation, we can partition the variables in a program that
has been converted to SSA form into φ-equivalence classes that we will refer as
φ-webs.?We say that x and y are φ-related to one another if they are referenced
22 2 Properties and Flavors — (P. Brisk, F. Rastello)
SSA!conventional by the same φ-function, i.e., if x and y are either parameters or defined by the
conventional|seeSSA
φ-function. The transitive closure of this relation defines an equivalence relation
C-SSA
SSA!transformed that partitions the variables defined locally in the procedure into equivalence
tranformed|seeSSA classes, the φ-webs. Intuitively, the φ-equivalence class of a resource represents
T-SSA a set of resources “connected” via φ-functions. For any freshly constructed SSA
SSA!destruction code, the φ-webs exactly correspond to the register web of the original non-SSA
destruction
code.
translation out of SSA
Conventional SSA form (C-SSA)? is defined as SSA form for which each φ-
web is interference free. Many program optimizations such as copy propagation
may transform a procedure from conventional to a non-conventional?(T-SSA for
Transformed-SSA) form, in which some variables belonging to the same φ-web
interfere with one another. Figure 2.4c shows the corresponding transformed
SSA form of our previous example: here variable a 1 interferes with variables a 2 ,
a 3 , and a 4 , since it is defined at the top and used last.
Bringing back the conventional property of a T-SSA code is as “difficult” as
translating out of SSA (also known as SSA “destruction,” see Chapter 3)?. Indeed,
the destruction of conventional SSA form is straightforward: each φ-web can be
replaced with a single variable; all definitions and uses are renamed to use the new
variable, and all φ-functions involving this equivalence class are removed. SSA
destruction starting from non-conventional SSA form can be performed through
a conversion to conventional SSA form as an intermediate step. This conversion
is achieved by inserting copy operations that dissociate interfering variables from
the connecting φ-functions. As those copy instructions will have to be inserted
at some points to get rid of φ-functions, for machine level transformations such
as register allocation or scheduling, T-SSA provides an inaccurate view of the
resource usage. Another motivation for sticking to C-SSA is that the names used
in the original program might help capture some properties otherwise difficult to
discover. Lexical partial redundancy elimination (PRE) as described in Chapter 11
illustrates this point.
Apart from those specific examples most current compilers choose not to
maintain the conventional property. Still, we should outline that, as later de-
scribed in Chapter 21, checking if a given φ-web is (and if necessary turning
it back to) interference free can be done in linear time (instead of the naive
quadratic time algorithm) in the size of the φ-web.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 A stronger definition of interference
Throughout this chapter, two variables have been said to interfere if their live-
ranges intersect. Intuitively, two variables with overlapping lifetimes will require
two distinct storage locations; otherwise, a write to one variable will overwrite
the value of the other. In particular, this definition has applied to the discussion
2.6 A stronger definition of interference 23
interference
a ← ... a1 ← . . . a1 ← . . .
tmp ← a tmp ← a 1
a ←a a ←a +1 a2 ← a1 a3 ← a1 + 1 a3 ← a1 + 1
the two notions are strictly equivalent: two live-ranges intersect iff one contains
the definition of the other.
Secondly, consider two variables u and v , whose live-ranges overlap. If we
can prove that u and v will always hold the same value at every place where
both are live, then they do not actually interfere with one another. Since they
always have the same value, a single storage location can be allocated for both
variables, because there is only one unique value between them. Of course, this
new criterion is in general undecidable. Still, a technique such as global value
numbering that is straightforward to implement under SSA (see Chapter 11.5.1)
can make a fairly good job, especially in the presence of a code with many variable-
to-variable copies, such as one obtained after a naive SSA destruction pass (see
Chapter 3). In that case (see Chapter 21), the difference between the refined
notion of interference and non-value-based one is significant.
This refined notion of interference has significant implications if applied to
SSA form. In particular, the interference graph of a procedure is no longer chordal,
as any edge between two variables whose lifetimes overlap could be eliminated
by this property.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.7 Further readings
The advantages of def-use and use-def chains provided for almost free under
SSA are well illustrated in Chapter 8 and 13.
The notion of minimal SSA and a corresponding efficient algorithm to com-
pute it were introduced by Cytron et al. [94]. For this purpose they extensively
develop the notion of dominance frontier of a node n , DF (n ) = J (n , r ). The
fact that J + (S ) = J (S ) has been actually discovered later, with a simple proof
by Wolfe [316]. More details about the theory on (iterated) dominance frontier
can be found in Chapters 3 and 4. The post-dominance frontier, which is its
symmetric notion, also known as the control dependence graph, finds many
applications. Further discussions on control dependence graph can be found in
Chapter 14.
Most SSA papers implicitly consider the SSA form to fulfill the dominance
property. The first technique that really exploits the structural properties of the
strictness is the fast SSA destruction algorithm developed by Budimlić et al. [58]
and revisited in Chapter 21.
The notion of pruned SSA has been introduced by Choi, Cytron, and Fer-
rante [71]. The example of Figure 2.3 to illustrate the difference between pruned
and non pruned SSA has been borrowed from Cytron et al. [94]. The notions of
conventional and transformed SSA were introduced by Sreedhar et al. in their
seminal paper [277] for destructing SSA form. The description of the existing
techniques to turn a general SSA into either a minimal, a pruned, a conventional,
or a strict SSA is provided in Chapter 3.
2.7 Further readings 25
The ultimate notion of interference was first discussed by Chaitin in his semi-
nal paper [65] that presents the graph coloring approach for register allocation.
His interference test is similar to the refined test presented in this chapter. In the
context of SSA destruction, Chapter 21 addresses the issue of taking advantage
of the dominance property with this refined notion of interference.
SSA!construction
SSA!destruction
CHAPTER 3
Standard Construction and Destruction Algorithms
J. Singer
F. Rastello
27
28 3 Standard Construction and Destruction Algorithms — (J. Singer, F. Rastello)
insertion, of φ-function r
live-range splitting entry
reachable, by a definition
live-range
single definition property A
renaming, of variables
variable name
minimal SSA form B C
y ←0 tmp ← x
x ←0 x←y
y ← tmp
D
x ← f (x , y )
ret x
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Construction
The original construction algorithm for SSA form consists of two distinct phases.
? ?
1. φ-function insertion performs live-range splitting to ensures that any use
1?
of a given variable v is reached by exactly one definition of v . The resulting
live-ranges exhibit the property of having a single definition, which occurs
at the beginning of each live-range.?
2. Variable renaming?assigns a unique variable name?to each live-range. This
second phase rewrites variable names in program statements such that the
program text contains only one definition of each variable, and every use
refers to its corresponding unique reaching definition.
As already outlined in Chapter 2, there are different flavors of SSA with distinct
?
properties. In this chapter, we focus on the minimal SSA form.
? join set
For a given set of nodes S in a CFG, the join set J (S ) is the set of join nodes of
dominance frontier, DF
S , i.e., nodes in the CFG that can be reached by two (or more) distinct elements
dominates, strictly
of S using disjoint paths. Join sets were introduced in Chapter 2, Section 2.2. iterated dominance
Let us consider some join set examples from the program in Figure 3.1. frontier, DF+
dominance property
1. J ({B , C }) = {D }, since it is possible to get from B to D and from C to D
along different, non-overlapping, paths.
2. Again, J ({r, A, B , C , D , E }) = {A, D , E } (where r is the entry), since the nodes
A, D , and E are the only nodes with multiple predecessors in the program.
The dominance frontier?of a node n , DF(n ), is the border of the CFG region
that is dominated by n . More formally,
• node x strictly dominates?node y if x dominates y and x 6= y ;
• the set of nodes DF(n ) contains all nodes x such that n dominates a prede-
cessor of x but n does not strictly dominate x .
For instance, in our Figure 3.1, the dominance frontier of the y defined in
block B is the first operation of D, while the DF of the y defined in block C would
be the first operations of D and E.
Note that DF is defined over individual nodes, but for simplicitySof presen-
tation, we overload it to operate over sets of nodes too, i.e., DF(S ) = s ∈S DF(s ).
The iterated dominance frontier DF+ (S )?is obtained by iterating the computation
of DF until reaching a fixed point, i.e., it is the limit D Fi →∞ (S ) of the sequence:
DF1 (S ) = DF(S )
DFi +1 (S ) = DF (S ∪ DFi (S ))
Construction of minimal SSA requires for each variable v the insertion of
φ-functions at J (Defs(v )), where Defs(v ) is the set of nodes that contain defi-
nitions of v . The original construction algorithm for SSA form uses the iterated
dominance frontier DF+ (Defs(v )). This is an over-approximation of join set, since
DF+ (S ) = J (S ∪ {r }), i.e., the original algorithm assumes an implicit definition of
every variable at the entry node r .
insert φ-functions for x at the beginning of nodes A, D , and E . Figure 3.2 shows
the example CFG program with φ-functions for x inserted.
r
entry
A
x ← φ(x , x )
B C
y ←0 tmp ← x
x ←0 x←y
y ← tmp
D
x ← φ(x , x )
x ← f (x , y )
x ← φ(x , x )
ret x
Fig. 3.2 Example control-flow graph, including inserted φ-functions for variable x .
DJ-graph@DJ-graph
Algorithm 3.1: Standard algorithm for inserting φ-functions dominance edge, D-edge
1 for v : variable names in original program do join edge, J-edge
2 F ← {} Â set of basic blocks where φ is added dominance frontier edge,
3 W ← {} Â set of basic blocks that contain definitions of v DF-edge
4 for d ∈ Defs(v ) do
5 let B be the basic block containing d
6 W ← W ∪ {B }
7 while W 6= {} do
8 remove a basic block X from W
9 for Y : basic block ∈ DF(X ) do
10 if Y 6∈ F then
11 add v ← φ(...) at entry of Y
12 F ← F ∪ {Y }
13 if Y ∈
/ Defs(v ) then
14 W ← W ∪ {Y }
W at the start of each while loop iteration. At the beginning, the CFG looks like
Figure 3.1. At the end, when all the φ-functions for x have been placed, then the
CFG looks like Figure 3.2.
Providing the dominator tree is given, the computation of the dominance fron-
tier is quite straightforward. As illustrated by Figure 3.3, this can be understood
?
using the DJ-graph notation. The skeleton of the DJ-graph is the dominator tree
?
of the CFG that makes the D-edges (dominance edges). This is augmented with
J-edges (join edges)?that correspond to all edges of the CFG whose source does
not strictly dominate its destination. A DF-edge (dominance frontier edge)?is an
edge whose destination is in the dominance frontier of its source. By definition,
there is a DF-edge (a , b ) between every CFG nodes a , b such that a dominates
a predecessor of b , but does not strictly dominate b . In other-words, for each
J -edge (a , b ), all ancestors of a (including a ) that do not strictly dominate b
have b in their dominance frontier. For example, in Figure 3.3, (F,G ) is a J-edge,
so {(F,G ), (E ,G ), (B ,G )} are DF-edges. This leads to the pseudo-code given in
32 3 Standard Construction and Destruction Algorithms — (J. Singer, F. Rastello)
def-use chains Algorithm 3.2, where for every edge (a , b ) we visit all ancestors of a to add b to
their dominance frontier.
Since the iterated dominance frontier is simply the transitive closure of the
dominance frontier, we can define the DF+ -graph as the transitive closure of the
DF-graph. In our example, as {(C , E ), (E ,G )} are DF-edges, (C ,G ) is a DF+ -edge.
Hence, a definition of x in C will lead to inserting φ-functions in E and G . We
can compute the iterated dominance frontier for each variable independently,
as outlined in this chapter, or “cache” it to avoid repeated computation of the
iterated dominance frontier of the same node. This leads to more sophisticated
algorithms detailed in Chapter 4.
A
A A A
B
C x ← ... G B G B G B
D
E
E C E C E C
F
G
F D F D F D
Once φ-functions have been inserted using this algorithm, the program usu-
ally still contains several definitions per variable, however, now there is a single
definition statement in the CFG that reaches each use. For each variable use in
a φ-function, it is conventional to treat them as if the use actually occurs on
the corresponding incoming edge or at the end of the corresponding predeces-
sor node. If we follow this convention, then def-use chains?are aligned with the
CFG dominator tree. In other words, the single definition that reaches each use
dominates that use.
3.1 Construction 33
The variable renaming algorithm translates our running example from Fig-
ure 3.1 into the SSA form of Figure 3.4a. The table in Figure 3.4b gives a walk-
through example of Algorithm 3.3, only considering variable x . The labels l i
mark instructions in the program that mention x , shown in Figure 3.4a. The
table records (1) when x .reachingDef is updated from xold into xnew due to a call
34 3 Standard Construction and Destruction Algorithms — (J. Singer, F. Rastello)
r
entry
A
l 1 x1 ← φ(x5 , ⊥)
y1 ← φ(y4 , ⊥)
B C BB x mention x .reachingDef
y2 ← 0 l 3 tmp ← x1 r l 1 use ⊥
l 2 x2 ← 0 l 4 x3 ← y1 A def l 1 ⊥ then x1
y3 ← tmp B def l 2 x1 then x2
B l 5 use x2
D C l 3 use x2 updated into x1
l 5 x4 ← φ(x2 , x3 ) C def l 4 x1 then x3
y4 ← φ(y2 , y3 ) C l 5 use x3
l 6 x5 ← f (x4 , y4 ) C l 7 use x3
D def l 5 x3 updated into x1 then x4
E D l 6 use x4
D def l 6 x4 then x5
l7 x6 ← φ(x5 , x3 ) D l 1 use x5
y5 ← φ(y4 , y3 ) D l 7 use x5
l8 ret x6 E def l 7 x5 then x6
E l 8 use x6
(a) CFG (b) Walk-trough of renaming for variable x
Fig. 3.4 SSA form of the example of Figure 3.1.
update the reachingDef field at this point). The top stack value is peeked when a minimal SSA form
pruned SSA form
variable use is encountered (we read from the reachingDef field at this point).
conventional SSA form,
Multiple stack values may be popped when moving to a different node in the CSSA
dominator tree (we always check whether we need to update the reachingDef field strict SSA form
before we read from it). While the slot-based algorithm requires more memory, dominance property, SSA
form with
it can take advantage of an existing working field for a variable, and be more undefined variable
efficient in practice. SSA destruction
3.1.4 Summary
Now let us review the flavour of SSA form that this simple construction algo-
rithm produces. We refer back to several SSA properties that were introduced in
Chapter 2.
• It is minimal (see Section 2.2).? After the φ-function insertion phase, but
before variable renaming, the CFG contains the minimal number of inserted
φ-functions to achieve the property that exactly one definition of each vari-
able v reaches every point in the graph.
• It is not pruned (see Section 2.4).?Some of the inserted φ-functions may be
dead, i.e., there is not always an explicit use of the variable subsequent to
the φ-function (e.g., y5 in Figure 3.4a).
?
• It is conventional (see Section 2.5). The transformation that renames all φ-
related variables into a unique representative name and then removes all
φ-functions is a correct SSA-destruction algorithm.
• Finally, it has the dominance property (see Section 2.3).?Each variable use
is dominated by its unique definition. This is due to the use of iterated
dominance frontiers during the φ-placement phase, rather than join sets.
Whenever the iterated dominance frontier of the set of definition points of
a variable differs from its join set, then at least one program point can be
reached both by r (the entry of the CFG) and one of the definition points. In
other words, as in Figure 3.1, one of the uses of the φ-function inserted in
block A for x does not have any actual reaching definition that dominates
it. This corresponds to the ⊥ value used to initialize each reachingDef slot
?
in Algorithm 3.3. Actual implementation code can use a NULL value, cre-
ate a fake undefined variable at the entry of the CFG, or create undefined
pseudo-operations on-the-fly just before the particular use.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Destruction
SSA form? is a sparse representation of program information, which enables
simple, efficient code analysis and optimization. Once we have completed SSA
36 3 Standard Construction and Destruction Algorithms — (J. Singer, F. Rastello)
conventional SSA form based optimization passes, and certainly before code generation, it is necessary
coalescing, of variables
to eliminate φ-functions since these are not executable machine instructions.
φ-web
conventional SSA This elimination phase is known as SSA destruction.
parallel copy When freshly constructed, an untransformed SSA code is conventional and
critical edge its destruction is straightforward:?One simply has to rename all φ-related vari-
edge splitting ables (source and destination operands of the same φ-function) into a unique
parallel copy
representative variable. Then, each φ-function should have syntactically identi-
transformed SSA, T-SSA
conventional SSA, C-SSA cal names for all its operands, and thus can be removed to coalesce the related
live-ranges.?
We refer to a set of φ-related variables as a φ-web.?We recall from Chapter 2
that conventional SSA is defined as a flavor under which each φ-web is free
?
from interferences. Hence, if all variables of a φ-web have non-overlapping
live-ranges then the SSA form is conventional. The discovery of φ-webs can be
performed efficiently using the classical union-find algorithm with a disjoint-set
data structure, which keeps track of a set of elements partitioned into a number of
disjoint (non-overlapping) subsets. The φ-webs discovery algorithm is presented
in Algorithm 3.4.
While freshly constructed SSA code is conventional, this may not be the case
after performing some optimizations such as copy propagation. Going back to
conventional SSA form requires the insertion of copies. The simplest (although
not the most efficient) way to destroy non-conventional SSA form is to split all
critical edges, and then replace φ-functions by copies?at the end of predecessor
basic blocks. A critical edge is an edge from a node with several successors to a
? ?
node with several predecessors. The process of splitting an edge, say (b1 , b2 ), in-
volves replacing edge (b1 , b2 ) by (i) an edge from b1 to a freshly created basic block
and by (ii) another edge from this fresh basic block to b2 . As φ-functions have
a parallel semantic, i.e., have to be executed simultaneously not sequentially,
the same holds for the corresponding copies inserted at the end of predecessor
?
basic blocks. To this end, a pseudo instruction called a parallel copy is created to
represent a set of copies that have to be executed in parallel. The replacement of
parallel copies by sequences of simple copies is handled later on. Algorithm 3.5
presents the corresponding pseudo-code that makes non-conventional SSA con-
ventional.?As already mentioned, SSA destruction of such form is straightforward.
However, Algorithm 3.5 can be slightly modified to directly destruct SSA by delet-
3.2 Destruction 37
ing line 13, replacing a i0 by a 0 in the following lines, and adding “remove the edge splitting
conservative coalescing
φ-function” after them.
register allocation
aggressive coalescing
We stress that the above destruction technique has several drawbacks: first
because of specific architectural constraints, region boundaries, or exception
?
handling code, the compiler might not permit the splitting of a given edge; sec-
ond, the resulting code contains many temporary-to-temporary copy operations.
In theory, reducing the frequency of these copies is the role of the coalescing dur-
ing the register allocation phase. A few memory- and time-consuming coalescing
heuristics mentioned in Chapter 22can handle the removal of these copies ef-
fectively. Coalescing can also, with less effort, be performed prior to the register
?
allocation phase. As opposed to a (so-called conservative) coalescing during
? ?
register allocation, this aggressive coalescing would not cope with the interfer-
ence graph colorability. Further, the process of copy insertion itself might take a
substantial amount of time and might not be suitable for dynamic compilation.
The goal of Chapter 21 is to cope both with non-splittable edges and difficulties
related to SSA destruction at machine code level, but also aggressive coalescing
in the context of resource constrained compilation.
Once φ-functions have been replaced by parallel copies, we need to sequen-
tialize the parallel copies, i.e., replace them by a sequence of simple copies. This
phase can be performed immediately after SSA destruction or later on, perhaps
even after register allocation (see Chapter 22). It might be useful to postpone
the copy sequentialization since it introduces arbitrary interference between
variables. As an example, a 1 ← a 2 k b1 ← b2 (where inst1 k inst2 represents two
instructions inst1 and inst2 to be executed simultaneously) can be sequentialized
38 3 Standard Construction and Destruction Algorithms — (J. Singer, F. Rastello)
parallel copy, into a 1 ← a 2 ; b1 ← b2 which would make b2 interfere with a 1 while the other way
sequentialization of
conventional, making round b1 ← b2 ; a 1 ← a 2 would make a 2 interfere with b1 instead.
SSA If we still decide to replace parallel copies into a sequence of simple copies
strict, making SSA immediately after SSA destruction, this can be done as shown in Algorithm 3.6.?To
see that this algorithm converges, one can visualize the parallel copy as a graph
where nodes represent resources and edges represent transfer of values: the
number of steps is exactly the number of cycles plus the number of non-self
edges of this graph. The correctness comes from the invariance of the behavior
of seq; pcopy. An optimized implementation of this algorithm will be presented
in Chapter 21.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 SSA property transformations
As discussed in Chapter 2, SSA comes in different flavors. This section describes
algorithms that transform arbitrary SSA code into the desired flavor. Making SSA
?
conventional corresponds exactly to the first phase of SSA destruction (described
in Section 3.2) that splits critical edges and introduces parallel copies (sequen-
tialized later in bulk or on-demand) around φ-functions. As already discussed,
this straightforward algorithm has several drawbacks addressed in Chapter 21.
?
Making SSA strict, i.e., fulfill the dominance property, is as “hard” as con-
structing SSA. Of course, a pre-pass through the graph can detect the offending
variables that have definitions that do not dominate their uses. Then there are
several possible single-variable φ-function insertion algorithms (see Chapter 4)
that can be used to patch up the SSA, by restricting attention to the set of non-
conforming variables. The renaming phase can also be applied with the same
filtering process. As the number of variables requiring repair might be a small
3.3 SSA property transformations 39
proportion of all variables, a costly traversal of the whole program can be avoided pruned SSA, building
dead φ-functions
by building the def-use chains (for non-conforming variables) during the detec-
dead code elimination
tion pre-pass. Renaming can then be done on a per-variable basis or better (if def-use chains
pruned SSA is preferred) the reconstruction algorithm presented in Chapter 5 use-def chains
can be used for both φ-functions placement and renaming. semi-pruned SSA
? local variable
The construction algorithm described above does not build pruned SSA form.
If available, liveness information can be used to filter out the insertion of φ-
functions wherever the variable is not live: the resulting SSA form is pruned. Al-
ternatively, pruning SSA form is equivalent to a dead-code elimination pass?after
SSA construction. As use-def chains?are implicitly provided by SSA form, dead-
φ-function elimination simply relies on marking actual uses (non-φ-function
ones) as useful and propagating usefulness backward through φ-functions. Al-
gorithm 3.7 presents the relevant pseudo-code for this operation. Here, stack is
used to store useful and unprocessed variables defined by φ-functions.
To construct pruned SSA form via dead code elimination, it is generally much
faster to first build semi-pruned SSA form?, rather than minimal SSA form, and
then apply dead code elimination. Semi-pruned SSA form is based on the obser-
vation that many variables are local?, i.e., have a small live-range that is within
a single basic block. Consequently, pruned SSA would not instantiate any φ-
functions for these variables. Such variables can be identified by a linear traversal
over each basic block of the CFG. All of these variables can be filtered out: min-
40 3 Standard Construction and Destruction Algorithms — (J. Singer, F. Rastello)
insertion of φ-function, imal SSA form restricted to the remaining variables gives rise to the so called
pessimistic
minimal, making SSA semi-pruned SSA form.
non-
copy-propagation
strict, making SSA non-
reducible CFG 3.3.1 Pessimistic φ-function insertion
minimal, making SSA
aj ← ... aj ← ...
T 1(a i )
a i ← φ(. . . , a j , a i ) a i ← φ(. . . , a j )
aj ← ... aj ← ...
T 2(a i )
a i ← φ(a j , . . . , a j )
. . . ← ai . . . ← ai ... ← aj ... ← aj
Fig. 3.5 T1 and T2 rewrite rules for SSA-graph reduction, applied to def-use relations between SSA
variables.
This approach can be implemented using a worklist, which stores the candi-
date nodes for simplification. Using the graph made up of def-use chains (see
Chapter 14), the worklist can be initialized with successors of non-φ-functions.
However, for simplicity, we may initialize it with all φ-functions. Of course, if loop
nesting forest information is available, the worklist can be avoided by traversing
the CFG in a single pass from inner to outer loops, and in a topological order
within each loop (header excluded). But since we believe the main motivation
for this approach to be its simplicity, the pseudo-code shown in Algorithm 3.8
uses a work queue.
This algorithm is guaranteed to terminate in a fixed number of steps. At every
iteration of the while loop, it removes a φ-function from the work queue W .
Whenever it adds new φ-functions to W , it removes a φ-function from the
program. The number of φ-functions in the program is bounded so the number
of insertions to W is bounded. The queue could be replaced by a worklist, and
the insertions/removals done at random. The algorithm would be less efficient,
but the end result would be the same.
42 3 Standard Construction and Destruction Algorithms — (J. Singer, F. Rastello)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Further readings
The early literature on SSA form [93, 94] introduces the two phases of the con-
struction algorithm we have outlined in this chapter, and discusses algorithmic
complexity on common and worst-case inputs. These initial presentations trace
the ancestry of SSA form back to early work on data-flow representations by
Shapiro and Saint [268].
Briggs et al. [54] discuss pragmatic refinements to the original algorithms for
SSA construction and destruction, with the aim of reducing execution time. They
introduce the notion of semi-pruned form, show how to improve the efficiency of
the stack-based renaming algorithm, and describe how copy propagation must
be constrained to preserve correct code during SSA destruction.
There are numerous published descriptions of alternative algorithms for SSA
construction, in particular for the φ-function insertion phase. The pessimistic
approach that first inserts φ-functions at all control-flow merge points and
then removes unnecessary ones using simple T1/T2 rewrite rules was proposed
by Aycock and Horspool [18]. Brandis and Mössenböck [49] describe a simple,
syntax-directed approach to SSA construction from well structured high-level
source code. Throughout this textbook, we consider the more general case of
SSA construction from arbitrary CFGs.
A reducible CFG is one that will collapse to a single node when it is transformed
using repeated application of T1/T2 rewrite rules. Aho et al. [2] describe the
concept of reducibility and trace its history in early compilers literature.
Sreedhar and Gao [273] pioneer linear-time complexity φ-function insertion
algorithms based on DJ-graphs. These approaches have been refined by other
researchers. Chapter 4 explores these alternative construction algorithms in
depth.
3.4 Further readings 43
Blech et al. [34] formalize the semantics of SSA, in order to verify the correct-
ness of SSA destruction algorithms. Boissinot et al. [41] review the history of
SSA destruction approaches, and highlight misunderstandings that led to in-
correct destruction algorithms. Chapter 21 presents more details on alternative
approaches to SSA destruction.
There are instructive dualisms between concepts in SSA form and functional
programs, including construction, dominance and copy propagation. Chapter 6
explores these issues in more detail.
insertion, of φ-function
renaming, of variables
SSA!minimal
DJ-graph@DJ-graph
iterated dominance
frontier
loop nesting forest
CHAPTER 4 dominance frontier, DF
dominance frontier
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1 Basic algorithm
We start by recalling the basic algorithm already described in Chapter 3. The
original algorithm for φ-functions is based on computing the dominance frontier
? ?
(DF) set for the given control-flow graph. The dominance frontier DF(x ) of a node
x is the set of all nodes z such that x dominates a predecessor of z , without strictly
dominating z . For example, DF(8) = {6, 8} in Figure 4.1. The basic algorithm
for the insertion of φ-functions consists in computing the iterated dominance
45
46 4 Advanced Construction Algorithms for SSA — (D. Das, U. Ramakrishna, V. Sreedhar)
join node frontier (DF+ ) for a set of all definition points (or nodes where variables are
DJ-graph@DJ-graph
defined). Let Defs(v ) be the set of nodes where variable v is defined. Given
dominator!tree
dominance frontier edge, that the dominance frontier for a set of nodes is just the union of the DF set of
DF-edge each node, we can compute DF+ (Defs(v )) as a limit of the following recurrence
join edge, J-edge equation (where S is initially Defs(v )):
DF+ 1 (S ) = DF(S )
DF+ i +1 (S ) = DF(S ∪ DF+ i (S ))
A φ-function is then inserted at each join node?in the DF+ (Defs(v )) set.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
+
4.2 Computation of DF (S ) using DJ-graphs
We now present a linear time algorithm for computing the DF+ (S ) set of a given
set of nodes S without the need for explicitly pre-computing the full DF set.
The algorithm uses the DJ-graph (see Chapter 3, Section 3.1.2 and Figure 3.3b)
representation of a CFG. The DJ-graph?for our example CFG is also shown in
Figure 4.1b. Rather than explicitly computing the DF set, this algorithm uses a
DJ-graph to compute the DF+ (Defs(v )) on the fly.
Now let us try to understand how to compute the DF set for a single node using the
DJ-graph. Consider the DJ-graph shown in Figure 4.1b where the depth of a node
?
is the distance from the root in the dominator tree. The first key observation is
?
that a DF-edge never goes down to a greater depth. To give a raw intuition of why
this property holds, suppose there was a DF-edge from 8 to 7, then there would
be a path from 3 to 7 through 8 without flowing through 6, which contradicts the
dominance of 7 by 6.
As a consequence, to compute DF(8) we can simply walk down the dominator
? J
(D) tree from node 8 and from each visited node y , identify all join (J) edges y →z
such that z .depth ≤ 8.depth. For our example the J-edges that satisfy this condi-
J J
tion are 10→8 and 9→6. Therefore DF(8) = {6, 8}. To generalize the example, we
can compute the DF of a node x using the following formula (see Figure 4.2a for
a illustration):
J
DF(x ) = {z | ∃ y ∈ dominated(x ) ∧ y →z ∈ J-edges ∧ z .depth ≤ x .depth}
where
dominated(x ) = {y | x dom y }
4.2 Computation of DF+ (S ) using DJ-graphs 47
3 v ← ...
3 11 depth 2
4 v ← ... 8
4 5 6 8 depth 3
5 v ← φ(. . . ) 9 ... ←v
7 9 depth 4
6 v ← φ(. . . ) 10
Now we can extend the above idea to compute the DF+ for a set of nodes, and
hence the insertion of φ-functions. This algorithm does not precompute DF;
given a set of initial nodes S = Defs(v ) for which we want to compute the relevant
set of φ-functions, a key observation can be made. Let w be an ancestor node of
a node x on the dominator tree. If DF(x ) has already been computed before the
computation of DF(w ), the traversal of dominated(x ) can be avoided and DF(x )
directly used for the computation of DF(w ). This is because nodes reachable
from dominated(x ) are already in DF(x ). However, the converse may not be true,
and therefore the order of the computation of DF is crucial.
To illustrate the key observation consider the example DJ-graph in Figure 4.1b,
and let us compute DF+ ({3, 8}). It is clear from the recursive definition of DF+ that
we have to compute DF(3) and DF(8) as a first step. Now supposing we start with
node 3 and compute DF(3). The resulting DF set is DF(3) = {2}. Now supposing
we next compute the DF set for node 8, and the resulting set is DF(8) = {6, 8}.
Notice here that we have already visited node 8 and its sub-tree when visiting
node 3. We can avoid such duplicate visits by ordering the computation of DF
set so that we first compute DF(8) and then during the computation of DF(3) we
avoid visiting the sub-tree of node 8, and use the result DF(8) that was previously
computed.
Thus, to compute DF(w ), where w is an ancestor of x in the DJ-graph, we do
not need to compute it from scratch as we can re-use the information computed
as part of DF(x ) as shown. For this, we need to compute the DF of deeper (based
on depth) nodes (here, x ), before computing the DF of a shallower node (here,
w ). The formula is as follows, with Figure 4.2b illustrating the positions of nodes
z and z 0 .
48 4 Advanced Construction Algorithms for SSA — (D. Das, U. Ramakrishna, V. Sreedhar)
iterated dominance
frontier
z
depth
z0
w
z
x y
(a) (b)
Fig. 4.2 Illustrating examples to the DF formulas.
?
In this section we present the algorithm for computing DF+ . Let, for node x ,
x .depth be its depth from the root node r , with r.depth = 0. To ensure that
the nodes are processed according to the above observation we use a simple
array of sets OrderedBucket, and two functions defined over this array of sets: (1)
InsertNode(n ) that inserts the node n in the set OrderedBucket[n .depth], and (2)
GetDeepestNode() that returns a node from the OrderedBucket with the deepest
depth number.
In Algorithm 4.1, at first we insert all nodes belonging to S in the Ordered-
Bucket. Then the nodes are processed in a bottom-up fashion over the DJ-graph
from deepest node depth to least node depth by calling Visit(x ). The proce-
dure Visit(x ) essentially walks top-down in the DJ-graph avoiding already visited
nodes. During this traversal it also peeks at destination nodes of J-edges. When-
ever it notices that the depth number of the destination node of a J-edge is less
than or equal to the depth number of the current_x, the destination node is added
to the DF+ set (Line 4) if it is not present in DF+ already. Notice that at Line 5 the
destination node is also inserted in the OrderedBucket if it was never inserted
before. Finally, at Line 9 we continue to process the nodes in the sub-tree by
visiting over the D-edges. When the algorithm terminates, the set DF+ contains
the iterated dominance frontier for the initial set S .
4.2 Computation of DF+ (S ) using DJ-graphs 49
Procedure Visit(y )
J
1 foreach J-edge y →z do
2 if z .depth ≤ current_x.depth then
3 if z 6∈ DF+ then
4 DF+ ← DF+ ∪ {z }
5 if z 6∈ S then InsertNode(z )
D 0
6 foreach D-edge y →y do
7 if y .visited = false then
0
8 y 0 .visited ← true
/* if(y 0 .boundary = false) Â See the section on Further Reading for details */
9 Visit(y 0 )
level
0 1 1 1 1 1
Ordered Bucket
1
2 3 3 2 3 2 3 2 3 2
3 4 4 5 4 6 5 4 6 5 4
4 7 7 7 7 7
5
In Figure 4.3, some of the phases of the algorithm are depicted for clarity.
The OrderedBucket is populated with the nodes 1, 3, 4 and 7 corresponding to
S = Defs(v ) = {1, 3, 4, 7}. The nodes are inserted in the buckets corresponding to
the depths at which they appear. Hence, node 1 which appears at depth 0 is in
50 4 Advanced Construction Algorithms for SSA — (D. Das, U. Ramakrishna, V. Sreedhar)
the 0-th bucket, node 3 is in bucket 2 and so on. Since the nodes are processed
J
bottom-up, the first node that is visited is node 7. The J-edge 7→2 is considered,
+ +
and the DF set is empty: the DF set is updated to hold node 2 according to
Line 4 of the Visit procedure. In addition, InsertNode(2) is invoked and node 2 is
J
inserted in bucket 2. The next node visited is node 4. The J-edge 4→5 is considered
which results in the new DF = {2, 5}. The final DF set converges to {2, 5, 6} when
+ +
node 5 is visited. Subsequent visits of other nodes do not add anything to the DF+
set. An interesting case arises when node 3 is visited. Node 3 finally causes nodes
8, 9 and 10 also to be visited (Line 9 during the down traversal of the D-graph).
J
However, when node 10 is visited, considering J-edge 10→8 does not result in an
+
update of the DF set as the depth of node 8 is deeper than that of node 3.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
+
4.3 Data-flow computation of DF -graph using DJ-graph
Data-flow equation:
J
Consider a J-edge y →z .
Then for all nodes x such that x dominates y and x .depth ≥ z .depth:
The set of data-flow equations for each node n in the DJ-graph can be solved
iteratively using a top-down pass over the DJ-graph. To check whether multiple
passes are required over the DJ-graph before a fixed point is reached for the
data-flow equations, we devise an “inconsistency condition” stated as follows:
Inconsistency Condition:
For a J-edge, y 0 →x
J
, if y 0 does not satisfy DF+ (y 0 ) ⊇ DF+ (x ),
then the node y 0 is said to be inconsistent.
The algorithm described in the next section is directly based on the method
of building up the DF+ (x ) sets of the nodes as each J-edge is encountered in an
iterative fashion by traversing the DJ-graph top-down. If no node is found to be
inconsistent after a single top-down pass, all the nodes are supposed to have
4.3 Data-flow computation of DF+ -graph using DJ-graph 51
Function TDMSC-Main(DJ-graph)
Input: A DJ-graph representation of a program.
Output: The DF+ sets for the nodes.
1 foreach node x ∈ DJ-graph do
2 DF+ (x ) ← {}
3 repeat until TDMSC-I(DJ-graph)
Function TDMSC-I(DJ-graph)
1 RequireAnotherPass ← false
2 foreach edge e do e .visited ← false
3 while z ← next node in B(readth) F(irst) S(earch) order of DJ-graph do
J
4 foreach incoming edge e = y →z do
5 if not e .visited then
6 e .visited ← true
7 x←y
8 while (x .depth ≥ z .depth) do
9 DF+ (x ) ← DF+ (x ) ∪ DF+ (z ) ∪ {z }
10 lx ← x
11 x ← parent(x ) Â dominator tree parent
0 J
12 foreach incoming edge e 0 = y →lx do
13 if e 0 .visited then
14 if DF+ (y 0 ) 6⊇ DF+ (lx) then  Check inconsistency
15 RequireAnotherPass ← true;
16 return RequireAnotherPass
The first and direct variant of the approach laid out above is poetically termed
TDMSC-I. This variant works by scanning the DJ-graph in a top-down fashion as
shown in Line 3 of Function TDMSC-I. All DF+ (x ) sets are set to the empty set
before the initial pass of TDMSC-I. The DF+ (x ) sets computed in a previous pass
are carried over if a subsequent pass is required.
The DJ-graph is visited depth by depth. During this process, for each node z
J
encountered, if there is an incoming J-edge y →z as in Line 4, then a separate
52 4 Advanced Construction Algorithms for SSA — (D. Das, U. Ramakrishna, V. Sreedhar)
bottom-up pass starts at Line 8 (see Figure 4.4a for a snapshot of the variables
during algorithm execution).
This bottom-up pass traverses all nodes x such that x dominates y and
x .depth ≥ y .depth, updating the DF+ (x ) values using the aforementioned data-
flow equation. Line 12 is used for the inconsistency check. RequireAnotherPass
is set to true only if a fixed point is not reached and the inconsistency check
succeeds for some node.
There are some subtleties in the algorithm that should be noted. Line 12 of
the algorithm visits incoming edges to lx only when lx is at the same depth as z ,
which is the current depth of inspection and the incoming edges to lx’s posterity
are at a depth greater than that of node z and are unvisited yet.
Here, we will briefly walk through TDMSC-I using the DJ-graph of Figure 4.1b
(reprinted here as 4.4b). Moving top-down over the graph, the first J-edge en-
J
countered is when z = 2, i.e., 7→2. As a result, a bottom-up climbing of the
nodes happens, starting at node 7 and ending at node 2 and the DF+ sets of these
nodes are updated so that DF+ (7) = DF+ (6) = DF+ (3) = DF+ (2) = {2}. The next
J J J J J
J-edge to be visited can be any of 5→6, 9→6, 6→5, 4→5, or 10→8 at depth = 3.
J J
Assume node 6 is visited first, and thus it is 5→6, followed by 9→6. This results
in DF+ (5) = DF+ (5) ∪ DF+ (6) ∪ {6} = {2, 6}, DF+ (9) = DF+ (9) ∪ DF+ (6) ∪ {6} = {2, 6},
J
and DF+ (8) = DF+ (8) ∪ DF+ (6) ∪ {6} = {2, 6}. Now, let 6→5 be visited. Hence,
DF (6) = DF (6) ∪ DF (5) ∪ {5} = {2, 5, 6}. At this point, the inconsistency check
+ + +
J J
comes into the picture for the edge 6→5, as 5→6 is another J-edge that is already
visited and is an incoming edge of node 6. Checking for DF+ (5) ⊇ DF+ (6) fails,
implying that the DF+ (5) needs to be computed again. This will be done in a
succeeding pass as suggested by the RequireAnotherPass value of true. In a sec-
J
ond iterative pass, the J-edges are visited in the same order. Now, when 5→6 is
visited, DF (5) = DF (5) ∪ DF (6) ∪ {6} = {2, 5, 6} as this time DF (5) = {2, 6} and
+ + + +
1
x
depth 2
z lx (x )
3 11
(x ) e0
e 4 5 6 8
(x ) y0
7 9
y (x )
10
J
DF+ (6) = {2, 5, 6}. On a subsequent visit of 6→5, DF+ (6) is also set to {2, 5, 6}. The loop nesting forest
iterated dominance
inconsistency does not appear any more and the algorithm proceeds to handle frontier
J J J
the edges 4→5, 9→6 and 10→8 which have also been visited in the earlier pass. loop nesting forest
TDMSC-I is repeatedly invoked by a different function which calls it in a loop till reducible loop
RequireAnotherPass is returned as false as shown in the procedure TDMSCMain.
Once the iterated dominance frontier relation is computed for the entire CFG,
inserting the φ-functions is a straightforward application of the DF+ (x ) values
for a given Defs(x ), as shown in Algorithm 4.2.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Computing iterated dominance frontier using loop
nesting forests
?
This section illustrates the use of loop nesting forests to construct the iterated
?
dominance frontier (DF+ ) of a set of vertices in a CFG. This method works with
reducible as well as irreducible loops.
A loop nesting forest?is a data structure that represents the loops in a CFG and
the containment relation between them. In the example shown in Figure 4.5a
the loops with back edges 11 → 9 and 12 → 2 are both reducible loops?. The
corresponding loop nesting forest is shown in Figure 4.5b and consists of two
loops whose header nodes are 2 and 9. The loop with header node 2 contains
the loop with header node 9.
54 4 Advanced Construction Algorithms for SSA — (D. Das, U. Ramakrishna, V. Sreedhar)
... ← v 9
10
?
The idea is to use the forward CFG, an acyclic version of the control flow graph
?
(i.e., without back edges), and construct the DF+ for a variable in this context:
whenever two distinct definitions reach a join point, it belongs to the DF+ . Then,
we take into account the back edges using the loop nesting forest: if a loop
contains a definition, its header also belongs to the DF+ .
?
A definition node d “reaches” another node u if there is non-empty a path in
the graph from d to u which does not contain any redefinition. If at least two
definitions reach a node u , then u belongs to DF+ (S ) where S = Defs(x ) consists
of these definition nodes. This suggests the Algorithm 4.3 which works for acyclic
graphs. For a given S , we can compute DF+ (S ) as follows:
• Initialize DF+ to the empty set ;
• Using a topological order, compute the subset of DF+ (Defs(x )) that can reach
a node using forward data-flow analysis ;
• Add a node to DF+ if it is reachable from multiple nodes;
?
For Figure 4.5, the forward CFG of the graph G , termed Gfwd , is formed by
dropping the back edges 11 → 9 and 12 → 2. Also, r is a specially designated
node that is the root of the CFG. For the definitions of x in nodes 4, 5, 7 and 12 in
Figure 4.5, the subsequent nodes (forward) reached by multiple definitions?are 6
4.4 Computing iterated dominance frontier using loop nesting forests 55
and 8: node 6 can be reached by any one of the two definitions in nodes 4 or 5, back edge
reducible CFG
and node 8 by either the definition from node 7 or one of 4 or 5. Note that the
back edge
back edges?do not exist in the forward CFG and hence node 2 is not part of the
DF+ set yet. We will see later how the DF+ set for the entire graph is computed
by considering the contribution of the back edges.
10 if |ReachingDefs| = 1 then
11 UniqueReachingDef(u) ← ReachingDefs;
12 else
13 DF+ ← DF+ ∪ {u};
14 return DF+
Let us walk through this algorithm computing DF+ for variable v , i.e., S =
{4, 5, 7, 12}. The nodes in Figure 4.5 are already numbered in topological order.
Nodes 1 to 5 have only one predecessor, none of them being in S , so their Uni-
queReachingDef stays r, and DF+ is still empty. For node 6, its two predecessors
belong to S , hence ReachingDefs = {4, 5}, and 6 is added to DF+ . Nothing changes
for 7, then for 8 its predecessors 6 and 7 are respectively in DF+ and S : they are
added to ReachingDefs, and 8 is then added to DF+ . Finally, for nodes 8 to 12,
their UniqueReachingDef will be updated to node 8, but this will not change DF+
anymore which will end up being {6, 8}.
?
A reducible graph can be decomposed into an acyclic graph and a set of back
?
edges. The contribution of back edges to the iterated dominance frontier can be
identified by using the loop nesting forest. If a vertex v is contained in a loop, then
DF+ (v ) will contain the loop header, i.e. the unique entry of the reducible loop.
For any vertex v , let HLC(v ) denote the set of loop headers of the loops containing
v . Given a set of vertices S , it turns out that DF+ (S ) = HLC(S ) ∪ DF+fwd (S ∪ HLC(S ))
56 4 Advanced Construction Algorithms for SSA — (D. Das, U. Ramakrishna, V. Sreedhar)
irreducible CFG
forward iterated entry entry entry
dominance frontier
u s u s u s
v, w
θ
v w v w v w
v w
where HLC(S ) = v ∈S HLC(v ), and where DF+fwd denote the DF+ restricted to the
S
?
We will now briefly explain how graphs containing irreducible loops can be
handled. The insight behind the implementation is to transform the irreducible
loop in such a way that an acyclic graph is created from the loop without changing
the dominance properties of the nodes.
The loop in the graph of Figure 4.6a, that is made up of nodes v and w , is
irreducible as it has two entry nodes, v and w . We let the headers be those entry
nodes. It can be transformed to the acyclic graph (c) by removing the back edges,
i.e. the edges within the loop that point to the header nodes, in other words,
edges v → w and w → v . We create a dummy node, θ , to which we connect all
predecessors of the headers (u and s ), and that we connect to all the headers of
the loop (v and w ), creating graph (d).
Following this transformation, the graph is now acyclic, and computing the
DF+ for the nodes in the original irreducible graph translates to computing DF+
using the transformed graph to get DF+fwd?and using the loop forest of the original
graph (b).
The crucial observation that allows this transformation to create an equivalent
acyclic graph is the fact that the dominator tree of the transformed graph remains
identical to the original graph containing an irreducible cycle.
4.5 Concluding remarks and further readings 57
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Concluding remarks and further readings
Concluding remarks
Although all these algorithms claim to be better than the original algorithm by
Cytron et al., they are difficult to compare due to the unavailability of these
algorithms in a common compiler framework.
In particular, while constructing the whole DF+ set seems very costly in the
classical construction algorithm, its cost is actually amortized as it will serve
to insert φ-functions for many variables. It is, however, interesting not to pay
this cost whenever we only have a few variables to consider, for instance when
repairing SSA as in the next chapter.
Note also that people have observed in production compilers that, during SSA
construction, what seems to be the most expensive part is the renaming of the
variables and not the insertion of φ-functions.
Further readings
The algorithm computing DF+ without the explicit DF-graph is from Das &
Ramakrishna [98]. For iterative DF+ set computation, they also exhibit TDMSC-
II, an improvement to algorithm TDMSC-I. This improvement is fueled by the
observation that for an inconsistent node u , the DF+ sets of all nodes w such that
w dominates u and w.depth ≥ u .depth, can be locally corrected for some special
cases. This heuristic works very well for certain classes of problems—especially
for CFGs with DF-graphs having cycles consisting of a few edges. This eliminates
extra passes as an inconsistent node is made consistent immediately on being
detected.
Finally, the part on computing DF+ sets using loop nesting forests is based on
Ramalingam’s work on loops, dominators, and dominance frontiers [245].
live-range splitting
CHAPTER 5
SSA Reconstruction S. Hack
x0 ← . . . x0 ← . . . x0 ← . . .
X ← spill x0 X ← spill x0
. .
. . . ← x0 . . . ← x0 . . . ← x0 .
. . . . ← x0 .
.
x0 ← reload X x0 ← reload X
. . . ← x0 . . . ← x0 x2 ← φ(x0 , x1 )
. . . ← x2
(a) Original program (b) Spilling x0 , SSA broken (c) SSA reconstructed
Fig. 5.1 Adding a second definition as a side-effect of spilling.
59
60 5 SSA Reconstruction — (S. Hack)
path duplication Many optimizations perform such program modifications, and maintaining
jump threading
SSA is often one of the more complicated and error-prone parts in such optimiza-
tions, owing to the insertion of additional φ-functions and the correct redirection
of the uses of the variable.
Another example for such a transformation is path duplication which is dis-
cussed in Chapter 20. Several popular compilers such as GCC and LLVM perform
??
one or another variant of path duplication, for example when threading jumps.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 General considerations
In this chapter, we will discuss two algorithms. The first is an adoption of the
classical dominance-frontier based algorithm. The second performs a search
from the uses of the variables to the definition and places φ-functions on demand
at appropriate places. In contrast to the first, the second algorithm might not
construct minimal SSA form in general; However, it does not need to update its
internal data structures when the CFG is modified.
We consider the following scenario: The program is represented as a control-
flow graph (CFG) and is in SSA form with dominance property. For the sake of
simplicity, we assume that each instruction in the program only writes to a single
variable. An optimization or transformation violates SSA by inserting additional
definitions for an existing SSA variable, like in the examples above. The original
variable and the additional definitions can be seen as a single non-SSA variable
that has multiple definitions and uses. Let in the following v be such a non-SSA
variable.
When reconstructing SSA for v , we will first create fresh variables for every
definition of v to establish the single-assignment property. What remains is
associating every use of v with a suitable definition. In the algorithms, v.defs
denotes the set of all instructions that define v . A use of a variable is a pair
consisting of a program point (an instruction) and an integer denoting the index
of the operand at this instruction.
Both algorithms presented in this chapter share the same driver routine de-
scribed in Algorithm 5.1. First, we scan all definitions of v so that for every basic
block b we have the list b .defs that contains all instructions in the block which de-
fine one of the variables in v.defs. It is best to sort this according to the schedule
of the instructions in the block from back to front, making, the latest definition
the first in the list.
Then, all uses of the variable v are traversed to associate them with the proper
definition. This can be done by using precomputed use-def chains if available or
scanning all instructions in the dominance subtree of v ’s original SSA definition.
For each use, we have to differentiate whether the use is in a φ-function or not.
If so, the use occurs at the end of the predecessor block that corresponds to the
position of the variable in the φ’s argument list. In that case, we start looking
5.1 General considerations 61
for the reaching definition from the end of that block. Otherwise, we scan the
instructions of the block backwards until we reach the first definition that is
before the use (Line 14). If there is no such definition, we have to find one that
reaches this block from outside.
We use two functions, FindDefFromTop and FindDefFromBottom that search
the reaching definition respectively from the beginning or the end of a block.
FindDefFromBottom actually just returns the last definition in the block, or call
FindDefFromTop if there is none.
The two presented approaches to SSA repairing differ in the implementation
of the function FindDefFromTop. The differences are described in the next two
sections.
19 v 0 ← version of v defined by d
20 rewrite use of v by v 0 in inst
Procedure FindDefFromBottom(v , b )
1 if b .defs 6= ; then
2 return latest instruction in b .defs
3 else
4 return FindDefFromTop(v , b )
62 5 SSA Reconstruction — (S. Hack)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Reconstruction based on the dominance frontier
This algorithm follows the same principles as the classical SSA construction
algorithm by Cytron at al. as described in Chapter 3. We first compute the iterated
dominance frontier (DF+ ) of v . This set is a sound approximation of the set where
φ-functions must be placed—it might contain blocks where a φ-function would
be dead. Then, we search for each use u the corresponding reaching definition.
This search starts at the block of u . If that block b is in the DF+ of v , a φ-function
needs to be placed at its entrance. This φ-function becomes a new definition
of v and has to be inserted in v.defs and in b .defs. The operands of the newly
created φ-function will query their reaching definitions by recursive calls to
FindDefFromBottom on predecessors of b . Because we inserted the φ-function
into b .defs before searching for the arguments, no infinite recursion can occur
(otherwise, it could happen for instance with a loop back edge).
If the block is not in the DF+ , the search continues in the immediate dominator
of the block. This is because in SSA, every use of a variable must be dominated by
its definition.1 Therefore, the reaching definition is the same for all predecessors
of the block, and hence for the immediate dominator of this block.
Procedure FindDefFromTop(v , b )
 SSA Reconstruction based on Dominance Frontiers
1 if b ∈ DF+ (v .defs) then
2 v 0 ← fresh variable
3 d ← new φ-function in b : v 0 ← φ(. . . )
4 append d to b .defs
5 foreach p ∈ b .preds do
6 o ← FindDefFromBottom(v , p )
7 v 0 ← version of v defined by o
8 set corresponding operand of d to v 0
9 else
10 d ← FindDefFromBottom(v , b .idom) Â search in immediate dominator
11 return d
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Search-based reconstruction
The second algorithm presented here is adapted from an algorithm designed to
construct SSA from the abstract syntax tree, but it also works well on control-flow
1
The definition of an operand of a φ-function has to dominate the corresponding predecessor
block.
5.3 Search-based reconstruction 63
graphs. Its major advantage over the algorithm presented in the last section is insertion of φ-function,
pessimistic
that it does neither require dominance information nor dominance frontiers.
Thus it is well suited to be used in transformations that change the control-flow
graph. Its disadvantage is that potentially more blocks have to be visited during
the reconstruction. The principle idea is to start a search from every use to find
the corresponding definition, inserting φ-functions on the fly while caching
the SSA variable alive at the beginning of basic blocks. As in the last section, we
only consider the reconstruction for a single variable called v in the following.
If multiple variables have to be reconstructed, the algorithm can be applied to
each variable separately.
The algorithm performs a backward depth-first search in the CFG to collect the
reaching definitions of v in question at each block, recording the SSA variable
that is alive at the beginning of a block in the “beg” field of this block. If the CFG
is an acyclic graph (DAG), all predecessors of a block can be visited before the
block itself is processed, as we are using a post-order traversal following edges
backward. Hence, we know all the definitions that reach a block b : if there is
more than one definition, we need to place a φ-function in b , otherwise it is not
necessary.
If the CFG has loops, there are blocks for which not all reaching definitions can
be computed before we can decide whether a φ-function has to be placed or
not. In a loop, recursively computing the reaching definitions for a block b will
end up at b itself. To avoid infinite recursion when we enter a block during the
traversal, we first create a φ-function without arguments, “pending_φ.” This
creates a new definition vφ for v which is the variable alive at the beginning of
this block.
When we return to b after traversing the rest of the CFG, we decide whether
a φ-function has to be placed in b by looking at the reaching definition for
?
every predecessor. These reaching definitions can be either vφ itself (loop in the
CFG without a definition of v ), or some other definitions of v . If there is only
one such other definition, say w , then pending_φ is not necessary and we can
remove it, propagating w downward instead of vφ . Note that in this case it will
be necessary to “rewrite” all uses that referred to pending_φ to w . Otherwise,
we keep pending_φ and fill its missing operands with the reaching definitions.
In this version of function FindDefFromTop, this check is done by the function
Phi-Necessary.
64 5 SSA Reconstruction — (S. Hack)
Procedure FindDefFromTop(b )
 Search-based SSA Reconstruction
Input: b , a basic block
1 if b .top 6= ⊥ then
2 return b .top
3 pending_φ ← new φ-function in b
4 vφ ← result of pending_φ
5 b .top ← vφ
6 reaching_defs ← []
7 foreach p ∈ b .preds do
8 reaching_defs ← reaching_defs ∪ FindDefFromBottom(v , p )
9 vdef ← Phi-Necessary(vφ , reaching_defs)
10 if vdef = vφ then
11 set arguments of pending_φ to reaching_defs
12 else
13 rewire all uses of pending_φ to vdef
14 remove pending_φ
15 b .top ← vdef
16 return vdef
 this assertion is violated if reaching_defs contains only pending_φ which never can
happen
8 assert (other 6= ⊥)
9 return other
In programs with loops, it can be the case that the local optimization performed
when function FindDefFromTop calls Phi-Necessary does not remove all unnec-
essary φ-functions. This can happen in loops where φ-functions can become
unnecessary because other φ-functions are optimized away. Consider the exam-
ple in Figure 5.2. We look for a definition of x from block E . If Algorithm Find-
5.3 Search-based reconstruction 65
insertion of φ-function,
pessimistic
A x ← ... A x0 ← . . .
B B x2 ← φ(x0 , x1 )
C C x1 ← φ(x2 , x1 )
D D
E ... ← x E ... ← x
x0 ← . . .
x2 ← φ(x0 , x1 ) x1 ← φ(x2 , x0 )
1. Relating the core ideas of SSA to concepts from other areas of compiler and
programming language research provides conceptual insight into the SSA
discipline and thus contributes to a better understanding of the practical
appeal of SSA to compiler writers;
2. Reformulating SSA as a functional program makes explicit some of the syn-
tactic conditions and semantic invariants that are implicit in the definition
and use of SSA. Indeed, the introduction of SSA itself was motivated by a
similar goal: to represent aspects of program structure—namely the def-use
relationships—explicitly in syntax, by enforcing a particular naming disci-
pline. In a similar way, functional representations directly enforce invariants
such as “all φ-functions in a block must be of the same arity,” “the variables
assigned to by these φ-functions must be distinct,” “φ-functions are only
allowed to occur at the beginning of a basic block,” or “each use of a vari-
able should be dominated by its (unique) definition.” Constraints such as
these would typically have to be validated or (re-)established after each op-
timization phase of an SSA-based compiler, but are typically enforced by
construction if a functional representation is chosen. Consequently, less code
is required, improving the robustness, maintainability, and code readability
of the compiler;
3. The intuitive meaning of “unimplementable” φ-instructions is complemen-
ted by a concrete execution model, facilitating the rapid implementation of
67
68 6 Functional Representations of SSA — (L. Beringer)
formal parameters interpreters. This enables the compiler developers to experimentally vali-
let-binding
date SSA-based analyses and transformations at their genuine language level,
without requiring SSA destruction. Indeed, functional intermediate code can
often be directly emitted as a program in a high-level mainstream functional
language, giving the compiler writer access to existing interpreters and com-
pilation frameworks. Thus, rapid prototyping is supported and high-level
evaluation of design decisions is enabled;
4. Formal frameworks of program analysis that exist for functional languages
become applicable. Type systems provide a particularly attractive formalism
due to their declarativeness and compositional nature. As type systems for
functional languages typically support higher-order functions, they can be
expected to generalize more easily to interprocedural analyses than other
static analysis formalisms;
5. We obtain a formal basis for comparing variants of SSA—such as the variants
discussed elsewhere in this book—, for translating between these variants,
and for constructing and destructing SSA. Correctness criteria for program
analyses and associated transformations can be stated in a uniform manner,
and can be proved to be satisfied using well-established reasoning principles.
Rather than discussing all these considerations in detail, the purpose of the
present chapter is to informally highlight particular aspects of the correspon-
dence and then point the reader to some more advanced material. Our exposition
is example-driven but leads to the identification of concrete correspondence
pairs between the imperative/SSA world and the functional world.
Like the remainder of the book, our discussion is restricted to code occurring
in a single procedure.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1 Low-level functional program representations
Functional languages represent code using declarations of the form
function f (x0 , . . . , xn ) = e (6.1)
let v = 3 in
spans both inner let-bindings, the scopes of which are themselves not nested
inside one other as the inner binding of v occurs in the e1 position of the let-
binding for y .
In contrast to an assignment in an imperative language, a let-binding for
variable x hides any previous value bound to x for the duration of evaluating
e2 but does not permanently overwrite it. Bindings are treated in a stack-like
fashion, resulting in a tree-shaped nesting structure of boxes in our code excerpts.
For example, in the above code, the inner binding of v to value 2 × 3 = 6 shadows
the outer binding of v to value 3 precisely for the duration of the evaluation of the
expression 4 × v . Once this evaluation has terminated (resulting in the binding
of y to 24), the binding of v to 3 becomes visible again, yielding the overall result
of 72.
The concepts of binding and static scope ensure that functional programs
enjoy the characteristic feature of SSA, namely the fact that each use of a variable
is uniquely associated with a point of definition. Indeed, the point of definition
for a use of x is given by the nearest enclosing binding of x . Occurrences of
variables in an expression that are not enclosed by a binding are called free?. A
well-formed procedure declaration contains all free variables of its body amongst
its formal parameters. Thus, the notion of scope makes explicit the invariant that
each use of a variable should be dominated by its (unique) definition.
In contrast to SSA, functional languages achieve the association of definitions
to uses without imposing the global uniqueness of variables, as witnessed by
the duplicate binding occurrences for v in the above code. As a consequence
of this decoupling, functional languages enjoy a strong notion of referential
?
transparency : the choice of x as the variable holding the result of e1 depends
only on the free variables of e2 . For example, we may rename the inner v in code
(6.2) to z without altering the meaning of the code:
70 6 Functional Representations of SSA — (L. Beringer)
alpha-renaming@α- let v = 3 in
renaming
compositional equational let y = (let z = 2 × v in 4 × z end)
v,z
reasoning (6.3)
in y × v end
continuation passing v,y
style v
end
CPS|seecontinuation
Note that this conversion formally makes the outer v visible for the expression
4 × z , as indicated by the index v, z decorating its surrounding box.
In order to avoid altering the meaning of the program, the choice of the newly
introduced variable has to be such that confusion with other variables is avoided.
Formally, this means that a renaming
let x = e1 in e2 end to let y = e1 in e2 [y ↔ x ] end
can only be carried out if y is not a free variable of e2 . Moreover, in case that e2
already contains some preexisting bindings to y , the substitution of x by y in
e2 (denoted by e2 [y ↔ x ] above) first renames these preexisting bindings in a
suitable manner. Also note that the renaming only affects e2 —any occurrences of
x or y in e1 refer to conceptually different but identically named variables, but the
static scoping discipline ensures these will never be confused with the variables
involved in the renaming. In general, the semantics-preserving renaming of
bound variables is called α-renaming?. Typically, program analyses for functional
languages are compatible with α-renaming in that they behave equivalently
for fragments that differ only in their choice of bound variables, and program
transformations α-rename bound variables whenever necessary.
A consequence of referential transparency, and thus a property typically en-
joyed by functional languages, is compositional equational reasoning?: the mean-
ing of a piece of code e is only dependent of its free variables, and can be cal-
culated from the meaning of its subexpressions. For example, the meaning of
a phrase let x = e1 in e2 end only depends on the free variables of e1 and on the
free variables of e2 other than x . Hence, languages with referential transparency
allow one to replace a subexpression by some semantically equivalent phrase
without altering the meaning of the surrounding code. Since semantic preserva-
tion is a core requirement of program transformations, the suitability of SSA for
formulating and implementing such transformations can be explained by the
proximity of SSA to functional languages.
once the evaluation of the current code fragment has terminated. Syntactically, lambda-notation@λ-
notation
continuations are expressions that may occur in functional position (i.e., are
typically applied to argument expressions), as is the case for the variable k in the
following modification from code (6.2):
let v = 3 in
let y = (let v = 2 × v in 4 × v end)
(6.4)
in k (y × v ) end
end
In effect, k represents any function that may be applied to the result of expres-
sion (6.2).
Surrounding code may specify the concrete continuation by binding k to a
suitable expression. It is common practice to write these continuation-defining
expressions in λ-notation?, i.e., in the form λ x .e where x typically occurs free in
e . The effect of the expression is to act as the (unnamed) function that sends x
to e (x ), i.e., formal parameter x represents the place-holder for the argument
to which the continuation is applied. Note that x is α-renameable, as λ acts as
a binder. For example, a client of the above code fragment wishing to multiply
the result by 2 may insert code (6.4) in the e2 position of a let-binding for k that
contains λ x . 2 × x in its e1 -position, as in the following code:
let k = λ x . 2 × x
in let v = 3 in
let y = (let z = 2 × v in 4 × z end)
(6.5)
in k (y × v ) end
end
end
When the continuation k is applied to the argument y × v , the (dynamic)
value y × v (i.e., 72) is substituted for x in the expression 2 × x , just like in an
ordinary function application.
Alternatively, the client may wrap fragment (6.4) in a function definition with
formal argument k and construct the continuation in the calling code, where
he would be free to choose a different name for the continuation-representing
variable:
function f (k ) =
let v = 3 in
let y = (let z = 2 × v in 4 × z end)
in k (y × v ) end (6.6)
end
in let k = λ x . 2 × x in f (k ) end
end.
This makes CPS form a discipline of programming using higher-order func-
tions, as continuations are constructed “on-the-fly” and communicated as argu-
ments of other function calls. Typically, the caller of f is itself parametric in its
continuation, as in
function g (k ) =
(6.7)
let k 0 = λ x . k (x + 7) in f (k 0 ) end.
72 6 Functional Representations of SSA — (L. Beringer)
function h (y , k ) =
let x = 4 in
let k 0 = λ z . k (z × x )
in if y > 0
(6.8)
then let z = y × 2 in k 0 (z ) end
else let z = 3 in k 0 (z ) end
end
end
y
y
x ←4
x ←4 if ( y > 0)
if ( y > 0)
z1 ← y × 2 z2 ← 3
z ← y ×2 z ←3
z ← φ(z 1 , z 2 )
return z × x return z × x
(a) (b)
Fig. 6.1 Control-flow graph for code (6.8) (a), and SSA representation (b).
The SSA form of this CFG is shown in Figure 6.1b. If we apply similar renamings
of z to z 1 and z 2 in the two branches of (6.8), we obtain the following fragment:
6.1 Low-level functional program representations 73
function h (y ) =
let x = 4 in
function h 0 (z ) = z × x
x ,y ,z ,h ,h 0
in if y > 0
then let z = y × 2 in h 0 (z ) end (6.10)
x ,y ,z ,h ,h 0
where the local function h 0 plays a similar role as the continuation k 0 and is jointly
called from both branches. In contrast to the CPS representation, however, the
body of h 0 returns its result directly rather than by passing it on as an argument
to some continuation. Also note that neither the declaration of h nor that of
h 0 contain additional continuation parameters. Thus, rather than handing its
result directly over to some caller-specified receiver (as communicated by the
continuation argument k ), h simply returns control back to the caller, who is
then responsible for any further execution. Roughly speaking, the effect is similar
to the imperative compilation discipline of always setting the return address
of a procedure call to the instruction pointer immediately following the call
instruction.
A stricter format is obtained if the granularity of local functions is required to
be that of basic blocks:
function h (y ) =
let x = 4 in
function h 0 (z ) = z × x
in if y > 0
then function h1 () = let z = y × 2 in h 0 (z ) end
(6.11)
in h1 () end
else function h2 () = let z = 3 in h 0 (z ) end
in h2 () end
end
end
Independent of the granularity level of local functions, the process of moving let-normal form
from the CFG to the SSA form is again captured by suitably α-renaming the
bindings of z in h1 and h2 :
function h (y ) =
let x = 4 in
function h 0 (z ) = z × x
in if y > 0
then function h1 () = let z 1 = y × 2 in h 0 (z 1 ) end
(6.12)
in h1 () end
else function h2 () = let z 2 = 3 in h 0 (z 2 ) end
in h2 () end
end
end
Again, the role of the formal parameter z of the control-flow merge point function
h 0 is identical to that of a φ-function. In accordance with the fact that the basic
blocks representing the arms of the conditional do not contain φ-functions, the
local functions h1 and h2 have empty parameter lists—the free occurrence of y
in the body of h1 is bound at the top level by the formal argument of h .
For both direct style and CPS the correspondence to SSA is most pronounced for
?
code in let-normal form: each intermediate result must be explicitly named by a
variable, and function arguments must be names or constants. Syntactically, let-
normal form isolates basic instructions in a separate category of primitive terms
a and then requires let-bindings to be of the form let x = a in e end. In particular,
neither jumps (conditional or unconditional) nor let-bindings are primitive.
Let-normalized form is obtained by repeatedly rewriting code as follows:
let y = e
let x = let y = e in e 0 end
y
into in let x = e 0 in e 00 end
in e 00 x ,y
x y
end end,
subject to the side condition that y is not free in e 00 . For example, let-normalizing
code (6.3) pulls the let-binding for z to the outside of the binding for y , yielding
let v = 3 in
let z = 2 × v in
let y = 4 × z in y × v end (6.13)
v,y ,z
v,z
end
v
end
76 6 Functional Representations of SSA — (L. Beringer)
let v = 3,
z = 2 × v,
(6.14)
y =4×z
in y × v end
Summarizing our discussion up to this point, Table 6.1 collects some corre-
spondences between functional and imperative/SSA concepts.
Table 6.1 Correspondence pairs between functional form and SSA (part I).
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Functional construction and destruction of SSA
The relationship between SSA and functional languages is extended by the corre-
spondences shown in Table 6.2. We discuss some of these aspects by considering
the translation into SSA, using the program in Figure 6.2 as a running example.
Table 6.2 Correspondence pairs between functional form and SSA: program structure.
b2
b1
x ←5+ y b3
v ←1
y ← x ×z true w ←v +y
z ←8
x ← x −1 return w
y ←4
if ( x = 0)
false
function f1 () = let v = 1, z = 8, y = 4
in f2 (v, z , y ) end
and f2 (v, z , y ) = let x = 5 + y , y = x × z , x = x − 1
(6.15)
in if x = 0 then f3 (y , v ) else f2 (v, z , y ) end
and f3 (y , v ) = let w = y + v in w end
in f1 () end
lambda-dropping@λ- a function fi is the target of one φ-function for the corresponding block bi .
dropping
dropping!λ The arguments of these φ-functions are the arguments in the corresponding
positions in the calls to fi . As the number of arguments in each call to fi coincides
with the number of formal parameters of fi , the φ-functions in bi are all of the
same arity, namely the number of call sites to fi . In order to coordinate the
relative positioning of the arguments of the φ-functions, we choose an arbitrary
enumeration of these call sites.
function f1 () = let v1 = 1, z 1 = 8, y1 = 4
in f2 (v1 , z 1 , y1 ) end
and f2 (v2 , z 2 , y2 ) = let x1 = 5 + y2 , y3 = x1 × z 2 , x2 = x1 − 1
(6.16)
in if x2 = 0 then f3 (y3 , v2 ) else f2 (v2 , z 2 , y3 ) end
and f3 (y4 , v3 ) = let w1 = y4 + v3 in w1 end
in f1 () end
b2
v2 ← φ(v1 , v2 ) b1
b3
b1 z 2 ← φ(z 1 , z 2 )
y4 ← φ(y3 )
v1 ← 1 y2 ← φ(y1 , y3 )
true v3 ← φ(v2 )
z1 ← 8 x1 ← 5 + y2
w1 ← v3 + y4 b2
y1 ← 4 y3 ← x1 × z 2
return w1
x2 ← x1 − 1
if ( x2 = 0)
false b3
6.2.2 λ-dropping
Block sinking?analyzes the static call structure to identify which function defini-
tions may be moved inside each other. For example, whenever our set of function
declarations contains definitions f (x1 , . . . , xn ) = e f and g (y1 , . . . , ym ) = eg where
f 6= g and such that all calls to f occur in e f or eg , we can move the declaration
for f into that of g —note the similarity to the notion of dominance. If applied
aggressively, block sinking indeed amounts to making the entire dominance tree
structure explicit in the program representation. In particular, algorithms for
computing the dominator tree from a CFG discussed elsewhere in this book can
be applied to identify block sinking opportunities, where the CFG is given by the
call graph of functions.
In our example (6.15), f3 is only invoked from within f2 , and f2 is only called
in the bodies of f2 and f1 (see the dominator tree in Figure 6.3 (right)). We may
thus move the definition of f3 into that of f2 , and the latter one into f1 .
Several options exist as to where f should be placed in its host function. The
first option is to place f at the beginning of g , by rewriting to
function g (y1 , . . . , ym ) = function f (x1 , . . . , xn ) = e f
in eg end.
This transformation does not alter the meaning of the code, as the declaration
of f is closed: moving f into the scope of the formal parameters y1 , . . . , ym (and
also into the scope of g itself) does not alter the bindings to which variable uses
inside e f refer.
Applying this transformation to example (6.15) yields the following code:
function f1 () =
function f2 (v, z , y ) =
function f3 (y , v ) = let w = y + v in w end
in let x = 5 + y , y = x × z , x = x − 1
in if x = 0 then f3 (y , v ) else f2 (v, z , y ) end (6.17)
end
in let v = 1, z = 8, y = 4 in f2 (v, z , y ) end
end
in f1 () end
An alternative strategy is to insert f near the end of its host function g , in the
vicinity of the calls to f . This brings the declaration of f additionally into the
scope of all let-bindings in eg . Again, referential transparency and preservation
of semantics are respected as the declaration on f is closed. In our case, the
alternative strategy yields the following code:
80 6 Functional Representations of SSA — (L. Beringer)
dropping!parameter function f1 () =
parameter dropping let v = 1, z = 8, y = 4
in function f2 (v, z , y ) =
let x = 5 + y , y = x × z , x = x − 1
in if x = 0
then function f3 (y , v ) = let w = y + v in w end
(6.18)
in f3 (y , v ) end
else f2 (v, z , y )
end
in f2 (v, z , y ) end
end
in f1 () end
In general, one would insert f directly prior to its call if g contains only a single
call site for f . In case that g contains multiple call sites for f , these are (due to
their tail-recursive positioning) in different arms of a conditional, and we would
insert f directly prior to this conditional.
Both outlined placement strategies result in code whose nesting structure
reflects the dominance relationship of the imperative code. In our example,
code (6.17) and (6.18) both nest f3 inside f2 inside f1 , in accordance with the
dominator tree of the imperative program show on Figure 6.3.
The rationale for these clauses is that removing x from f ’s parameter list means
that any free occurrence of x in f ’s body is now bound outside of f ’s declaration.
Therefore, for each call to f to be correct, one needs to ensure that this outside
binding coincides with the one containing the call.
Similarly, we may simultaneously drop a parameter x occurring in all decla-
rations of a block of mutually recursive functions f1 , . . . , fn , if the scope for x at
the point of declaration of the block coincides with the tightest scope in force
at any call site to some fi outside the block, and if in each call to some fi inside
some f j , the tightest scope for x is the one associated with the formal parameter
x of f j . In both cases, dropping a parameter means to remove it from the list of
formal parameter lists of the function declarations concerned, and also from the
argument lists of the corresponding function calls.
In code (6.18), these conditions sanction the removal of both parameters from
the nonrecursive function f3 . The scope applicable for v at the site of declaration
of f3 and also at its call site is the one rooted at the formal parameter v of f2 . In
case of y , the common scope is the one rooted at the let-binding for y in the
body of f2 . We thus obtain the following code:
function f1 () =
let v = 1, z = 8, y = 4
in function f2 (v, z , y ) =
let x = 5 + y , y = x × z , x = x − 1
in if x = 0
then function f3 () = let w = y + v in w end
(6.19)
in f3 () end
else f2 (v, z , y )
end
in f2 (v, z , y ) end
end
in f1 () end
Considering the recursive function f2 next we observe that the recursive call is in
the scope of the let-binding for y in the body of f2 , preventing us from removing
y . In contrast, neither v nor z have binding occurrences in the body of f2 . The
scopes applicable at the external call site to f2 coincide with those applicable at
its site of declaration and are given by the scopes rooted in the let-bindings for v
and z . Thus, parameters v and z may be removed from f2 :
function f1 () =
let v = 1, z = 8, y = 4
in function f2 (y ) =
let x = 5 + y , y = x × z , x = x − 1
in if x = 0
then function f3 () = let w = y + v in w end
(6.20)
in f3 () end
else f2 (y )
end
in f2 (y ) end
end
in f1 () end
82 6 Functional Representations of SSA — (L. Beringer)
Interpreting the uniquely-renamed variant of (6.20) back in SSA yields the desired
code with a single φ-function, for variable y at the beginning of block b2 , see
Figure 6.4. The reason that this φ-function can’t be eliminated (the redefinition
b1
b2
b1 y2 ← φ(y1 , y3 )
v1 ← 1 x1 ← 5 + y2 b3
true b2
z1 ← 8 y3 ← x1 × z 1 return w1
y1 ← 4 x2 ← x1 − 1
if ( x2 = 0)
false b3
witnessed by our use of the let-binding construct for binding code-representing loop-closed|exampleidx
expressions to the variables k in our syntax for CPS.
Thus, the dominator tree immediately suggests a function nesting scheme,
where all children of a node are represented as a single block of mutually recursive
function declarations.2
The choice as to where functions are placed corresponds to variants of SSA.
?
For example, in loop-closed SSA form (see Chapters 14 and 10), SSA names that
are defined in a loop must not be used outside the loop. To this end, special-
purpose unary φ-nodes are inserted for these variables at the loop exit points.
As the loop is unrolled, the arity of these trivial φ-nodes grows with the number
of unrollings, and the program continuation is always supplied with the value
the variable obtained in the final iteration of the loop. In our example, the only
loop-defined variable used in f3 is y —and we already observed in code (6.17)
how we can prevent the dropping of y from the parameter list of f3 : we insert f3 at
the beginning of f2 , preceding the let-binding for y . Of course, we would still like
to drop as many parameters from f2 as possible, hence we apply the following
placement policy during block sinking: functions that are targets of loop-exiting
function calls and have live-in variables that are defined in the loop are placed
at the beginning of the loop headers. Other functions are placed at the end of
their hosts. Applying this policy to our original program (6.15) yields (6.21).
function f1 () =
let v = 1, z = 8, y = 4
in function f2 (v, z , y ) =
function f3 (y , v ) = let w = y + v in w end
in let x = 5 + y , y = x × z , x = x − 1
(6.21)
in if x = 0 then f3 (y , v ) else f2 (v, z , y ) end
end
in f2 (v, z , y ) end
end
in f1 () end
We may now drop v (but not y ) from the parameter list of f3 , and v and z from
f2 , to obtain code (6.22).
function f1 () =
let v = 1, z = 8, y = 4
in function f2 (y ) =
function f3 (y ) = let w = y + v in w end
in let x = 5 + y , y = x × z , x = x − 1
(6.22)
in if x = 0 then f3 (y ) else f2 (y ) end
end
in f2 (y ) end
end
in f1 () end
The SSA form corresponding to (6.22) contains the desired loop-closing φ-node
for y at the beginning of b3 , as shown in Figure 6.5a. The nesting structure of
2
Refinements of this representation will be sketched in Section 6.3.
84 6 Functional Representations of SSA — (L. Beringer)
unrolling|exampleidx b1
b2
b1 y2 ← φ(y1 , y4 ) b3
v1 ← 1 x1 ← 5 + y2 y3 ← φ(y4 )
true b2
z1 ← 8 y4 ← x1 × z 1 w1 ← v1 + y3
y1 ← 4 x2 ← x1 − 1 return w1
if ( x2 = 0)
false b3
(a) Loop-closed
b2
b1 y2 ← φ(y1 , y40 ) b3
v1 ← 1 x1 ← 5 + y2 y3 ← φ(y4 , y40 ) b1
true
z1 ← 8 y4 ← x1 × z 1 w1 ← v1 + y3
y1 ← 4 x2 ← x1 − 1 return w1
if ( x2 = 0)
true b20
b2
false
y20 ← φ(y4 )
x10 ← 5 + y20
y40 ← x10 × z 1 b20 b3
false x20 ← x10 − 1
if ( x20 = 0)
(b) Loop-unrolled
Fig. 6.5 Loop-closed (a) and loop-unrolled (b) forms of running example program, corresponding to
codes (6.22) and (6.23), respectively.
both (6.21) and (6.22) coincides with the dominance structure of the original
imperative code and its loop-closed SSA form.
We unroll? the loop by duplicating the body of f2 , without duplicating the
declaration of f3 :
function f1 () =
let v = 1, z = 8, y = 4
in function f2 (y ) =
function f3 (y ) = let w = y + v in w end
in let x = 5 + y , y = x × z , x = x − 1
in if x = 0 then f3 (y )
else function f20 (y ) =
(6.23)
let x = 5 + y , y = x × z , x = x − 1
in if x = 0 then f3 (y ) else f2 (y ) end
in f20 (y ) end
end
in f2 (y ) end
end
in f1 () end
6.3 Refined block sinking and loop nesting forests 85
Both calls to f3 are in the scope of the declaration of f3 and contain the ap- destruction!SSA
SSA!destruction
propriate loop-closing arguments. In the SSA reading of this code—shown in
lost-copy problem
Figure 6.5b—the first instruction in b3 has turned into a non-trivial φ-node. As swap problem
expected, the parameters of this φ-node correspond to the two control-flow
arcs leading into b3 , one for each call site to f3 in code (6.23). Moreover, the call
and nesting structure of (6.23) is indeed in agreement with the control flow and
dominance structure of the loop-unrolled SSA representation.
The above example code excerpts where variables are not made distinct exhibit a
further pattern: the argument list of any call coincides with the list of formal pa-
rameters of the invoked function. This discipline is not enjoyed by functional pro-
grams in general, and is often destroyed by optimizing program transformations.
However, programs that do obey this discipline can be immediately converted to
?
imperative non-SSA form. Thus, the task of SSA destruction amounts to amounts
to converting a functional program with arbitrary argument lists into one where
argument lists and formal parameter lists coincide for each function. This can
be achieved by introducing additional let-bindings of the form let x = y in e end.
For example, a call f (v, z , y ) where f is declared as function f (x , y , z ) = e may be
converted to
let x = v, a = z , z = y , y = a in f (x , y , z )
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3 Refined block sinking and loop nesting forests
reducible CFG function declarations for the fi , nested inside the declaration of f :
function f (. . .) = let . . . < body of b > . . . in
function f1 (. . .) = e1 Â body of b1 , with calls to b , bi
.
. (6.24)
.
and fn (. . .) = en  body of bn , with calls to b , bi
in . . . < calls to b, bi from b > . . . end
b1 b2 b4 b
g b1 b2 b3 b4 b5
b5 b3 g
Fig. 6.6 Example placement for reducible flow graphs. (a) Control-flow graph; (b) Overlay of CFG
(dashed edges) arcs between dominated blocks onto dominance graph (solid edges).
Figure 6.6a and its enriched dominance tree shown in Figure 6.6b. A possible
(but not unique) ordering of the children of b is [b5 , b1 , b3 , b2 , b4 ], resulting in the
nesting shown in code (6.25).
6.3 Refined block sinking and loop nesting forests 87
The code respects the dominance relationship in much the same way as the naive
placement, but additionally makes f1 inacessible from within e5 , and makes f3
inaccessible from within f1 or f5 . As the reordering does not move function
declarations inside each other (in particular: no function declaration is brought
into or moved out of the scope of the formal parameters of any other function)
the reordering does not affect the potential to subsequently perform parameter
dropping.
Declaring functions using λ-abstraction brings further improvements. This
enables us not only to syntactically distinguish between loops and non-recursive
control-flow structures using the distinction between let and letrec?present in
many functional languages, but also to further restrict the visibility of function
names. Indeed, while b3 is immediately dominated by b in the above example,
its only control-flow predecessors are b2 /g and b4 . We would hence like to make
the declaration of f3 local to the tuple (f2 , f4 ), i.e., invisible to f . This can be
achieved by combining let/letrec bindings with pattern matching, if we insert
the shared declaration of f3 between the declaration of the names f2 and f4 and
the λ-bindings of their formal parameters pi :
u v ({u, v, w, x }, u v ({w, x }, u v
{u, v }) {w, x })
w x w x w x
G0 G1 G2
Figure 6.8a shows the CFG-enriched dominance tree of G0 . The body of loop
L 0 is easily identified as the maximal SCC, and likewise the body of L 1 once the
cycles (u , w ) and (x , v ) are broken by the removal of the back-edges w → u and
x → v.
The loop nesting forest resulting from Steensgaard’s contruction is shown in
Figure 6.8b. Loops are drawn as ellipsis decorated with the appropriate header
nodes, and nested in accordance with the containment relation B1 ⊂ B0 between
the bodies.
In the functional representation, a loop L = (B , {h1 , . . . , hn }) yields a function
declaration block for functions h1 , . . . , hn , with private declarations for the non-
headers from B \H. In our example, loop L 0 provides entry points for the headers
u and v but not for its non-headers w and x . Instead, the loop comprised of the
latter nodes, L 1 , is nested inside the definition of L 0 , in accordance with the loop
nesting forest.
6.3 Refined block sinking and loop nesting forests 89
entry u, v
u v
exit
w, x
u w x v
w x
(a) (b)
Fig. 6.8 Illustration of Steensgaard’s construction of loop nesting forests: (a) CFG-enriched dominance
tree; (b) resulting loop nesting forest.
function entry(. . .) =
let . . . < body of entry > . . .
in letrec (u , v ) = Â define outer loop L 0 , with headers u, v
letrec (w, x ) = Â define inner loop L 1 , with headers w, x
let exit = λpexit . . . . < body of exit >
in ( λpw . . . . < body of w, with calls to u, x, and exit > . . .
(6.27)
, λpx . . . . < body of x, with calls to w, v, and exit > . . . )
end  end of inner loop
in ( λpu . . . . < body of u, with call to w > . . .
, λpv . . . . < body of v, with call to x > . . . )
end  end of outer loop
in . . . < calls from entry to u and v > . . .
By placing L 1 inside L 0 according to the scheme from code (6.26) and making
exit private to L 1 , we obtain the representation (6.27) which captures all the
essential information of Steensgaard’s construction. Effectively, the functional
reading of the loop nesting forest extends the earlier correspondence between
the nesting of individual functions and the dominance relationship to groups
of functions and basic blocks: loop L 0 dominates L 1 in the sense that any path
from entry to a node in L 1 passes through L 0 ; more specifically, any path from
entry to a header of L 1 passes trough a header of L 0 .
In general, each step of Steensgaard’s construction may identify several loops,
as a CFG may contain several maximal SCCs. As the bodies of these SCCs are
necessarily non-overlapping, the construction yields a forest comprised of trees
shaped like the loop nesting forest in Figure 6.8b. As the relationship between
the trees is necessarily acyclic, the declarations of the function declaration tuples
corresponding to the trees can be placed according to the loop-extended notion
of dominance.
90 6 Functional Representations of SSA — (L. Beringer)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
monadic IR
6.4 Concluding remarks and further readings
Further readings
Shortly after the introduction of SSA, O’Donnell [222] and Kelsey [167] noted the
correspondence between let-bindings and points of variable declaration and its
extension to other aspects of program structure using continuation-passing style.
Appel [14, 13] popularized the correspondence using a direct-style representation,
building on his earlier experience with continuation-based compilation [11].
Continuations and low-level functional languages have been an object of
intensive study since their inception about four decades ago [303, 182]. For ret-
rospective accounts of the historical development, see Reynolds [255] and Wads-
worth [307]. Early studies of CPS and direct style include work by Reynolds and
Plotkin [253, 254, 238]. Two prominent examples of CPS-based compilers are
those by Sussman et al. [288] and Appel [11]. An active area of research con-
cerns the relative merit of the various functional representations, algorithms for
formal conversion between these formats, and their efficient compilation to ma-
chine code, in particular with respect to their integration with program analyses
and optimizing transformations [95, 252, 168]. A particularly appealing variant
is that of Kennedy [168], where the approach to explicitly name control-flow
points such as merge points is taken to its logical conclusion. By mandating all
control-flow points to be explicitly named, a uniform representation is obtained
that allows optimizing transformations to be implemented efficiently, avoiding
the administrative overhead to which some of the alternative approaches are
susceptible.
Occasionally, the term direct style refers to the combination of tail-recursive
functions and let-normal form, and the conditions on the latter notion are
strengthened so that only variables may appear as branch conditions. Varia-
tions of this discipline include administrative normal form (A-normal form,
ANF) [126], B-form [289], and SIL [294].
Closely related to continuations and direct-style representation are monadic
?
intermediate languages as used by Benton et al. [28] and Peyton-Jones et al. [232].
These partition expressions into categories of values and computations, similar
to the isolation of primitive terms in let-normal form [254, 238]. This allows
one to treat side-effects (memory access, IO, exceptions, etc.) in a uniform way,
following Moggi [208], and thus simplifies reasoning about program analyses
and the associated transformations in the presence of impure language features.
Lambda-lifting and dropping are well-known transformations in the func-
tional programming community, and are studied in-depth by Johnsson [162] and
Danvy et al. [96].
Rideau et al. [256] present an in-depth study of SSA destruction, including a
verified implementation in the proof assistant Coq for the “windmills” problem,
i.e., the task of correctly introducing φ-compensating assignments. The local
algorithm to avoid the lost-copy problem and swap problem identified by Briggs
6.4 Concluding remarks and further readings 91
et al. [54] was given by Beringer [29]. In this solution, the algorithm to break the
cycles is in line with the results of May [204].
We are not aware of previous work that transfers the analysis of loop nesting
forests to the functional setting, or of loop analyses in the functional world that
correspond to loop nesting forests. Our discussion of Steensgaard’s construction
was based on a classification of loop nesting forests by Ramalingam [245], which
also served as the source of the example in Figure 6.7. Two alternative construc-
tions discussed by Ramalingam are those by Sreedhar, Gao and Lee [276], and
Havlak [147]. To us, it appears that a functional reading of Sreedhar, Gao and
Lee’s construction would essentially yield the nesting mentioned at the begin-
ning of Section 6.3. Regarding Havlak’s construction, the fact that entry points
of loops are not necessarily classified as headers appears to make an elegant
representation in functional form at least challenging.
Extending the syntactic correspondences between SSA and functional lan-
guages, similarities may be identified between their characteristic program anal-
ysis frameworks, data-flow analyses and type systems. Chakravarty et al. [66]
prove the correctness of a functional representation of Wegmann and Zadeck’s
SSA-based sparse conditional constant propagation algorithm [311]. Beringer et
al. [30] consider data-flow equations for liveness and read-once variables, and for-
mally translate their solutions to properties of corresponding typing derivations.
Laud et al. [185] present a formal correspondence between data-flow analyses
and type systems but consider a simple imperative language rather than SSA.
At present, intermediate languages in functional compilers do not provide
syntatic support for expressing nesting forest directly. Indeed, most functional
compilers do not perform advanced analyses of nested loops. As an exception to
this rule, the MLton compiler (http://mlton.org) implements Steensgaard’s
algorithm for detecting loop nesting forests, leading to a subsequent analysis of
the loop unrolling and loop switching transformations [287].
Concluding remarks
relationship rather than characterizing the values held in the variables at runtime.
Noting a correspondence between the types associated with a variable and the
sets of def-use paths, the authors admit types to be formulated over type variables
whose introduction and use corresponds to the introduction of φ-nodes in SSA.
Finally, Pop et al.’s model [242] dispenses with control flow entirely and instead
views programs as sets of equations that model the assignment of values to
variables in a style reminiscent of partial recursive functions. This model is
discussed in more detail in Chapter 10.
φ Part II
Analysis
CHAPTER 7
Introduction M. Schordan
F. Rastello
The goal of this chapter is to provide an overview of the benefits of SSA for
program analyses and optimizations. We illustrate how SSA makes an analyses
more convenient because of its similarity to functional programs. Technically,
the def-use chains explicitly expressed through the SSA graph, but also the static-
single information property, is what makes SSA so convenient.
There are several analysis that propagate information through the SSA graph.
The chapter “propagation engine”8 gives a somehow general description of the
mechanism.
TODO: refine following key points
• shows how SSA form facilitates the design and implementation of analyses
equivalent to traditional data flow analyses.
• SSA property allows to reduce analysis time and memory consumption
• Presented Propagation Engine is an extension of the well-known approach
by Wegman and Zadeck for sparse conditional constant propagation.
• Basic algorithm not limited to constant propagation
• allows to solve a large class of data-flow problems more efficiently than the
iterative work list algorithm for solving data flow equations. The basic idea
is to directly propagate information computed at the unique defition of a
variable to all its uses
• Data flow analyses based on SSA form rely on specialized program represen-
tation based on SSA graphs which resemble traditional use-def chains.
• BUT: not all data flow analyses can be modeled.
As already mentioned in Chapter ??, SSA form can come in different flavors.
The vanilla one is strict SSA, or said equivalently, SSA form with the dominance
property. The most common SSA construction algorithm exploits this dominance
property by two means: First it allows to compute join sets for φ-placement in a
95
96 7 Introduction — (M. Schordan, F. Rastello)
very efficient way using the dominance frontier. Second it allows variable renam-
ing using a folding scheme along the dominance tree. The notion of dominance
and dominance frontier are two structural properties that make SSA form singu-
lar for compiler analysis and transformations.
Those two aspects are illustrated in this part through two chapters: Chapter 9
shows how loop nesting forest and dominance property can be exploited to devise
very efficient liveness analysis; Chapter 11 shows how the dominance frontier
that allows to insert a minimal number of φ-function for SSA construction can
also be used to minimize redundant computations.
Chapter 10 illustrates how capturing properties of the SSA graph itself (circuits)
can be used to determine induction variables.
TODO: refine following key points
• extraction of the reducible loop tree can be done on the SSA graph itself
• the induction variable analysis is based on the detection of self references in
the SSA representation and on its characterization
• algorithm translates the SSA representation into a representation of polyno-
mial functions, describing the sequence of values that SSA variables hold
during the execution of loops.
• Number of iterations is computed as the minimum solution of a polynomial
inequality with integer solutions, also called Diophantine inequality.
7 Introduction — (M. Schordan, F. Rastello) 97
A central task of compilers is to optimize a given input program such that the
resulting code is more efficient in terms of execution time, code size, or some
other metric of interest. However, in order to perform these optimizations, typi-
cally some form of program analysis is required to determine if a given program
transformation is applicable, to estimate its profitability, and to guarantee its
correctness.
Data-flow analysis is a simple yet powerful approach to program analysis that
is utilized by many compiler frameworks and program analysis tools today. We
will introduce the basic concepts of traditional data-flow analysis in this chapter
and will show how the static single assignment form (SSA) facilitates the design
and implementation of equivalent analyses. We will also show that how the SSA
property allows us to reduce the compilation time and memory consumption of
the data-flow analyses that this program representation supports.
Traditionally, data-flow analysis is performed on a control-flow graph repre-
sentation (CFG) of the input program. Nodes in the graph represent operations
and edges represent the potential flow of program execution. Information on
certain program properties is propagated among the nodes along the control-flow
edges until the computed information stabilizes, i.e., no new information can be
inferred from the program.
The propagation engine presented in the following sections is an extension of
the well known approach by Wegman and Zadeck for sparse conditional constant
propagation (also known as SSA-CCP). Instead of using the CFG they represent
the input program as an SSA graph as defined in Chapter 14: operations are again
represented as nodes in this graph, however, the edges represent data dependen-
cies instead of control flow. This representation allows a selective propagation
of program properties among data dependent graph nodes only. As before, the
processing stops when the information associated with the graph nodes stabi-
lizes. The basic algorithm is not limited to constant propagation and can also be
99
100 8 Propagating Information using SSA— (F. Brandner, D. Novillo)
data-flow analysis applied to solve a large class of other data-flow problems efficiently. However,
propagation
not all data-flow analyses can be modeled. In this chapter, we will also investigate
information!propagation
lattice!complete the limitations of the SSA-based approach.
The remainder of this chapter is organized as follows. First, the basic concepts
of (traditional) data-flow analysis are presented in Section 8.1. This will provide
the theoretical foundation and background for the discussion of the SSA-based
propagation engine in Section 8.2. We then provide an example of a data-flow
analysis that can be performed efficiently by the aforementioned engine, namely
copy propagation in Section 8.3.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1 Preliminaries
?
Data-flow analysis is at the heart of many compiler transformations and optimiza-
tions, but also finds application in a broad spectrum of analysis and verification
tasks in program analysis tools such as program checkers, profiling tools, and
timing analysis tools. This section gives a brief introduction to the basics of data-
flow analysis. Due to space considerations, we cannot cover this topic in full
depth.
As noted before, data-flow analysis derives information from certain interest-
ing program properties that may help to optimize the program. Typical examples
of interesting properties are: the set of live variables at a given program point, the
particular constant value a variable may take, or the set of program points that
are reachable at run-time. Liveness information, for example, is critical during
register allocation, while the two latter properties help to simplify computations
and to identify dead code.
The analysis results are gathered from the input program by propagating
information among its operations considering all potential execution paths. The
propagation?is typically performed iteratively until the computed results stabilize.
Formally, a data-flow problem can be specified using a monotone framework
that consists of:
• a complete lattice representing the property space,
• a flow graph resembling the control flow of the input program, and
• a set of transfer functions modeling the effect of individual operations
on the property space.
Property Space: A key concept for data-flow analysis is the representation
of the property space via partially ordered sets (L , v), where L represents some
interesting program property and v represents a reflexive, transitive, and anti-
symmetric relation. Using the v relation, upper and lower bounds, as well as least
upper and greatest lower bounds, can be defined for subsets of L .
A particularly interesting class of partially ordered sets are complete lattices,?where
all subsets have a least upper bound as well F as adgreatest lower bound. These
bounds are unique and are denoted by and respectively. In the context
8.1 Preliminaries 101
of program analysis the former is often referred to as the join operator, while ⊥ (bottom, lattice)
> (top, lattice)
the latter is termed the meet operator. Complete lattices have two distinguished
chain
elements, the least element and the greatest element, often denoted by ⊥?and ascending chain
>?respectively. backward analysis
An ascending chain?is a totally ordered subset {l 1 , . . . , l n } of a complete lattice. forward analysis
A chain is said to stabilize if there exists an index m , where ∀i > m : l i = l m . An transfer function
maximal fixed point
analogous definition can be given for descending chains.
MFP|seemaximal fixed
Program Representation: The functions of the input program are represented point
as control-flow graphs, where the nodes represent operations, or instructions,
and edges denote the potential flow of execution at run-time. Data-flow infor-
mation is then propagated from one node to another adjacent node along the
respective graph edge using in and out sets associated with every node. If there
exists only one edge connecting two nodes, data can be simply copied from one
set to the other. However, if a node has multiple incoming edges, the information
from those edges has to be combined using the meet or join operator.
Sometimes, it is helpful to reverse the flow graph to propagate information,
i.e., reverse the direction of the edges in the control-flow graph. Such analyses are
?
termed backward analyses, while those using the regular flow graph are forward
?
analyses.
Transfer Functions:?Aside from the control flow, the operations of the pro-
gram need to be accounted for during analysis. Usually these operations change
the way data is propagated from one control-flow node to the other. Every oper-
ation is thus mapped to a transfer function, which transforms the information
available from the in set of the flow graph node of the operation and stores the
result in the corresponding out set.
Putting all those elements together—a complete lattice, a flow graph, and a set of
transfer functions—yields an instance of a monotone framework. This framework
describes a set of data-flow equations whose solution will ultimately converge
to the solution of the data-flow analysis. A very popular and intuitive way to
?
solve these equations is to compute the maximal (minimal) fixed point (MFP)
?
using an iterative work list algorithm. The work list contains edges of the flow
graph that have to be revisited. Visiting an edge consists of first combining the
information from the out set of the source node with the in set of the target
node, using the meet or join operator, then applying the transfer function of
the target node. The obtained information is then propagated to all successors
of the target node by appending the corresponding edges to the work list. The
algorithm terminates when the data-flow information stabilizes, as the work list
then becomes empty.
A single flow edge can be appended several times to the work list in the course
of the analysis. It may even happen that an infinite feedback loop prevents the
102 8 Propagating Information using SSA— (F. Brandner, D. Novillo)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Data-flow Propagation under SSA Form
SSA form allows us to solve a large class of data-flow problems more efficiently
than the iterative work list algorithm presented previously. The basic idea is to
directly propagate information computed at the unique definition of a variable to
all its uses. In this way, intermediate program points that neither define nor use
the variable of interest do not have to be taken into consideration, thus reducing
memory consumption and compilation time.
Programs in SSA form exhibit φ-operations placed at join points of the original
CFG. In the following, we assume that possibly many φ-operations are associated
with the corresponding CFG nodes at those join points.
Data-flow analyses under SSA form rely on a specialized program representa-
tion based on SSA graphs, which resemble traditional def-use chains and simplify
the propagation of data-flow information. The nodes of an SSA graph correspond
to the operations of the program, including the φ-operations that are represented
by dedicated nodes in the graph. The edges of the graph connect the unique
definition of a variable with all its uses, i.e., edges represent true dependencies.
8.2 Data-flow Propagation under SSA Form 103
1: y1 ← 6
3: x1 ← 4 4: x2 ← 5
1 y1 ← 6
2 if (. . . ) then 2: if (. . . )
3 x1 ← 4 5: x3 ← φ(x1 , x2 )
else 3: x1 ← 4 4: x2 ← 5
4 x2 ← 5 1: y1 ← 6
5 x3 ←
5: x3 ← φ(x1 , x2 )
φ(x1 , x2 )
6 z 1 ← x3 + y1 6: z 1 ← x3 + y1
6: z 1 ← x3 + y1
(a) SSA pseudo code (b) SSA graph (c) Control-flow graph
Fig. 8.1 Example program and its SSA graph
Besides the data dependencies, the SSA graph captures the relevant join points
of the CFG of the program. A join point is relevant for the analysis whenever the
value of two or more definitions may reach a use by passing through that join.
The SSA form properties ensure that a φ-operation is placed at the join point
and that any use of the variable that the φ-function defines has been properly
updated to refer to the correct name.
Consider for example the code excerpt shown in Figure 8.1, along with its
corresponding SSA graph and CFG. Assume we are interested in propagating
information from the assignment of variable y1 , at the beginning of the code,
down to its unique use at the end. The traditional CFG representation causes
the propagation to pass through several intermediate program points. These
program points are concerned only with computations of the variables x1 , x2 ,
and x3 , and are thus irrelevant for y1 . The SSA graph representation, on the other
hand, propagates the desired information directly from definition to use sites,
without any intermediate step. At the same time, we also find that the control-
flow join following the conditional is properly represented by the φ-operation
defining the variable x3 in the SSA graph.
Even though the SSA graph captures data dependencies and the relevant join
points in the CFG, it lacks information on other control dependencies. However,
analysis results can often be improved significantly by considering the additional
information that is available from the control dependencies in the CFG. As an ex-
ample consider once more the code of Figure 8.1, and assume that the condition
associated with the if-statement is known to be false for all possible program
executions. Consequently, the φ-operation will select the value of x2 in all cases,
which is known to be of constant value 5. However, due to the shortcomings of
the SSA graph, this information cannot be derived. It is thus important to use
both the control-flow graph and the SSA graph during data-flow analysis in order
to obtain the best possible results.
104 8 Propagating Information using SSA— (F. Brandner, D. Novillo)
The algorithm is shown in Algorithm 8.1 and processes two work lists, the
CFGWorkList, which containing edges of the control-flow graph, and the SSA-
WorkList, which contains edges from the SSA graph. It proceeds by removing the
top element of either of those lists and processing the respective edge. Through-
out the main algorithm, operations of the program are visited to update the work
lists and propagate information using Algorithm 8.2.
The CFGWorkList is used to track edges of the CFG that were encountered to
be executable, i.e., where the data-flow analysis cannot rule out that a program
execution traversing the edge exists. Once the algorithm has determined that
a CFG edge is executable, it will be processed by Step 3 of the main algorithm.
First, all φ-operations of its target node need to be reevaluated due to the fact
that Algorithm 8.2 discarded the respective operands of the φ-operations so
far—because the control-flow edge was not yet marked executable. Similarly,
the operation of the target node has to be evaluated when the target node is
encountered to be executable for the first time, i.e., the currently processed
control-flow edge is the first of its incoming edges that is marked executable. Note
that this is only required the first time the node is encountered to be executable,
due to the processing of operations in Step 4b, which thereafter triggers the
reevaluation automatically when necessary through the SSA graph.
Regular operations as well as φ-operations are visited by Algorithm 8.2 when
the corresponding control-flow graph node has become executable, or when-
ever the data-flow information of one of their predecessors in the SSA graph
changed. At φ-operations, the information from multiple control-flow paths is
combined using the usual meet or join operator. However, only those operands
where the associated control-flow edge is marked executable are considered.
Conditional branches are handled by examining their conditions based on the
data-flow information computed so far. Depending on whether those conditions
are satisfiable or not, control-flow edges are appended to the CFGWorkList to
ensure that all reachable operations are considered during the analysis. Finally,
106 8 Propagating Information using SSA— (F. Brandner, D. Novillo)
3: x1 ← 4 4: x2 ← 5 3: x1 ← 4 4: x2 ← 5
(x1 , 4) (x2 , 5) (x2 , 5)
5: x3 ← φ(x1 , x2 ) 1: y1 ← 6 5: x3 ← φ(x1 , x2 ) 1: y1 ← 6
(z 1 , ⊥) (z 1 , 11)
(a) All code reachable (b) With unreachable code
Fig. 8.2 Sparse conditional data-flow propagation using SSA graphs
all regular operations are processed by applying the relevant transfer function
and possibly propagating the updated information to all uses by appending the
respective SSA graph edges to the SSAWorkList.
As an example, consider the program shown in Figure 8.1 and the constant
propagation problem. First, assume that the condition of the if-statement can-
not be statically evaluated, we thus have to assume that all its successors in
the CFG are reachable. Consequently, all control-flow edges in the program will
eventually be marked executable. This will trigger the evaluation of the constant
assignments to the variables x1 , x2 , and y1 . The transfer functions immediately
yield that the variables are all constant, holding the values 4, 5, and 6 respectively.
This new information will trigger the reevaluation of the φ-operation of variable
x3 . As both of its incomingd control-flow edges are marked executable, the com-
bined information yields 4 5 = ⊥, i.e., the value is known not to be a particular
constant value. Finally, also the assignment to variable z 1 is reevaluated, but
the analysis shows that its value is not a constant as depicted by Figure 8.2a. If,
however, the if-condition is known to be false for all possible program executions
a more precise result can be computed, as shown in Figure 8.2b. Neither the
control-flow edge leading to the assignment of variable x1 nor its outgoing edge
leading to the φ-operation of variable x3 are marked executable. Consequently,
the reevaluation of the φ-operation considers the data-flow information of its
second operand x2 only, which is known to be constant. This enables the analysis
to show that the assignment to variable z 1 is, in fact, constant as well.
8.2.3 Discussion
During the course of the propagation algorithm, every edge of the SSA graph is
processed at least once, whenever the operation corresponding to its definition
is found to be executable. Afterward, an edge can be revisited several times
depending on the height h of the lattice representing the property space of the
analysis. On the other hand, edges of the control-flow graph are processed at most
8.2 Data-flow Propagation under SSA Form 107
once. This leads to an upper bound in execution time of O (|ESSA |·h +|ECFG |), where
ESSA and ECFG represent the edges of the SSA graph and the control-flow graph
respectively. The size of the SSA graph increases with respect to the original non-
SSA program. Measurements indicate that this growth is linear, yielding a bound
that is comparable to the bound of traditional data-flow analysis. However, in
practice the SSA-based propagation engine outperforms the traditional approach.
This is due to the direct propagation from the definition of a variable to its uses,
without the costly intermediate steps that have to be performed on the CFG. The
overhead is also reduced in terms of memory consumption: instead of storing
the in and out sets capturing the complete property space on every program
point, it is sufficient to associate every node in the SSA graph with the data-flow
information of the corresponding variable only, leading to considerable savings
in practice.
8.2.4 Limitations
Unfortunately, the presented approach also has its limitations, because of the
exclusive propagation of information between data-dependent operations. This
prohibits the modeling of data-flow problems that propagate information to
program points that are not directly related by either a definition or a use of a
variable.
Consider, for example, the problem of available expressions that often occurs
in the context of redundancy elimination. An expression is available at a given
program point when the expression is computed and not modified thereafter
on all paths leading to that program point. In particular, this might include
program points that are independent from the expression and its operands, i.e.,
neither define nor use any of its operands. The SSA graph does not cover those
points, as it propagates information directly from definitions to uses without any
intermediate steps.
TODO: flo: reprendre en disant que justement, c’est une misconception Furthermore,
data-flow analysis using SSA graphs is limited to forward problems. Due to the
structure of the SSA graph, it is not possible to simply reverse the edges in the
graph as it is done with flow graphs. For one, this would invalidate the nice
property of having a single source for incoming edges of a given variable, as vari-
ables typically have more than one use. In addition, φ-operations are placed at
join points with respect to the forward control flow and thus do not capture join
points in the reversed control-flow graph. SSA graphs are consequently not suited
to model backward problems in general. There are, however, program represen-
tations akin to the SSA format that can handle backward analyses. Chapter 13
gives an overview of such representations.
108 8 Propagating Information using SSA— (F. Brandner, D. Novillo)
copy propagation
1: x1 ← . . . 1: y1 ← . . .
2: y1 ← x1 2: x1 ← y1 3: x2 ← y1
3: z 1 ← y1 4: x3 ← φ(x1 , x2 )
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Example—Copy Propagation
Even though data-flow analysis based on SSA graphs has its limitations, it is still
a useful and effective solution for interesting problems, as we will show in the
following example. Copy propagation?under SSA form is, in principle, very simple.
Given the assignment x ← y , all we need to do is to traverse the immediate uses
of x and replace them with y , thereby effectively eliminating the original copy
operation. However, such an approach will not be able to propagate copies past
φ-operations, particularly those in loops. A more powerful approach is to split
copy propagation into two phases: Firstly, a data-flow analysis is performed to
find copy-related variables throughout the program; Secondly, a rewrite phase
eliminates spurious copies and renames variables.
1: x1 ← . . . 1: x1 ← . . .
e1
2: x2 ← φ(x1 , x4 ) 2: x2 ← φ(x1 , x4 ) 3: if (x2 = x1 )
3: if (x2 = x1 )
e2
4: x3 ← . . .
e3 4: x3 ← . . . e5
e4
5: x4 ← φ(x3 , x2 )
5: x4 ← φ(x2 , x3 )
6: if (. . .)
e6 6: if (. . .)
The analysis for copy propagation can be described as the problem of prop- copy-of value
agating the copy-of value?of variables. Given a sequence of copies as shown in
Figure 8.3a, we say that y1 is a copy of x1 and z 1 is a copy of y1 . The problem
with this representation is that there is no apparent link from z 1 to x1 . In order to
handle transitive copy relations, all transfer functions operate on copy-of values
instead of the direct source of the copy. If a variable is not found to be a copy
of anything else, its copy-of value is the variable itself. For the above example,
this yields that both y1 and z 1 are copies of x1 , which in turn is a copy of itself.
The lattice of this data-flow problem is thus similar to the lattice used previously
for constant propagation. The lattice elements correspond to variables of the
program instead of integer numbers. The least element of the lattice represents
the fact that a variable is a copy of itself.
Similarly, we would like to obtain the result that x3 is a copy of y1 for the
example of Figure 8.3b. This is accomplished by choosing the join operator such
that a copy relation is propagated whenever the copy-of values of all the operands
of the φ-operation match. When visiting the φ-operation for x3 , the analysis
finds that x1 and x2 are both copies of y1 and consequently propagates that x3 is
also a copy of y1 .
The next example shows a more complex situation where copy relations are
obfuscated by loops—see Figure 8.4. Note that the actual visiting order depends
on the shape of the CFG and immediate uses, in other words, the ordering used
here is meant for illustration only. Processing starts at the operation labeled 1,
with both work lists empty and the data-flow information > associated with all
variables.
1. Assuming that the value assigned to variable x1 is not a copy, the data flow
information for this variable is lowered to ⊥, the SSA edges leading to opera-
tions 2 and 3 are appended to the SSAWorkList, and the control-flow graph
edge e1 is appended to the CFGWorkList.
2. Processing the control-flow edge e1 from the work list causes the edge to be
marked executable and the operations labeled 2 and 3 to be visited. Since
edge e5 is not yet known to be executable, the processing of the φ-operation
yields a copy relation between x2 and x1 . This information is utilized in order
to determine which outgoing control-flow graph edges are executable for
the conditional branch. Examining the condition shows that only edge e3 is
executable and thus needs to be added to the work list.
3. Control-flow edge e3 is processed next and marked executable for the first
time. Furthermore, the φ-operation labeled 5 is visited. Due to the fact that
edge e4 is not known the be executable, this yields a copy relation between
x4 and x1 (via x2 ). The condition of the branch labeled 6 cannot be analyzed
and thus causes its outgoing control flow edges e5 and e6 to be added to the
work list.
4. Now, control-flow edge e5 is processed and marked executable. Since the
target operations are already known to be executable, only the φ-operation
is revisited. However, variables x1 and x4 have the same copy-of value x1 ,
110 8 Propagating Information using SSA— (F. Brandner, D. Novillo)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4 Further Readings
Traditional data-flow analysis is well established and well described in numer-
ous papers. The book by Nielsen, Nielsen, and Hankin [216] gives an excellent
introduction to the theoretical foundations and practical applications. For re-
ducible flow graphs the order in which operations are processed by the work list
algorithm can be optimized [149, 164, 216], allowing to derive tighter complexity
bounds. However, relying on reducibility is problematic because the flow graphs
are often not reducible even for proper structured languages. For instance, re-
versed control-flow graphs for backward problems can be—and in fact almost
always are—irreducible even for programs with reducible control-flow graphs,
for instance because of loops with multiple exits. Furthermore, experiments have
shown that the tighter bounds not necessarily lead to improved compilation
times [85].
Apart from computing a fixed point (MFP) solution, traditional data-flow
equations can also be solved using a more powerful approach called the meet
over all paths (MOP) solution, which computes the in data-flow information for
a basic block by examining all possible paths from the start node of the control-
flow graph. Even though more powerful, computing the MOP solution is often
harder or even undecidable [216]. Consequently, the MFP solution is preferred
in practice.
The sparse propagation engine [220, 311], as presented in the chapter, is based
on the underlying properties of the SSA form. Other intermediate representations
offer similar properties. Static Single Information form (SSI) [271] allows both
backward and forward problems to be modeled by introducing σ operations,
which are placed at program points TODO: florent: modifier discussion sur backward
apres chapitre SSI where data-flow information for backward problems needs to
be merged [270]. Chapter 13 provides additional information on the use of SSI
form for static program analysis. Bodík uses an extended SSA form, e -SSA, to
8.4 Further Readings 111
eliminate array bounds checks [36]. Ruf [260] introduces the value dependence
graph, which captures both control and data dependencies. He derives a sparse
representation of the input program, which is suited for data-flow analysis, using
a set of transformations and simplifications.
The sparse evaluation graph by Choi et al. [71] is based on the same basic idea
as the approach presented in this chapter: intermediate steps are eliminated
by by-passing irrelevant CFG nodes and merging the data-flow information
only when necessary. Their approach is closely related to the placement of φ-
operations and similarly relies on the dominance frontier during construction.
A similar approach, presented by Johnson and Pingali [160], is based on single-
entry/single-exit regions. The resulting graph is usually less sparse, but is also
less complex to compute. Ramalingam [247] further extends these ideas and
introduces the compact evaluation graph, which is constructed from the initial
CFG using two basic transformations. The approach is superior to the sparse
representations by Choi et al. as well as the approach presented by Johnson and
Pingali.
The previous approaches derive a sparse graph suited for data-flow analy-
sis using graph transformations applied to the CFG. Duesterwald et al. [111]
instead examine the data-flow equations, eliminate redundancies, and apply
simplifications to them.
strict SSA form
liveness
data-flow analysis
backward analysis
forward control-flow
graph
reducible CFG
CHAPTER 9 loop nesting forest
Liveness B. Boissinot
F. Rastello
?
This chapter illustrates the use of strict SSA properties to simplify and accelerate
?
liveness analysis, which determines for all variables the set of program points
where they are live, i.e., their values are potentially used by subsequent opera-
tions. Liveness information is essential to solve storage assignment problems,
eliminate redundancies, and perform code motion. For instance, optimizations
like software pipelining, trace scheduling, register-sensitive redundancy elimina-
tion (see Chapter 11), if-conversion (see Chapter 20), as well as register allocation
(see Chapter 22)heavily rely on liveness information.
Traditionally, liveness information is obtained by data-flow analysis?: liveness
sets are computed for all basic blocks and variables simultaneously by solving
a set of data-flow equations. These equations are usually solved by an iterative
?
algorithm, propagating information backwards through the control-flow graph
(CFG) until a fixed point is reached and the liveness sets stabilize. The number
of iterations depends on the control-flow structure of the considered program,
more precisely on the structure of its loops.
In this chapter, we show that, for strict SSA-form programs, the live-range of a
variable has valuable properties that can be expressed in terms of loop nesting
?
forest of the CFG and its corresponding directed acyclic graph, the forward-CFG.
?
Informally speaking, and restricted to reducible CFGs, those properties for a
variable v are:
• v is live at a program point q if and only if v is live at the entry h of the largest
?
loop/basic block (highest node in the loop nesting forest) that contains q
but not the definition of v .
• v is live at h if and only if there is a path in the forward-CFG from h to a use
of v that does not contain the definition.
A direct consequence of this property is the possible design of a data-flow
algorithm that computes liveness sets without the requirement of any iteration
113
114 9 Liveness — (B. Boissinot, F. Rastello)
irreducible CFG to reach a fixed point: at most two passes over the CFG are necessary. The first
live-range
pass, very similar to traditional data-flow analysis, computes partial liveness sets
upward exposed use
liveness by traversing the forward-CFG backwards. The second pass refines the partial
liveness sets and computes the final solution by propagating forward along the
loop nesting forest. For the sake of clarity, we first present the algorithm for
?
reducible CFGs. Irreducible CFGs can be handled with a slight variation of the
algorithm, with no need to modify the CFG itself.
Another approach to liveness analysis more closely follows the classical defini-
tion of liveness: a variable is live at a program point q if q belongs to a path of the
CFG leading from a definition of that variable to one of its uses without passing
through another definition of the same variable. Therefore, the live-range?of a
?
variable can be computed using a backward traversal starting on its uses and
stopping when reaching its definition (unique under SSA).
One application of the properties of live-ranges under strict SSA-form is the
design of a simple liveness check algorithm. In contrast to classical data-flow
analyses, liveness check does not provide the set of variables live at a block, but
its characteristic function. Liveness check provides a query system to answer
questions such as “Is variable v live at location q ?” Its main features are:
1. The algorithm itself consists of two parts, a pre-computation part, and an
online part executed at each liveness query. It is not based on setting up and
subsequently solving data-flow equations;
2. The pre-computation is independent of variables, it only depends on the
structure of the control-flow graph; Hence, pre-computed information re-
mains valid upon adding or removing variables or their uses;
3. An actual query uses the def-use chain of the variable in question and deter-
mines the answer essentially by testing membership in pre-computed sets
of basic blocks.
We will first need to repeat basic definitions relevant in our context and pro-
vide the theoretical foundations in the next section, before presenting multiple
algorithms to compute liveness sets: The two-pass data-flow algorithm in Sec-
tion 9.2 and the algorithms based on path-exploration in Section 9.4. We finally
present the liveness check algorithm last, in Section 9.3.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1 Definitions
Liveness? is a property relating program points to sets of variables which are
considered to be live at these program points. Intuitively, a variable is considered
live at a given program point when its value will be used in the future of any
dynamic execution. Statically, liveness can be approximated by following paths
backwards on the control-flow graph, connecting the uses of a given variable to
its definitions—or, in the case of SSA forms, to its unique definition. The variable
is said to be live at all program points along these paths. For a CFG node q ,
9.2 Data-flow approaches 115
? live-in
representing an instruction or a basic block, a variable v is live-in at q if there
live-out
is a path, not containing the definition of v, from q to a node where v is used
live-range
(including q itself). It is live-out?at q if it is live-in at some successor of q . liveness!φ-function
The computation of live-in and live-out sets at the entry and the exit of basic φ-function, as
blocks is usually termed liveness analysis. It is indeed sufficient to consider only multiplexer
definitions, of
these sets at basic block boundaries since liveness within a basic block is trivial to φ-functions
recompute from its live-out set with a backward traversal of the block (whenever uses, of φ-functions
the definition of a variable is encountered, it is pruned from the live-out set). backward analysis
Live-ranges?are closely related to liveness. Instead of associating program points upward exposed use
with sets of live variables, the live-range of a variable specifies the set of program
points where that variable is live. Live-ranges of programs under strict SSA form
exhibit certain useful properties (see Chapter 2), some of which can be exploited
for register allocation (see Chapter 22).
The special behavior of φ-functions often causes confusion on where exactly
its operands are actually used and defined. For a regular operation, variables are
used and defined where the operation takes place. However, the semantics?of
φ-functions (and in particular the actual place of φ-uses) should be defined
carefully, especially when dealing with SSA destruction. In algorithms for SSA
destruction (see Chapter 21), a use in a φ-function is considered live somewhere
inside the corresponding predecessor block, but, depending on the algorithm
and, in particular, the way copies are inserted, it may or may not be considered
as live-out for that predecessor block. Similarly, the definition of a φ-function
is always considered to be at the beginning of the block, but, depending on the
algorithm, it may or may not be marked as live-in for the block. To make the
description of algorithms easier, we follow the same definition as the one used in
Chapter 21, Section 21.2: For a φ-function a 0 = φ(a 1 , . . . , a n ) in block B0 , where
a i comes from block Bi :
• Its definition-operand?is considered to be at the entry of B0 , in other words
variable a 0 is live-in of B0 ;
?
• Its use-operands are at the exit of the corresponding predecessor basic blocks,
in other words, variable a i is live-out of basic block Bi .
This corresponds to placing a copy a 0 ← a i on each edge from Bi to B0 . The
data-flow equations given hereafter and the presented algorithms follow the
same semantics. They require minor modifications when other φ-semantics are
desired.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Data-flow approaches
A well-known and frequently used approach to compute the live-in and live-out
sets of basic blocks is backward data-flow analysis?(see Chapter 8, Section 8.1).
The liveness sets are given by a set of equations that relate upward-exposed
?
uses and definitions to live-in and live-out sets. We say a use is upward-exposed in
116 9 Liveness — (B. Boissinot, F. Rastello)
reducible CFG a block when there is no local definition preceding it, i.e., the live-range “escapes”
live-in
the block at the top.
The sets of upward-exposed uses and definitions do not change during live-
ness analysis and can thus be pre-computed. In the following equations, we
denote PhiDefs(B ) the variables defined by φ-functions at the entry of block B ,
and PhiUses(B ) the set of variables used in a φ-function at the entry of a succes-
sor of the block B .
Informally, the live-in of block B are the variables defined in the φ-functions
of B , those used in B (and not defined in B ), and those which are just “passing
through.” On the other hand, the live-out are those that must be live for a succes-
sor S , i.e., either live-in of S (but not defined in a φ-function of S ) or used in a
φ-function of S .
10 3 v ← ...
4 8
... ← v
5 9
Fig. 9.1 An example of a reducible CFG. Forward CFG is represented using full edges; back-edges are
thickened. Backward pass on forward CFG sets v as live-in of node 5, but not of node 2. Forward
pass on loop nesting forest then sets v as live at node 6 but not at node 7.
Those two properties pave the way for describing the two steps that make up
our liveness set algorithm:
1. A backward pass propagates partial liveness information upwards using a
post-order traversal of the forward-CFG;
2. The partial liveness sets are then refined by traversing the loop nesting forest,
propagating liveness from loop-headers down to all basic blocks within
loops.
Algorithm 9.1 shows the necessary initialization and the high-level structure to
compute liveness in two passes.
forward control flow in successors of B . Similarly, PhiDefs(B ) denotes the set of variables defined by a
graph
back-edge φ-function in B .
The next phase, which traverses the loop nesting forest, is shown in Algo-
rithm 9.3. The live-in and live-out sets of all basic blocks within a loop are unified
with the liveness sets of its loop-header.
Example 1. The CFG of Figure 9.2a is a pathological case for iterative data-flow
analysis. The pre-computation phase does not mark variable a as live throughout
the two loops. An iteration is required for every loop nesting level until the final
solution is computed. In our algorithm, after the CFG traversal, the traversal of
the loop nesting forest (Figure 9.2b) propagates the missing liveness information
from the loop-header of loop L 2 down to all blocks within the loop’s body and all
inner loops, i.e., blocks 3 and 4 of L 3 .
9.2.1.1 Correctness
The first pass propagates the liveness sets using a post-order traversal of the
forward CFG, Gfwd?, obtained by removing all back-edges?from the CFG G . The
9.2 Data-flow approaches 119
a ← ... 1
Lr
... ← a 2 1 L2
3 2 L3
4 3 4
first two lemmas show that this pass correctly propagates liveness information
to the loop-headers of the original CFG.
Lemma 2. Let G be a reducible CFG, v an SSA variable, and d its definition. Let p
be a node of G such that all loops containing p also contain d . Then v is live-in
at p iff there is a path in Gfwd , from p to a use of v that does not go through d .
Pointers to formal proofs are provided in the last section of this chapter. The
important property used in the proof is the dominance property that enforces the
full live-range of a variable to be dominated by its definition d . As a consequence,
any back-edge part of the live-range is dominated by d , and the associated loop
cannot contain d .
Algorithm 9.2, which propagates liveness information along the DAG Gfwd , can
only mark variables as live-in that are indeed live-in. Furthermore, if, after this
propagation, a variable v is missing in the live-in set of a CFG node p , Lemma 2
shows that p belongs to a loop that does not contain the definition of v. Let L be
such a maximal loop. According to Lemma 1, v is correctly marked as live-in at
the header of L . The next lemma shows that the second pass of the algorithm
(Algorithm 9.3) correctly adds variables to the live-in and live-out sets where they
are missing.
Our approach involves potentially many outermost excluding loop queries, es-
pecially for the liveness check algorithm as developed further. An efficient im-
plementation of OLE?is required. The technique proposed here and shown in
Algorithm 9.5 is to pre-compute the set of ancestors from the loop-tree for ev-
9.2 Data-flow approaches 121
1 1
2 2
Lr
10 3 10 3 1 10
L2
4 8 4 8 2
7
3
4 L8
L5
5 9 5 9
8 9
5 6
6 6
7 7
(a) Irreducible CFG (b) Reducible CFG (c) Loop nesting forest
Fig. 9.3 A reducible CFG derived from an irreducible CFG, using the loop nesting forest. The transfor-
mation redirects edges arriving inside a loop to the loop header (here 9 → 6 into 9 → 5).
ery node. A simple set operation can then find the node we are looking for: the
ancestors of the definition node are removed from the ancestors of the query
point. From the remaining ancestors, we pick the shallowest. Using bitsets for
encoding the set of ancestors of a given node, indexed with a topological order
of the loop tree, this operations are easily implemented. The removal is a bit
inversion followed by a bitwise “and” operation, and the shallowest node is found
by searching for the first set bit in the bitset. Since the number of loops (and thus
the number loop-headers) is rather small, the bitsets are themselves small as
well and this optimization does not result in much wasted space.
Consider a topological indexing of loop-headers: n .LTindex (n being a loop-
header) or reciprocally i .node (i being an index). For each node, we associate a
bitset (indexed by loop-headers) of all its ancestors in the loop tree: n .ancestors.
This can be computed using any topological traversal of the loop-tree by a
call of DFS_COMPUTE_ANCESTORS(L r ). Notice that some compiler intermedi-
ate representations sometimes consider L r as a loop header. Considering so in
DFS_COMPUTE_ANCESTORS will not spoil the behavior of OLE.
Using this information, finding the outermost excluding loop can be done by
simple bitset operations as in Algorithm 9.5.
Example 2. Consider the example of Figure 9.3c again and suppose the loops
L 2 , L 8 , and L 5 are respectively indexed 0,1, and 2. Using big-endian notations
for bitsets, Algorithm 9.4 would give binary labels 110 to node 9 and 101 to
122 9 Liveness — (B. Boissinot, F. Rastello)
liveness check
tree scan
Algorithm 9.4: Compute the loop nesting forest ancestors.
destruction, SSA 1 Function DFS_compute_ancestors(node n)
sparseset 2 if n 6= L r then
bitsets 3 n .ancestors ← n.LTparent.ancestors
4 else
5 n .ancestors ← ; Â empty bitset
6 if n .isLoopHeader then
7 n .ancestors.add(n .LTindex)
8 foreach s in n.LTchildren do
9 DFS_COMPUTE_ANCESTORS(s )
node 6. The outermost loop containing 6 but not 9 is given by the leading bit of
101 ∧ ¬110 = 001, i.e., L 5 .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3 Liveness check using loop nesting forest and forward
reachability
In contrast to liveness sets, liveness check?does not provide the set of variables
live at a block, but provides a query system to answer questions such as “is
variable v live at location q ?” Such a framework is well suited for tree-scan
? ?
based register allocation (see Chapter 22), SSA destruction (see Chapter 21), or
Hyperblock scheduling (see Chapter 18). Most register-pressure aware algorithms
such as code-motion are not designed to take advantage of liveness a check query
system and still require sets. This query system can obviously be built on top
of pre-computed liveness sets. Queries in O (1) are possible, at least for basic
? ?
block boundaries, providing the use of sparsesets or bitsets to allow for efficient
element-wise queries. If sets are only stored at basic block boundaries, to allow
a query system at instruction granularity, it is possible to use the list of uses
of variables or backward scans. Constant time worst case complexity is lost in
this scenario and liveness sets that have to be incrementally updated at each
9.3 Liveness check using loop nesting forest and forward reachability 123
(even minor) code transformation can be avoided and replaced by less memory foward control flow
graph, modified
consuming data structures that only depend on the CFG.
In the following, we consider the live-in query of variable a at node q . To
avoid notational overhead, let a be defined in the CFG node d = def (a) and let
u ∈ uses(a) be a node where a is used. Suppose that q is strictly dominated by d
(otherwise v cannot be live at q ). Lemmas 1, 2, and 3 stated in Section 9.2.1.1
can be rephrased as follow:
1. Let h be the header of the maximal loop containing q but not d . Let h be q
if such maximal loop does not exist. Then v is live-in at h if and only if there
exists a forward path that goes from h to u .
2. If v is live-in at the header of a loop then it is live at any node inside the loop.
In other words, v is live-in at q if and only if there exists a forward path from
h to u where h is, if it exists, the header of the maximal loop containing q but
not d , and q itself otherwise. Given the forward control-flow graph and the loop
nesting forest, finding out if a variable is live at some program point can be done
in two steps. First, if there exists a loop containing the program point q and not
the definition, pick the header of the biggest such loop instead as the query point.
Then check for reachability from q to any use of the variable in the forward CFG.
As explained in Section 9.2.2, for irreducible CFG, the modified forward CFG?that
redirects any edge s → t to the loop header of the outermost loop containing t
but excluding s (t .OLE (s )), has to be used instead. Correctness is proved from
the theorems used for liveness sets.
Algorithm 9.6 puts a little bit more efforts onto the table to provide a query
system at instructions granularity. If q is in the same basic block than d (lines 8-
13), then v is live at q if and only if there is a use outside the basic block, or
inside but after q . If h is a loop-header then v is live at q if and only if a use
is forward reachable from h (lines 19-20). Otherwise, if the use is in the same
basic block than q it must be after q to bring the variable live at q (lines 17-
18). In this pseudo-code, upper cases are used for basic blocks while lower case
are used for program points at instructions granularity. “def(a )” is an operand.
“uses(a )” is a set of operands. “basicBlock(u )” returns the basic block containing
the operand u . Given the semantics of the φ-function instruction, the basic
block returned by this function for a φ-function operand can be different from
the block where the instruction textually occurs. Also, “u .order” provides the
corresponding (increasing) ordering in the basic block. For a φ-function operand,
the ordering number might be greater than the maximum ordering of the basic
block if the semantics of the φ-function places the uses on outgoing edges of the
predecessor block. Q .OLE (D ) corresponds to Algorithm 9.5 given in Section 9.2.3.
forwardReachable(H ,U ), which tells if U is reachable in the modified forward
CFG, will be described later.
The live-out check algorithm, given by Algorithm 9.7 only differs from Live-in
check in lines 5, 11, and 17 that involve ordering comparisons. In line 5, if q is
equal to d it cannot be live-in while it might be live-out; in lines 11 and 17 if q is
at a use point it makes it live-in but not necessarily live-out.
124 9 Liveness — (B. Boissinot, F. Rastello)
12 return false
13 H ← Q .OLE (D )
14 foreach u in uses(a) do
15 U ← basicBlock(u)
16 if (not isLoopHeader(H )) and U = Q and order(u ) < order(q ) then
17 continue
18 if forwardReachable(H ,U ) then
19 return true
20 return false
12 return false
13 H ← Q .OLE (D )
14 foreach u in uses(a) do
15 U ← basicBlock(u)
16 if (not isLoopHeader(H )) and U = Q and order(u ) ≤ order(q ) then
17 continue
18 if forwardReachable(H ,U ) then
19 return true
20 return false
9.4 Liveness sets using path exploration 125
The liveness check query system relies on pre-computations for efficient OLE
and forwardReachable queries. The outermost excluding loop is identical to
the one used for liveness set. We explain how to compute the modified-forward
reachability here (i.e., forward-reachability on transformed CFG to handle irre-
ducibility). In practice we do not build explicitly the modified-forward graph.
To compute efficiently the modified-forward reachability we simply need to
traverse the modified-forward graph in a reverse topological order. A post-order
initiated by a call to the recursive function DFS_Compute_forwardReachable(r )
(Algorithm 9.8) will do the job. Bitsets can be used to efficiently implement sets
of basic blocks. Once forward reachability has been pre-computed this way,
forwardReachable(H ,U ) returns true if and only if U ∈ H.forwardReachable.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4 Liveness sets using path exploration
Another maybe more intuitive way of calculating liveness sets is closely related
to the definition of the live-range of a given variable. As recalled earlier, a variable
is live at a program point p , if p belongs to a path of the CFG leading from a
definition of that variable to one of its uses without passing through the defini-
tion. Therefore, the live-range of a variable can be computed using a backward
traversal starting at its uses and stopping when reaching its (unique) definition.
Actual implementation of this idea could be done in several ways. In partic-
ular the order along which use operands are processed, in addition to the way
liveness sets are represented, can substantially impact the performance. The one
we choose to develop here allows to use a simple stack-like set representation
which avoids any expensive set-insertion operations and set-membership tests.
The idea is to process use operands variable per variable. In other words the
126 9 Liveness — (B. Boissinot, F. Rastello)
def-use chains processing of different variables is not intermixed, i.e., the processing of one
variable is completed before the processing of another variable begins.
Depending on the particular compiler framework, a preprocessing step that
performs a full traversal of the program (i.e., the instructions) might be required
in order to derive the def-use chains?for all variables, i.e., a list of all uses for each
SSA-variable. The traversal of the variable list and processing of its uses thanks
to def-use chains is depicted in Algorithm 9.9.
Note that, in strict SSA form, in a given block, no use can appear before a
definition. Thus, if v is live-out or used in a block B , it is live-in iff it is not defined
in B . This leads to the code of Algorithm 9.10 for path exploration.
Algorithm 9.9: Compute liveness sets per variable using def-use chains.
1 Function Compute_LiveSets_SSA_ByVar(CFG)
2 foreach variable v do
3 foreach block B where v is used do
4 if v ∈ PhiUses(B ) then  Used in the φ of a successor block
5 LiveOut(B ) = LiveOut(B ) ∪ {v }
6 Up_and_Mark(B , v )
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5 Further readings
least common ancestor The loop nesting forest considered in this chapter corresponds to the one
obtained using Havlak’s algorithm [147]. A more generalized definition exists and
correspond to the minimal loop nesting forest as defined by Ramalingam [245].
The handling of any minimal loop nesting forest is also detailed in Chapter 2
of[248].
Handling of irreducible CFG can be done through CFG transformations such
as node splitting [155, 2]. Such a transformation can lead to an exponential growth
in the number of nodes. Ramalingam [245] proposed a transformation (different
than the one presented here but also without any exponential growth) that only
maintains dominance property (not the full semantic).
Finding the maximal loop not containing a node s but containing a node t
?
(OLE) is a problem similar to finding the least common ancestor (LCA) of the
two nodes s and t in the rooted loop-nested forest: the loop in question is the
only direct child of LCA(s , t ), ancestor of t . As described in [27], an LCA query
can be reduced to a Range Minimum Query (RMQ) problem that can itself be
answered in O (1), with a pre-computation of O (n ). Adaptation of LCA to provide
an efficient algorithm for OLE queries is detailed in Chapter 2 of [248].
This Chapter is a short version of Chapter 2 of [248] that among other de-
tails contains formal proofs and handling of different φ-function semantics.
Sparsesets are described by Cooper and Torczon [89].
induction variable
recognition
reducible loop
CHAPTER 10
Loop Tree and Induction Variables S. Pop
A. Cohen
?
This chapter presents an extension of the SSA under which the extraction of the
reducible loop tree can be done only on the SSA graph itself. This extension also
captures reducible loops?in the CFG. This chapter first illustrates this property
then shows its usefulness through the problem of induction variable recognition.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1 Part of the CFG and loop tree can be exposed from
the SSA
129
130 10 Loop Tree and Induction Variables — (S. Pop, A. Cohen)
In the classic definition of the SSA, the CFG provides the skeleton of the program:
basic blocks contain assignment statements defining SSA variable names, and
the basic blocks with multiple predecessors contain φ-nodes. Let us look at what
happens when, starting from a classic SSA representation, we remove the CFG.
In order to remove the CFG, imagine a pretty printer function that dumps only
the arithmetic instructions of each basic-blocks and skips the control instructions
of an imperative program by traversing the CFG structure in any order. Does the
representation, obtained from this pretty printer, contains enough information
to enable us to compute the same thing as the original program? 1 Let us see
what happens with an example in its CFG based SSA representation:
B1
a ← ...
B2
b ← ...
B3
c ←a +b
B4
return c
After removing the CFG structure, listing the definitions in an arbitrary order, we
could obtain this:
return c
b ← ... Â some computation independent of a
c ←a +b
a ← ... Â some computation independent of b
And this SSA code is enough, in the absence of side effects, to recover an order
of computation that leads to the same result as in the original program. For
example, the evaluation of this sequence of statements would produce the same
result:
b ← ... Â some computation independent of a
a ← ... Â some computation independent of b
c ←a +b
return c
1
To simplify the discussion, we consider the original program to be free of side effect instruc-
tions.
10.1 Part of the CFG and loop tree can be exposed from the SSA 131
We will now see how to represent the natural loops in the SSA form by systemati-
cally adding extra φ-nodes at the end of loops, together with extra information
about the loop exit predicate. Supposing that the original program contains a
loop:
B1
x ←3
B2
i ← φ(x , j )
if (i < N )
B3
j ←i +1
B4
k ← φexit (i )
B4
return k
Pretty printing, with a random order traversal, we could obtain this SSA code:
x ←3
return k
i ← φ(x , j )
k ← φ(i )
j ←i +1
We can remark that some information is lost in this pretty printing: the exit
condition of the loop have been lost. We will have to record this information in
the extension of the SSA representation. However, the loop structure still appears
through the cyclic definition of the induction variable i . To expose it, we can
rewrite this SSA code using simple substitutions, as:
i ← φ(3, i + 1)
k ← φ(i )
return k
Thus, we have the definition of the SSA name i defined in function of itself. This
pattern is characteristic of the existence of a loop. We can remark that there are
two kinds of φ-nodes used in this example:
?
• loop-φ nodes “i = φ(x , j )” (also denoted i = φentry (x , j ) as in Chapter 14)
have an argument that contains a self reference j and an invariant argument
x : here the defining expression “ j = i + 1” contains a reference to the same
132 10 Loop Tree and Induction Variables — (S. Pop, A. Cohen)
φexit -node loop-φ definition i , while x (here 3) is not part of the circuit of dependencies
that involves i and j . Note that it is possible to define a canonical SSA form
by limiting the number of arguments of loop-φ nodes to two.
• close-φ nodes “k = φexit (i )”? (also denoted k = φexit (i ) as in Chapter 14)
capture the last value of a name defined in a loop. Names defined in a loop
can only be used within that loop or in the arguments of a close-φ node (that
is “closing” the set of uses of the names defined in that loop). In a canonical
SSA form it is possible to limit the number of arguments of close-φ nodes to
one.
As we have seen in the above example, the exit condition of the loop disappeared
during the basic pretty printing of the SSA. To capture the semantics of the
computation of the loop, we have to specify in the close-φ node, when we exit
the loop so as to be able to derive which value will be available in the end of the
loop. With our extension that adds the loop exit condition to the syntax of the
close-φ, the SSA pretty printing of the above example would be:
x ←3
i ← φentry (x , j )
j ←i +1
k ← φexit (i ≥ N , i )
return k
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2 Analysis of Induction Variables
The purpose of the induction variables analysis is to provide a characterization
of the sequences of values taken by a variable during the execution of a loop. This
characterization can be an exact function of the canonical induction variable of
the loop (i.e., a loop counter that starts at zero with a step of one for each iteration
10.2 Analysis of Induction Variables 133
of the loop) or an approximation of the values taken during the execution of the chain of recurrences
use-def chain
loop represented by values in an abstract domain. In this section, we will see
a possible characterization of induction variables in terms of sequences. The
domain of sequences will be represented by chains of recurrences?: as an example,
a canonical induction variable with an initial value 0 and a stride 1 that would
occur in the loop with label x will be represented by the chain of recurrence
{0, +, 1} x .
The first phase of the induction variables analysis is the detection of the strongly
connected components of the SSA. This can be performed by traversing the
use-def SSA chains?and detecting that some definitions are visited twice. For a
self referring use-def chain, it is possible to derive the step of the corresponding
induction variable as the overall effect of one iteration of the loop on the value
of the loop-φ node. When the step of an induction variable depends on another
cyclic definition, one has to further analyze the inner cycle. The analysis of
the induction variable ends when all the inner cyclic definitions used for the
computation of the step are analyzed. Note that it is possible to construct SSA
graphs with strongly connected components that are impossible to characterize
with the chains of recurrences. This is precisely the case of the following example
that shows two inter-dependent circuits, the first that involves a and b with step
c + 2, and the second that involves c and d with step a + 2. This leads to an
endless loop, which must be detected.
a ← φentry (0, b )
c ← φentry (1, d )
b ← c +2
d ←a +3
Fig. 10.1 Detection of the cyclic definition using a depth first search traversal of the use-def chains.
134 10 Loop Tree and Induction Variables — (S. Pop, A. Cohen)
Once the def-use circuit and its corresponding overall loop update expression
have been identified, it is possible to translate the sequence of values of the
induction variable to a chain of recurrence. The syntax of a polynomial chain of
recurrence is: {base, +, step} x , where base and step may be arbitrary expressions
or constants, and x is the loop to which the sequence is associated. As a chain
of recurrence represents the sequence of values taken by a variable during the
execution of a loop, the associated expression of a chain of recurrence is given
by {base, +, step} x (` x ) = base + step × ` x , that is a function of ` x , the number of
times the body of loop x has been executed.
When base or step translates to sequences varying in outer loops, the resulting
sequence is represented by a multivariate chain of recurrences. For example
{{0, +, 1} x , +, 2} y defines a multivariate chain of recurrence with a step of 1 in
loop x and a step of 2 in loop y , where loop y is enclosed in loop x .
When step translates into a sequence varying in the same loop, the chain of re-
currence represents a polynomial of a higher degree. For example, {3, +, {8, +, 5} x } x
represents a polynomial evolution of degree 2 in loop x . In this case, the chain of
recurrence is also written omitting the extra braces: {3, +, 8, +, 5} x . The semantics
n
of a chain of recurrences is defined using the binomial coefficient p = p !(nn−p !
)! ,
by the equation:
n
X `x
{c0 , +, c1 , +, c2 , +, . . . , +, cn } x (`x ) = cp .
p =0
p
with ` the iteration domain vector (the iteration loop counters for all the loops
in which the chain of recurrence variates), and ` x the iteration counter of loop x .
10.2 Analysis of Induction Variables 135
The last phase of the induction variable analysis consists in the instantiation
(or further analysis) of symbolic expressions left from the previous phase. This
includes the analysis of induction variables in outer loops, computing the last
value of the counter of a preceding loop, and the propagation of closed form ex-
pressions for loop invariants defined earlier. In some cases, it becomes necessary
to leave in a symbolic form every definition outside a given region, and these
symbols are then called parameters of the region.
Let us look again at the example of Figure 10.1 to see how the sequence of
values of the induction variable c is characterized with the chains of recurrences
notation. The first step, after the cyclic definition is detected, is the translation of
this information into a chain of recurrence: in this example, the initial value (or
base of the induction variable) is a and the step is e , and so c is represented by a
chain of recurrence {a , +, e }1 that is varying in loop number 1. The symbols are
then instantiated: a is trivially replaced by its definition leading to {3, +, e }1 . The
analysis of e leads to this chain of recurrence: {8, +, 5}1 that is then used in the
chain of recurrence of c , {3, +, {8, +, 5}1 }1 and that is equivalent to {3, +, 8, +, 5}1 ,
a polynomial of degree two:
` ` `
F (`) = 3 +8 +5
0 1 2
5 2 11
= ` + ` + 3.
2 2
One of the important static analyses for loops is to evaluate their trip count, i.e.,
the number of times the loop body is executed before the exit condition becomes
true. In common cases, the loop exit condition is a comparison of an induction
variable against some constant, parameter, or another induction variable. The
number of iterations is then computed as the minimum solution of a polynomial
inequality with integer solutions, also called a Diophantine inequality. When one
136 10 Loop Tree and Induction Variables — (S. Pop, A. Cohen)
x ←0
for i = 0; i < N ; i + + do
 loop1
for j = 0; j < M ; j + + do
 loop2
x ← x +1
x0 ← 0
1
i ← φentry (0, i + 1)
1
x1 ← φentry (x0 , x2 )
1
x4 ← φexit (i < N , x1 )
2
j ← φentry (0, j + 1)
2
x3 ← φentry (x1 , x3 + 1)
2
x2 ← φexit ( j < M , x3 )
x3 represents the value of variable x at the end of the original imperative pro-
gram. The analysis of scalar evolutions for variable x4 would trigger the analysis
of scalar evolutions for all the other variables defined in the loop closed SSA form
as follows:
• first, the analysis of variable x4 would trigger the analysis of i , N and x1
– the analysis of i leads to i = {0, +, 1}1 i.e., the canonical loop counter l 1
of loop1 .
– N is a parameter and is left under its symbolic form
– the analysis of x1 triggers the analysis of x0 and x2
· the analysis of x0 leads to x0 = 0
· analyzing x2 triggers the analysis of j , M , and x3
· j = {0, +, 1}2 i.e., the canonical loop counter l 2 of loop2 .
· M is a parameter
2
· x3 = φentry (x1 , x3 + 1) = {x1 , +, 1}2
2
· x2 = φentry ( j < M , x3 ) is then computed as the last value of x3 after
loop2 , i.e., it is the chain of recurrence of x3 applied to the first it-
eration of loop2 that does not satisfy j < M or equivalently l 2 < M .
The corresponding Diophantine inequality l 2 ≥ M have minimum
10.3 Further readings 137
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.3 Further readings
Induction variable detection has been studied extensively in the past because of
its central role in loop optimizations. Wolfe [315] designed the first SSA-based
induction variable recognition technique. It abstracts the SSA graph and classifies
inductions according to a wide spectrum of patterns.
When operating on a low-level intermediate representation with arbitrary
gotos, detecting the natural loops is the first step in the analysis of induction
variables. In general, and when operating on low-level code in particular, it is
preferable to use analyses that are more robust to complex control flow that
do not resort to an early classification into predefined patterns. Chains of re-
currences [19, 175, 326] have been proposed to characterize the sequence of
values taken by a variable during the execution of a loop [302], and it has proven
to be more robust to the presence of complex, unstructured control flow, to
the characterization of induction variables over modulo-arithmetic such as un-
signed wrap-around types in C, and to the implementation in a production
compiler [241].
The formalism and presentation of this chapter is derived from the thesis
work of Sebastian Pop. The manuscript [240] contains pseudo-code and links
to the implementation of scalar evolutions in GCC since version 4.0. The same
approach has also influenced the design of LLVM’s scalar evolution, but the
implementation is different. Parler des types, arith modulo, traitement dans GCC
et LLVM, citer comme difficulte.
Induction varaible analysis is used in dependence tests for scheduling and
parallelization [314], and more recently, the extraction of short-vector to SIMD
instructions [221].2 The Omega test [244] and parametric integer linear program-
ming [121] have typically used to reason about system parametric affine Dio-
phantine inequalities. But in many cases, simplications and approximations can
lead to polynomial decision procedures [20]. Modern parallelizing compilers
tend to implement both kinds, depending on the context and aggressiveness of
the optimization.
2
Note however that the computation of closed form expressions is not required for dependence
testing itself [317].
138 10 Loop Tree and Induction Variables — (S. Pop, A. Cohen)
CHAPTER 11 elimination|mainidx
redundancy
elimination!partial
redundancy
elimination!syntax-
driven
redundancy
Redundancy Elimination F. Chow elimination!value-
driven
?
Redundancy elimination is an important category of optimizations performed
by modern optimizing compilers. In the course of program execution, certain
computations may be repeated multiple times that yield the same results. Such
redundant computations can be eliminated by saving the results of the earlier
computations and reusing instead of recomputing them later.
There are two types of redundancies: full redundancy and partial redundancy.
?
A computation is fully redundant if the computation has occurred earlier re-
gardless of the flow of control. The elimination of full redundancy is also called
? ?
common subexpression elimination. A computation is partially redundant if the
computation has occurred only along certain paths. Full redundancy can be re-
garded as a special case of partial redundancy where the redundant computation
occurs regardless of the path taken.
There are two different views for a computation related to redundancy: how
it is computed and the computed value. The former relates to the operator and
the operands it operates on, which translates to how it is represented in the
program representation. The latter refers to the value generated by the compu-
tation in the static sense 1 . As a result, algorithms for finding and eliminating
redundancies can be classified into being syntax-driven or being value-driven.
?
In syntax-driven analyses, two computations are the same if they are the same
operation applied to the same operands that are program variables or constants.
In this case, redundancy can arise only if the variables’ values have not changed
between the occurrences of the computation. In value-based?analyses, redun-
dancy arises whenever two computations yield the same value. For example,
a + b and a + c compute the same result if b and c can be determined to hold the
same value. In this chapter, we deal mostly with syntax-driven redundancy elim-
ination. The last section will extend our discussion to value-based redundancy
elimination.
1
All values referred to in this Chapter are static values viewed with respect to the program code.
A static value can map to different dynamic values during program execution.
139
140 11 Redundancy Elimination— (F. Chow)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.1 Why partial redundancy elimination and SSA are re-
lated
Figure 11.1 shows the two most basic forms of partial redundancy. In Fig-
ure 11.1(a), a +b is redundant when the right path is taken. In Figure 11.1(b), a +b
is redundant whenever the back edge (see Section 4.4.1) of the loop is taken. Both
?
are examples of strictly partial redundancies, in which insertions are required to
?
eliminate the redundancies. In contrast, a full redundancy can be deleted with-
out requiring any insertion. Partial redundancy elimination (PRE)?is powerful
because it subsumes global common subexpressions and loop-invariant code
motion.
We can visualize the impact on redundancies of a single computation as
shown in Figure 11.2. In the region of the control-flow graph dominated by the
occurrence of a + b , any further occurrence of a + b is fully redundant, assuming
a and b are not modified. Following the program flow, once we are past the
dominance frontiers, any further occurrence of a + b is partially redundant. In
constructing SSA form, dominance frontiers are where φ’s are inserted. Since
partial redundancies start at dominance frontiers, it must be related to SSA’s φ’s.
In fact, the same sparse approach to modeling the use-def relationships among
the occurrences of a program variable can be used to model the redundancy
relationships among the different occurrences of a + b .
2
The opposite of maximal expression tree form is the triplet form in which each arithmetic
operation always defines a temporary.
11.1 Why partial redundancy elimination and SSA are related 141
... ← a + b ( ←(
...(
(
a +b
... ← a + b
... ← a + b ( ←(
...(
(
a +b
... ← a + b
Fig. 11.2 Dominance frontiers (dashed) are boundaries between fully (highlighted basic blocks) and
partially redundant regions (normal basic blocks).
factored redundancy ?
points in the control-flow graph.3 The resulting factored redundancy graph (FRG)
graph
expression SSA can be regarded as the SSA form for expressions.
form|mainidx To make the expression SSA form?more intuitive, we introduce the hypothetical
variable version temporary h , which can be thought of as the temporary that will be used to store
renaming, of expressions
the value of the expression. The FRG can be viewed as the SSA graph for h .
Observe that we have not yet determined where h should be defined or used.
In referring to the FRG, a use node will refer to a node in the FRG that is not a
definition.
The SSA form for h is constructed in two steps similar to ordinary SSA form:
the Φ-Insertion step followed by the Renaming step. In the Φ-Insertion step, we
insert Φ’s at the dominance frontiers of all the expression occurrences, to ensure
that we do not miss any possible placement positions for the purpose of PRE, as
in Figure 11.3a. We also insert Φ’s caused by expression alteration. Such Φ’s are
triggered by the occurrence of φ’s for any of the operands in the expression. In
Figure 11.3b, the Φ at block 3 is caused by the φ for a in the same block, which
in turns reflects the assignment to a in block 2.
1 . . . ← a 1 + b1 [h ]
1 . . . ← a 1 + b1 [h ] a2 ← . . . 2
a 3 ← φ(a 1 , a 2 )
3
2 [h ] ← Φ([h ], [h ]) [h ] ← Φ([h ], [h ])
The Renaming step assigns SSA versions?to h such that occurrences renamed
to identical h -versions will compute to the same values. We conduct a pre-order
traversal of the dominator tree similar to the renaming step in SSA construction
for variables, but with the following modifications: (1) In addition to a renam-
?
ing stack for each variable, we maintain a renaming stack for the expression;
(2) Entries on the expression stack are popped as our dominator tree traversal
backtracks past the blocks where the expression originally received the version.
Maintaining the variable and expression stacks together allows us to decide ef-
ficiently whether two occurrences of an expression should be given the same
h -version.
There are three kinds of occurrences of the expression in the program: (real) the
occurrences in the original program, which we call real occurrences; (Φ-def) the
3
Adhering to SSAPRE’s convention, we use lower case φ’s in the SSA form of variables and
upper case Φ’s in the SSA form for expressions.
11.2 How SSAPRE works 143
inserted Φ’s; and (Φ-use) the use operands of the Φ’s, which are regarded as occur- SSAPRE
ring at the ends of the predecessor blocks of their corresponding edges. During
the visitation in Renaming, a Φ’s is always given a new version. For a non-Φ, i.e.,
cases (real) and (Φ-use), we check the current version of every variable in the
expression (the version on the top of each variable’s renaming stack) against
the version of the corresponding variable in the occurrence on the top of the
expression’s renaming stack. If all the variable versions match, we assign it the
same version as the top of the expression’s renaming stack. If any of the variable
versions does not match, for case (real), we assign it a new version, as in the
example of Figure 11.4a; for case (Φ-use), we assign the special class ⊥ to the
Φ-use to denote that the value of the expression is unavailable at that point, as in
the example of Figure 11.4b. If a new version is assigned, we push the version on
the expression stack.
. . . ← a 1 +b1 [h1 ]
[h1 ] ← Φ([h1 ], ⊥)
. . . ← a 1 +b1 [h1 ]
. . . ← a 1 +b1 [h1 ] b2 ← . . .
. . . ← a 1 +b2 [h2 ]
(a) (b)
Fig. 11.4 Examples of expression renaming
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2 How SSAPRE works
?
Referring to the expression being optimized as X , we use the term placement
to denote the set of points in the optimized program where X ’s computation
occurs. In contrast, original computation points refer to the points in the origi-
nal program where X ’s computation took place. The original program will be
transformed to the optimized program by performing a set of insertions and
deletions.
144 11 Redundancy Elimination— (F. Chow)
PRE, correctness The objective of SSAPRE is to find a placement that satisfies the following four
PRE, safety
criteria in this order:
PRE, computational
optimality – Correctness?X is fully available at all the original computation points.
PRE, lifetime optimality – Safety?There is no insertion of X on any path that did not originally contain X .
critical edge – Computational optimality?No other safe and correct placement can result in
SSAPRE, safety criterion
fewer computations of X on any path from entry to exit in the program.
fully ?
anticipated|seedownsafe – Lifetime optimality Subject to the computational optimality, the life range of
downsafe|mainidx the temporary introduced to store X is minimized.
Each occurrence of X at its original computation point can be qualified with
exactly one of the following attributes: (1) fully redundant; (2) strictly partially
redundant; (3) non-redundant.
As a code placement problem, SSAPRE follows the same two-step process used
in all PRE algorithms. The first step determines the best set of insertion points
that render as many strictly partially redundant occurrences fully redundant
as possible. The second step deletes fully redundant computations, taking into
account the effects of the inserted computations. As we consider this second
step to be well understood, the challenge lies in the first step for coming up with
the best set of insertion points. The first step will tackle the safety, computational
optimality and lifetime optimality criteria, while the correctness criterion is
delegated to the second step. For the rest of this section, we only focus on the first
step for finding the best insertion points, which is driven by the strictly partially
redundant occurrences.
We assume that all critical edges?in the control-flow graph have been removed
by inserting empty basic blocks at such edges (see Algorithm 3.5). In the SSAPRE
approach, insertions are only performed at Φ-uses. When we say a Φ is a candidate
for insertion, it means we will consider inserting at its use operands to render
X available at the entry to the basic block containing that Φ. An insertion at a
Φ-use means inserting X at the incoming edge corresponding to that Φ operand.
In reality, the actual insertion is done at the end of the predecessor block.
As we have pointed out, at the end of Section 11.1, insertions only need to be
?
considered at the Φ’s. The safety criterion implies that we should only insert at
?
Φ’s where X is downsafe (fully anticipated). Thus, we perform data-flow analysis
on the FRG to determine the downsafe attribute for Φ’s. Data-flow analysis can
be performed with linear complexity on SSA graphs, which we illustrate with the
Downsafety computation.
?
A Φ is not downsafe if there is a control-flow path from that Φ along which the
expression is not computed before program exit or before being altered by the
redefinition of one of its variables. Except for loops with no exit, this can happen
only due to one of the following cases: (dead) there is a path to exit or an alteration
of the expression along which the Φ result version is not used; or (transitive) the
11.2 How SSAPRE works 145
Φ result version appears as the operand of another Φ that is not downsafe. Case available, can
be|mainidx
(dead) represents the initialization for our backward propagation of ¬downsafe;
all other Φ’s are initially marked downsafe. The Downsafety propagation is based
on case (transitive). Since a real occurrence of the expression blocks the case
(transitive) propagation, we define a has_real_use flag attached to each Φ operand
and set this flag to true when the Φ operand is defined by another Φ and the path
from its defining Φ to its appearance as a Φ operand crosses a real occurrence.
The propagation of ¬downsafe is blocked whenever the has_real_use flag is true.
Figure 11.1 gives the DownSafety propagation algorithm. The initialization of
the has_real_use flags is performed in the earlier Renaming phase.
8 Function Reset_downsafe(X )
9 if def(X ) is not a Φ then return
10 f ← def(X )
11 if not downsafe(f ) then return
12 downsafe(f ) ← false
13 foreach operand ω of f do
14 if not has_real_use(ω) then Reset_downsafe(ω)
At this point, we have eliminated the unsafe Φ’s based on the safety criterion.
Next, we want to identify all the Φ’s that are possible candidates for insertion, by
disqualifying Φ’s that cannot be insertion candidates in any computationally op-
timal placement. An unsafe Φ can still be an insertion candidate if the expression
is fully available there, though the inserted computation will itself be fully redun-
dant. We define the can_be_avail?attribute for the current step, whose purpose
is to identify the region where, after appropriate insertions, the computation can
become fully available. A Φ is ¬can_be_avail if and only if inserting there violates
computational optimality. The can_be_avail attribute can be viewed as:
146 11 Redundancy Elimination— (F. Chow)
We illustrate our discussion in this section with the example of Figure 11.5,
where the program exhibits partial redundancy that cannot be removed by safe
code motion. The two Φ’s with their computed data-flow attributes are as shown.
If insertions were based on can_be_avail, a + b would have been inserted at
11.3 Speculative PRE 147
downsafe =0
[h ] ← Φ([h 1], ⊥) 3
can_be_avail =0 3
exit 4 a2 ← . . . 5
downsafe =1
a 3 ← φ(a 1 , a 2 , a 3 )
can_be_avail =1 6
[h5 ] ← Φ([h3 ], ⊥, [h5 ])
later=1
. . . ← a 3 +b1 [h5 ] 7
exit
the exits of blocks 4 and 5 due to the Φ in block 6, which would have resulted in
unnecessary code motion that increases register pressure. By considering later,
no insertion is performed, which is optimal under safe PRE for this example.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3 Speculative PRE
If we ignore the safety requirement of PRE discussed in Section 11.2, the result-
?
ing code motion will involve speculation. Speculative code motion suppresses
redundancy in some path at the expense of another path where the computation
is added but result is unused. As long as the paths that are burdened with more
computations are executed less frequently than the paths where the redundant
computations are avoided, a net gain in program performance can be achieved.
Thus, speculative code motion should only be performed when there are clues
about the relative execution frequencies of the paths involved.
Without profile data, speculative PRE can be conservatively performed by
restricting it to loop-invariant computations. Figure 11.6 shows a loop-invariant
computation a + b that occurs in a branch inside the loop. This loop-invariant
code motion is speculative because, depending on the branch condition inside
the loop, it may be executed zero time, while moving it to the loop header causes
it to execute once. This speculative loop-invariant code motion is profitable
unless the path inside the loop containing the expression is never taken, which is
usually not the case. When performing SSAPRE, marking Φ’s located at the start
of loop bodies as downsafe will effect speculative loop invariant code motion.
148 11 Redundancy Elimination— (F. Chow)
fault safety
fall-through ... ← a + b
... ← a + b
Computations like indirect loads and divides are called dangerous computa-
tions because they may fault. Dangerous computations in general should not
be speculated. As an example, if we replace the expression a + b in Figure 11.6
by a /b and the speculative code motion is performed, it may cause a runtime
divide-by-zero fault after the speculation because b can be 0 at the loop header
while it is never 0 in the branch that contains a /b inside the loop body.
Dangerous computations are sometimes protected by tests (or guards) placed
in the code by the programmers or automatically generated by language com-
pilers like those for Java. When such a test occurs in the program, we say the
dangerous computation is safety-dependent on the control-flow point that es-
tablishes its safety. At the points in the program where its safety dependence is
satisfied, the dangerous instruction is fault-safe?and can still be speculated.
We can represent safety dependences as value dependences in the form of
abstract τ variables. Each runtime test that succeeds, defines a τ variable on
?
its fall-through path. During SSAPRE, we attach these τ variables as additional
operands to the dangerous computations related to the test. The τ variables are
also put into SSA form, so their definitions can be found by following the use-def
chains. The definitions of the τ variables have abstract right-hand-side values
that are not allowed to be involved in any optimization. Because they are abstract,
they are also omitted in the generated code after the SSAPRE phase. A dangerous
computation can be defined to have more than one τ operand, depending on its
semantics. When all its τ operands have definitions, it means the computation
is fault-safe; otherwise, it is unsafe to speculate. By including the τ operands
into consideration, speculative PRE automatically honors the fault-safety of
dangerous computations when it performs speculative code motion.
11.4 Register promotion via PRE 149
In Figure 11.7, the program contains a non-zero test for b . We define an register
promotion|mainidx
additional τ operand for the divide operation in a /b in SSAPRE to provide the
information whether a non-zero test for b is available. At the start of the region
guarded by the non-zero test for b , the compiler inserts the definition of τ1
with the abstract right-hand-side value τ-edge. Any appearance of a /b in the
region guarded by the non-zero test for b will have τ1 as its τ operand. Having a
defined τ operand allows a /b to be freely speculated in the region guarded by
the non-zero test, while the definition of τ1 prevents any hoisting of a /b past
the non-zero test.
yes yes
b1 = 0? b1 = 0?
no no
τ1 ← τ-edge τ1 ← τ-edge
. . . ← a 1 /b1 , τ1
. . . ← a 1 /b1 , τ1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4 Register promotion via PRE
Variables and most data in programs normally start out residing in memory. It
?
is the compiler’s job to promote those memory contents to registers as much
as possible to speed up program execution. Load and store instructions have to
be generated to transfer contents between memory locations and registers. The
compiler also has to deal with the limited number of physical registers and find
an allocation that makes the best use of them. Instead of solving these problems
all at once, we can tackle them as two smaller problems separately:
150 11 Redundancy Elimination— (F. Chow)
Variables with no aliases are trivial register promotion candidates. They include
the temporaries generated during PRE to hold the values of redundant computa-
tions. Variables in the program can also be determined via compiler analysis or
by language rules to be alias-free. For those trivial candidates, one can rename
them to unique pseudo-registers, and no load or store needs to be generated.
Our register promotion is mainly concerned with scalar variables that have
aliases, indirectly accessed memory locations and constants. A scalar variable
can have aliases whenever its address is taken, or if it is a global variable, since
it can be accessed by function calls. A constant value is a register promotion
candidate whenever some operations using it have to refer to it through register
operands.
Since the goal of register promotion is to obtain the most efficient placement
for loads and stores, register promotion can be modeled as two separate prob-
?
lems: PRE of loads, followed by PRE of stores. In the case of constant values, our
use of the term load will extend to refer to the operation performed to put the
constant value in a register. The PRE of stores does not apply to constants.
From the point of view of redundancy, loads behave like expressions: the later
occurrences are the ones to be deleted. For stores, this is the reverse: as illustrated
in the examples of Figure 11.8 the earlier stores are the ones to be deleted. The
?
PRE of stores, also called partial dead code elimination, can thus be treated as
the dual of the PRE of loads. Thus, performing PRE of stores has the effects of
moving stores forward, while inserting them as early as possible. Combining the
effects of the PRE of loads and stores results in optimal placements of loads and
stores while minimizing the live ranges of the pseudo-registers, by virtue of the
computational and lifetime optimality of our PRE algorithm.
11.4 Register promotion via PRE 151
(a) (b)
Fig. 11.9 Redundant loads after stores
152 11 Redundancy Elimination— (F. Chow)
PRE, of loads When we perform the PRE of loads, we thus include the store occurrences
PRE, of stores)
into consideration. The Φ-insertion step will insert Φ’s at the iterated dominance
frontiers of store occurrences. In the Rename step, a store occurrence is always
given a new h -version, because a store is a definition. Any subsequent load
renamed to the same h -version is redundant with respect to the store.
? ?
We apply the PRE of loads (LPRE) first, followed by the PRE of stores (STRE.
This ordering is based on the fact that LPRE is not affected by the result of STRE,
but LPRE creates more opportunities for the SPRE by deleting loads that would
otherwise have blocked the movement of stores. In addition, speculation is
required for the PRE of loads and stores in order for register promotion to do a
decent job in loops.
The example in Figure 11.10 illustrates what is discussed in this section. During
LPRE, A ← . . . is regarded as a store occurrence. The hoisting of the load of A to
the loop header does not involve speculation. The occurrence of A ← . . . causes r
to be updated by splitting the store into the two statements r ← . . . ; A ← r . In the
PRE of stores (SPRE), speculation is needed to sink A ← . . . to outside the loop
because the store occurs in a branch inside the loop. Without performing LPRE
first, the load of A inside the loop would have blocked the sinking of A ← . . . .
r ← load A r ← load A
[h1 ] ← Φ(h3 , ⊥)
A←r
As mentioned earlier, SPRE is the dual of LPRE. Code motion in SPRE will have
the effect of moving stores forward with respect to the control-flow graph. Any
presence of (aliased) loads have the effect of blocking the movement of stores or
rendering the earlier stores non-redundant.
To apply the dual of the SSAPRE algorithm, it is necessary to compute a
program representation that is the dual of the SSA form, the static single use
(SSU)?form (see Chapter 13 – SSU is a special case of SSI). In SSU, use-def edges
are factored at divergence points in the control-flow graph using σ-functions
(see Section 13.1.4). Each use of a variable establishes a new version (we say the
load uses the version), and every store reaches exactly one load.
?
We call our store PRE algorithm SSUPRE, which is made up of the correspond-
ing steps in SSAPRE. Insertion of σ-functions and renaming phases, construct
the SSU form for the variable whose store is being optimized. The data-flow
analyses consist of UpSafety to compute the upsafe (fully available) attribute,
CanBeAnt to compute the can_be_ant attribute and Earlier to compute the ear-
lier attribute. Though store elimination itself does not require the introduction of
temporaries, lifetime optimality still needs to be considered for the temporaries
introduced in the LPRE phase which hold the values to the point where the stores
are placed. It is desirable not to sink the stores too far down.
A ← ... A1 ← . . .
σ(A 3 , A 2 ) ← A 1
σ(⊥, A 1 ) ← A 2
A ← ...
Figure 11.11 gives the SSU form and the result of SSUPRE on an example
program. The sinking of the store to outside the loop is traded for the insertion
154 11 Redundancy Elimination— (F. Chow)
value numbering
value numbering, global a ← ... H(a ) = v1
c ← ... H(c ) = v2
b ←c H(b ) = H(c ) = v2
c ←a +c H(c ) = H(+(H(a ), H(c ))) = H(+(v1 , v2 )) = v3
d ←a +b H(d ) = H(+(H(a ), H(b ))) = H(+(v1 , v2 )) = v3
(a) straightline code (b) value numbers
Fig. 11.12 Value numbering in a local scope
of a store in the branch inside the loop. The optimized code no longer exhibits
any store redundancy.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.5 Value-based redundancy elimination
The PRE algorithm we have described so far is not capable of recognizing re-
dundant computations among lexically different expressions that yield the same
value. In this section, we discuss redundancy elimination based on value analysis.
value numbering,
a1 ← . . . H(a 1 ) = v1 optimistic
c1 ← . . . H(c1 ) = v2
b1 ← c 1 H(b1 ) = H(c1 ) = v2
c2 ← a 1 + c1 H(c2 ) = H(+(H(a 1 ), H(c1 ))) = H(+(v1 , v2 )) = v3
d 1 ← a 1 + b1 H(d 1 ) = H(+(H(a 1 ), H(b1 ))) = H(+(v1 , v2 )) = v3
(a) processed statements (b) value numbers
Fig. 11.13 Global value numbering on SSA form
value numbering by visiting the nodes in the control-flow graph in a reverse post
order traversal of the dominator tree. This traversal strategy can minimize the
instances when a φ-use has unknown value number which arises only in the
case of back edges from loops. When this arises, we have no choice but to assign
a new value number to the variable defined by the φ-function. For example, in
the following loop:
1 i1 ← 0
2 j1 ← 0
3 while <cond> do
4 i 2 ← φ(i 3 , i 1 )
5 j2 ← φ( j3 , j1 )
6 i3 ← i2 + 4
7 j3 ← j2 + 4
when we try to hash a value number for either of the two φ’s, the value num-
bers for i 3 and j3 are not yet determined. As a result, we create different value
numbers for i 2 and j2 . This makes the above algorithm unable to recognize that
i 2 and j2 can be given the same value number, or i 3 and j3 can be given the same
value number.
The above hash-based value numbering algorithm can be regarded as pes-
simistic, because it will not assign the same value number to two different expres-
sions unless it can prove they compute the same value. There exists a different
approach (see Section 11.6 for references) to performing value numbering that
?
is not hash-based and is optimistic. It does not depend on any traversal over the
program’s flow of control, and so is not affected by the presence of back edges. The
algorithm partitions all the expressions in the program into congruence classes.
Expressions in the same congruence class are considered equivalent because
they evaluate to the same static value. The algorithm is optimistic because when
it starts, it assumes all expressions that have the same operator to be in the same
congruence class. Given two expressions within the same congruence class, if
their operands at the same operand position belong to different congruence
classes, the two expressions may compute to different values, and thus should
not be in the same congruence class. This is the subdivision criterion. As the algo-
rithm iterates, the congruence classes are subdivided into smaller ones while the
total number of congruence classes increases. The algorithm terminates when
no more subdivision can occur. At this point, the set of congruence classes in
156 11 Redundancy Elimination— (F. Chow)
GVN-PRE this final partition will represent all the values in the program that we care about,
and each congruence class is assigned a unique value number.
While such a partition-based algorithm is not obstructed by the presence
of back edges, it does have its own deficiencies. Because it has to consider one
operand position at a time, it is not able to apply commutativity to detect more
equivalences. Since it is not applied bottom-up with respect to the expression
tree, it is not able to apply algebraic simplifications while value numbering. To
get the best of both the hash-based and the partition-based algorithms, it is
possible to apply both algorithms independently and then combine their results
together to shrink the final set of value numbers.
So far, we have discussed finding computations that compute to the same values,
but have not addressed eliminating the redundancies among them. Two compu-
tations that compute to the same value exhibit redundancy only if there is an
control-flow path that leads from one to the other.
An obvious approach is to consider PRE for each value number separately.
This can be done by introducing, for each value number, a temporary that stores
the redundant computations. But value-number-based PRE has to deal with the
issue of how to generate an insertion. Because the same value can come from
different forms of expressions at different points in the program, it is necessary
to determine which form to use at each insertion point. If the insertion point
is outside the live range of any variable version that can compute that value,
then the insertion point has to be disqualified. Due to this complexity, and the
expectation that strictly partial redundancy is rare among computations that
yield the same value, it seems to be sufficient to perform only full redundancy
elimination among computations that have the same value number.
However, it is possible to broaden the scope and consider PRE among lexi-
cally identical expressions and value numbers at the same time. In this hybrid
approach, it is best to relax our restriction on the style of program representation
described in Section 11. By not requiring Conventional SSA Form, we can more
effectively represent the flow of values among the program variables. By regard-
ing the live range of each SSA version to extend from its definition to program exit,
we allow its value to be used whenever convenient. The program representation
can even be in the form of triplets, in which the result of every operation is im-
mediately stored in a temporary. It will just assign the value number of the right
hand sides to the left-hand side variables. This hybrid approach (GVN-PRE?– see
below) can be implemented based on an adaptation of the SSAPRE framework.
Since each φ-function in the input can be viewed as merging different value num-
bers from the predecessor blocks to form a new value number, the Φ-function
insertion step will be driven by the presence of φ’s for the program variables. The
FRGs can be formed from some traversal of the program code. Each FRG can be
11.6 Further readings 157
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.6 Further readings
The concept of partial redundancy was first introduced by Morel and Renvoise.
In their seminal work [209], Morel and Renvoise showed that global common
subexpressions and loop-invariant computations are special cases of partial
redundancy, and they formulated PRE as a code placement problem. The PRE
algorithm developed by Morel and Renvoise involves bidirectional data-flow
analysis, which incurs more overhead than unidirectional data-flow analysis.
In addition, their algorithm does not yield optimal results in certain situations.
An better placement strategy, called lazy code motion (LCM), was later devel-
oped by Knoop et al [178, 180]. It improved on Morel and Renvoise’s results by
avoiding unnecessary code movements, by removing the bidirectional nature
of the original PRE data-flow analysis and by proving the optimality of their
algorithm. After lazy code motion was introduced, there have been alternative
formulations of PRE algorithms that achieve the same optimal results, but differ
in the formulation approach and implementation details [110, 109, 226, 319].
The above approaches to PRE are all based on encoding program properties
in bit vector forms and the iterative solution of data-flow equations. Since the
bit vector representation uses basic blocks as its granularity, a separate algo-
rithm is needed to detect and suppress local common subexpressions. Chow et
al [72, 170] came up with the first SSA-based approach to perform PRE. Their
SSAPRE algorithm is an adaptation of LCM that take advantage of the use-def
information inherent in SSA. It avoids having to encode data-flow information in
bit vector form, and eliminates the need for a separate algorithm to suppress local
common subexpressions. Their algorithm was first to make use of SSA to solve
data-flow problems for expressions in the program, taking advantage of SSA’s
sparse representation so that fewer number of steps are needed to propagate
data-flow information. The SSAPRE algorithm thus brings the many desirable
characteristics of SSA-based solution techniques to PRE.
In the area of speculative PRE, Murphy et al. [214] introduced the concept of
fault-safety and use it in the SSAPRE framework for the speculation of dangerous
computations. When execution profile data are available, it is possible to tailor
the use of speculation to maximize runtime performance for the execution that
matches the profile. Xue and Cai [318] presented a computationally and lifetime
optimal algorithm for speculative PRE based on profile data. Their algorithm uses
bit-vector-based data-flow analysis and applies minimum cut to flow networks
formed out of the control-flow graph to find the optimal code placement. Zhou
et al. [324] applies the minimum cut approach to flow networks formed out of the
158 11 Redundancy Elimination— (F. Chow)
FRG in the SSAPRE framework to achieve the same computational and lifetime
optimal code motion. They showed their sparse approach based on SSA results
in smaller flow networks, enabling the optimal code placements to be computed
more efficiently.
Lo et al. [194] showed that register promotion can be achieved by load place-
ment optimization followed by store placement optimization. Other optimiza-
tions can potentially be implemented using the SSAPRE framework, like code
hoisting, register shrink-wrapping [75] and live range shrinking. Moreover, PRE
has traditionally provided the context for integrating additional optimizations
into its framework. They include operator strength reduction [179] and linear
function test replacement [171].
Hashed-based value numbering originated from Cocke and Schwartz [80], and
Rosen et al. [258] extended it to global value number based on SSA. The partition-
based algorithm was developed by Alpern et al. [8]. Briggs et al. [53] presented
refinements to both the hash-based and partition-based algorithms, including
applying the hash-based method in a post order traversal of the dominator tree.
VanDrunen and Hosking proposed A-SSAPRE (anticipation-based SSAPRE)
that removes the requirement of Conventional SSA Form and is best for pro-
gram representations in the form of triplets [304]. Their algorithm determines
optimization candidates and construct FRGs via a depth-first, preorder traversal
over the basic blocks of the program. Within each FRG, non-lexically identical
expressions are allowed, as long as there are potential redundancies among them.
VanDrunen and Hosking [305] subsequently presented GVN-PRE (Value-based
Partial Redundancy Elimination) which is claimed to subsume both PRE and
GVN.
φ
Part III
Extensions
CHAPTER 12
Introduction V. Sarkar
F. Rastello
So far, we have introduced the foundations of SSA form and its use in different
program analyses. We now motivate the need for extensions to SSA form to
enable a larger class of program analyses. The extensions arise from the fact
that many analyses need to make finer-grained distinctions between program
points and data accesses than what what can be achieved by vanilla SSA form.
However, these richer flavors of extended SSA-based analyses still retain many of
the benefits of SSA form (e.g., sparse data-flow propagation) which distinguish
them from classical data-flow frameworks for the same analysis problems.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.1 Static single information form
The sparseness in vanilla SSA form arises from the observation that information
for an unaliased scalar variable can be safely propagated from its (unique) defi-
nition to all its reaching uses without examining any intervening program points.
As an example, SSA-based constant propagation aims to compute for each single
assignment variable, the (usually over-approximated) set of possible values car-
ried by the definition of that variable. For instance, consider an instruction that
defines a variable a , and uses two variables, b and c . And example is a = b + c .
In an SSA-form program, constant propagation will determine if a is a constant
by looking directly at the definition point of b and at the definition point of c . We
say that information is propagated from these two definition points directly to
the instruction a = b + c . However, there are many analyses for which definition
points are not the only source of new information. For example, consider an
if-then-else statement which uses a in both the then and else parts. If the branch
condition involves a , say a == 0, we now have additional information that can
161
162 12 Introduction — (V. Sarkar, F. Rastello)
distinguish the value of a in the then and else parts, even though both uses have
the same reaching definition. Likewise, the use of a reference variable p in certain
contexts can be the source of new information for subsequent uses e.g., the fact
that p is non-null because no exception was thrown at the first use.
The goal of Chapter 13 is to present a systematic approach to dealing with
these additional sources of information, while still retaining the space and time
benefits of vanilla SSA form. This approach is called Static Single Information (SSI)
form, and it involves additional renaming of variables to capture new sources of
information. For example, SSI form can provide distinct names for the two uses
of a in the then and else parts of the if-then-else statement mentioned above.
This additional renaming is also referred to as live range splitting, akin to the idea
behind splitting live ranges in optimized register allocation. The sparseness of SSI
form follows from formalization of the Partitioned Variable Lattice (PVL) problem
and the Partitioned Variable Problem (PVP), both of which establish orthogonality
among transfer functions for renamed variables. The φ-function inserted at join
nodes in vanilla SSA form are complemented by σ-functions in SSI form that can
perform additional renaming at branch nodes and interior (instruction) nodes.
Information can be propagated in the forward and backward directions using
SSI form, enabling analyses such as range analysis (leveraging information from
branch conditions) and null pointer analysis (leveraging information from prior
uses) to be performed more precisely than with vanilla SSA form.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2 Control dependencies
So as to expose parallelism and locality one need to get rid of the CFG at some
points. For loop transformations, software pipelining, there is a need to manip-
ulate a higher degree of abstraction to represent the iteration space of nested
loops and to extend data-flow information to this abstraction. One can expose
even more parallelism (at the level of instructions) by replacing control flow by
control dependences: the goal is either to express a predicate expression under
which a given basic block is to be executed, or select afterward (using similar
predicate expressions) the correct value among a set of eagerly computed ones.
1. technically, we say that SSA provides data flow (data dependences). The
goal is to enrich it with control dependences. Program dependence graph
(PDG) constitutes the basis of such IR extensions. Gated-SSA (GSA) men-
tioned below provides an interpretable (data or demand driven) IR that uses
this concept. Psi-SSA also mentioned below is a very similar IR but more
appropriated to code generation for architectures with predication.
2. Note that such extensions sometimes face difficulties to handle loops cor-
rectly (need to avoid deadlock between the loop predicate and the computa-
tion of the loop body, replicate the behavior of infinite loops, etc.). However,
we believe that, as we will illustrate further, loop carried control dependences
12.4 Psi-SSA form 163
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.3 Gated-SSA forms
As already mentioned, one of the strenghts of SSA form is its associated data-flow
graph (DFG), the SSA graph that allows to propagate directly the information
along the def-use chains. This is what makes data-flow analysis sparse. By com-
bining the SSA graph with the control flow graph, static analysis can be made
context sensitive. This can be done in a more natural and powerful way by in-
corporating in a unified representation both the data-flow and the contro-flow
information.
The program dependence graph (PDG) adds to the data dependence edges
(SSA graph as the Data Dependence Graph – DDG) the control dependence edges
(Control Dependence Graph – CDG). As already mentioned one of the main mo-
tivation for the development of the PDG was to aid automatic parallelization of
instructions accross multiple basic-block. However, in practice it also exposes
the relationship between the control predicates and their related control de-
pendent instructions, thus allowing to propagate the associated information. A
natural way to represent this relationship is through the use of gated functions
that are used in some extensions such as the Gated-SSA (GSA) or the Value State
Dependence Graph (VSDG). Gating functions are directly interpretable versions
of φ-nodes. As an example, φif (P, v1 , v2 ) can be interpreted as a function that
selects the value v1 if predicated P evaluates to true and the value v2 otherwise.
PDG, GSA, and VSDG are described in Chapter 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.4 Psi-SSA form
Psi-SSA form (Chapter 15) addresses the need for modeling static single assign-
ment form in predicated operations. A predicated operation is an alternate rep-
resentation of a fine-grained control-flow structure, often obtained by using the
well known if-conversion transformation (see Chapter 20). A key advantage of
using predicated operations in a compiler’s intermediate representation is that it
can often enable more optimizations by creating larger basic blocks compared to
approaches in which predicated operations are modeled as explicit control-flow
graphs. From an SSA form perspective, the challenge is that a predicated opera-
tion may or may not update its definition operand, depending on the value of
the predicate guarding that assignment. This challenge is addressed in Psi-SSA
form by introducing ψ-functions that perform merge functions for predicated
164 12 Introduction — (V. Sarkar, F. Rastello)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.5 Hashed SSA form
The motivation for SSI form arose from the need to perform additional renaming
to distinguish among different uses of an SSA variable. Hashed SSA (HSSA) form
introduced in Chapter 16 addresses another important requirement viz., the
need to model aliasing among variables. For example, a static use or definition of
indirect memory access ∗p in the C language could represent the use or definition
of multiple local variables whose addresses can be taken and may potentially be
assigned to p along different control-flow paths. To represent aliasing of local
variables, HSSA form extends vanilla SSA form with MayUse (µ) and MayDef (χ)
functions to capture the fact that a single static use or definition could potentially
impact multiple variables. Note that MayDef functions can result in the creation
of new names (versions) of variables, compared to vanilla SSA form. HSSA form
does not take a position on the accuracy of alias analysis that it represents. It is
capable of representing the output of any alias analysis performed as a pre-pass
to HSSA construction. As summarized above, a major concern with HSSA form is
that its size could be quadratic in the size of the vanilla SSA form, since each use or
definition can now be augmented by a set of MayUse’s and MayDef’s respectively.
A heuristic approach to dealing with this problem is to group together all variable
versions that have no “real” occurrence in the program i.e., do not appear in
a real instruction outside of a φ, µ or χ function. These versions are grouped
together into a single version called the zero version of the variable.
In addition to aliasing of locals, it is important to handle the possibility of
aliasing among heap-allocated variables. For example, ∗p and ∗q may refer to
the same location in the heap, even if no aliasing occurs among local variables.
HSSA form addresses this possibility by introducing a virtual variable for each
address expression used in an indirect memory operation, and renaming virtual
variables with φ functions as in SSA form. Further, the alias analysis pre-pass is
expected to provide information on which virtual variables may potentially be
aliased, thereby leading to the insertion of µ or χ functions for virtual variables
12.6 Array SSA form 165
as well. Global value numbering is used to increase the effectiveness of the virtual
variable approach, since all indirect memory accesses with the same address
expression can be merged into a single virtual variable (with SSA renaming as
usual). In fact, the Hashed SSA name in HSSA form comes from the use of hashing
in most value-numbering algorithms.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.6 Array SSA form
In contrast to HSSA form, Array SSA form (Chapter 17) takes an alternate ap-
proach to modeling aliasing of indirect memory operations by focusing on alias-
ing in arrays as its foundation. The aliasing problem for arrays is manifest in the
fact that accesses to elements A[i ] and A[ j ] of array A refer to the same location
when i = j . This aliasing can occur with just local array variables, even in the
absence of pointers and heap-allocated data structures. Consider a program with
a definition of A[i ] followed by a definition of A[ j ]. The vanilla SSA approach can
be used to rename these two definitions to (say) A 1 [i ] and A 2 [ j ]. The challenge
with arrays arises when there is a subsequent use of A[k ]. For scalar variables,
the reaching definition for this use can be uniquely identified in vanilla SSA form.
However, for array variables, the reaching definition depends on the subscript
values. In this example, the reaching definition for A[k ] will be A 2 or A 1 if k == j
or k == i (or a prior definition A 0 if k 6= j and k 6= i ). To provide A[k ] with a single
reaching definition, Array SSA form introduces a definition-Φ (d Φ) operator that
represents the merge of A 2 [ j ] with the prevailing value of array A prior to A 2 .
The result of this d Φ operator is given a new name, A 3 (say), which serves as the
single definition that reaches use A[k ] (which can then be renamed to A 3 [k ]).
This extension enables sparse data-flow propagation algorithms developed for
vanilla SSA form to be applied to array variables, as illustrated by the algorithm
for sparse constant propagation of array elements presented in this chapter. The
accuracy of analyses for Array SSA form depends on the accuracy with which
pairs of array subscripts can be recognized as being definitely-same (DS ) or
definitely-different (DD).
To model heap-allocated objects, Array SSA form builds on the observation
that all indirect memory operations can be modeled as accesses to elements of
abstract arrays that represent disjoint subsets of the heap. For modern object-
oriented languages like Java, type information can be used to obtain a parti-
tioning of the heap into disjoint subsets e.g., instances of field x are guaranteed
to be disjoint from instances of field y . In such cases, the set of instances of
field x can be modeled as a logical array (map) H x that is indexed by the object
reference (key). The problem of resolving aliases among field accesses p.x and
q .x then becomes equivalent to the problem of resolving aliases among array
accesses H x [p ] and H x [q ], thereby enabling Array SSA form to be used for
analysis of objects as in the algorithm for redundant load elimination among
166 12 Introduction — (V. Sarkar, F. Rastello)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.7 Memory based data flow
SSA provides data flow / dependences between scalar variables. Execution order
of side effect instructions must also be respected. Indirect memory access can
be considered very conservatively as such and lead to (sometimes called state,
see Chapter 14) dependence edges. Too conservative dependences annihilate
the potential of optimizations. Alias analysis is the first step toward more precise
dependence information. Representing this information efficiently in the IR is
important. One could simply add a dependence (or a flow arc) between two
"consecutive" instructions that may or must alias. Then in the same spirit than
SSA φ-nodes aim at combining the information as early as possible (as opposed
to standard def-use chains, see the discussion in Chapter 2), similar nodes can
be used for memory dependences. Consider the following sequential C-like code
involving pointers, where p and q may alias:
∗p ← . . .; ∗q ← . . .; . . . ← ∗p ; . . . ← ∗q ;
Without the use of φ-nodes, the amount of def-use chains required to link the
assignments to their uses would be quadratic (4 here). Hence the usefulness
of generalizing SSA and its φ-node for scalars to handle memory access for
sparse analyzes. HSSA (see Chapter 16) and Array-SSA (see the Chapter 17) are
two different implementations of this idea. One have to admit that if this early
combination is well suited for analysis or interpretation, the introduction of a
φ-function might add a control dependence to an instruction that would not
exist otherwise. In other words only simple loop carried dependences can be
expressed this way. Let us illustrate this point using a simple example:
for i do
A[i ] ← f (A[i − 2])
This would easily allow for the propagation of information that A2 ≥ 0. On the
other hand, by adding this φ-node, it becomes difficult to devise that iteration i
and i +1 can be executed in parallel: the φ-node adds a loop carried dependence.
If one is interested in performing more sophisticated loop transformations than
just exposing fully parallel loops (such as loop interchange, loop tiling or mul-
tidimensional software pipelining), then (Dynamic) Single Assignment forms
should be his friend. There exists many formalisms including Kahn Process Net-
works (KPN), or Fuzzy Data-flow Analysis (FADA) that implement this idea. But
each time restrictions apply. This is part of the huge research area of automatic
parallelization outside of the scope of this book. For further details we refer to
the corresponding Encyclopedia of Parallel Computing [225].
static single information
SSI|seestatic single
information
CHAPTER 13
Static Single Information Form F. Pereira
F. Rastello
The objective of a data-flow analysis is to discover facts that are true about a pro-
gram. We call such facts information. Using the notation introduced in Chapter 8,
an information is a element in the data-flow lattice. For example, the information
that concerns liveness analysis is the set of variables alive at a certain program
point. Similarly to liveness analysis, many other classical data-flow approaches
bind information to pairs formed by a variable and a program point. However, if
an invariant occurs for a variable v at any program point where v is alive, then
we can associate this invariant directly to v . If the intermediate representation of
a program guarantees this correspondence between information and variable for
every variable, then we say that the program representation provides the Static
Single Information (SSI)?property.
In Chapter 8 we have shown how the SSA form allows us to solve sparse for-
ward data-flow problems such as constant propagation. In the particular case of
constant propagation, the SSA form lets us assign to each variable the invariant—
or information—of being constant or not. The SSA intermediate representation
gives us this invariant because it splits the live-ranges of variables in such a
way that each variable name is defined only once. Now we will show that live-
range splitting can also provide the SSI property not only to forward, but also to
backward data-flow analyses.
Different data-flow analyses might extract information from different pro-
gram facts. Therefore, a program representation may afford the SSI property
to some data-flow analyses, but not to all of them. For instance, the SSA form
naturally provides the SSI property to the reaching definition analysis. Indeed,
the SSA form provides the static single information property to any data-flow
analysis that obtains information at the definition sites of variables. These analy-
ses and transformations include copy and constant propagation as illustrated
in Chapter 8. However, for a data-flow analysis that derives information from
the use sites of variables, such as the class inference analysis we will describe in
169
170 13 Static Single Information Form — (F. Pereira, F. Rastello)
Static Single Information Section 13.1.6, the information associated with a variable might not be unique
range analysis
along its entire live-range even under SSA: in that case the SSA form does not
analysis!range
provide the SSI property.
There exists extensions of the SSA form that provide the SSI property to more
data-flow analyses than the original SSA does. Two classic examples—detailed
later—are the Extended-SSA (e-SSA) form, and the Static Single Information (SSI)
form. The e-SSA form provides the SSI property to analyses that take information
from the definition site of variables, and also from conditional tests where these
variables are used. The SSI form provides the static single information property to
data-flow analyses that extract information from the definition sites of variables
and from the last use sites (which we define later). These different intermediate
representations rely on a common strategy to achieve the SSI property: live-range
splitting. In this chapter we show how to use live-range splitting to build program
representations that provide the static single information property to different
types of data-flow analyses.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.1 Static single information
?
The goal of this section is to define the notion of Static Single Information, and to
explain how it supports the sparse data-flow analysesdiscussed in Chapter 8. With
this purpose, we revisit the concept of sparse analysis in Section 13.1.1. There
exists a special class of data-flow problems, which we call Partitioned Lattice
per Variable (PLV), that fits in the sparse data-flow framework of this chapter
very well. We will look more carefully into these problems in Section 13.1.2. The
intermediate program representations discussed in this chapter provide the
static single information property—formalized in Section 13.1.3—to any PLV
problem. We give in Section 13.1.5 algorithms to solve sparsely any data-flow
problem that contains the SSI property. This sparse framework is very broad:
many well-known data-flow problems are partitioned lattice, as we will see in
the examples of Section 13.1.6.
0: dense analysis
l1 : i ←0 sparse analysis
1:
l2 : s ←0 prog. point [i ] [s ]
1 i ←0 2: 0 top top
2 s ←0
3: 1 [0, 0] top
3 while i < 100
l3 : (i < 100)? 2 [0, 0] [0, 0]
do
7: 3 [0, 100] [0, +∞[
4 i ←i +1
4 [100, 100] [0, +∞[
5 s ← s +i
4: 5: 5 [0, 99] [0, +∞[
6 return s 6 [0, 100] [0, +∞[
l6 : ret s l4 : i ←i +1
7 [0, 100] [0, +∞[
6:
l5 : s ← s +i
.
Fig. 13.1 An example of a dense data-flow analysis that finds the range of possible values associated
with each variable at each program point.
partitioned lattice13.1.2
per Partitioned lattice per variable (PLV) problems
variable
PLV problems
partitioned variable The class of non-relational data-flow analysis problems we are interested in are
problems
PVP the ones that bind information to pairs of program variables and program points.
?
We design this class of problems as Partitioned Lattice per Variable problems
and formally describe them as follows.
s ∈preds(p )
where x p denotes the abstract state associated with program point p , and F s →p
is the transfer function from predecessor s to p . The analysis can alternatively
be written as a constraint system that binds to each program point p and each
s ∈ preds(p ) the equation x p = x p ∧ F s →p (x s ) or, equivalently, the inequation
x p v F s →p (x s ).
If the information associated with a variable is invariant along its entire live-
range, then we can bind this information to the variable itself. In other words,
13.1 Static single information 173
we can replace all the constraint variables [v ]p by a single constraint variable Static Single Information
SSI
[v ], for each variable v and every p ∈ live(v ). Consider the problem of range
analysis again. There are two types of control-flow points associated with non-
identity transfer functions: definitions and conditionals. (1) At the definition
point of variable v , Fv simplifies to a function that depends only on some [u ]
where each u is an argument of the instruction defining v ; (2) At the conditional
tests that use a variable v , Fv can be simplified to a function that uses [v ] and
possibly other variables that appear in the test. The other programs points are
associated with an identity transfer function and can thus be ignored: [v ]p =
s →p
[v ]p ∧ Fv ([v1 ]s , . . . , [vn ]s ) simplifies to [v ]p = [v ]p ∧ [v ]p i.e., [v ]p = [v ]p . This
gives the intuition on why a propagation engine along the def-use chains of a
SSA-form program can be used to solve the constant propagation problem in an
equivalent, yet “sparser,” manner.
A program representation that fulfills the Static Single Information (SSI) prop-
erty allows us to attach the information to variables, instead of program points,
and needs to fulfills the following four properties: Split forces the information
related to a variable to be invariant along its entire live-range; Info forces this
information to be irrelevant outside the live-range of the variable; Link forces
the def-use chains to reach the points where information is available for a trans-
fer function to be evaluated; Finally, Version provides a one-to-one mapping
between variable names and live-ranges.
We now give a formal definition of the SSI, and the four properties.
?
Property 1 (SSI). STATIC SINGLE INFORMATION: Consider a forward (resp. back-
ward) monotone PLV problem Edense stated as a set of constraints
for every variable v , each program point p , and each s ∈ preds(p ) (resp. s ∈
succs(p )). A program representation fulfills the Static Single Information property
if and only if it fulfills the following four properties:
σ-function We must split live-ranges using special instructions to provide the SSI prop-
parallel copy
erties. A naive way would be to split them between each pair of consecutive
interior node
instructions, then we would automatically provide these properties, as the newly
created variables would be live at only one program point. However, this strategy
would lead to the creation of many trivial program regions, and we would lose
sparsity. We provide a sparser way to split live-ranges that fit Property 1 in Sec-
tion 13.2. We may also have to extend the live-range of a variable to cover every
program point where the information is relevant; We accomplish this last task by
inserting pseudo-uses and pseudo-definitions of this variable.
denotes m copies vi0 = vi performed in parallel with instruction inst. This means
that all the uses of inst plus all right-hand variables vi are read simultaneously,
then inst is computed, then all definitions of inst plus all left-hand variables vi0
are written simultaneously. For a usage example of parallel copies, we will see
later in this chapter a example of null pointer analysis Figure 13.4.
We call joins the program points that have one successor and multiple prede-
cessors. For instance, two different definitions of the same variable v might be
associated with two different constants; hence, providing two different pieces
of information about v . To avoid that these definitions reach the same use of v
we merge them at the earliest program point where they meet. We do it via our
well-known φ-functions.
In backward analyses the information that emerges from different uses of
a variable may reach the same branch point, which is a program point with a
unique predecessor and multiple successors. To ensure Property 1, the use that
reaches the definition of a variable must be unique, in the same way that in
an SSA-form program the definition that reaches a use is unique. We ensure
13.1 Static single information 175
this property via special instructions called σ-functions. The σ-functions are
the dual of φ-functions, performing a parallel assignment depending on the
execution path taken. The assignment
q
(l 1 : v11 , . . . , l q : v1 ) = σ(v1 ) k . . . k (l 1 : vm1 , . . . , l q : vmq ) = σ(vm )
j
represents m σ-functions that assign to each variable vi the value in vi if control
flows into block l j . As with φ-functions, these assignments happen in parallel,
i.e., the m σ-functions encapsulate m parallel copies. Also, notice that variables
live in different branch targets are given different names by the σ-function.
ssi
Let us consider a unidirectional forward (resp. backward) PLV problem Edense
p s →p
stated as a set of equations [v ] v Fv ([v1 ] , . . . , [vn ] ) (or equivalently [v ]p =
s s
p s →p s s
[v ] ∧ Fv ([v1 ] , . . . , [vn ] ) for every variable v , each program point p , and each
s ∈ preds(p ) (resp. s ∈ succs(p )). To simplify the discussion, any φ-function (resp.
σ-function) is seen as a set of copies, one per predecessor (resp. successor),
which leads to as many constraints. In other words, a φ-function such as p : a =
φ(a 1 : l 1 , . . . , a m : l m ), gives us n constraints such as
j j j
→p
[a ]p v Fal ([a 1 ]l , . . . , [a n ]l )
j
which usually simplifies into [a ]p v [a j ]l . This last can be written equivalently
into the classical meet
j
[a ]p v [a j ]l
^
l j ∈preds(p )
which usually simplifies into [a j ]l j v [a ]p . Given a program that fulfills the SSI
ssi
property for Edense and the set of transfer functions Fvs , we show here how to
build an equivalent sparse constrained system.
• For each instruction i that defines (resp. uses) a variable v , let a . . . z be the
s →p
set of used (resp. defined) variables. Because of the Link property, Fv (that
i s s
we will denote Fv from now) depends only on some [a ] . . . [z ] . Thus, there
176 13 Static Single Information Form — (F. Pereira, F. Rastello)
backward analysis
static single use
Algorithm 13.1: Backward propagation engine under SSI
SSU|seestatic single use 1 worklist ← ;
2 foreach v ∈ vars do [v ] ← top
3 foreach i ∈ insts do push(worklist, i )
4 while worklist 6= ; do
5 i ← pop(worklist)
6 foreach v ∈ i.uses do
7 [v ]new ← [v ] ∧ G vi ([i.defs])
8 if [v ] 6= [v ]new then
9 worklist ← worklist ∪ v.defs
10 [v ] ← [v ]new
the same left-hand-side. Technically this is the reason why we manipulate a chaotic iteration
range analysis
constraint system (system with inequations) and not an equation systemas in
1
analysis!range
Chapter 8. Both systems can be solved using a scheme known as chaotic itera-
tion?such as the worklist algorithm we provide here. The slight and important
difference for a constraint system as opposed to an equation system, is that one
needs to meet G vi (. . . ) with the old value of [v ] to ensure the monotonicity of
the consecutive values taken by [v ]. It would be still possible to enforce the SSU
property, in addition to the SSA property, of our intermediate representation
at the expenses of adding more φ-functions and σ-functions. However, this
guarantee is not necessary to every sparse analysis. The dead-code elimination
problem illustrates well this point: for a program under SSA form, replacing G vi
in Algorithm 13.1 by the property “i is a useful instruction or one of the variables
it defines is marked as useful” leads to the standard SSA-based dead-code elimi-
nation algorithm. The sparse constraint system does have several equations (one
per variable use) for the same left-hand-side (one for each variable). It is not
necessary to enforce the SSU property in this instance of dead-code elimination,
and doing so would lead to a less efficient solution in terms of compilation time
and memory consumption. In other words, a code under SSA form fulfills the
SSI property for dead-code elimination.
We start this section revising the initial example of data-flow analysis of this
chapter, given in Figure 13.1.?A range analysis acquires information from either
the points where variables are defined, or from the points where variables are
tested. In the original figure we know that i must be bound to the interval [0, 0]
immediately after instruction l 1 . Similarly, we know that this variable is upper
bounded by 100 when arriving at l 4 , due to the conditional test that happens
before. Therefore, in order to achieve the SSI property, we should split the live-
ranges of variables at their definition points, or at the conditionals where they are
used. Figure 13.2 shows on the left the original example after live-range splitting.
In order to ensure the SSI property in this example, the live-range of variable i
must be split at its definition, and at the conditional test. The live-range of s , on
the other hand, must be split only at its definition point, as it is not used in the
conditional. Splitting at conditionals is done via σ-functions. The representation
1
In the ideal world with monotone framework and lattice of finite height.
178 13 Static Single Information Form — (F. Pereira, F. Rastello)
class inference i1 = 0
s1 = 0
[i 1 ] = [i 1 ] ∪ [0, 0] = [0, 0]
[s1 ] = [s1 ] ∪ [0, 0] = [0, 0]
[i 2 ] = [i 2 ] ∪ [i 1 ] ∪ [i 4 ] = [0, 100]
i 2 = φ(i 1 , i 4 )
[s2 ] = [s2 ] ∪ [s1 ] ∪ [s3 ] = [0, +∞[
s2 = φ(s1 , s3 )
[i 3 ] = [i 3 ] ∪ ([i 2 ] ∩ ]−∞, 99]) = [0, 99]
(i 2 < 100)?
[i 4 ] = [i 4 ] ∪ ([i 3 ] + 1) = [1, 100]
(⊥, i 3 ) = σ(i 2 )
[s3 ] = [s3 ] ∪ ([s2 ] + [i 4 ]) = [1, +∞[
ret s2 i4 = i3 + 1
s3 = s2 + i 4
.
Fig. 13.2 Live-range splitting on Figure 13.1 and a solution to this instance of the range analysis problem.
Class inference
Some dynamically typed languages, such as Python, JavaScrip, Ruby or Lua, rep-
resent objects as tables containing methods and fields.?It is possible to improve
the execution of programs written in these languages if we can replace these
simple tables by actual classes with virtual tables. A class inference engine tries
to assign a class to a variable v based on the ways that v is defined and used.
Figure 13.3 illustrates this optimization on a Python program (a). Our objective
is to infer the correct suite of methods for each object bound to variable v . Fig-
ure 13.3c shows the results of a dense implementation of this analysis. Because
type inference is a backward analysis that extracts information from use sites,
we split live-ranges using parallel copies at these program points, and rely on
σ-functions to merge them back, as shown on Figure 13.3b. The use-def chains
that we derive from the program representation lead naturally to a constraint
system, shown on Figure 13.3d, where [v j ] denotes the set of methods associated
with variable v j . A fix-point to this constraint system is a solution to our data-
flow problem. This instance of class inference is a Partitioned Variable Problem
(PVP),2 because the data-flow information associated with a variable v can be
computed independently from the other variables.
2
Actually class inference is no more a PVP as soon as we want to propagate the information
through copies.
13.1 Static single information 179
2: 5:
l 3 : tmp ← i + 1 l 5 : v ← new OY() l 3 : tmp ← i + 1 l 5 : v3 ← new OY()
3: 6:
l 4 : v.m1 () l 6 : v.m2 () l 4 : v2 .m1 () k v4 ← v2 l 6 : v3 .m2 () k v5 ← v3
4: 7:
v6 ← φ(v4 , v5 )
l 7 : v6 .m3 () l 7 : v6 .m3 ()
(a) (b)
prog. point [v ]
1 {m1 , m3 } [v6 ] = [v6 ] ∧ {m3 } = {m3 }
2 {m1 , m3 } [v5 ] = [v5 ] ∧ [v6 ] = {m3 }
3 {m1 , m3 } [v4 ] = [v4 ] ∧ [v6 ] = {m3 }
4 {m3 } [v2 ] = [v2 ] ∧ ({m1 } ∧ [v4 ]) = {m1 , m3 }
5 top [v3 ] = [v3 ] ∧ ({m2 } ∧ [v4 ]) = {m2 , m3 }
6 {m2 , m3 } [v1 ] = [v1 ] ∧ [v2 ] = {m1 , m3 }
7 {m3 } [v1 ] = [v1 ] ∧ [v7 ] = {m1 , m3 }
(c) (d)
Fig. 13.3 Class inference analysis as an example of backward data-flow analysis that takes information
from the uses of variables.
The objective of null pointer analysis?is to determine which references may hold
null values. This analysis allows compilers to remove redundant null-exception
tests and helps developers find null pointer dereferences. Figure 13.4 illustrates
this analysis. Because information is produced not only at definition but also at
use sites, we split live-ranges after each variable is used as show on Figure 13.4b.
For instance, we know that v2 cannot be null, otherwise an exception would
have been thrown during the invocation v1 .m(); Hence the call v2 .m() cannot
result in a null pointer dereference exception. On the other hand, we notice in
Figure 13.4c that the state of v4 is the meet of the state of v3 , definitely not-null,
and the state of v1 , possibly null, and we must conservatively assume that v4 may
be null.
180 13 Static Single Information Form — (F. Pereira, F. Rastello)
splitting strategy
range analysis l 1 : v ← foo() l 1 : v1 ← foo()
analysis!range
class inference [v1 ] = [v1 ] ∧ 0 = 0
l 2 : v.m () l 2 : v1 .m () k v2 ← v1 [v2 ] = [v2 ] ∧
A0 = A0
l 3 : v.m () l 3 : v2 .m () k v3 ← v2
[v3 ] = [v3 ] ∧
A0 = A0
[v4 ] = [v4 ] ∧ ([v3 ] ∧ [v1 ]) = 0
v4 ← φ(v3 , v1 )
l 4 : v.m() l 4 : v4 .m ()
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2 Construction and destruction of the intermediate pro-
gram representation
In the previous section we have seen how the static single information property
gives the compiler the opportunity to solve a data-flow problem sparsely. How-
ever, we have not yet seen how to convert a program to a format that provides
the SSI property. This is a task that we address in this section, via the three-steps
algorithm from Section 13.2.2.
?
A live-range splitting strategy Pv = I↑ ∪ I↓ over a variable v consists of a set of
“oriented” program points. We let I↓ denote a set of points i with forward direction.
Similarly, we let I↑ denote a set of points i with backward direction. The live-
range of v must be split at least at every point in Pv . Going back to the examples
from Section 13.1.6, we have the live-range splitting strategies enumerated below.
The list in Figure 13.5 gives further examples of live-range splitting strategies.
Corresponding references are given in the last section of this chapter.
• Range analysis?is a forward analysis that takes information from points where
variables are defined and conditional tests that use these variables. For in-
stance, in Figure 13.1, we have Pi = {l 1 , Out(l 3 ), l 4 }↓ where Out(l i ) is the exit
of l i (i.e., the program point immediately after l i ), and Ps = {l 2 , l 5 }↓ .
?
• Class inference is a backward analysis that takes information from the uses
of variables; thus, for each variable, the live-range splitting strategy is char-
acterized by the set Uses↑ where Uses is the set of use points. For instance, in
Figure 13.3, we have Pv = {l 4 , l 6 , l 7 }↑ .
13.2 Construction and destruction of the intermediate program representation 181
Fig. 13.5 Live-range splitting strategies for different data-flow analyses. Defs (resp. Uses) denotes the set
of instructions that define (resp. use) the variable; Conds denotes the set of instructions that
apply a conditional test on a variable; Out(Conds) denotes the exits of the corresponding basic
blocks; LastUses denotes the set of instructions where a variable is used, and after which it is
no longer live.
?
• Null pointer analysis takes information from definitions and uses and prop-
agates this information forwardly. For instance, in Figure 13.4, we have that
Pv = {l 1 , l 2 , l 3 , l 4 }↓ .
iterated post-dominance necessary. As we have pointed out in Section 13.1.4, we might have, for the same
frontier
original variable, many different sources of information reaching a common
program point. For instance, in Figure 13.1, there exist two definitions of variable
i , e.g., l 1 and l 4 , that reach the use of i at l 3 . The information that flows forward
from l 1 and l 4 collides at l 3 , the loop entry. Hence the live-range of i has to be split
immediately before l 3 —at In(l 3 )—leading, in our example, to a new definition i 1 .
In general, the set of program points where information collides can be easily
characterized by the notion of join sets and iterated dominance frontier (DF+ )
seen in Chapter 4. Similarly, split sets created by the backward propagation of
information can be over-approximated by the notion of iterated post-dominance
frontier (pDF+ )?, which is the dual of DF+ . That is, the post-dominance frontier is
the dominance frontier in a CFG where direction of edges have been reversed.
Note that, just as the notion of dominance requires the existence of a unique
entry node that can reach every CFG node, the notion of post dominance requires
the existence of a unique exit node reachable by any CFG node. For control-flow
graphs that contain several exit nodes or loops with no exit, we can ensure the
single-exit property by creating a dummy common exit node and inserting some
never-taken exit edges into the program.
7 else S↑ ← S↑ ∪ Out(pDF+ (i ))
8 S↓ ← ;
9 foreach i ∈ S↑ ∪ Defs(v ) ∪ I↓ do
10 if i .is_branch then
11 foreach e ∈ outgoing_edges (i ) do
12 S↓ ← S↓ ∪ In(DF+ (e ))
13 else S↓ ← S↓ ∪ In(DF+ (i ))
14 S ← Pv ∪ S↑ ∪ S↓
 Split live-range of v by inserting φ, σ, and copies
15 foreach i ∈ S do
16 if i does not already contain any definition of v then
17 if i .is_join then insert “v ← φ(v, ..., v )” at i
18 else
19 if i .is_branch then insert “(v, ..., v ) ← σ(v )" at i
20 else insert a copy “v ← v " at i
Fig. 13.7 Live-range splitting. In(l ) denotes a program point immediately before l , and Out(l ) a program
point immediately after l .
13.2 Construction and destruction of the intermediate program representation 183
Figure 13.7 shows the algorithm that we use to create new definitions of vari- variable renaming
dead code elimination
ables. This algorithm has three main phases. First, in lines 2-7 we create new
undefined code
definitions to split the live-ranges of variables due to backward collisions of elimination
information. These new definitions are created at the iterated post-dominance
frontier of points that originate information. If a program point is a join node,
then each of its predecessors will contain the live-range of a different definition
of v , as we ensure in lines 5-6 of our algorithm. Notice that these new definitions
are not placed parallel to an instruction, but in the region immediately after it,
which we denote by “Out(. . . ).” In lines 8-13 we perform the inverse operation:
we create new definitions of variables due to the forward collision of information.
Our starting points S↓ , in this case, include also the original definitions of v , as
we see in line 9, because we want to stay in SSA form in order to have access to a
fast liveness check as described in Chapter 9. Finally, in lines 14-20 we actually
insert the new definitions of v . These new definitions might be created by σ
functions (due to Pv or to the splitting in lines 2-7); by φ-functions (due to Pv
or to the splitting in lines 8-13); or by parallel copies.
The?rename algorithm in Figure 13.8 builds def-use and use-def chains for a pro-
gram after live-range splitting. This algorithm is similar to the classic algorithm
used to rename variables during the SSA construction that we saw in Chapter 3.
To rename a variable v we traverse the program’s dominance tree, from top to
bottom, stacking each new definition of v that we find. The definition currently
on the top of the stack is used to replace all the uses of v that we find during the
traversal. If the stack is empty, this means that the variable is not defined at this
point. The renaming process replaces the uses of undefined variables by ⊥ (see
comment of function stack.set_use). We have two methods, stack.set_use
and stack.set_def, that build the chains of relations between variables. Notice
that sometimes we must rename a single use inside a φ-function, as in lines 16-17
of the algorithm. For simplicity we consider this single use as a simple assignment
when calling stack.set_use, as one can see line 17. Similarly, if we must rename
a single definition inside a σ-function, then we treat it as a simple assignment,
like we do in lines 12-14 of the algorithm.
?
Just as Algorithm 3.7, the algorithm in Figure 13.9 eliminates φ-functions and
parallel copies that define variables not actually used in the code. By symmetry,
it also eliminates σ-functions and parallel copies that use variables not actually
defined in the code. We mean by “actual” instructions those that already existed
184 13 Static Single Information Form — (F. Pereira, F. Rastello)
1 Function rename(var v)
 Compute use-def & def-use chains.
2 stack ← ;
3 foreach CFG node n in dominance order do
4 if ∃v ← φ(v : l 1 , . . . , v : l q ) in In(n ) then
5 stack.set_def(v ← φ(v : l 1 , . . . , v : l q ))
6 foreach instruction u in n that uses v do
7 stack.set_use(u)
8 if ∃ instruction d in n that defines v then
9 stack.set_def(d )
10 foreach instruction (. . .) ← σ(v ) in Out(n) do
11 stack.set_use((. . .) ← σ(v ))
12 if ∃ (v : l 1 , . . . , v : l q ) ← σ(v ) in Out(n ) then
13 foreach v : l i ← v in (v : l 1 , . . . , v : l q ) ← σ(v ) do
14 stack.set_def(v : l i ← v )
in the program before we transformed it with split. Line 2, “web” is fixed to the
set of versions of v , so as to restrict the cleaning process to variable v , as we see in
the first two loops. The “active” set is initialized to actual instructions line 4. Then,
during the first loop lines 5-8, we augment it with φ-functions, σ-functions, and
copies that can reach actual definitions through use-def chains. The correspond-
ing version of v is hence marked as defined (line 8). The next loop, lines 11-14
performs a similar process, this time to add to the active set instructions that
13.2 Construction and destruction of the intermediate program representation 185
σ-function
1 Function clean(var v )
2 let web = {vi |vi is a version of v }
3 defined ← ;
4 active ← {inst |inst actual instruction and web ∩ inst.defs 6= ;}
5 while ∃inst ∈ active | web ∩ inst.defs\defined 6= ; do
6 foreach vi ∈ web ∩ inst.defs\defined do
7 active ← active ∪ Uses(vi )
8 defined ← defined ∪ {vi }
9 used ← ;
10 active ← {inst |inst actual instruction and web ∩ inst.uses 6= ;}
11 while ∃inst ∈ active | inst.uses\used 6= ; do
12 foreach vi ∈ web ∩ inst.uses\used do
13 active ← active ∪ Def (vi )
14 used ← used ∪ {vi }
Fig. 13.9 Dead and undefined code elimination. Original instructions not inserted by split are called
actual instructions. inst.defs denotes the (set) of variable(s) defined by inst, and inst.uses
denotes the set of variables used by inst.
can reach actual uses through def-use chains. The corresponding version of v is
then marked as used (line 14). Each non live variable, i.e., either undefined or
dead (non used) hence not in the “live” set (line 15) is replaced by ⊥ in all φ, σ,
or copy functions where it appears by the loop lines 15-18. Finally every useless
φ, σ, or copy functions are removed by lines 19-20.
Implementing σ-functions:
critical edge
SSI destruction Missing figure: Figure does note compile in overleaf!!!
SSI destruction (a)
v1 ← new OX()
(i %2)?
v6 ← φ(v1 , v3 )
v6 .m3 ()
(b)
Fig. 13.10 (a) implementing σ-functions via single arity φ-functions; (b) getting rid of copies and σ-
functions.
?
φ-function for v . This case happens when the control-flow edge l → l j is critical:
a critical edge links a basic block with several successors to a basic block with
several predecessors. If l j already contains a φ-function v 0 ← φ(. . . , v j , . . .), then
we rename v j to v .
SSI destruction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.3 Further readings
virtually every important compiler. Many compiler textbooks describes the the- static single use|mainidx
oretic basis of the notions of lattice, monotone data-flow framework and fixed
points. For a comprehensive overview of these concepts, including algorithms
and formal proofs, we refer the interested reader to Nielson et al.’s book [217] on
static program analysis.
The original description of the intermediate program representation known
as Static Single Information form was given by Ananian in his Master’s thesis [9].
The notation for σ-functions that we use in this chapter was borrowed from
Ananian’s work. The SSI program representation was subsequently revisited by
Jeremy Singer in his PhD thesis [271]. Singer proposed new algorithms to convert
programs to SSI form, and also showed how this program representation could
be used to handle truly bidirectional data-flow analyses. We did not discuss
bidirectional data-flow problems, but the interested reader can find examples
of such analyses in Khedker et al.’s work [173]. Working on top of Ananian’s and
Singer’s work, Boissinot et al. [40] have proposed a new algorithm to convert
a program to SSI form. Boissinot et al. have also separated the SSI program
representation in two flavors, which they call weak and strong. Tavares et al. [290]
have extended the literature on SSI representations, defining building algorithms
and giving formal proofs that these algorithms are correct. The presentation that
we use in this chapter is mostly based on Tavares et al.’s work.
There exist other intermediate program representations that, like the SSI
form, make it possible to solve some data-flow problems sparsely. Well-known
among these representations is the Extended Static Single Assignment form,
introduced by Bodik et al. to provide a fast algorithm to eliminate array bound
checks in the context of a JIT compiler [37]. Another important representation,
which supports data-flow analyses that acquire information at use sites, is the
Static Single Use form (SSU)?. As uses and definitions are not fully symmetric
(the live-range can “traverse” a use while it cannot traverse a definition), there
exists different variants of SSU [237, 130, 194]. For instance, the “strict” SSU
form enforces that each definition reaches a single use, whereas SSI and other
variations of SSU allow two consecutive uses of a variable on the same path. All
these program representations are very effective, having seen use in a number
of implementations of flow analyses; however, they only fit specific data-flow
problems.
The notion of Partitioned Variable Problem (PVP) was introduced by Zadeck,
in his PhD dissertation [323]. Zadeck proposed fast ways to build data-structures
that allow one to solve these problems efficiently. He also discussed a number
of data-flow analyses that are partitioned variable problems. There are data-
flow analyses that do not meet the Partitioned Lattice per Variable property.
Noticeable examples include abstract interpretation problems on relational
domains, such as Polyhedrons [91], Octagons [207] and Pentagons [195].
In terms of data-structures, the first, and best known method proposed to sup-
port sparse data-flow analyses is Choi et al.’s Sparse Evaluation Graph (SEG) [71].
The nodes of this graph represent program regions where information produced
by the data-flow analysis might change. Choi et al.’s ideas have been further
188 13 Static Single Information Form — (F. Pereira, F. Rastello)
CHAPTER 14
Graphs and Gating Functions J. Stanier
Many compilers represent the input program as some form of graph in order to
support analysis and transformation. Over time a cornucopia of program graphs
have been presented in the literature and subsequentially implemented in real
compilers. Many of these graphs use SSA concepts as the core principle of their
representation, ranging from literal translations of SSA into graph form to more
abstract graphs which are implicitly in SSA form. We aim to introduce a selection
of program graphs which use these SSA concepts, and examine how they may be
useful to a compiler writer.
?
A well-known graph representation is the Control-Flow Graph (CFG) which
we encountered at the beginning of the book whilst being introduced to the
core concept of SSA. The CFG models control flow in a program, but the graphs
that we will study instead model data flow. This is useful as a large number of
compiler optimizations are based on data flow analysis. In fact, all graphs that
we consider in this chapter are all data-flow graphs.
?
In this chapter, we will look at a number of SSA-based graph representations.
An introduction to each graph will be given, along with diagrams to show how
sample programs look when translated into that particular graph. Additionally,
we will describe the techniques that each graph was created to solve, with refer-
ences to the literature for further research.
For this chapter, we assume that the reader already has familiarity with SSA
(see Chapter 1) and the applications that it is used for.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.1 Data-flow graphs
Since all of the graphs in this chapter are data-flow graphs, let us define them. A
data-flow graph?(DFG) is a directed graph G = (V , E ) where the edges E represent
189
190 14 Graphs and Gating Functions — (J. Stanier)
SSA graph the flow of data from the result of one instruction to the input of another. An
use-def chains
instruction executes once all of its input data values have been computed. When
SSA graph,
demand-based an instruction executes, it produces a new data value which is propagated to
other connected instructions.
Whereas the CFG imposes a total ordering on instructions - the same ordering
that the programmer wrote them in - the DFG has no such concept of ordering;
it just models the flow of data. This means that it typically needs a companion
representation such as the CFG to ensure that optimized programs are still correct.
However, with access to both the CFG and DFG, optimizations such as dead code
elimination, constant folding and common subexpression elimination can be
performed effectively. But this comes at a price: keeping both graphs updated
during optimization can be costly and complicated.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.2 The SSA graph
We begin our exploration with a graph that is a literal representation of SSA: the
SSA Graph?. The SSA Graph can be constructed from a program in SSA form by
explicitly adding use-def chains?. To demonstrate what the graph looks like, we
present some sample code in Figure 14.1 which is then translated into an SSA
Graph.
An SSA Graph consists of vertices that represent instructions (such as + and
print) or φ-functions, and directed edges that connect uses to definitions of
values. The outgoing edges of a vertex represent the arguments required for that
instruction, and the ingoing edge(s) to a vertex represent the propagation of
the instruction’s result(s) after they have been computed. We call these types
?
of graph demand-based representations. This is because in order to compute a
instruction, we must first demand the results of the operands.
Although the textual representation of SSA is much easier for a human to read,
the primary benefit of representing the input program in graph form is that the
compiler writer is able to apply a wide array of graph-based optimizations by
using standard graph traversal and transformation techniques.
In the literature, the SSA Graph has been used to detect induction variables in
loops (see Chapter 10), for performing instruction selection(see 19), operator
strength reduction, rematerialization, and has been combined with an extended
SSA language to support compilation in a parallelizing compiler. The reader
should note that the exact specification of what constitutes an SSA Graph changes
from paper to paper. The essence of the intermediate representation (IR) has
been presented here, as each author tends to make small modifications for their
particular implementation.
14.2 The SSA graph 191
i0 ← 0 a0 ← 0
a0 ← 0
i0 ← 0
i 1 ← φ(i 0 , i 2 ) a 1 ← φ(a 0 , a 2 )
a 1 ← φ(a 0 , a 2 )
i 1 ← φ(i 0 , i 2 )
if (i 1 > 100)
1 100 20 a2 ← a1 ∗ i1
a2 ← a1 ∗ i1
i2 ← i1 + 1
if (a 2 > 20)
i2 ← i1 + 1 i 1 > 100 a 2 > 20
a 3 ← φ(a 1 , a 2 )
i 3 ← φ(i 1 , i 2)
print(a 3 + i 3 ) i 3 ← φ(i 1 , i 2 ) a 3 ← φ(a 1 , a 2 )
print(a 3 + i 3 )
We illustrate the usefulness of the SSA Graph through a basic induction variable
(IV) recognition technique. A more sophisticated technique is developed in
Chapter 10. Given that a program is represented as an SSA Graph, the task of
finding induction variables is simplified. A basic linear induction variable i is a
variable that appears only in the form:
1 i = 10
2 while <cond> do
3 ...
4 i =i +k
5 ...
program dependence be easily discovered in linear time using any depth first search traversal. Each
graph
parallelization such SCC must conform to the following constraints:
PDG, statement nodes
PDG, predicate nodes • The SCC contains only one φ-function at the header of the loop.
PDG, region nodes • The SCC contains only addition and subtraction operators, and the right
operand of the subtraction is not part of the SCC (e.g. no i = n − i assign-
ments).
• The other operand of each addition or subtraction is loop invariant.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.3 Program dependence graph
The Program Dependence Graph?(PDG) represents both control and data depen-
dencies together in one graph. The PDG was developed to support optimizations
requiring reordering of instructions and graph rewriting for parallelism?, as the
strict ordering of the CFG is relaxed and complemented by the presence of data
dependence information. The PDG is a directed graph G = (V , E ) where nodes
V are statements, predicate expressions or region nodes, and edges E represent
either control or data dependencies. Thus, the set of all edges E has two dis-
tinct subsets: the control dependence subgraph EC and the data dependence
subgraph ED .
? ?
Statement nodes represent instructions in the program. Predicate nodes test
a conditional statement and have true and false edges to represent the choice
?
taken on evaluation of the predicate. Region nodes group control dependen-
cies with identical source and label together. If the control dependence for a
region node is satisfied, then it follows that all of its children can be executed.
Thus, if a region node has three different control-independent statements as
immediate children, then those statements could potentially be executed in
parallel. Diagrammatically, rectangular nodes represent statements, diamond
nodes predicates, and circular nodes are region nodes. Dashed edges represent
control dependence, and solid edges represent data dependence. Loops in the
PDG are represented by back edges in the control dependence subgraph. We
show example code translated into a PDG in Figure 14.2.
Building a PDG is a multi-stage process involving:
Entry
true
R1
i ←1
R2 i ←1 return a
if (i 1 > 100)
a ← 2 ∗ B [i ] false
A[i ] ← a i > 100
i ←i +1
if (a 2 > 20) false
R3
return a
a > 20 a ← 2 ∗ B [i ] A[i ] ← a i ←i +1
parallelization conditions together. Region nodes are also inserted so that predicate nodes will
only have two successors. To begin with, an unpruned PDG is created by checking,
for each node of the CFG, which control region it depends on. This is done by
traversing the post dominator tree in post order, and mapping sets of control
dependencies to region nodes. For each node N visited in the post dominator
tree, the map is checked for an existing region node with the same set CD of
control dependencies. If none exists, a new region node R is created with these
control dependencies and entered into the map. R is made to be the only control
dependence predecessor of N . Next, the intersection INT of CD is computed
for each immediate child of N in the post dominator tree. If INT = CD then the
corresponding dependencies are removed from the child and replaced with a
single dependence on the child’s control predecessor. Then, a pass over the graph
is made to make sure that each predicate node has a unique successor for each
boolean value. If more than one exists, the corresponding edges are replaced by
a single edge to a freshly created region node that itself points to the successor
nodes.
Finally, the data dependence subgraph is generated. This begins with the con-
struction of DAGs for each basic block where each upwards reaching leaf is called
a merge node. Data flow analysis is used to compute reaching definitions. All
individual DAGs are then connected together: edges are added from definitions
nodes to the corresponding merge nodes that may be reached. The resulting
graph is the data dependence subgraph, and PDG construction is complete.
The PDG has been used for generating code for parallel architectures and has
also been used in order to perform accurate program slicing and testing.
i0 ← 0
i ←1 i 1 ← φ(i 0 , i 2 )
p ← i 1 > 100
if (i 1 > 100)
a 2 ← 2 ∗ B [i 1 ] i2 ← i1 + 1
a ← 2 ∗ B [i ]
A[i 1 ] ← a 2 a 1 ← φentry (⊥, a 2 )
A[i ] ← a
i ←i +1
q ← a 2 > 20
if (a 2 > 20)
return a 3
1 a1 ← . . . a
Entry
2 if p then
3 if q then
4 a2 ← . . . p ← ... φif
p a1 ← . . .
5 else
6 a3 ← . . . true false true false
7 else q r a1 = . . .
q φif
8 if r then
9 ... true false true false
10 else
11 ... a2 ← . . . a3 ← . . . a2 = . . . a3 = . . .
The most natural way uses a data-flow analysis that computes, for each pro- analysis, symbolic
gram points and each variable, its unique reaching definition and the associated
set of reaching paths. This set of paths is abstracted using a path expression. If
the code is not already under SSA, and if at a merge point of the CFG its prede-
cessor basic blocks are reached by different variables, a φ-function is inserted.
The gates of each operand is set to the path expression of its corresponding
incoming edge. If a unique variable reaches all the predecessor basic blocks, the
corresponding path expressions are merged. Of course, a classical path compres-
sion technique can be used to minimize the number of visited edges. One can
observe the similarities with the φ-function placement algorithm described in
Section 4.4.
There also exists a relationship between the control dependencies and the
gates: from a code already under strict and conventional SSA form, one can
derive the gates of a φif function from the control dependencies of its operands.
This relationship is illustrated by Figure 14.4 in the simple case of a structured
code.
These gating functions are important as the concept will form components
of the Value State Dependence Graph later. GSA has seen a number of uses in
the literature including analysis and transformations based on data flow. With
the diversity of applications (see chapters 23 and 10), many variants of GSA
have been proposed. Those variations concern the correct handling of loops in
addition to the computation and representation of gates.
By using gating functions it becomes possible to construct IRs based solely
on data dependencies. These IRs are sparse in nature compared to the CFG,
making them good for analysis and transformation. This is also a more attractive
proposition than generating and maintaining both a CFG and DFG, which can be
complex and prone to human error. One approach has been to combine both of
these into one representation, as is done in the PDG. Alternatively, we can utilize
gating functions along with a data-flow graph for an effective way of representing
whole program information using data-flow information.
?
GSA is useful for performing symbolic analysis. Traditionally, symbolic analysis is
performed by forward propagation of expressions through a program. However,
complete forward substitution is expensive and can result in a large quantity of
unused information and complicated expressions. Instead, backward, demand-
198 14 Graphs and Gating Functions — (J. Stanier)
analysis, demand-driven ?
driven substitutions can be performed using GSA which only substitutes needed
value state dependence
graph information. Consider the following program:
VSDG
1 JMAX ← EXPR
2 if p then
3 J ← JMAX − 1
4 else
5 J ← JMAX
6 assert (J ≤ JMAX)
thus the assert () statement evaluates to true. In real, non-trivial programs, these
expressions can get unnecessarily long and complicated.
Using GSA instead allows for backwards, demand-driven substitutions. The
program above has the following GSA form:
1 JMAX1 ← EXPR
2 if p then
3 J1 ← JMAX1 − 1
4 else
5 J2 ← JMAX1
6 J3 ← φif (p , J1 , J2 )
7 assert (J3 ≤ JMAX1 )
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.5 Value state dependence graph
The?gating functions defined in the previous section were used in the develop-
ment of a sparse data-flow graph IR called the Value State Dependence Graph
14.5 Value state dependence graph 199
(VSDG). The VSDG is a directed graph consisting of operation nodes, loop and
merge nodes together with value and state dependency edges. Cycles are permit-
ted but must satisfy various restrictions. A VSDG represents a single procedure:
this matches the classical CFG. An example VSDG is shown in Figure 14.5.
public: fac()
N0
n| STATE
= −
T
5
6 return result
∗
C
return N∞
γ-nodes
?
The γ-node is similar to the φif gating function in being dependent on a control
predicate, rather than the control-independent nature of SSA φ-functions. A
γ-node γ(C : p , T : vtrue , F : vfalse ) evaluates the condition dependency p , and
returns the value of vtrue if p is true, otherwise vfalse . We generally treat γ-nodes
as single-valued nodes (contrast θ -nodes, which are treated as tuples), with the
effect that two separate γ-nodes with the same condition can be later combined
into a tuple using a single test. Figure 14.6 illustrates two γ-nodes that can be
combined in this way. Here, we use a pair of values (2-tuple) of values for prots
T and F . We also see how two syntactically different programs can map to the
same structure in the VSDG.
θ -nodes
?
The θ -node models the iterative behavior of loops, modeling loop state with the
notion of an internal value which may be updated on each iteration of the loop.
It has five specific elements which represent dependencies at various stages of
14.5 Value state dependence graph 201
computation. The θ -node corresponds to a merge of the φentry and φexit nodes
in Gated SSA. A θ -node θ (C : p , I : vinit , R : vreturn , L : viter , X : ve x i t ) sets its
internal value to initial value vinit then, while condition value p holds true, sets
viter to the current internal value and updates the internal value with the repeat
value vreturn . When p evaluates to false computation ceases and the last internal
value is returned through vexit .
A loop which updates k variables will have: a single condition p , initial values
1 k 1 k k
vinit , . . . , vinit , loop iterations viter , . . . , viter , loop returns vreturn1 , . . . , vreturn , and loop
1 k
exits vexit , . . . , vexit . The example in Figure 14.7 also shows a pair (2-tuple) of values
being used on ports I , R , L , X , one for each loop-variant value.
The θ -node directly implements pretest loops (while, for); post-test loops
(do...while, repeat...until) are synthesized from a pre-test loop preceded
by a duplicate of the loop body. At first this may seem to cause unnecessary
duplication of code, but it has two important benefits: (1) it exposes the first
loop body iteration to optimization in post-test loops (cf. loop-peeling), and
(2) it normalizes all loops to one loop structure, which both reduces the cost of
optimization, and increases the likelihood of two schematically dissimilar loops
being isomorphic in the VSDG.
1 if p then 3 4
2 x ←2
3 y ←3
2 y x 5
4 else
5 x ←4 x y
6 y ←5
p T F
1 if p then
2 x ←2 C γ
3 else
4 x ←4
5 if p then
6 y ←3
7 else y
x
8 y ←5
i j i j
1 j ← ...
2 i ←0
3 while i < 10
do 10 I X
4 j ← j −1 R j
5 i ←i +1 i
6 ... ← j
< Θ + 1 −
C
i
L j
State nodes
Loads and stores compute a value and state?. The call node takes both the name
of the function to call and a list of arguments, and returns a list of results; it is
treated as a state node as the function body may read or update state.
We maintain the simplicity of the VSDG by imposing the restriction that all
functions have one return node (the exit node N∞ ), which returns at least one
result (which will be a state value in the case of void functions). To ensure that
function calls and definitions are able to be allocated registers easily, we suppose
that the number of arguments to, and results from, a function is smaller than the
number of physical registers—further arguments can be passed via a stack as
usual.
Note also that the VSDG neither forces loop invariant code into nor out-of
loop bodies, but rather allows later phases to determine, by adding serializing
edges, such placement of loop invariant nodes for later phases.
become dead after some other optimization. Thus, a dead node is a node that is scalar evolution
code selection
not post dominated by the exit node N∞ . To perform dead node elimination, only
strenght reduction
two passes are required over the VSDG resulting in linear runtime complexity: rematerialization
one pass to identify all of the live nodes, and a second pass to delete the unmarked parallelisation
(i.e., dead) nodes. It is safe because all nodes which are deleted are guaranteed program dependence
graph
never to be reachable from the return node.
5 Function WalkAndMark(n,G )
6 if n is marked then return
7 mark n
8 foreach node m ∈ N ∧ (n, m ) ∈ (EV ∪ ES ) do
9 WalkAndMark(m )
10 Function DeleteMarked(G )
11 foreach node n ∈ N do
12 if n is unmarked then delete(n )
13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14.6 Further readings
A compiler’s intermediate representation can be a graph, and many different
graphs exist in the literature. We can represent the control flow of a program
as a Control-Flow Graph (CFG) [5], where straight-line instructions are con-
tained within basic blocks and edges show where the flow of control may be
transferred to once leaving that block. A CFG is traditionally used to convert a
program to SSA form [94]. We can also represent programs as a type of Data-
flow Graph (DFG) [107, 108], and SSA can be represented in this way as an SSA
Graph [88]. An example was given that used the SSA Graph to detect a variety
?
of induction variables in loops [315, 132]. It has also been used for performing
? ?
instruction selection techniques [113, 264], operator strength reduction [88],
?
rematerialization [55], and has been combined with an extended SSA language
to aid compilation in a parallelizing?compiler [283].
The Program Dependence Graph (PDG)?as defined by Ferrante et al. [124] rep-
resents control and data dependencies in one graph. Their definition of control
dependencies that turns out to be equivalent to the post-dominance frontier
204 14 Graphs and Gating Functions — (J. Stanier)
CHAPTER 15
Psi-SSA Form F. de Ferrière
?
In the SSA representation, each definition of a variable is given a unique name,
and new pseudo definitions are introduced on φ-functions to merge values
coming from different control-flow paths. An example is given figure 15.1(b).
Each definition is an unconditional definition, and the value of a variable is the
value of the expression on the unique assignment to this variable. This essential
property of the SSA representation does not any longer hold when definitions may
?
be conditionally executed. When a variable is defined by a predicated operation,
the value of the variable will or will not be modified depending on the value
of a guard register. As a result, the value of the variable after the predicated
operation is either the value of the expression on the assignment if the predicate
is true, or the value the variable had before this operation if the predicate is false.
This is represented in figure 15.1(c) where we use the notation p ? a = op to
?
indicate that an operation a = op is executed only if predicate p is true, and is
ignored otherwise. We will also use the notation p to refer to the complement of
predicate p. The goal of the ψ-SSA form advocated in this chapter is to express
these conditional definitions while keeping the static single assignment property.
205
206 15 Psi-SSA Form — (F. de Ferrière)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gated-SSA form
dependence, flow-15.1 Definition and construction
ψ-function
back-end, compiler
guard Predicated operations are used to convert control-flow regions into straight-line
gate code. Predicated operations may be used by the intermediate representation in
predicate an early stage of the compilation process as a result of inlining intrinsic func-
tions. Later on, the compiler may also generate predicated operations through
if-conversion optimizations as described in Chapter 20.
In figure 15.1(c), the use of a on the last instruction refers to the variable a 1 if p
is false, or to the variable a 2 if p is true. These multiple reaching definitions on the
use of a cannot be represented by the standard SSA representation. One possible
representation would be to use the Gated-SSA form?, presented in Chapter 14. In
such a representation, the φ-function would be augmented with the predicate
p to tell which value between a 1 and a 2 is to be considered. However, Gated-
SSA is a completely different intermediate representation where the control
flow is no longer represented. This representation is more suited for program
interpretation than for optimizations at code generation level as addressed in
this chapter. Another possible representation would be to add a reference to
a 1 on the definition of a 2 . p ? a 2 = op2 | a 1 would have the following semantic:
a 2 takes the value computed by op2 if p is true, or holds the value of a 1 if p is
false. The use of a on the last instruction of Figure 15.1(c) would now refer to the
variable a 2 , which holds the correct value. The drawback of this representation
is that it adds dependencies?between operations (here a flow dependence from
op1 to op2), which would prevent code reordering for scheduling.
Our solution is presented in figure 15.1(d). The φ-function of the SSA code
?
with control flow is “replaced” by a ψ-function on the corresponding predicated
code, with information on the predicate associated with each argument. This rep-
resentation is adapted to code optimization and code generation on a low-level
intermediate representation.?A ψ-function a 0 = ψ(p1 ?a 1 , . . . , pi ?a i , . . . , pn ?a n )
defines one variable, a 0 , and takes a variable number of arguments a i ; each
argument a i is associated with a predicate pi . In the notation, the predicate pi
?
will be omitted if pi ≡ true.
15.1 Definition and construction 207
if (p )
then
a 1 = 1; p ? a 1 = 1;
else
a 2 = −1; p ? a 2 = −1;
x1 = φ(a 1 , a 2 ) x1 = ψ(p ?a 1 , p ?a 2 )
if (q )
then
a 3 = 0; q ? a 3 = 0;
x2 = φ(x1 , a 3 ) x2 = ψ(p ?a 1 , p ?a 2 , q ?a 3 )
(a) control-flow code (b) Predicated code
construction, of ψ-SSA definitions. We take the convention that the order of the arguments in a ψ-
renaming, of variables
function is, from left to right, equal to the original order of their definitions, from
if-conversion
def-use chains top to bottom, in the control-flow dominance tree of the program in a non-SSA
representation. This information is needed to maintain the correct semantics of
the code during transformations of the ψ-SSA representation and to revert the
code back to a non ψ-SSA representation.
?
The construction of the ψ-SSA representation is a small modification on
the standard algorithm to built an SSA representation (see Section 3.1). The
insertion of ψ-functions is performed during the SSA renaming?phase. During the
SSA renaming phase, basic blocks are processed in their dominance order, and
operations in each basic block are scanned from top to bottom. On an operation,
for each predicated definition of a variable, a new ψ-function will be inserted
just after the operation : Consider the definition of a variable x under predicate
p2 (p2 ? x = op); suppose x1 is the current version of x before to proceeding op,
and that x1 is defined through predicate p1 (possibly true); after renaming x
into a freshly created version, say x2 , a ψ-function of the form x = ψ(p1 ?x1 , p2 ?x ),
is inserted right after op. Then renaming of this new operation proceeds. The
first argument of the ψ-function is already renamed and thus is not modified.
The second argument is renamed into the current version of x which is x2 . On
the definition of the ψ-function, the variable x is given a new name, x3 , which
becomes the current version for further references to the x variable. This insertion
and renaming of a ψ-function is shown on Figure 15.3.
p2 ? x = op p2 ? x = op p2 ? x2 = op p2 ? x2 = op
x = ψ(p1 ?x1 , p2 ?x ) x = ψ(p1 ?x1 , p2 ?x ) x3 = ψ(p1 ?x1 , p2 ?x2 )
(a) Initial (b) ψ-insertion (c) op-renaming (d) ψ-renaming
Fig. 15.3 Construction and renaming of ψ-SSA
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.2 SSA algorithms
?
With this definition of the ψ-SSA representation, implicit data-flow links to pred-
icated operations are now explicitly expressed through ψ-functions. Usual algo-
rithms that perform optimizations or transformations on the SSA representation
can now be easily adapted to the ψ-SSA representation, without compromising
the efficiency of the transformations performed. Actually, within the ψ-SSA repre-
15.3 Psi-SSA algorithms 209
sentation, predicated definitions behave exactly the same as non predicated ones constant propagation
dead code elimination
for optimizations on the SSA representation. Only the ψ-functions have to be
global vale numbering
treated in a specific way. As an example, the classical constant propagation?algo- partial redundancy
rithm under SSA can be easily adapted to the ψ-SSA representation. In this algo- elimination
rithm, the only modification is that ψ-functions have to be handled with the same induction variable
? recognition
rules as the φ-functions. Other algorithms such as dead code elimination(see ψ-inlining
? ?
Chapter 3), global value numbering, partial redundancy elimination (see Chap- ψ-reduction
ter 11), and induction variable analysis?(see Chapter 10), are examples of algo- ψ-projection
rithm that can easily be adapted to this representation with minor efforts. predication
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.3 Psi-SSA algorithms
?
ψ-inlining recursively replaces in a ψ function an argument a i that is defined
on another ψ function by the arguments of this other ψ-function. The predicate
pi associated with argument a i will be distributed with an and operation over the
predicates associated with the inlined arguments. This is shown in figure 15.4.
a 1 = op1 a 1 = op1
p2 ? a 2 = op2 p2 ? a 2 = op2
x1 = ψ(a 1 , p2 ?a 2 ) x1 = ψ(a 1 , p2 ?a 2 ) // dead
p3 ? a 3 = op3 p3 ? a 3 = op3
x2 = ψ(p1 ?x1 , p3 ?a 3 ) x2 = ψ(p1 ?a 1 , p1 ∧ p2 ?a 2 , p3 ?a 3 )
Fig. 15.5 ψ-reduction. The first argument a 1 of the ψ-function can safely be removed
p2 ? a 2 = op2 p2 ? a 2 = op2
p2 ? a 3 = op3 p2 ? a 3 = op3
x2 = ψ(p2 ?a 2 , p2 ?a 3 ) x2 = ψ(p2 ?a 2 , p2 ?a 3 )
x3 = ψ(p2 ?a 2 )
p2 ? y1 = x2 p2 ? y1 = x3
Fig. 15.6 ψ-projection of x2 on p2 . Second argument a 3 can be removed.
p2 ? a 3 = op3 p2 ? a 3 = op3
p2 ? a 2 = op2 p2 ? a 2 = op2
x2 = ψ(p2 ?a 2 , p2 ?a 3 ) x2 = ψ(p2 ?a 3 , p2 ?a 2 )
Fig. 15.7 ψ-permutation of arguments a 2 and a 3
?
ψ-promotion changes one of the predicates used in a ψ-function by a larger
predicate. Promotion must obey the following condition so that the semantics
of the ψ-function is not altered by the transformation : consider an operation
a 0 = ψ(p1 ?x1 , ..., pi ?xi , ..., pn ?xn ) promoted into a 0 = ψ(p1 ?x1 , ..., pi0 ?xi , ..., pn ?xn )
with pi ⊆ pi0 , then pi0 must fulfill
n −1
i[
pi0 \
[
pk ∩ pk = ; (15.1)
k =i k =1
Sn
where pi0 \S k =i pk corresponds to the possible increase of the predicate of the ψ-
n
function, k =1 pk . This promotion must also satisfy the properties of ψ-functions,
and in particular, that the predicate associated with a variable in a ψ-function
must be included in or equal to the predicate on the definition of that variable
(which itself can be a ψ-function). A simple ψ-promotion is illustrated in Fig-
ure 15.8(c).
The ψ-SSA representation can be used on a partially predicated architecture?,
where only a subset of the instructions supports a predicate operand. Figure 15.8
shows an example where some code with control-flow edges was transformed
into a linear sequence of instructions. Taking the example of an architecture
15.4 Psi-SSA destruction 211
speculation
if (p ) SSA!destruction
then phiweb
a 1 = ADD i 1 , 1; a 1 = ADD i 1 , 1; a 1 = ADD i 1 , 1; C-SSA
else
a 2 = ADD i 1 , 2; a 2 = ADD i 1 , 2; a 2 = ADD i 1 , 2;
x = φ(a 1 , a 2 ) x = ψ(p ?a 1 , p ?a 2 ) x = ψ(a 1 , p ?a 2 )
where the ADD operation cannot be predicated, the ADD operation must be
speculated? under the true predicate. On an architecture where the ADD oper-
ation can be predicated, it may also be profitable to perform speculation in
order to reduce the number of predicates on predicated code and to reduce the
number of operations to compute these predicates. Once speculation has been
performed on the definition of a variable used in a ψ-function, the predicate
associated with this argument can be promoted, provided that the semantic of
the ψ-function is maintained (Equation 15.1).
Usually, the first argument of a ψ-function can be promoted under the true
predicate. Also, when disjoint conditions are computed, one of them can be
promoted to include the other conditions, usually reducing the number of predi-
cates. A side effect of this transformation is that it may increase the number of
copy instructions to be generated during the ψ-SSA destruction phase, as will be
explained in the following section.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.4 Psi-SSA destruction
?
The SSA destruction phase reverts an SSA representation into a non-SSA represen-
tation. This phase must be adapted to the ψ-SSA representation. This algorithm
uses ψ-φ-webs to create a conventional ψ-SSA representation. The notion of
φ-webs is?extended to φ and ψ operations so as to derive the notion of con-
ventional ψ-SSA (ψ-C-SSA) form?. A ψ-φ-web is a non empty, minimal, set of
variables such that if two variables are referenced on the same φ or ψ-function
then they are in the same ψ-φ-web. The property of the ψ-C-SSA form is that the
renaming into a single variable of all variables that belong to the same ψ-φ-web,
and the removal of the ψ and φ functions, results in a program with the same
semantics as the original program.
Now, consider Figure 15.9 to illustrate the transformations that must be per-
formed to convert a program from a ψ-SSA form into a program in ψ-C-SSA
form.
212 15 Psi-SSA Form — (F. de Ferrière)
Looking at the first example (Figure 15.9(a)), the dominance order of the
definitions for the variables a and b differs from their order from left to right in
the ψ-function. Such code may appear after a code motion algorithm has moved
the definitions for a and b relatively to each other. Here, the renaming of the
variables a, b and x into a single variable will not restore the semantics of the
original program. The order in which the definitions of the variables a, b and x
occur must be corrected. This is done through the introduction of the variable
c that is defined as a predicated copy of the variable b, after the definition of a.
Now, the renaming of the variables a, c and x into a single variable will result in
the correct behavior.
In Figure 15.9(d) the definition of the variable b has been speculated. However,
the semantics of the ψ-function is that the variable x will only be assigned the
value of b when p is true. A new variable c must be defined as a predicated copy
of the variable b, after the definition of b and p; in the ψ-function, variable b is
then replaced by variable c. The renaming of variables a, c and x into a single
variable will now follow the correct behavior.
In Figure 15.9(g), the renaming of the variables a, b, c, x and y into a single
variable will not give the correct semantics. In fact, the value of a used in the sec-
ond ψ-function would be overridden by the definition of b before the definition
of the variable c. Such code will occur after copy folding has been applied on a
ψ-SSA representation. We see that the value of a has to be preserved before the
15.4 Psi-SSA destruction 213
definition of b. This is done through the definition of a new variable (d here), normalized-ψ
ψ-web
resulting in the code given in Figure 15.9(h). Now, the variables a, b and x can
pining
be renamed into a single variable, and the variables d, c and y will be renamed coalescing
into another variable, resulting in a program in a non-SSA form with the correct normalized-ψ
behavior. ψ-T-SSA
We will now present an algorithm that will transform a program from a ψ-SSA
form into its ψ-C-SSA form. This algorithm is made of three parts.
15.4.1 Psi-normalize
?
We define the notion of normalized-ψ. The normalized form of a ψ-function has
two characteristics:
15.4.2 Psi-web
The role of the psi-web?phase is to repair the ψ-functions that are part of a non
interference-free ψ-web.?This case corresponds to the example presented in
Figure 15.9(g). In the same way as there is a specific point of use for arguments
?
on φ-functions for liveness analysis (e.g., see Section 21.2), we give a definition
of the actual point of use of arguments on normalized ψ-functions for liveness
analysis. With this definition, liveness?analysis is computed accurately and an
interference graph can be built. The cases where repair code is needed can
be easily and accurately detected by observing that variables in a ψ-function
interfere.
1
When a i is defined by a ψ-function, its definition may appear after the definition for a i −1 ,
although the non-ψ definition for a i appears before the definition for a i −1 .
15.4 Psi-SSA destruction 215
a = op1 a = op1
p ? b = op2 b = p ? op2 : a
q ? c = op3 c = q ? op3 : b
x = ψ(a , p ?b , q ?c ) x =c
Given this definition of point of use of ψ-function arguments, and using the
usual point of use of φ-function arguments, a traditional liveness analysis can
be run. Then an interference graph can be built to collect the interferences
between variables involved in ψ or φ-functions. For the construction of the
interference graph, an interference between two variables that are defined on
disjoint predicates can be ignored.
live-range are initialized with a single variable per ψ-web. Then, ψ-functions are processed
liveness check
one at a time, in no specific order, merging when non-interfering the ψ-webs of
its operands together. Two ψ-webs interfere if at least one variable in the first
ψ-web interferes with at least one variable in the other one. The arguments of
the ψ-function, say a 0 = ψ(p1 ?a 1 , ..., pi ?a i , ..., pn ?a n ), are processed from right
(a n ) to left (a 1 ). If the ψ-web that contains a i does not interfere with the ψ-web
that contains a 0 , they are merged together. Otherwise, repair code is needed. A
new variable, a i0 , is created and is initialized with a predicated copy pi ? a i0 = a i ,
inserted just above the definition for a i +1 , or just above the ψ-function in case
of the last argument. The current argument a i in the ψ-function is replaced
by the new variable a i0 . The interference graph is updated. This can be done
by considering the set of variables, say U , a i interferes with. For each u ∈ U ,
if u is in the merged ψ-web, it should not interfere with a i0 ; if the definition of
u dominates the definition of a i , it is live-through the definition of a i , thus is
should be made interfering with a i0 ; last, if the definition of a i dominates the
definition of b , it should be made interfering only if this definition is within the
live-range?of a i0 (see Chapter 9).
Consider the code in Figure 15.11 to see how this algorithm works. The live-
ness on the ψ-function creates a live-range for variable a that extends down to
the definition of b, but not further down. Thus, the variable a does not interfere
15.5 Further readings 217
with the variables b, c or x. The live-range for variable b extends down to its
use in the definition of variable d. This live-range interferes with the variables
c and x. The live-range for variable c extends down to its use in the ψ-function
that defines the variable x. At the beginning of the processing on the ψ-function
x = ψ(p ?a , q ?b , r ?c ), ψ-webs are singletons {a }, {b }, {c }, {x }, {d }. The argu-
ment list is processed from right to left i.e., starting with variable c . {c } does
not interfere with {x }, they can be merged together, resulting in psiWeb = {x , c }.
Then, variable b is processed. Since it interferes with both x and c, repairing code
is needed. A variable b’ is created, and is initialized just below the definition for
b, as a predicated copy of b. The interference graph is updated conservatively,
with no changes. psiWeb now becomes {x , b 0 , c }. Then variable a is processed,
and as no interference is encountered, {a } is merged to psiWeb. The final code
after SSA destruction is shown in Figure 15.11(c).
(a) before processing the (b) after processing the ψ- (c) after actual coalescing
ψ-function function
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15.5 Further readings
In this chapter we mainly described the ψ-SSA representation and we detailed
specific transformations that can be performed thanks to this representation.
More details on the implementation of the ψ-SSA algorithms, and figures on the
benefits of this representation, can be found in [284] and [105].
We mentioned in this chapter that a number of classical SSA-based algorithm
can be easily adapted to the ψ-SSA representation, usually by just adapting the
rules on the φ-functions to the ψ-functions. Among these algorithm, we can
mention the constant propagation algorithm described in [311], dead code elim-
ination [210], global value numbering [78], partial redundancy elimination [73]
and induction variable analysis [315] which have already been implemented into
a ψ-SSA framework.
There are also other SSA representations that can handle predicated instruc-
tion, of which is the Predicated SSA representation [62]. This representation is
218 15 Psi-SSA Form — (F. de Ferrière)
CHAPTER 16
Hashed SSA form: HSSA M. Mantione
F. Chow
?
Hashed SSA (or in short HSSA), is an SSA extension that can effectively represent
how aliasing relations affect a program in SSA form. It works equally well for
aliasing among scalar variables and, more generally, for indirect load and store
operations on arbitrary memory locations. This allows the application of all
common SSA based optimizations to perform uniformly both on local variables
and on external memory areas.
It should be noted, however, that HSSA is a technique useful for representing
aliasing effects, but not for detecting aliasing. For this purpose, a separate alias
analysis pass must be performed, and the effectiveness of HSSA will be influenced
by the accuracy of this analysis.
The following sections explain how HSSA works. Initially, given aliasing infor-
mation, we will see how to represent them in SSA form for scalar variables. Then
we will introduce a technique that reduces the overhead of the above representa-
tion, avoiding an explosion in the number of SSA versions for aliased variables.
Subsequently we will represent indirect memory operations on external memory
areas as operations on "virtual variables" in SSA form, which will be handled uni-
formly with scalar (local) variables. Finally we will apply global value numbering
(GVN) to all of the above, obtaining the so called Hashed SSA form 1 .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.1 SSA and aliasing: µ and χ -functions
Aliasing?occurs inside a compilation unit when a single one single storage location
(that contains a value) can be potentially accessed through different program
"variables". This can happen in one of the four following ways:
1
The name Hashed SSA comes from the use of hashing in value-numbering
219
220 16 Hashed SSA form: HSSA — (M. Mantione, F. Chow)
parallel instruction • First, when two or more storage locations partially overlap. This, for instance,
happens with the C "union" construct, where different parts of a program
can access the same storage location under different names.
• Second, when a local variable is referred by a pointer used in an indirect
memory operation. In this case the variable can be accessed in two ways:
directly, through the variable name, and indirectly, through the pointer that
holds its address.
• Third, when the address of a local variable is passed to a function, which in
turn can then access the variable indirectly.
• Finally, storage locations with global scope can obviously be accessed by dif-
ferent functions. In this case every function call can potentially access every
global location, unless the compiler uses global optimizations techniques
where every function is analyzed before the actual compilation takes place.
The real problem with aliasing is that these different accesses to the same
program variable are difficult to predict. Only in the first case (explicitly overlap-
ping locations) the compiler has full knowledge of when each access takes place.
In all the other cases (indirect accesses through the address of the variable) the
situation becomes more complex, because the access depends on the address
that is effectively stored in the variable used in the indirect memory operation.
This is a problem because every optimization pass is concerned with the actual
value that is stored in every variable, and when those values are used. If variables
can be accessed in unpredictable program points, the only safe option for the
compiler is to handle them as "volatile" and avoid performing optimizations on
them, which is not desirable.
Intuitively, in the presence of aliasing the compiler could try to track the
values of variable addresses inside other variables (and this is exactly what HSSA
does), but the formalization of this process is not trivial. The first thing that is
needed is a way to model the effects of aliasing on a program in SSA form. To do
this, assuming that we have already performed alias analysis, we must formally
define the effects of indirect definitions and uses of variables. Particularly, each
definition can be a "MustDef" operand in the direct case, or a "MayDef" operand
in the indirect case. We will represent MayDef through the use of χ-functions.
Similarly, uses can be "MustUse" or "MayUse" operands (respectively in the direct
and indirect case), and we will represent MayUse through the use of µ-functions.
The semantic of µ and χ operators can be illustrated through the C like example
of Figure 16.1 where ∗p represents an indirect access with address p . Obviously
the argument of the µ-operator is the potentially used variable. Less obviously,
the argument to the χ-operator is the assigned variable itself. This expresses the
fact that the χ-operator only potentially modifies the variable, so the original
value could "flow through" it.
The use of µ and χ-operators does not alter the complexity of transforming
a program in SSA form. All that is necessary is a pre-pass that inserts them in
the program. Ideally, a µ and a χ should be placed in parallel?to the instruction
that led to its insertion. Parallel instructions are represented in Figure 16.1 using
the notation introduced in Section 13.1.4. Still, practical implementations may
16.2 Introducing “zero versions” to limit the explosion of the number of variables 221
choose to insert µ and χs before or after the instructions that involve aliasing.
Particularly, µ-functions could be inserted immediately before the involved state-
ment or expression, and χ-operators immediately after it. This distinction allows
us to model call effects correctly: the called function appears to potentially use
the values of variables before the call, and the potentially modified values appear
after the call.
Thanks to the systematic insertion of µ-functions and χ-functions, an assign-
ment of any scalar variable can be safely considered dead if it is not marked
live by a standard SSA based dead-code elimination. In our running example
of Figure 16.1(c), the potential side effect of any assignment to i , represented
through the µ-function at the return, allows detecting that the assignment to
i 4 is not dead. The assignment of value 2 to i would have been considered as
dead in the absence of the function call to f (), which potentially uses it (detected
through its corresponding µ-function).
1 i =2 1 i =2 1 i1 = 2
2 if j then 2 if j then 2 if j1 then
3 ... 3 ... 3 ...
4 else 4 else 4 else
5 f () 5 f () k µ(i ) 5 f () k µ(i 1 )
6 ∗p = 3 6 ∗p = 3 k i = χ(i ) 6 ∗p = 3 k i 2 = χ(i 1 )
7 i 3 = φ(i 1 , i 2 )
7 i =4 7 i =4 8 i4 = 4
8 return 8 return k µ(i ) 9 return k µ(i 4 )
(a) Initial C code (b) After µ and χ insertion (c) After φ insertion and version-
ing
Fig. 16.1 A program example where ∗p might alias i , and function f might indirectly use i but not alter
it
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2 Introducing “zero versions” to limit the explosion of
the number of variables
While it is true that µ and χ insertion does not alter the complexity of SSA con-
struction, applying it to a production compiler as described in the previous
section would make working with code in SSA form terribly inefficient. This is be-
cause χ-operators cause an explosion in the number of variable values, inducing
the insertion of new φ-functions which in turn create new variable versions. In
practice the resulting IR, and especially the number of distinct variable versions,
would be needlessly large. The biggest issue is that the SSA versions introduced
222 16 Hashed SSA form: HSSA — (M. Mantione, F. Chow)
variable version by χ-operators are useless for most optimizations that deal with variable values.
zero version
χ definitions adds uncertainty to the analysis of variables values: the actual value
data flow
use-def chains of a variable after a χ definition could be its original value, or it could be the one
forward control-flow indirectly assigned by the χ.
graph Intuitively, the solution to this problem is to factor all variable versions?that are
dead code elimination
considered useless together, so that no space is wasted to distinguish among them.
?
We assign number 0 to this special variable version, and call it "zero version".
Our notion of useless versions relies on the concept of "real occurrence of a
variable", which is an actual definition or use of a variable in the original program.
Therefore, in SSA form, variable occurrences in µ, χ and φ-functions are not "real
occurrences". In our example of Figure 16.1, i 2 have no real occurrence while i 1
and i 3 have. The idea is that variable versions that have no real occurrence do
not influence the program output. Once the program is converted back from
SSA form these variables are removed from the code. Since they do not directly
appear in the code, and their value is usually unknown, distinguishing among
them is almost pointless. For those reasons, we consider zero versions, versions of
variables that have no real occurrence, and whose value comes from at least one
χ-function (optionally through φ-functions). An equivalent, recursive definition
is the following:
• The result of a χ has zero version if it has no real occurrence.
• If the operand of a φ has zero version, the φ result has zero version if it has
no real occurrence.
? ?
Algorithm 16.1 performs zero-version detection if only use-def chains, and not
def-use chains, are available. A "HasRealOcc" flag is associated to each variable
version, setting it to true whenever a real occurrence is met in the code. This
can be done while constructing the SSA form. A list "NonZeroPhiList", initially
empty, is also associated to each original program variable.
The time spent in the first iteration grows linearly with the number of variable
versions, which in turn is proportional to the number of definitions and therefore
to the code size. On the other hand, the while loop may, in the worst case, iterate
as many times as the longest chain of contiguous φ assignments in the program.
This bound can easily be reduced to the largest loop depth of the program by
traversing the versions using a topological order of the forward control-flow
?
graph. All in all, zero version detection in the presence of µ and χ-functions
does not change the complexity of SSA construction in a significant way, while
the corresponding reduction in the number of variable versions is definitely
desirable.
This loss of information has almost no consequences on the effectiveness of
subsequent optimization passes. control-flow graph Since variables with zero
versions have uncertain values, not being able to distinguish them usually only
slightly affects the quality of optimizations that operate on values. On the other
?
hand, when performing sparse dead-code elimination along use-def chains,
zero versions for which use-def chains have been broken must be assumed live.
However this has no practical effect. Zero versions have no real occurrence, so
16.3 SSA and indirect memory operations: virtual variables 223
13 c ha ng e s = t r u e
14 while c ha ng e s do
15 c h a ng e s = f a l s e
16 foreach vi ∈ v.N o nZ e r o P h i Li s t do
17 let V = vi .d e f .o p e r a nd s
18 if ∀v j ∈ V , v j .H a s R e a l O c c then
19 vi .H a s R e a l O c c = t r u e
20 v.N o nZ e r o P hi Li s t .r e mo v e (vi )
21 c ha ng e s = t r u e
22 else if ∃v j ∈ V , v j .v e r s i o n = 0 then
23 vi .v e r s i o n = 0
24 v.N o nZ e r o P hi Li s t .r e mo v e (vi )
25 c ha ng e s = t r u e
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.3 SSA and indirect memory operations: virtual vari-
ables
The technique described in the previous sections only apply to ”regular“ variables
in a compilation unit, and not to arbitrary memory locations accessed indirectly.
As an example, in Figure 16.1, µ, χ, and φ-functions have been introduced to
224 16 Hashed SSA form: HSSA — (M. Mantione, F. Chow)
single virtual variable. So on one extreme, there would be one virtual variable global value numbering
for each indirect memory operation. Assignment factoring corresponds to make
each virtual variable represents more than one single indirect memory operand.
On the other extreme, the most factored HSSA form would have only one single
virtual variable on the overall. In the example of Figure 16.2(c) we considered
as given by alias analysis the non aliasing between b and b + 1, and choose two
virtual variables to represent the two corresponding distinct memory locations.
In the general case, virtual variables can obviously alias with one another as in
Figure 16.2(b).
1 p =b 1 p =b 1 p =b
2 q =b 2 q =b 2 q =b
3 ∗p = . . . 3 ∗p = · · · k v ∗ = χ(v ∗ ) k w ∗ = χ(w ∗ ) 3 ∗p = · · · k x ∗ = χ(x ∗ )
4 p =p +1 4 p =p +1 4 p =p +1
5 · · · = ∗p 5 · · · = ∗p k µ(v ∗ ) k µ(w ∗ ) 5 · · · = ∗p k µ(y ∗ )
6 q =q +1 6 q =q +1 6 q =q +1
7 ∗p = . . . 7 ∗p = · · · k v ∗ = χ(v ∗ ) k w ∗ = χ(w ∗ ) 7 ∗p = · · · k y ∗ = χ(y ∗ )
8 · · · = ∗q 8 · · · = ∗q k µ(v ∗ ) k µ(w ∗ ) 8 · · · = ∗q k µ(y ∗ )
(a) Initial C code (b) v and w alias with ops 3,5,7, and 8 (c) x alias with op 3; y with 5,7, and
8
Fig. 16.2 Some virtual variables and their insertion depending on how they alias with operands.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.4 GVN and indirect memory operations: HSSA
In the previous sections we sketched the foundations of a framework for dealing
with aliasing and indirect memory operations in SSA form: we identified the
effects of aliasing on local variables, introduced µ and χ-operators to handle
them, applied zero versioning to keep the number of SSA versions acceptable, and
defined virtual variables as a way to apply SSA also to memory locations accessed
?
indirectly. However, HSSA is complete only once Global Value Numbering is
2
I’m not sure what this means. Rephrase? (dnovillo)
226 16 Hashed SSA form: HSSA — (M. Mantione, F. Chow)
applied to scalar and virtual variables, handling all of them uniformly (GVN. See
Chapter 11).
GVN is normally used as a way to perform redundancy elimination which
means removing redundant expression computations from a program, typically
storing the result of a computation in a temporary variable and reusing it later
instead of performing the computation again. As the name suggests, GVN works
assigning a unique number to every expression in the program with the idea that
expression identified by the same number are guaranteed to give the same result.
This value number is obtained using a hash function represented here as H(k e y ).
To identify identical expressions, each expression tree is hashed bottom up: as an
example, for p1 + 1, p1 and 1 are replaced by their respective value numbers, then
the expression is put into canonical form, and finally hashed. Which ends up to
H(+(H(b ), H(1))), as H(p1 ) = H(b ) in our example. If the program is in SSA form
computing value numbers that satisfy the above property is straightforward:
variables are defined only once, therefore two expressions that apply the same
operator to the same SSA variables are guaranteed to give the same result and
so can have the same value number. While computing global value numbers
for code that manipulates scalar variables is beneficial because it can be used
to implement redundancy elimination, applying GVN to virtual variables has
additional benefits.
First of all, it can be used to determine when two address expressions compute
the same address: this is guaranteed if they have the same global value number.
In our example, p2 and q2 will have the same value number h7 while p1 will
not. This allows indirect memory operands that have the same GVN for their
address expressions and the same virtual variable version to become a single
entity in the representation. This puts them in the same rank as scalar variables
and allows the transparent application of SSA based optimizations on indirect
memory operations. For instance, in the vector modulus computation described
above, every occurrence of the expression "p->x" will always have the same
GVN and therefore will be guaranteed to return the same value, allowing the
compiler to emit code that stores it in a temporary register instead of performing
the redundant memory reads (the same holds for "p->y"). Similarly, consider
the example of Figure 16.3. Loads of lines 5 and 8 can not be considered as
redundant because the versions for v (v1∗ then v2∗ ) are different. On the other
hand the load of line 8 can be safely avoided using a rematerialization of the
value computed in line 7 as both the version for v (v2∗ ) and the value number
for the memory operands are identical. As a last example, if all the associated
virtual variable versions for an indirect memory store (defined in parallel by
χ-functions) are found to be dead, then it can be safely eliminated 3 . In other
words HSSA transparently extends DEADCE to work also on indirect memory
stores.
3
Note that any virtual variable that aliases with a memory region live-out of the compiled pro-
cedure is considered to alias with the return instruction of the procedure, and as a consequence
will lead to a live µ-function
16.4 GVN and indirect memory operations: HSSA 227
1 p =b 1 p1 = b 1 h1 = h0
2 q =b 2 q1 = b 2 h2 = h0
3 ∗p = 3 3 ∗p1 = 3 k v1∗ = χ(v0∗ ) 3 h5 = h3 k h4 = χ(H(v0∗ ))
4 p =p +1 4 p2 = p1 + 1 4 h8 = h7
5 · · · = ∗p 5 · · · = ∗p2 k µ(v1∗ ) 5 · · · = h10 k µ(h4 )
6 q =q +1 6 q2 = q1 + 1 6 h11 = h7
7 ∗p = 4 7 ∗p2 = 4 k v2∗ = χ(v1∗ ) 7 h14 = h12 k h13 = χ(h4 )
8 · · · = ∗q 8 · · · = ∗q2 k µ(v2∗ ) 8 · · · = h14 k µ(h13 )
(a) Initial C code (b) with one virtual variable (c) HSSA statements
Fig. 16.3 Some code after variables versioning, its corresponding HSSA form along with its hash table
entries. q1 +1 that simplifies into +(h0 , h6 ) will be hashed to h7 , and i v a r (∗q2 , v2∗ ) that simplifies
into i v a r (∗h7 , v2∗ ) will be hashed to h14 .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.5 Building HSSA
We now present the HSSA construction algorithm. It is straightforward, because
it is a simple composition of µ and χ insertion, zero versioning and virtual
variable introduction (described in previous sections), together with regular SSA
renaming and GVN application.
Algorithm 16.2: SSA form construction
1. Perform alias analysis and assign a virtual variables to each indirect memory operand
2. Insert µ-functions and χ-functions for scalar and virtual variables
3. Insert φ-functions (considering both regular and χ assignments) as for standard SSA
construction
4. Perform SSA renaming on all scalar and virtual variables as for standard SSA construction
At the end of Algorithm 16.2 we have code in plain SSA form. The use of µ
and χ operands guarantees that SSA versions are correct also in case of aliasing.
Moreover, indirect memory operations are "annotated" with virtual variables,
and also virtual variables have SSA version numbers. However, note how virtual
16.5 Building HSSA 229
variables are sort of "artificial" in the code and will not contribute to the final code
generation pass, because what really matters are the indirect memory operations
themselves.
The next steps corresponds to Algorithm 16.3 where steps 5 and 6 can be done
using a single traversal of the program. At the end of this phase the code has
exactly the same structure as before, but the number of unique SSA versions had
diminished because of the application of zero versions.
Algorithm 16.3: Detecting zero versions
5. perform DEADCE (also on χ and φ stores)
6. initialize HasRealOcc and NonZeroPhiList as for Algorithm 16.1, then run Algorithm 16.1
(Zero-version detection)
a. expressions are processed bottom up, reusing existing hash table expression nodes
and using var nodes of the appropriate SSA variable version (the current one in the
dominator tree traversal)
b. two ivar nodes have the same value number if these conditions are both true:
• their address expressions have the same value number, and
• their virtual variables have the same versions, or are separated by definitions
that do not alias the ivar (possible to verify because of the dominator tree
traversal order)
As a result of this, each node in the code representation has a proper value
number, and nodes with the same number are guaranteed to produce the same
value (or hold it in the case of variables). The crucial issue is that the code must
be traversed following the dominator tree in pre-order. This is important because
when generating value numbers we must be sure that all the involved definitions
already have a value number. Since the SSA form is a freshly created one, it is
strict (i.e., all definitions dominate their use). As a consequence, a dominator
tree pre-order traversal satisfies this requirement.
Note that after this step virtual variables are not needed anymore, and can
be discarded from the code representation: the information they convey about
aliasing of indirect variables has already been used to generate correct value
numbers for ivar nodes.
Algorithm 16.5: Linking definitions
8. The left-hand side of each assignment (direct and indirect, real, φ and χ) is updated to
point to its var or ivar node (which will point back to the defining statement)
9. Also all φ, µ and χ operands are updated to point to the corresponding GVN table entry
230 16 Hashed SSA form: HSSA — (M. Mantione, F. Chow)
register promotion At the end of the last steps listed in Algorithm 16.5, HSSA form is complete,
and every value in the program code is represented by a reference to a node in
the HSSA value table.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.6 Using HSSA
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.7 Further Readings
Cite the original paper from Zhou et al. [325] Cite the original paper from Chow
[74]. Cite work done in the GCC compiler (which was later scrapped due to com-
pile time and memory consumption problems. but the experience is valuable):
http://www.airs.com/dnovillo/Papers/mem-ssa.pdf Memory SSA - A Unified
Approach for Sparsely Representing Memory Operations, D. Novillo, 2007 GCC
Developers’ Summit, Ottawa, Canada, July 2007. Talk about the possible differ-
ences (in terms of notations) that might exist between this chapter and the paper.
Cite Alpern, Wegman, Zadech paper for GVN. Discuss the differences with Ar-
ray SSA. Cite the paper mentioned in Fred’s paper about factoring. Give some
pointers to computation of alias analysis, but also representation of alias infor-
16.7 Further Readings 231
mation (eg points-to). Add a reference on the paper about register promotion
and mention the authros call it indirect removal
CHAPTER 17
Array SSA Form V. Sarkar
K. Knobe
S. Fink
In this chapter, we introduce an Array SSA form that captures element-level data-
flow information for array variables, and coincides with standard SSA form when
applied to scalar variables. Any program with arbitrary control-flow structures
and arbitrary array subscript expressions can be automatically converted to this
Array SSA form, thereby making it applicable to structures, heap objects and any
other data structure that can be modeled as a logical array. A key extension over
standard SSA form is the introduction of a definition-Φ function that is capable
of merging values from distinct array definitions on an element-by-element
basis. There are several potential applications of Array SSA form in compiler
analysis and optimization of sequential and parallel programs. In this chapter,
we focus on sequential programs and use constant propagation as an exemplar
of a program analysis that can be extended to array variables using Array SSA
form, and redundant load elimination as an exemplar of a program optimization
that can be extended to heap objects using Array SSA form.
The rest of the chapter is organized as follows. Section 17.1 introduces full
Array SSA form for run-time evaluation and partial Array SSA form for static
analysis. Section 17.2 extends the scalar SSA constant propagation algorithm to
enable constant propagation through array elements. This includes an extension
to the constant propagation lattice to efficiently record information about array
elements and an extension to the work-list algorithm to support definition-Φ
functions (section 17.2.1), and a further extension to support non-constant (sym-
bolic) array subscripts (section 17.2.2). Section 17.3 shows how Array SSA form
can be extended to support elimination of redundant loads of object fields and ar-
ray elements in strongly typed languages, and section 17.4 contains suggestions
for further reading.
233
234 17 Array SSA Form — (V. Sarkar, K. Knobe, S. Fink)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.1 Array SSA form
1. Renamed array variables: All array variables are renamed so as to satisfy the
static single assignment property. Analogous to standard SSA form, control
Φ operators are introduced to generate new names for merging two or more
prior definitions at control-flow join points, and to ensure that each use
refers to precisely one definition.
2. Array-valued @ variables: For each static definition A j , we introduce an
@ variable (pronounced “at variable”) @A j that identifies the most recent
iteration vector at which definition A j was executed. We assume that all
@ variables are initialized to the empty vector, ( ), at the start of program
execution. Each update of a single array element, A j [k ] := . . ., is followed
by the statement, @A j [k ] := i where i is the iteration vector for the loops
surrounding the definition of A j .
3. Definition Φ’s: A definition-Φ operator is introduced in Array SSA form to deal
with preserving (“non-killing”) definitions of arrays. Consider A 0 and A 1 , two
renamed arrays that originated from the same array variable in the source
program such that A 1 [k ] := . . . is an update of a single array element and A 0 is
the prevailing definition at the program point just prior to the definition of A 1 .
A definition Φ, A 2 := d Φ(A 1 , @A 1 , A 0 , @A 0 ), is inserted immediately after the
definitions for A 1 and @A 1 . Since definition A 1 only updates one element of
A 0 , A 2 represents an element-level merge of arrays A 1 and A 0 . Definition Φ’s
did not need to be inserted in standard SSA form because a scalar definition
completely kills the old value of the variable.
4. Array-valued Φ operators: Another consequence of renaming arrays is that
a Φ operator for array variables must also return an array value. Consider a
(control or definition) Φ operator of the form, A 2 := Φ(A 1 , @A 1 , A 0 , @A 0 ). Its
17.1 Array SSA form 235
if @A 1 [ j ] @A 0 [ j ] then A 1 [ j ]
A 2 [ j ] = else A 0 [ j ] (17.1)
end if
The key extension over the scalar case is that the conditional expression
specifies an element-level merge of arrays A 1 and A 0 .
Figures 17.1 and 17.2 show an example program with an array variable, and
the conversion of the program to full Array SSA form as defined above.
@A 0 [∗] := ( ) ; @A 1 [∗] := ( )
Fig. 17.2 Conversion of program in figure 17.1 to Full Array SSA Form
236 17 Array SSA Form — (V. Sarkar, K. Knobe, S. Fink)
We now introduce a partial Array SSA form for static analysis, that serves as
an approximation of full Array SSA form. Consider a (control or definition) Φ
statement, A 2 := Φ(A 1 , @A 1 , A 0 , @A 0 ). A static analysis will need to approximate
the computation of this Φ operator by some data-flow transfer function, L Φ . The
inputs and output of L Φ will be lattice elements for scalar/array variables that
are compile-time approximations of their run-time values. We use the notation
L (V ) to denote the lattice element for a scalar or array variable V . Therefore, the
statement, A 2 := Φ(A 1 , @A 1 , A 0 , @A 0 ), will in general be modeled by the data-flow
equation, L (A 2 ) = L Φ (L (A 1 ), L (@A 1 ), L (A 0 ), L (@A 0 )).
While the runtime semantics of Φ functions for array variables critically de-
pends on @ variables (Equation 17.1), many compile-time analyses do not need
the full generality of @ variables. For analyses that do not distinguish among
iteration instances, it is sufficient to model A 2 := Φ(A 1 , @A 1 , A 0 , @A 0 ) by a data-
flow equation, L (A 2 ) = L φ (L (A 1 ), L (A 0 )), that does not use lattice variables
L (@A 1 ) and L (@A 0 ). For such cases, a partial Array SSA form can be obtained
by dropping dropping @ variables, and using the φ operator, A 2 := φ(A 1 , A 0 )
instead of A 2 := Φ(A 1 , @A 1 , A 0 , @A 0 ). A consequence of dropping @ variables is
that partial Array SSA form does not need to deal with iteration vectors, and
therefore does not require the control-flow graph to be reducible as in full Array
SSA form. For scalar variables, the resulting φ operator obtained by dropping @
variables exactly coincides with standard SSA form.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.2 Sparse constant propagation of array elements
In this section, we describe the lattice representation used to model array values
for constant propagation. Let U iAnd and U eAl e m be the universal set of index
values and the universal set of array element values respectively for an array
variable A. For an array variable, the set denoted by lattice element L (A) is a
subset of index-element pairs in U iAnd × U eAl e m . There are three kinds of lattice
elements for array variables that are of interest in our framework:
1. L (A) = top ⇒ SET(L (A)) = { }
This “top” case indicates that the set of possible index-element pairs that
have been identified thus far for A is the empty set, { }.
2. L (A) = 〈(i 1 , e1 ), (i 2 , e2 ), . . .〉
⇒ SET(L (A)) = {(i 1 , e1 ), (i 2 , e2 ), . . .} ∪ (U iAnd − {i 1 , i 2 , . . .}) × U eAl e m
The lattice element for this “constant” case is represented by a finite list
of constant index-element pairs, 〈(i 1 , e1 ), (i 2 , e2 ), . . .〉. The constant indices,
i 1 , i 2 , . . ., must represent distinct (non-equal) index values. The meaning of
this lattice element is that the current stage of analysis has identified some
17.2 Sparse constant propagation of array elements 237
L (A 1 [k ]) L (k ) = top L (k ) = C o n s t a n t L (k ) = ⊥
L (A 1 ) = top top top ⊥
L (A 1 ) = 〈(i 1 , e1 ), . . .〉 top e j , if ∃ (i j , e j ) ∈ L (A 1 ) with
DS (i j , L (k )) = true ⊥
⊥, otherwise
L (A 1 ) = ⊥ ⊥ ⊥ ⊥
Fig. 17.3 Lattice computation for L (A 1 [k ]) = L [ ] (L (A 1 ), L (k )), where A 1 [k ] is an array element read
operator
L (A 1 ) L (i ) = top L (i ) = C o n s t a n t L (i ) = ⊥
L (k ) = top top top ⊥
L (k ) = C o n s t a n t top 〈(L (k ), L (i ))〉 ⊥
L (k ) = ⊥ ⊥ ⊥ ⊥
Fig. 17.4 Lattice computation for L (A 1 ) = L d [ ] (L (k ), L (i )), where A 1 [k ] := i is an array element write
operator
We now describe how array lattice elements are computed for various op-
erations that appear in Array SSA form. We start with the simplest operation
viz., a read access to an array element. Figure 17.3 shows how L (A 1 [k ]), the
lattice element for array reference A 1 [k ], is computed as a function of L (A 1 )
and L (k ), the lattice elements for A 1 and k . We denote this function by L [ ]
i.e., L (A 1 [k ]) = L [ ] (L (A 1 ), L (k )). The interesting case in figure 17.3 occurs in
the middle cell when neither L (A 1 ) nor L (k ) is top or ⊥. The notation DS in
the middle cell in figure 17.3 represents a “definitely-same” binary relation i.e.,
DS (a , b ) = true if and only if a and b are known to have exactly the same value.
Next, consider a write access of an array element, which in general has the form
A 1 [k ] := i . Figure 17.4 shows how L (A 1 ), the lattice element for the array being
written into, is computed as a function of L (k ) and L (i ), the lattice elements
for k and i . We denote this function by L d [ ] i.e., L (A 1 ) = L d [ ] (L (k ), L (i )). As
before, the interesting case in figure 17.4 occurs in the middle cell when both
238 17 Array SSA Form — (V. Sarkar, K. Knobe, S. Fink)
L (k ) and L (i ) are constant. For this case, the value returned for L (A 1 ) is simply
the singleton list, 〈 (L (k ), L (i )) 〉, which contains exactly one constant index-
element pair.
Now, we turn our attention to the φ functions. Consider a definition φ op-
eration of the form, A 2 := d φ(A 1 , A 0 ). The lattice computation for L (A 2 ) =
L d φ (L (A 1 ), L (A 0 )) is shown in figure 17.5. Since A 1 corresponds to a defini-
tion of a single array element, the list for L (A 1 ) can contain at most one pair (see
figure 17.4). Therefore, the three cases considered for L (A 1 ) in figure 17.5 are
L (A 1 ) = top, L (A 1 ) = 〈(i 0 , e 0 )〉, and L (A 1 ) = ⊥.
The notation UPDATE((i 0 , e 0 ), 〈(i 1 , e1 ), . . .〉) used in the middle cell in figure 17.5
denotes a special update of the list L (A 0 ) = 〈(i 1 , e1 ), . . .〉 with respect to the con-
stant index-element pair (i 0 , e 0 ). UPDATE involves four steps:
1. Compute the list T = { (i j , e j ) | (i j , e j ) ∈ L (A 0 ) and DD(i 0 , i j ) = true }.
Analogous to DS , DD denotes a “definitely-different” binary relation i.e.,
DD(a , b ) = true if and only if a and b are known to have distinct (non-equal)
values.
2. Insert the pair (i 0 , e 0 ) into T to obtain a new list, I .
3. (Optional) If there is a desire to bound the height of the lattice due to compile-
time considerations, and the size of list I exceeds a threshold size Z , then
one of the pairs in I can be dropped from the output list so as to satisfy the
size constraint.
4. Return I as the value of UPDATE((i 0 , e 0 ), 〈(i 1 , e1 ) , . . .〉).
L (A 2 ) L (A 0 ) = top L (A 0 ) = 〈(i 1 , e1 ), . . .〉 L (A 0 ) = ⊥
L (A 1 ) = top top top top
L (A 1 ) = 〈(i 0 , e 0 )〉 top UPDATE ((i 0 , e 0 ), 〈(i 1 , e 1 ), . . .〉) 〈(i 0 , e 0 )〉
L (A 1 ) = ⊥ ⊥ ⊥ ⊥
L (A 2 ) = L (A 1 ) u L (A 0 ) L (A 0 ) = top L (A 0 ) = 〈(i 1 , e1 ), . . .〉 L (A 0 ) = ⊥
L (A 1 ) = top top L (A 0 ) ⊥
L (A 1 ) = 〈(i 10 , e10 ), . . .〉 L (A 1 ) L (A 1 ) ∩ L (A 0 ) ⊥
L (A 1 ) = ⊥ ⊥ ⊥ ⊥
Y [3] := 99
if C then
D [1] := Y [3] ∗ 2
else
D [1] := Y [I ] ∗ 2
endif
Z := D [1]
Fig. 17.8 Array SSA form for the Sparse Constant Propagation Example
Fig. 17.9 Data-flow Equations for the Sparse Constant Propagation Example
Fig. 17.10 Solution to data-flow equations from figure 17.9, assuming I is unknown
Fig. 17.11 Solution to data-flow equations from figure 17.9, assuming I is known to be = 3
17.2 Sparse constant propagation of array elements 241
k := 2
do i := . . .
...
a [i ] := k ∗ 5
. . . := a [i ]
enddo
The first case is the most conservative solution. In the absence of any other
knowledge, it is always correct to state that DS (I1 , I2 ) = false and DD(I1 , I2 ) =
false.
The problem of determining if two symbolic index values are the same is
equivalent to the classical problem of global value numbering. If two indices i
and j have the same value number, then DS (i , j ) must = true. The problem of
computing DD is more complex. Note that DD, unlike DS , is not an equivalence
relation because DD is not transitive. If DD(A, B ) = true and DD(B , C ) = true, it
does not imply that DD(A, C ) = true. However, we can leverage past work on
array dependence analysis to identify cases for which DD can be evaluated to
true. For example, it is clear that DD(i , i + 1) = true, and that DD(i , 0) = true if i
is a loop index variable that is known to be ≥ 1.
Let us consider how the DS and DD relations for symbolic index values are
used by our constant propagation algorithms. Note that the specification of how
DS and DD are used is a separate issue from the precision of the DS and DD
values. We now describe how the lattice and the lattice operations presented in
section 17.2.1 can be extended to deal with non-constant subscripts.
First, consider the lattice itself. The top and ⊥ lattice elements retain the same
meaning as in section 17.2.1 viz., SET(top) = { } and SET(⊥) = U iAnd ×U eAl e m . Each
242 17 Array SSA Form — (V. Sarkar, K. Knobe, S. Fink)
element in the lattice is a list of index-value pairs where the value is still required
to be constant but the index may be symbolic — the index is represented by its
value number.
We now revisit the processing of an array element read of A 1 [k ] and the pro-
cessing of an array element write of A 1 [k ]. These operations were presented in
section 17.2.1 (figures 17.3 and 17.4) for constant indices. The versions for non-
constant indices appear in figure 17.13 and figure 17.14. For the read operation in
figure 17.13, if there exists a pair (i j ,e j ) such that DS (i j ,VALNUM(k )) = true (i.e.,
i j and k have the same value number), then the result is e j . Otherwise, the result
is top or ⊥ as specified in figure 17.13. For the write operation in figure 17.14, if
the value of the right-hand-side, i , is a constant, the result is the singleton list
〈(VALNUM(k ), L (i ))〉. Otherwise, the result is top or ⊥ as specified in figure 17.14.
L (A 1 [k ]) L (k ) = top L (k ) = VALNUM(k ) L (k ) = ⊥
L (A 1 ) = top top top ⊥
L (A 1 ) = 〈(i 1 , e1 ), . . .〉 top e j , if ∃ (i j , e j ) ∈ L (A 1 ) with
DS (i j , VALNUM(k )) = true ⊥
⊥, otherwise
L (A 1 ) = ⊥ ⊥ ⊥ ⊥
Fig. 17.13 Lattice computation for L (A 1 [k ]) = L [ ] (L (A 1 ), L (k )), where A 1 [k ] is an array element read
operator. If L (k ) = VALNUM(k ), the lattice value of index k is a value number that represents a
constant or a symbolic value.
L (A 1 ) L (i ) = top L (i ) = C o n s t a n t L (i ) = ⊥
L (k ) = top top top ⊥
L (k ) = VALNUM(k ) top 〈(VALNUM(k ), L (i ))〉 ⊥
L (k ) = ⊥ ⊥ ⊥ ⊥
Fig. 17.14 Lattice computation for L (A 1 ) = L d [ ] (L (k ), L (i )), where A 1 [k ] := i is an array element write
operator. If L (k ) = VALNUM(k ), the lattice value of index k is a value number that represents a
constant or a symbolic value.
We introduce a formalism called heap arrays which allows us to model object ref-
erences as associative arrays. An extended Array SSA form is constructed on heap
arrays by adding use-φ functions. For each field x , we introduce a hypothetical
one-dimensional heap array, H x . Heap array H x consolidates all instances of
field x present in the heap. Heap arrays are indexed by object references. Thus,
a GETFIELD of p.x is modeled as a read of element H x [p ], and a PUTFIELD of
q .x is modeled as a write of element H x [q ]. The use of distinct heap arrays for
distinct fields leverages the fact that accesses to distinct fields must be directed
to distinct memory locations in a strongly typed language. Note that field x is
considered to be the same field for objects of types C1 and C2 , if C2 is a subtype
of C1 . Accesses to one-dimensional array objects with the same element type
are modeled as accesses to a single two-dimensional heap array for that element
type, with one dimension indexed by the object reference as in heap arrays for
fields, and the second dimension indexed by the integer subscript.
Heap arrays are renamed in accordance with an extended Array SSA form that
contains three kinds of φ functions:
1. A control φ from scalar SSA form.
2. A definition φ (d φ) from Array SSA form.
3. A use φ (u φ) function creates a new name whenever a statement reads a
heap array element. u φ functions represent the extension in “extended”
Array SSA form.
The main purpose of the u φ function is to link together load instructions for
the same heap array in control-flow order. While u φ functions are used by the
redundant load elimination optimization presented in this chapter, it is not
necessary for analysis algorithms (e.g., constant propagation) that do not require
the creation of a new name at each use.
244 17 Array SSA Form — (V. Sarkar, K. Knobe, S. Fink)
In this section, we show how global value numbering and allocation site infor-
mation can be used to efficiently compute definitely-same (DS ) and definitely-
different (DD) information for heap array indices, thereby reducing pointer anal-
ysis queries to array index queries. If more sophisticated pointer analyses are
available in a compiler, they can be used to further refine the DS and DD infor-
mation.
As an example, Figure 17.15 illustrates two different cases of scalar replace-
ment (load elimination) for object fields. The notation Hx[p] refers to a read-
/write access of heap array element, H x [p ]. For the original program in fig-
ure 17.15(a), introducing a scalar temporary T1 for the store (def) of p.x can
enable the load (use) of r.x to be eliminated i.e., to be replaced by a use of T1. Fig-
ure 17.15(b) contains an example in which a scalar temporary (T2) is introduced
for the first load of p.x, thus enabling the second load of r.x to be eliminated
i.e., replaced by T2. In both cases, the goal of our analysis is to determine that
the load of r.x is redundant, thereby enabling the compiler to replace it by a
use of scalar temporary that captures the value in p.x. We need to establish two
facts to perform this transformation: 1) object references p and r are identical
(definitely same) in all program executions, and 2) object references q and r are
distinct (definitely different) in all program executions.
As before, we use the notation V (i ) to denote the value number of SSA variable
i . Therefore, if V (i ) = V ( j ), then DS (i , j ) = true. For the code fragment above,
the statement, p := r, ensures that p and r are given the same value number,
V (p ) = V (r ), so that DS (p , r ) = true. The problem of computing DD for object
references is more complex than value numbering, and relates to pointer alias
analysis. We outline a simple and sound approach below, which can be replaced
by more sophisticated techniques as needed. It relies on two observations related
to allocation-sites:
1. Object references that contain the results of distinct allocation-sites must be
different.
2. An object reference containing the result of an allocation-site must be differ-
ent from any object reference that occurs at a program point that dom-
inates the allocation site in the control-flow graph. For example, in Fig-
ure 17.15, the presence of the allocation site in q := new Type1 ensures
that DD(p , q ) = true.
The main program analysis needed to enable redundant load elimination is index
propagation, which identifies the set of indices that are available at a specific
17.3 Extension to objects: redundant load elimination 245
r := p r := p
q := new Type1 q := new Type1
. . . . . .
p.x := ... // Hx[p] := ... ... := p.x // ... := Hx[p]
q.x := ... // Hx[q] := ... q.x := ... // Hx[q] := ...
... := r.x // ... := Hx[r] ... := r.x // ... := Hx[r]
r := p r := p
q := new Type1 q := new Type1
. . . . . .
T1 := ... T2 := p.x
p.x := T1 ... := T2
q.x := ... q.x := ...
... := T1 ... := T2
(a) (b)
Fig. 17.15 Examples of scalar replacement
L (A 2 ) L (A 0 ) = top L (A 0 ) = 〈i1 , . . .〉 L (A 0 ) = ⊥
L (A 1 ) = top top top top
L (A 1 ) = 〈i0 〉 top UPDATE (i0 , 〈(i1 ), . . .〉) 〈i0 〉
L (A 1 ) = ⊥ ⊥ ⊥ ⊥
L (A 2 ) L (A 0 ) = top L (A 0 ) = 〈(i1 ), . . .〉 L (A 0 ) = ⊥
L (A 1 ) = top top top top
L (A 1 ) = 〈i0 〉 top L (A 1 ) ∪ L (A 0 ) L (A 1 )
L (A 1 ) = ⊥ ⊥ ⊥ ⊥
L (A 2 ) = L (A 1 ) u L (A 0 ) L (A 0 ) = top L (A 0 ) = 〈(i1 ), . . .〉 L (A 0 ) = ⊥
L (A 1 ) = top top L (A 0 ) ⊥
L (A 1 ) = 〈i01 , . . .〉 L (A 1 ) L (A 1 ) ∩ L (A 0 ) ⊥
L (A 1 ) = ⊥ ⊥ ⊥ ⊥
(a) Extended Array SSA form: (b) After index propagation: (c) After transformation:
r := p L (H 0x ) = { } r := p
q := new Type1 x q := new Type1
L (H 1) = {V (p ) = V (r )}
. . . . . .
x
H 1x [p ] := ... L (H 2) = {V (p ) = V (r )} T1 := ...
H 2x := d φ(H 1x , H 0x ) L (H x
{V (q )} p.x := T1
3) =
H 3x [q ] := ... x q.x := ...
L (H 4) = {V (p ) = V (r ), V (q )}
H 4x := d φ(H 3x , H 2x ) ... := T1
... := H 4x [r ]
Fig. 17.19 Trace of index propagation and load elimination transformation for program in figure 17.15(a)
TODO: Style: instructive, no performance results, try for max pages 10 per ssainput
code generation
instruction set
architecture!ISA
calling conventions
code selection
liveness
register allocation
CHAPTER 18
SSA Form and Code Generation B. Dupont de Dinechin
253
254 18 SSA Form and Code Generation — (B. Dupont de Dinechin)
select, instruction ?
• If-conversion using select, conditional move, or predicated, instructions
two-address mode
(Chapter 20).
hardware looping
SIMD • Use of specialized addressing modes?such as auto-modified addressing [188]
VLIW and modulo addressing.
instruction set • Exploitation of hardware looping?[191] or static branch prediction hints.
architecture!ISA|mainidx ?
• Matching fixed-point arithmetic and SIMD idioms to special instructions.
• Memory hierarchy optimizations, including cache prefetching and register
preloading [101].
• VLIW?instruction bundling, where parallel instruction groups constructed by
postpass instruction scheduling are encoded into instruction bundles [166].
This sophistication of modern compiler code generation motivates the in-
troduction of the SSA form on the machine code program representation in
order to simplify some of the analyses and optimizations. In particular, liveness
analysis (Chapter 9), if-conversion (Chapter 20), unrolling-based loop optimiza-
tions (Chapter 10), and exploitation of special instructions or addressing modes
benefit significantly from the SSA form. Chapter 19 presents an advanced tech-
nique of instruction selection on the SSA form by solving a specialized quadratic
assignment problem (PBQP).Although there is a debate as to whether or not SSA
form should be used in a register allocator, Chapter 22 makes a convincing case
for it.The challenge of correct and efficient SSA form destruction under the con-
straints of machine code is addressed in Chapter 21. Finally, Chapter 23 illustrates
how the SSA form has been successfully applied to hardware compilation.
In this chapter, we review some of the issues of inserting the SSA form in a
code generator, based on experience with a family of production code generators
and linear assembly optimizers for the ST120 DSP core [104][99, 284, 249] and
the Lx/ST200 VLIW family [119][102, 103, 42, 41, 39]. Section 18.1 presents the
challenges of maintaining the SSA form on a program representation based on
machine instructions. Section 18.2 discusses two code generator optimizations
that seem at odds with the SSA form, yet must occur before register allocation.
One is if-conversion, whose modern formulations require an extension of the
SSA form. The other is prepass instruction scheduling, which does not seem to
benefit from the SSA form. Constructing and destructing SSA form in a code
generator is required in such case, so Section 18.3 characterizes various SSA form
destruction algorithms with regards to satisfying the constraints of machine
code.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.1 SSA form engineering issues
? operands,
distinguish explicit operands, which are associated with a specific bit-field in the
explicit|mainidx
instruction encoding, from implicit operands?, without any encoding bits. Explicit operands,
operands correspond to allocatable architectural registers, immediate values, or implicit|mainidx
instruction modifiers. Implicit operands correspond to dedicated architectural dedicated
registers|mainidx
registers?, and to registers implicitly used by some instructions; for instance, the operation|mainidx
status register, the procedure link register, or the stack pointer. instruction|mainidx
? ?
An operation is an instance of an instruction that composes a program. It operator|mainidx
?
is seen by the compiler as an operator applied to a list of operands (explicit & operand,
indirect|mainidx
implicit), along with operand naming constraints, and has a set of clobbered constant propagation
registers. The compiler view of operations also involves indirect operands?, which registers, status
are not apparent in the instruction behavior, but are required to connect the flow range analysis
of values between operations. Implicit operands correspond to the registers used data-flow analysis,
non-zero value
for passing arguments and returning results at function call sites, and may also speculation
be used for the registers encoded in register mask immediates.
instruction, pseudo- specific machine instructions in the code generator internal representation, can
parallel copy
be provided as annotations that overrides the statically tabulated information.
register tuples
hardware loop|mainidx Finally, code generation for some instruction set architectures require that
register allocation pseudo-instructions?with standard semantics be available, besides variants of
instruction, kill φ-functions and parallel copy?operations.
pinning
two-address mode • Machine instructions that operate on register pairs, such as the long multi-
parallel copy plies on the ARM, or more generally on register tuples?, must be handled. In
live range pinning such cases there is a need for pseudo-instructions to compose wide operands
in register tuples, and to extract independently register allocatable operands
from wide operands.
• Embedded processor architectures such as the Tensilica Xtensa [135] pro-
?
vide zero-overhead loops (hardware loops), where an implicit conditional
branch back to the loop header is taken whenever the program counter
matches some address. The implied loop-back branch is also conveniently
materialized by a pseudo-instruction.
?
• Register allocation for predicated architectures requires that the live-ranges
of temporary variables with predicated definitions be contained by pseudo-
?
instructions [133] that provide backward kill points for liveness analysis.
program point. One possibility is to inhibit the promotion of the stack pointer to predicated execution
select, instruction
a SSA variable. Stack pointer definitions including memory allocations through
EPIC
alloca(), activation frame creation/destruction, are then encapsulated as in- instruction, predicate
stances of a specific pseudo-instruction. Instructions that use the stack pointer define
must be treated as special cases for the SSA form analyses and optimizations. cmov, instruction
cmov
The SSA form requires that variable definitions be killing definitions. This is
not the case for target operands such as a status register that contains several
independent bit-fields. Moreover, some instruction effects on bit-field may be
sticky, that is, with an implied disjunction (or) with the previous value. Typical
sticky bits include exception flags of the IEEE 754 arithmetic, or the integer
overflow flag on DSPs with fixed-point arithmetic. When mapping a status register
to a SSA variable, any operation that partially reads or modifies the register bit-
fields should appear as reading and writing the corresponding variable.
?
Predicated execution and conditional execution are other sources of defini-
tions that do not kill their target register. The execution of predicated instructions
is guarded by the evaluation of a single bit operand. The execution of conditional
instructions is guarded by the evaluation of a condition on a multi-bit operand.
We extend the ISA classification of [201] to distinguish four classes:
Partial predicated execution support select instructions?, first introduced
by the Multiflow TRACE architecture [84], are provided. These instructions
write to a destination register the value of one among two source operands,
depending on the condition tested on a third source operand. The Multiflow
TRACE 500 architecture was to include predicated store and floating-point
instructions [196].
Full predicated execution support Most instructions accept a Boolean predi-
cate operand, which nullifies the instruction effects if the predicate evaluates
?
to false. EPIC-style architectures also provide predicate define instructions
?
(PDIs) to efficiently evaluate predicates corresponding to nested conditions:
Unconditional, Conditional, parallel or, parallel and [133].
Partial conditional execution support Conditional move (cmov) instructions?,
first introduced by the Alpha AXP architecture [35], are provided.?instructions
are available in the ia32 ISA since the Pentium Pro.
Full conditional execution support Most instructions are conditionally exe-
cuted depending on the evaluation of a condition of a source operand. On
the ARM architecture, the implicit source operand is a bit-field in the status
register and the condition is encoded on 4 bits. On the VelociTI™ TMS230C6x
architecture [266], the source operand is a general register encoded on 3 bits
and the condition is encoded on 1 bit.
258 18 SSA Form and Code Generation — (B. Dupont de Dinechin)
transient information
18.1.5 Program representation invariants
invariants
φ-function, as
multiplexer Engineering a code generator requires decisions about what information is
transient?, or belongs to the invariants of the program representation. An in-
σ-function
parallel copy ?
loop nesting forest variant is a property which is ensured before and after each phase. Transient
conventional SSA information is recomputed as needed by some phases from the program repre-
SSAPRE sentation invariants. The applicability of the SSA form only spans the early phases
of the code generation process: from instruction selection, down to register allo-
cation. After register allocation, program variables are mapped to architectural
registers or to memory locations, so the SSA form analyses and optimizations no
longer apply. In addition, a program may be only partially converted to the SSA
form. This motivates the engineering of the SSA form as extensions to a baseline
code generator program representation.
Some extensions to the program representation required by the SSA form
are better engineered as invariants, in particular for operands, operations, basic
blocks, and control-flow graph. Operands which are SSA variables need to record
the unique operation that defines them as a target operand, and possibly to
maintain the list of where they appear as source operands. Operations such as
? ?
φ-functions , σ-functions of the SSI form [39] (see Chapter 13), and parallel
?
copies may appear as regular operations constrained to specific places in the
basic blocks. The incoming (resp. outgoing) arcs of basic blocks need also be kept
in the same order as the operands of each of its φ-functions (resp. σ-functions).
A program representation invariant that impacts SSA form engineering is the
structure of loops. The modern way of identifying loops in a CFG is the con-
?
struction of a loop nesting forest as defined by Ramalingam [245]. Non-reducible
control-flow allows for different loop nesting forests for a given CFG, yet high-
level information such as loop-carried memory dependences, or user-level loop
annotations, are provided to the code generator. This information is attached
to a loop structure, which thus becomes an invariant. The impact on the SSA
form is that some loop nesting forests, such as the Havlak [147] loop structure,
are more suitable than others as they enable to attach to basic blocks the results
of key analyses such as SSA variable liveness [39] (see Chapter 9).
Live-in and live-out sets at basic block boundaries are also candidates for
being program representation invariants. However, when using and updating
liveness information under the SSA form, it appears convenient to distinguish the
φ-function contributions from the results of data-flow fix-point computation.
In particular, Sreedhar et al. [277] introduced the φ-function semantics that
became later known as multiplexing mode (see Chapter 21) where a φ-function
B0 : a 0 = φ(B1 : a 1 , . . . , Bn : a n ) makes a 0 live-in of basic block B0 , and a 1 , . . . a n
live-out of basic blocks B1 , . . . Bn . The classic basic block invariants LiveIn(B ) and
LiveOut(B ) are then complemented with PhiDefs(B ) and PhiUses(B ).
Finally, some compilers adopt the invariant that the SSA form be conven-
?
tional across the code generation phases. This approach is motivated by the fact
that classic optimizations such as SSAPRE?[172] (see Chapter 11) require that
’the live ranges of different versions of the same original program variable do not
18.2 Code generation phases and the SSA form 259
overlap’, implying the SSA form to be conventional. Other compilers that use SSA transformed SSA, T-SSA
if-conversion
numbers and omit the φ-functions from the program representation [183] are
excepting instruction
similarly constrained. Work by Sreedhar et al. [277] and by Boissinot et al. [41] clar- load, dismissible
ified how to convert the transformed SSA?form conventional wherever required,
so there is no reason nowadays for this property to be an invariant.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.2 Code generation phases and the SSA form
?
If-conversion refers to optimizations that convert a program region to straight-
line code. It is primarily motivated by instruction scheduling on instruction-level
parallel cores [201], as removing conditional branches enables to:
• eliminate branch resolution stalls in the instruction pipeline,
• reduce uses of the branch unit, which is often single-issue,
• increase the size of the instruction scheduling regions.
In case of inner loop bodies, if-conversion further enables vectorization [6] and
software pipelining [228] (modulo scheduling). Consequently, control-flow re-
gions selected for if-conversion are acyclic, even though seminal techniques [6,
228] consider more general control-flow.
The scope and effectiveness of if-conversion depends on the ISA support. In
principle, any if-conversion technique targeted to full predicated or conditional
execution support may be adapted to partial predicated or conditional execution
support. For instance, non-predicated instructions with side-effects such as
memory accesses can be used in combination with select to provide a harmless
effective address in case the operation must be nullified [201].
Besides predicated or conditional execution, architectural support for if-
conversion is improved by supporting speculative execution. Speculative ex-
ecution (control speculation) refers to executing an operation before knowing
that its execution is required, such as when moving code above a branch [196]
or promoting operation predicates [201]. Speculative execution assumes that
instructions have reversible side effects, so speculating potentially excepting
?
instructions requires architectural support. On the Multiflow TRACE 300 archi-
tecture and later on the Lx VLIW architecture [119], non-trapping memory loads
known as dismissible?are provided. The IMPACT EPIC architecture speculative
execution [16] is generalized from the sentinel model [200].
The classic contributions to if-conversion did not consider the SSA form:
Allen et al. [6] Conversion of control dependences to data dependences, mo-
tivated by inner loop vectorization. They distinguish forward branches, exit
branches, and backward branches, and compute Boolean guards accordingly.
260 18 SSA Form and Code Generation — (B. Dupont de Dinechin)
program dependence ?
As this work pre-dates the Program Dependence Graph [124] (see Chapter 14),
graph
dependence, control- complexity of the resulting Boolean expressions is an issue. When comparing
store, predicated- to later if-conversion techniques, only the conversion of forward branches is
predicate promotion relevant.
WCET Park & Schlansker [228] Formulation of the ’RK algorithm’ based on control
predicated switching, ?
SSA- dependences. They assume a fully predicated architecture with only Condi-
predicated switching tional PDIs. The R function assigns a minimal set of Boolean predicates to
instruction basic blocks, and the K function express the way these predicates are com-
puted. The algorithm is general enough to process cyclic and irreducible
rooted flow graphs, but in practice it is applied to single entry acyclic regions.
Blickstein et al. [35] Pioneering use of cmov instructions to replace condi-
tional branches in the GEM compilers for the Alpha AXP architecture.
Lowney et al. [196] Matching of the innermost if-then constructs in the Mul-
tiflow Trace Scheduling compiler, in order to generate the select and the
predicated memory store?operations.
Fang [118] The proposed algorithm assumes a fully predicated architecture
with conditional PDIs. It is tailored to acyclic regions with single entry and
multiple exits, and as such is able to compute R and K functions without relying
on explicit control dependences. The main improvement of this algorithm
over [228] is that it also speculates instructions up the dominance tree through
predicate promotion?1 , except for stores and PDIs. This work further proposes
a pre-optimization pass to hoist or sink common sub-expressions before
predication and speculation.
Leupers [192] The technique focuses on if-conversion of nested if-then-else
statements on architectures with full conditional execution support. A dy-
namic programming technique appropriately selects either a conditional
jump or a conditional instruction based implementation scheme for each
if-then-else statement, and the objective is the reduction of worst-case
?
execution time (WCET).
A few contributions to if-conversion did use the SSA form but only internally:
Jacome et al. [154] Proposition of the Static Single Assignment - Predicated
Switching (SSA-PS)?transformation. It assumes a clustered VLIW architecture
fitted with predicated move instructions that operate inside clusters (internal
moves) or between clusters (external moves). The first idea of the SSA-PS
transformation is to realize the conditional assignments corresponding to φ-
?
functions via predicated switching operations, in particular predicated move
operations. The second idea is that the predicated external moves leverage
the penalties associated with inter-cluster data transfers. The SSA-PS trans-
formation predicates non-move operations and is apparently restricted to
innermost if-then-else statements.
Chuang et al. [77] A predicated execution support aimed at removing non-kill
register writes from the micro-architecture. They propose select instructions
1
The predicate used to guard an operation is promoted to a weaker condition
18.2 Code generation phases and the SSA form 261
?
called phi-ops, predicated memory accesses, unconditional PDIs, and orp phi-ops
single-entry-single-exit
instructions for or-ing multiple predicates. The RK algorithm is simplified region
?
for the case of single-entry single-exit regions, and adapted to the proposed phi-lists
architectural support. The other contribution is the generation of phi-ops,
whose insertion points are computed like the SSA form placement of the φ-
?
functions. The φ-functions source operands are replaced by phi-lists, where
each operand is associated with the predicate of its source basic block. The
phi-lists are processed by topological order of the predicates to generate the
phi-ops.
hyper-block|mainidx benefit that already predicated code may be part of the input. In practice, these
instruction scheduling,
prepass contributions follow the generic steps of if-conversion proposed by Fang [118]:
• if-conversion region selection;
• code hoisting and sinking of common sub-expressions;
• assignment of predicates to the basic blocks;
• insertion of operations to compute the basic block predicates;
• predication or speculation of operations;
• and conditional branch removal.
The result of an if-converted region is a hyper-block?, that is, a sequence of basic
blocks with predicated or conditional operations, where control may only enter
from the top, but may exit from one or more locations [202].
Although if-conversion based on the ψ-SSA form appears effective for the
different classes of architectural support, the downstream phases of the code
generator require some adaptations of the plain SSA form algorithms to handle
the ψ-functions. The largest impact of handling ψ-function is apparent in the
ψ-SSA form destruction [105], whose original description [284] was incomplete.
In order to avoid such complexities, a code generator may adopt a simpler
solution than the ψ-functions to represent the non-kill effects of conditional
operations on target operands. The key observation is that under the SSA form, a
CMOV operation is equivalent to a select operation with a same resource naming
constraint between one source and the target operand. Unlike other predicated or
conditional instructions, a select instruction kills its target register. Generalizing
this observation provides a simple way to handle predicated or conditional
operations in plain SSA form:
• For each target operand of the predicated or conditional instruction, add a
corresponding source operand in the instruction signature.
• For each added source operand, add a same resource naming constraint
with the corresponding target operand.
This simple transformation enables the SSA form analyses and optimizations to
remain oblivious to predicated or conditional execution. The drawback of this
solution is that non-kill definitions of a given variable (before SSA variable renam-
ing) remain in dominance order across program transformations, as opposed to
ψ-SSA where predicate value analysis may enable this order to be relaxed.
Further down the code generator, the last major phase before register allocation is
prepass instruction scheduling?. Innermost loops with a single basic block, super-
block or hyper-block body are candidates for software pipelining techniques such
as modulo scheduling [250]. For innermost loops that are not software pipelined,
and for other program regions, acyclic instruction scheduling techniques apply:
18.2 Code generation phases and the SSA form 263
basic block scheduling [136]; super-block scheduling [151]; hyper-block schedul- virtual
registers|seepseudo
ing [202]; tree region scheduling [145]; or trace scheduling [196]. registers
By definition, prepass instruction scheduling operates before register allo- web
cation. At this stage, instruction operands are mostly virtual registers, except CFG merge
for instructions with ISA or ABI constraints that bind them to specific architec-
tural registers. Moreover, preparation to prepass instruction scheduling includes
? ?
virtual register renaming, also known as register web construction, in order to
reduce the number of anti dependences and output dependences in the instruc-
tion scheduling problem. Other reasons why it seems there is little to gain from
scheduling instructions on a SSA form of the program representation include:
• Except in case of trace scheduling, the classic scheduling regions are single-
entry and do not have control-flow merge?. So there are no φ-functions in
case of acyclic scheduling, and only φ-functions in the loop header in case of
software pipelining. Keeping those φ-functions in the scheduling problem
has no direct benefits while adding significant complexity to the computation
of loop-carried dependences on virtual registers.
• Instruction scheduling must account for all the instruction issue slots re-
quired to execute a code region. If the only ordering constraints between
instructions, besides control dependences and memory dependences, are
limited to true data dependences on operands, code motion will create in-
terferences that must later be resolved by inserting copy operations in the
scheduled code region. (Except for interferences created by the overlapping
of live ranges that results from modulo scheduling, as these are resolved by
modulo renaming [181].) To prevent such code motion, scheduling instruc-
tions with SSA variables as operands must be constrained by adding extra
dependences to the scheduling problem.
• Some machine instructions have partial effects on special resources such
as the status register. Representing special resources as SSA variables even
though they are accessed at the bit-field level requires coarsening the instruc-
tion effects to the whole resource, as discussed in Section 18.1.4. Moreover,
def-use ordering implied by SSA form is not adapted to resources composed
of sticky bits, whose definitions can be reordered with regards to the next
use. Scheduling OR-type predicate define operations [265] raises the same
issues. An instruction scheduler is also expected to precisely track accesses
to unrelated or partially overlapping bit-fields in a status register.
• Aggressive instruction scheduling relaxes some flow data dependences that
are normally implied by def-use ordering. A first example is move renam-
ing [322], the dynamic switching of the definition of a source operand defined
by a copy operation when the consumer operations ends up being scheduled
at the same cycle or earlier. Another example is inductive relaxation [100],
where the dependence between additive induction variables and their use
as base in base+offset addressing modes is relaxed to the extent permitted
by the induction step and the range of the offset.
264 18 SSA Form and Code Generation — (B. Dupont de Dinechin)
To summarize, trying to keep the SSA form inside the prepass instruction
scheduling appears more complex than operating on the program representa-
tion with classic compiler temporary variables. This representation is obtained
after SSA form destruction and aggressive coalescing. If required by the register
allocation, the SSA form should be re-constructed.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18.3 SSA form destruction algorithms
The destruction of the SSA form in a code generator is required before the prepass
instruction scheduling and software pipelining, as discussed earlier, and also
before non-SSA register allocation. A weaker form of SSA destruction is the con-
version of transformed SSA form to conventional SSA form, which is required by
classic SSA form optimizations such as SSA-PRE [172] and SSA form register allo-
cators [231]. For all such cases, the main objective is to ensure that the operand
naming constraints are satisfied.
The contributions to SSA form destruction techniques can be characterized
as an evolution towards correctness, the ability to manage operand naming
constraints, and the reduction of algorithmic time and memory requirements:
Cytron et al. [94] First technique for translating out of SSA, by ’naive replace-
ment preceded by dead code elimination and followed by coloring’. They
replace each φ-function B0 : a 0 = φ(B1 : a 1 , . . . , Bn : a n ) by n copies a 0 = a i ,
one per basic block Bi , before applying Chaitin-style coalescing.
Briggs et al. [54] The correctness issues of Cytron et al. [94] out of (trans-
formed) SSA form translation are identified and illustrated by the lost-copy
problem and the swap problem. These problems appear in relation with the
critical edges, and when the parallel assignment semantics of a sequence of
φ-functions at the start of a basic block has is not accounted for [41]. Two
SSA form destruction algorithms are proposed, depending on the presence of
critical edges in the control-flow graph. However the need for parallel copy
operations to represent code after φ-function removal is not recognized.
Sreedhar et al. [277] This work is based on the definition of φ-congruence
classes as the sets of SSA variables that are transitively connected by a φ-
function. When none of the φ-congruence classes have members that inter-
fere, the SSA form is called conventional and its destruction is trivial: replace
all the SSA variables of a φ-congruence class by a temporary variable, and
remove the φ-functions. In general, the SSA form is transformed after program
optimizations, that is, some φ-congruence classes contain interferences. In
Method I, the SSA form is made conventional by inserting copy operations that
target the arguments of each φ-function in its predecessor basic blocks, and
also by inserting copy operations that source the target of each φ-function in
its basic block. The latter is the key for not depending on critical edge splitting
[41]. The code is then improved by running a new SSA variable coalescer that
18.3 SSA form destruction algorithms 265
grows the φ-congruence classes with copy-related variables, while keeping the
SSA form conventional. In Method II and Method III, the φ-congruence classes
are initialized as singletons, then merged while processing the φ-functions in
some order. In Method II, two variables of the current φ-function that inter-
fere directly or through their φ-congruence classes are isolated by inserting
copy operations for both. This ensures that the φ-congruence class which is
grown from the classes of the variables related by the current φ-function is
interference-free. In Method III, if possible only one copy operation is inserted
to remove the interference, and more involved choices about which variables
to isolate from the φ-function congruence class are resolved by a maximum
independent set heuristic. Both methods are correct except for a detail about
the live-out sets to consider when testing for interferences [41].
Leung & George [190] This work is the first to address the problem of satisfying
the same resource and the dedicated register operand naming constraints of the
SSA form on machine code. They identify that Chaitin-style coalescing after
SSA form destruction is not sufficient, and that adapting the SSA optimizations
to enforce operand naming constraints is not practical. They operate in three
steps: collect the renaming constraints; mark the renaming conflicts; and
reconstruct code, which adapts the SSA destruction of Briggs et al. [54]. This
work is also the first to make explicit use of parallel copy operations. A few
correctness issues were later identified and corrected by Rastello et al. [249].
Budimlić et al. [58] Contribution of a lightweight SSA form destruction moti-
vated by JIT compilation. It uses the (strict) SSA form property of dominance of
variable definitions over uses to avoid the maintenance of an explicit interfer-
ence graph. Unlike previous approaches to SSA form destruction that coalesce
increasingly larger sets of non-interfering φ-related (and copy-related) vari-
ables, they first construct SSA-webs with early pruning of obviously interfering
variables, then de-coalesce the SSA webs into non-interfering classes. They
propose the dominance forest explicit data-structure to speed-up these in-
terference tests. This SSA form destruction technique does not handle the
operand naming constraints, and also requires critical edge splitting.
Rastello et al. [249] The problem of satisfying the same resource and dedicated
register operand naming constraints of the SSA form on machine code is
revisited, motivated by erroneous code produced by the technique of Leung
& George [190]. Inspired by work of Sreedhar et al. [277], they include the φ-
related variables as candidates in the coalescing that optimizes the operand
naming constraints. This work avoids the patent of Sreedhar et al. (US patent
6182284).
Boissinot et al. [41] Formulation of a generic approach to SSA form destruc-
tion that is proved correct, handles operand naming constraints, and can be
optimized for speed. (See Chapter 21 for details of this generic approach.)
The foundation of this approach is to transform the program to conventional
SSA form by isolating the φ-functions like in Method I of Sreedhar et al. [277].
However, the copy operations inserted are parallel, so a parallel copy sequen-
tialization algorithm is provided. The task of improving the conventional SSA
266 18 SSA Form and Code Generation — (B. Dupont de Dinechin)
CHAPTER 19
Instruction Code Selection D. Ebner
A. Krall
B. Scholz
267
268 19 Instruction Code Selection — (D. Ebner, A. Krall, B. Scholz)
rule, chain
Code Generator
Machine
Description
Instruction Machine-
Lowering Target
IR Code Dependent
(optional) Program
Selection Backend
Fig. 19.1 Scenario: An instruction code selector translates a compiler’s IR to a low-level machine-
dependent representation.
rule cost
R1 s ← reg 0
R2 reg ← imm 1 SHL
R3 imm ← CST 0
R4 reg ← VAR 0
R5 reg ← SHL(reg, reg) 1 ADD
R6 reg ← SHL(reg, imm) 1
R7 reg ← ADD(reg, reg) 1
R8 reg ← LD(reg) 1
R9 reg ← LD(ADD(reg, reg)) 1 LD
R 10 reg ← LD(ADD(reg, SHL(reg, imm))) 1
sponding labels of nodes in the data-flow trees. The terminals of the grammar
are VAR, CST, SHL, ADD, and LD. Rules that translate from one non-terminal to
?
another are called chain rules, e.g., reg ← imm that translates an immediate
value to a register. Note that there are multiple possibilities to obtain a cover of
the data-flow tree for the example shown in Figure 19.2. Each rule has associated
costs. The cost of a tree cover is the sum of the costs of the selected rules. For
example, the DFT could be covered by rules R3 , R4 , and R10 which would give a
total cost for the cover of one cost unit. Alternatively, the DFT could be covered
by rule R2 , R3 , R5 , R7 , and R8 which yields four cost units for the cover for issuing
four assembly instructions. A dynamic programming algorithm selects a cost
optimal cover for the DFT.
19 Instruction Code Selection — (D. Ebner, A. Krall, B. Scholz) 269
Tree pattern matching on a DFT is limited to the scope of tree structures. To def-use chains
fixed-point arithmetic
overcome this limitation, we can extend the scope of the matching algorithm
to the computational flow of a whole procedure. The use of the SSA form as
an intermediate representation improves the code generation by making def-
use? relationships explicit. Hence, SSA exposes the data flow of a translation
unit and utilizes the code generation process. Instead of using a textual SSA
representations, we employ a graph representation of SSA called the SSA graph 1
that is an extension of DFTs and represents the data flow for scalar variables
of a procedure in SSA form. SSA graphs are a suitable representation for code
generation: First, SSA graphs capture acyclic and cyclic information flow beyond
basic block boundaries. Second, SSA graphs often arise naturally in modern
compilers as the intermediate code representation usually already is in SSA form.
Third, output or anti-dependencies in SSA graph do not exist.
As even acyclic SSA graphs are in the general case not restricted to a tree, no
dynamic programming approach can be employed for instruction code selection.
To get a handle on instruction code selection for SSA graphs, we will discuss in
the following an approach based on a reduction to a quadratic mathematical
programming problem (PBQP). Consider the code fragment of a dot-product
routine and the corresponding SSA graph shown in Figure 19.3. The code im-
plements a simple vector dot-product using fixed-point?arithmetic. Nodes in
the SSA graph represent a single operation while edges describe the flow of data
that is produced at the source node and consumed at the target node. Incoming
edges have an order which reflects the argument order of the operation. In the
figure the color of the nodes indicates to which basic block the operations belong
to.
The example in Figure 19.3 has fixed-point computations that need to be
modeled in the grammar. For fixed-point values most arithmetic and bit-wise
operations are identical to their integer equivalents. However, some operations
have different semantics, e.g.,multiplying two fixed-point values in format m.i
results in a value with 2i fractional digits. The result of the multiplication has to be
adjusted by a shift operation to the right (LSR). To accommodate for fixed-point
values, we add the following rules to the grammar introduced in Figure 19.2:
rule cost instruction
reg ← VAR is_fixed_point ? ∞ otherwise 0
fp ← VAR is_fixed_point ? 0 otherwise ∞
fp2 ← MUL(fp, fp) 1 MUL Rd, Rm, Rs
fp ← fp2 1 LSR Rd, Rm, i
fp ← ADD(fp, fp) 1 ADD Rd, Rm, Rs
fp2 ← ADD(fp2, fp2) 1 ADD Rd, Rm, Rs
fp ← PHI(fp, ...) 0
fp2 ← PHI(fp2, ...) 0
In the example the accumulation for double-precision fixed point values
(fp2) can be performed at the same cost as for the single-precision format (fp).
1
We consider its data-based representation here.?See Chapter 14
270 19 Instruction Code Selection — (D. Ebner, A. Krall, B. Scholz)
Thus, it would be beneficial to move the necessary shift from the inner loop
to the return block, performing the intermediate calculations in the extended
format. However, as a tree-pattern matcher generates code at statement-level,
the information of having values as double-precision cannot be hoisted across
basic block boundaries. An instruction code selector that is operating on the
SSA graph, is able to propagate non-terminal fp2 across the φ node prior to the
return and emits the code for the shift to the right in the return block.
In the following, we will explain how to perform instruction code selection
on SSA graphs with the means of a specialized quadratic assignment problem
(PBQP). First, we discuss the instruction code selection problem by employing a
discrete optimization problem called partitioned boolean quadratic problem.
An extension of patterns to arbitrary acyclic graph structures, which we refer to
as DAG grammars, is discussed in Sub-Section 19.2.1.
ret
fp_ dot_product(fp_ *p , fp_ *q ) {
fp_ s = 0, ∗e = p + N ;
while (p < e ) {
s = s + (∗p ) ∗ (∗q ); φ
p = p + 1;
q = q + 1;
} < ADD
return s ;
}
ADD MUL φ
CST:2N LD LD CST:0
φ φ
Fig. 19.3 Instruction code selection SSA Graph for a vector dot-product in fixed-point arithmetic. fp_
stands for unsigned short fixed point type.
19.1 Instruction code selection for tree patterns on SSA-graphs 271
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Partitioned Boolean
19.1 Instruction code selection for tree patterns on SSA- Quadratic
Programming, PBQP
graphs
The matching problem for SSA graphs reduces to a discrete optimization problem
called Partitioned Boolean Quadratic Problem (PBQP). First, we will introduce
the PBQP problem and then we will describe the mapping of the instruction
code selection problem to PBQP.
?
Partitioned Boolean Quadratic Programming (PBQP) is a generalized quadratic
assignment problem that has proven to be effective for a wide range of appli-
cations in embedded code generation, e.g.,register assignment, address mode
selection, or bank selection for architectures with partitioned memory. Instead of
problem-specific algorithms, these problems can be modeled in terms of generic
PBQPs that are solved using a common solver library. PBQP is flexible enough to
model irregularities of embedded architectures that are hard to cope with using
traditional heuristic approaches.
Consider a set of discrete variables X = {x1 , . . . , xn } and their finite domains
{D1 , . . . , Dn }. A solution of PBQP is a mapping h of each variable to an element in
its domain, i.e., an element of Di needs to be chosen for variable xi . The chosen
element imposes local costs and related costs with neighboring variables. Hence,
the quality of a solution is based on the contribution of two sets of terms.
1. For assigning variable xi to the element d i in Di . The quality of the assign-
ment is measured by a local cost function c (xi , d i ).
2. For assigning two related variables xi and x j to the elements d i ∈ Di and
d j ∈ D j . We measure the quality of the assignment with a related cost function
C (xi , x j , d i , d j ).
The total cost of a solution h is given as,
X X
f= c (xi , h (xi )) + C xi , x j , h (xi ), h (x j ) . (19.1)
1≤i ≤n 1≤i < j ≤n
The PBQP problem seeks for an assignment of variables xi with minimum total
costs.
In the following we represent both the local cost function and the related cost
function in matrix form, i.e., the related cost function C (xi , x j , d i , d j ) is decom-
posed for each pair (xi , x j ). The costs for the pair are represented as |Di |-by-|D j |
matrix/table Ci j . A matrix element corresponds to an assignment (d i , d j ). Simi-
−
→
larly, the local cost function c (xi , d i ) is represented by a cost vector ci enumerat-
272 19 Instruction Code Selection — (D. Ebner, A. Krall, B. Scholz)
rule, base ing the costs of the elements. A PBQP problem has an underlying graph structure
rule, production
graph G = (V , E , C , c ), which we refer to as a PBQP graph. For each decision
variable xi we have a corresponding node vi ∈ V in the graph, and for each cost
matrix Ci , j that is not the zero matrix, we introduce an edge e = (vi , v j ). The cost
functions c and C map nodes and edges to the original cost vectors and matrices
respectively. We will present an example later in this chapter in the context of
instruction code selection.
In general, finding a solution to this minimization problem is NP hard. How-
ever, for many practical cases, the PBQP instances are sparse, i.e.,many of the
cost matrices Ci , j are zero matrices and do not contribute to the overall solution.
Thus, optimal or near-optimal solutions can often be found within reasonable
time limits. Currently, there are two algorithmic approaches for PBQP that have
been proven to be efficient in practice for instruction code selection problems,
i.e.,a polynomial-time heuristic algorithm and a branch-&-bound based algo-
rithm with exponential worst case complexity. For a certain subclass of PBQP, the
algorithm produces provably optimal solutions in time O (nm 3 ), where n is the
number of discrete variables and m is the maximal number of elements in their
domains, i.e.,m = max (|D1 |, . . . , |Dn |). For general PBQPs, however, the solution
may not be optimal. To obtain still an optimal solution outside the subclass,
branch-&-bound techniques can be applied.
In the following, we describe the modeling of instruction code selection for SSA
graphs as a PBQP problem. In the basic modeling, SSA and PBQP graphs coincide.
The variables xi of the PBQP are decision variables reflecting the choices of
applicable rules (represented by Di ) for the corresponding node of xi . The local
costs reflect the costs of the rules and the related costs reflect the costs of chain
rules making rules compatible with each other. This means that the number of
decision vectors and the number of cost matrices in the PBQP are determined
by the number of nodes and edges in the SSA graph respectively. The sizes of Di
depend on the number of rules in the grammar. A solution for the PBQP instance
induces a complete cost minimal cover of the SSA graph.
As in traditional tree pattern matching, an ambiguous graph grammar con-
sisting of tree patterns with associated costs and semantic actions is used. In-
put grammars have to be normalized. This means that each rule is either a
so-called base rule or a chain rule. A base rule? is a production of the form
nt0 ← OP(nt1 , . . . , ntkp ) where nti are non-terminals and OP is a terminal sym-
bol, i.e.,an operation represented by a node in the SSA graph. A chain-rule is
a production of the form nt0 ← nt1 , where nt0 and nt1 are non-terminals. A
?
production rule nt ← OP1 (α, OP2 (β ), γ)) can be normalized by rewriting the rule
into two production rules nt ← OP1 (α, nt0 , γ) and nt0 ← OP2 (β ) where nt0 is a
new non-terminal symbol and α, β and γ denote arbitrary pattern fragments.
19.1 Instruction code selection for tree patterns on SSA-graphs 273
This transformation can be iteratively applied until all production rules are either
chain rules or base rules. To illustrate this transformation, consider the grammar
in Figure 19.4, which is a normalized version of the tree grammar introduced in
Figure 19.2. Temporary non-terminal symbols t1, t2, and t3 are used to decom-
pose larger tree patterns into simple base rules. Each base rule spans across a
single node in the SSA graph.
R2 R2 R1
VAR:a ( 0 )
VAR:i ( 0 )
CST:2 ( 0 )
R4 R5 R6
SHL ( 1 1 0 )
reg reg reg
reg 0 0 0
reg t1 reg
reg 0 ∞ 0
reg 0 ∞ 0
t1 ∞ 0 ∞
rule cost
R1 imm ← CST 0 R7 R8 R9
R2 reg ← VAR 0 ADD ( 1 0 0 )
R3 reg ← imm 1
R4 reg ← SHL(reg, reg) 1
R5 reg ← SHL(reg, imm) 1 reg t2 t3
R6 t1 ← SHL(reg, imm) 0 reg 0 ∞ ∞
R7 reg ← ADD(reg, reg) 1 t2 ∞ 0 ∞
R8 t2 ← ADD(reg, t1) 0 t3 ∞ ∞ 0
R9 t3 ← ADD(reg, reg) 0
R 10 reg ← LDW(reg) 1
R 11 reg ← LDW(t2) 1
R 10 R 11 R 12
R 12 reg ← LDW(t3) 1 LDW ( 1 1 1 )
Fig. 19.4 PBQP instance derived from the example shown in Figure 19.2. The grammar has been nor-
malized by introducing additional non-terminals.
274 19 Instruction Code Selection — (D. Ebner, A. Krall, B. Scholz)
The instruction code selection problem for SSA graphs is modeled in PBQP as
follows. For each node u in the SSA graph, a PBQP variable x u is introduced. The
domain of variable x u is determined by the subset of base rules whose terminal
symbol matches the operation of the SSA node, e.g.,there are three rules (R4 , R5 ,
R6 ) that can be used to cover the shift operation SHL in our example. The last
rule is the result of automatic normalization of a more complex tree pattern. The
−→
cost vector c u = w u · 〈c (R1 ), . . . , c (Rk u )〉 of variable x u encodes the local costs for
a particular assignment where c (Ri ) denotes the associated cost of base rule Ri .
Weight w u is used as a parameter to optimize for various objectives including
speed (e.g., w u is the expected execution frequency of the operation at node
u ) and space (e.g., the w u is set to one). In our example, both R4 and R5 have
associated costs of one. Rule R6 contributes no local costs as we account for the
full costs of a complex tree pattern at the root node. All nodes have the same
weight of one, thus the cost vector for the SHL node is 〈1, 1, 0〉.
An edge in the SSA graph represents data transfer between the result of an
operation u , which is the source of the edge, and the operand v which is the
tail of the edge. To ensure consistency among base rules and to account for the
costs of chain rules, we impose costs dependent on the selection of variable
x u and variable x v in the form of a cost matrix Cu v . An element in the matrix
corresponds to the costs of selecting a specific base rule ru ∈ R u of the result and
a specific base rule rv ∈ R v of the operand node. Assume that ru is nt ← OP(. . . )
and rv is · · · ← OP(α, nt0 , β ) where nt0 is the non-terminal of operand v whose
value is obtained from the result of node u . There are three possible cases:
1. If the non-terminal nt and nt’ are identical, the corresponding element in
matrix Cu v is zero, since the result of u is compatible with the operand of
node v .
2. If the non-terminals nt and nt0 differ and there exists a rule r : nt0 ← nt in
the transitive closure of all chain rules, the corresponding element in Cu v
has the costs of the chain rule, i.e.,w v · c (r ).
3. Otherwise, the corresponding element in Cu v has infinite costs prohibiting
the selection of incompatible base rules.
As an example, consider the edge from CST:2 to node SHL in Figure 19.4. There
is a single base rule R1 with local costs 0 and result non-terminal imm for the
constant. Base rules R4 , R5 , and R6 are applicable for the shift, of which the first
one expects non-terminal reg as its second argument, rules R5 and R6 both
expect imm. Consequently, the corresponding cost matrix accounts for the costs
of converting from reg to imm at index (1, 1) and is zero otherwise.
Highlighted elements in Figure 19.4 show a cost-minimal solution of the PBQP
with costs one. A solution of the PBQP directly induces a selection of base and
chain rules for the SSA graph. The execution of the semantic action rules inside
a basic block follow the order of basic blocks. Special care is necessary for chain
rules that link data flow across basic blocks. Such chain rules may be placed
inefficiently and a placement algorithm is required for some grammars.
19.2 Extensions and generalizations 275
VAR: p
∗p = r + 1;
∗q = p + 1;
∗r = q + 1;
INC ST
ST INC ST INC
VAR: q VAR: r
Fig. 19.5 DAG patterns may introduce cyclic data dependencies.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.2 Extensions and generalizations
We will now outline, through the example in Figure 19.5, a possible problem
formulation for these generalized patterns in the PBQP framework discussed so
far. The code fragment contains three feasible instances of a post-increment store
pattern. Assuming that p , q , and r point to mutually distinct memory locations,
there are no further dependencies apart from the edges shown in the SSA graph.
If we select all three instances of the post-increment store pattern concurrently,
the graph induced by SSA edges becomes acyclic, and the code cannot be emitted.
To overcome this difficulty, the idea is to express in the modeling of the problem,
a numbering of chosen nodes, that reflects the existence of a topological order.
Modeling
VAR: p
1: INC 2: ST
P3 P1,2 P2 P1,1
(2 M ) (2 M )
7: instance 〈2, 1〉
off on1 on2 on3 off on1 on2 on3 off on1 on2 on3
off 0 0 0 0
( 0 k k k ) off 0 0 0 0
on1 0 ∞ 0 0 on1 0 ∞ 0 0
on2 0 ∞ ∞ 0 on2 0 ∞ ∞ 0
on3 0 ∞ ∞ ∞ on3 0 ∞ ∞ ∞
off on1 on2 on3 off on1 on2 on3 off on1 on2 on3 off on1 on2 on3
P2 0 ∞ ∞ ∞ P3 0 ∞ ∞ ∞ P2 0 ∞ ∞ ∞ P3 0 ∞ ∞ ∞
P1,1 0 0 0 0 P1,2 0 0 0 0 P1,1 0 0 0 0 P1,2 0 0 0 0
P2 P1,1
3: ST 4: INC P3 0 0 5: ST 6: INC
P1,2 0 0
P2 P1,1 P3 P1,2 P2 P1,1 P3 P1,2
(2 M ) (2 M ) (2 M ) (2 M )
VAR: q VAR: r
Fig. 19.6 PBQP Graph for the Example shown in Figure 19.5. M is a large integer value. We use k as a
shorthand for the term 3 − 2M .
278 19 Instruction Code Selection — (D. Ebner, A. Krall, B. Scholz)
chosen nodes. To this end, we refine the state on such that it reflects a particular
index in a concrete topological order. Matrices among these nodes account for
data dependencies, e.g.,consider the matrix established among nodes 7 and 8.
Assuming instance 7 is on at index 2 (i.e., mapped to on2 ), the only remaining
choices for instance 8 are not to use the pattern (i.e., mapped to off) or to enable
it at index 3 (i.e., mapped to on3 ), as node 7 has to precede node 8.
Additional cost matrices are required to ensure that the corresponding proxy
state is selected on all the variables forming a particular pattern instance (which
can be modeled with combined costs of 0 or ∞ respectively). However, this
formulation allows for the trivial solution where all of the related variables en-
coding the selection of a complex pattern are set to off (accounting for 0 costs)
even though the artificial proxy state has been selected. We can overcome this
problem by adding a large integer value M to the costs for all proxy states. In
exchange, we subtract these costs from the cost vector of instances. Thus, the
penalties for the proxy states are effectively eliminated unless an invalid solution
is selected.
Cost matrices among nodes 1 to 6 do not differ from the basic approach
discussed before and reflect the costs of converting the non-terminal symbols
involved. It should be noted that for general grammars and irreducible graphs,
that the heuristic solver of PBQP cannot guarantee to deliver a solution that sat-
isfies all constraints modeled in terms of ∞ costs. This would be a NP-complete
problem. One way to work around this limitation is to include a small set of
rules that cover each node individually and that can be used as a fallback rule
in situations where no feasible solution has been obtained, which is similar to
macro substitution techniques and ensures a correct but possibly non-optimal
matching. These limitations do not apply to exact PBQP solvers such as the
branch-&-bound algorithm. It is also straight-forward to extend the heuristic
algorithm with a backtracking scheme on RN reductions, which would of course
also be exponential in the worst case.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19.3 Concluding remarks and further reading
Aggressive optimizations for the instruction code selection problem are enabled
by the use of SSA graph. The whole flow of a function is taken into account
rather a local scope. The move from basic tree-pattern matching [1] to SSA-based
DAG matching is a relative small step as long as a PBQP library and some basic
infrastructure (graph grammar translator, etc.) is provided. The complexity of
the approach is hidden in the discrete optimization problem called PBQP. Free
PBQP libraries are available from the web-pages of the authors and a library is
implemented as part of the LLVM [198] framework.
19.3 Concluding remarks and further reading 279
Many aspects of the PBQP formulation presented in this chapter could not be
covered in detail. The interested reader is referred to the relevant literature [114,
112] for an in-depth discussion.
As we move from acyclic linear code regions to whole-functions, it becomes
less clear in which basic block, the selected machine instructions should be
emitted. For chain rules, the obvious choices are often non-optimal. In [264], a
polynomial-time algorithm based on generic network flows is introduced that
allows a more efficient placement of chain rules across basic block boundaries.
This technique is orthogonal to the generalization to complex patterns.
Very Long Instruction
Word, VLIW
Instruction Level
Parallelism, ILP
scheduling
if-conversion
predicated execution
?
Very Large Instruction Word (VLIW) or Explicitly Parallel Instruction Computing
?
(EPIC) architectures make Instruction Level Parallelism (ILP) visible within the
?
Instruction Set Architecture (ISA), relying on static schedulers to organize the
compiler output such that multiple instructions can be issued in each cycle.
If-conversion?is the process of transforming a control-flow region with condi-
tional branches, into an equivalent predicated?or speculated?sequence of instruc-
tions (into a region of Basic Blocks (possible single) referred to as an Hyperblock.
? ?
If-converted code replaces control dependencies by data dependencies, and
?
thus exposes Instruction Level Parallelism very naturally within the new region
at the software level.
Removing control hazards improves performance in several ways: By remov-
ing the misprediction penalty, the instruction fetch throughput is increased and
the instruction cache locality improved. Enlarging the size of basic blocks al-
lows earlier execution of long latency operations and the merging of multiple
control-flow paths into a single flow of execution, that can later be exploited
by scheduling frameworks such as VLIW scheduling, hyperblock scheduling or
modulo scheduling.?
Consider the simple example given in Figure 20.1, that represents the execu-
tion of an if-then-else-end statement on a 4-issue processor with non-biased
?
branches. In this figure, r = q ? r1 : r2 stands for a select instruction where r
is assigned r1 if q is true, and r2 otherwise. With standard basic block ordering,
assuming that all instructions have a one cycle latency, the schedule height goes
from five cycles in the most optimistic case, to six cycles. After if-conversion the
execution path is reduced to four cycles with no branches, regardless of the test
outcome, and assuming a very optimistic one cycle branch penalty. But the main
benefit here is that it can be executed without branch disruption.
From this introductory example, we can observe that:
• the two possible execution paths have been merged into a single execution
path, implying a better exploitation of the available resources;
281
282 20 If-Conversion — (C. Bruel)
speculation
predication, full ALU1 ALU2 ALU3 ALU4
q = test q = test - - -
q ? br l 1 q ? br l 1 - - -
x =b - - -
r =x ∗y - - -
l1 : x =a x =b
r =x +y r =x ∗y
br l 2 l2 : r = r + 1 - - -
l1 : x = a - - -
r =x +y - - -
l2 : r =r +1 br l 2 - - -
q = test x1 = b x2 = a -
r1 = x 1 ∗ y r2 = x 2 + y - -
r = q ? r1 : r2 - - -
r =r +1 - - -
• the schedule height has been reduced, because instructions can be control
speculated before the branch;?
• the variables have been renamed, and a merge pseudo-instruction has been
introduced.
Thanks to SSA, the merging point is already materialized in the original con-
trol flow as a φ pseudo-instruction, and register renaming was performed by
SSA construction. Given this, the transformation to generate if-converted code
seems natural locally. Still exploiting those properties on larger scale control-flow
regions requires a framework that we will develop further.
p ? x =a +b t1 = a + b x =a +b
p ? x =a ∗b t2 = a ∗ b t =a ∗b
x = p ? t1 : t2 x = cmov p , t
(a) fully predicated (b) speculative using select (c) speculative using cmov
Fig. 20.2 Conditional execution using different models
To be speculated, an instruction must not have any side effects, or hazards. For
instance, a memory load must not trap because of an invalid address. Memory
operations are a major impediment to if-conversion. This is regrettable, because
as any other long latency instructions, speculative loads can be very effective
to fetching data earlier in the instruction stream, reducing stalls. Modern ar-
chitectures provide architectural support to dismiss invalid address exceptions.
Examples are the ldw.d dismissible load operation in the Multiflow Trace series
of computers, or in the STMicroelectronics ST231 processor, but also the specu-
lative load of the Intel IA64. The main difference is that with a dismissible model,
invalid memory access exceptions are not delivered, which can be problematic
in embedded or kernel environment that relies on memory exception for correct
behavior. A speculative model allows to catch the exception thanks to the token
bit check instruction. Some architectures, such as the IA64, offer both speculative
and predicated memory operations.?Stores can also be executed conditionally
by speculating part of their address value, with additional constraints on the
ordering on the memory operations due to possible alias between the two paths.
Figure 20.3 shows examples of various forms of speculative memory operations.
Note that the select instruction is an architecture instruction that does not
need to be replaced during the SSA destruction phase. If the target architecture
does not provide such gated instruction?, it can be emulated using two conditional
?
moves. This translation can be done afterward, and the select instruction still
be used as an intermediate form. It allows the program to stay in full SSA form
where all the data dependencies are made explicit, and can thus be fed to all SSA
optimizers.
284 20 If-Conversion — (C. Bruel)
φ removal, reduction
single-entry-node/single- t = ld.s(addr ) t = ldw.d(addr )
exit-node, chk.s (t) x = select p ? t : x
SESE ?p ? x = t ?
φ reduction, reduction (b) Multiflow/ST231 dismissi-
(a) IA64 speculative load
ble load
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20.1 Basic transformations
Unlike global approaches, that identify a control-flow region and if-convert it
in one shot, the technique described in this chapter is based on incremental
reductions. To this end, we consider basic SSA transformations whose goal is
to isolate a simple diamond-DAG structure (informally an if-then-else-end)
that can be easily if-converted. The complete framework, that identifies and
incrementally performs the transformation, is described in Section 20.2.
The basic transformation that actually if-converts the code is the φ removal?that
takes a simple diamond-DAG as an input, i.e., a single-entry-node/single-exit-
?
node (SESE) DAG with only two distinct forward paths from its entry node to its
exit node. The φ removal consists in (1) speculating the code of both branches
in the entry basic block (denoted head); (2) then replacing the φ-function by
a select; (3) finally, simplifying the control flow to a single basic block. This
transformation is illustrated by the example of Figure 20.4.
The goal of the φ reduction?transformation is to isolate a diamond-DAG from
a structure that resembles a diamond-DAG but has side entries to its exit block.
This diamond-DAG can then be reduced using the φ removal transformation.
Nested if-then-else-end in the original code can create such control flow. One
20.1 Basic transformations 285
can notice the similarity with the nested arity-two φif -functions?used for gated-
SSA?(see Chapter 14). In the most general case, the join node of the considered
region has n predecessors with φ-functions of the form B0 : r = φ(B1 : r1 , B2 :
r2 , . . . , Bn : rn ), and is such that removing edges from B3 , . . . , Bn would give a
diamond-DAG. After the transformation, B1 and B2 point to a freshly created
basic block, say B12 , that itself points to B0 ; a new variable B12 : r12 = φ(B1 :
r1 , B2 : r2 ) is created in this new basic block; the φ-function in B0 is replaced
by B0 : r = φ(B12 : r12 , . . . , Bn : rn ). This is illustrated through the example of
Figure 20.5.
p ? br r3 = . . .
B1 B2
r1 = g + 1
p ? br r3 = . . . r1 = g + 1 r2 = x r2 = x
B1 B2 r12 = p ? r1 : r2
r1 = g + 1 r2 = x B 12 r12 = φ(r1 , r2 ) r3 = . . .
The objective of path duplication?is to get rid of all side entry edges that avoid
a single-exit-node region to be a diamond-DAG. Through path duplication, all
edges that point to a node different than the exit node or to the willing entry
node, are “redirected” to the exit node. φ reduction can then be applied to the
obtained region. More formally, consider two distinguished nodes, named head
and the single exit node of the region exit, such that there are exactly two different
control-flow paths from head to exit; consider (if exists), the first node sidei on
one of the forward path head → side0 → . . . sidep → exit that has at least two
predecessors. The transformation duplicates the path P = sidei → · · · → sidep →
exit into P 0 = side’i → · · · → side’p → exit and redirects sidei −1 (or head if i = 0)
to side’i . All the φ-functions that are along P and P 0 for which the number of
predecessors have changed have to be updated accordingly. Hence, a r = φ(sidep :
r1 , B2 : r2 , . . . , Bn : rn ) in exit will be updated into r = φ(side0p : r1 , B2 : r2 , . . . , Bn :
rn , sidep : r1 ); a r = φ(sidei −1 : r0 , r1 , . . . , rm ) originally in sidei will be updated
286 20 If-Conversion — (C. Bruel)
renaming, of variables into r = φ(r1 , . . . , rm ) in sidei and into r = φ(r0 ) i.e., r = r0 in side’i . Variables
renaming?(see Chapter 5) along with copy-folding?can then be performed on P
copy folding
conjunctive predicate
merge and P 0 . All steps are illustrated through the example of Figure 20.6.
head r1 = w
head p ? br r2 = x r3 = y
r1 = w
r2 = x r3 = y
p ? br side’0 side0
r4 = r1 r4 = φ(r2 , r3 )
r4 = φ(r1 , r2 , r3 ) r0 = z
side0 t1 = x + y t1 = x + y
t1 = x + y t2 = x
t2 = x side’1 side1
r0 = z t 3 = φ(t 1 , t 2 ) t3 = t1 t 3 = φ(t 1 , t 2 )
side1
· · · = t3 · · · = t3 · · · = t3
r1 = w
r2 = x r3 = y
p ? br r2 = x r3 = y
side’0
side0
r40 = r1 r4 = φ(r2 , r3 ) r1 = w
r0 = z r4 = φ(r2 , r3 )
t 10 = x + y t1 = x + y r0 = z
t2 = x r40 = r1 t1 = x + y
side’1 t2 = x
t 10 = x + y side1
t 3 = φ(t 1 , t 2 )
· · · = t 10 · · · = t 10 t 3 = φ(t 1 , t 2 )
· · · = t3
r04 = p ? r0 : r40 · · · = t3
?
The last transformation, namely the Conjunctive predicate merge, concerns
the if-conversion of a control-flow pattern that sometimes appears on codes to
represent logical and or or conditional operations. As illustrated by Figure 20.7
the goal is to get rid of side exit edges that avoid a single-entry-node region to be
a diamond-DAG. As opposed to path duplication, the transformation is actually
restricted to a very simple pattern highlighted in Figure 20.7 made up of three
distinct basic blocks, head, that branches with predicate p to side, or exit. side,
which is empty, branches itself with predicate q to another basic block outside
of the region or to exit. Conceptually the transformation can be understood
as first isolating the outgoing path p → q and then if-converting the obtained
diamond-DAG.
20.1 Basic transformations 287
p ? br p ? br
t1 = . . .
r1 = f(t 1 ) t1 = . . .
t1 = . . . t2 = . . . t1 = . . . t2 = . . .
r1 = f(t 1 )
r1 = f(t 1 ) r2 = f(t 2 ) t2 = . . . r1 = f(t 1 ) r2 = f(t 2 )
r2 = f(t 2 ) t2 = . . .
t = φ(t 1 , t 2 ) t = p ? t1 : t2 r2 = f(t 2 )
r = φ(r1 , r2 ) r = p ? r1 : r2 r = φ(r1 , r2 ) r = p ? r1 : r2
· · · = f(r ) · · · = f(r ) · · · = f(r ) · · · = f(r )
?
The φ removal transformation described above considered a speculative execution model.
As we will illustrate hereafter, in the context of a predicated execution model,
?
the choice of speculation versus predication is an optimization decision that
should not be imposed by the intermediate representation. Also, transforming
speculated code into predicated code can be viewed as a coalescing?problem.
The use of ψ-SSA?(see Chapter 15), as the intermediate form of if-conversion,
288 20 If-Conversion — (C. Bruel)
dependence, control- allows to postpone the decision of speculating some code, while the coalescing
dependence, data-
problem is naturally handled by the ψ-SSA destruction phase.
dependence, anti-
parallel execution Just as (control) speculating an operation on a control-flow graph corresponds
ψ-function to ignore the control dependence?with the conditional branch, speculating an op-
def-use chains eration on an if-converted code corresponds to remove the data dependence?with
the corresponding predicate. On the other hand, on register allocated code, spec-
?
ulation adds anti-dependencies. This trade-off can be illustrated through the
example of Figure 20.9: For the fully predicated version of the code, the compu-
tation of p has to be done before the computations of x1 and x2 ; speculating the
computation of x1 removes the dependence with p and allows to execute it in
parallel?with the test (a <
? b ); if both the computation of x1 and x2 are speculated,
?
they cannot be coalesced and when destruction ψ-SSA, the ψ-function will give
rise to some select instruction; if only the computation of x1 is speculated, then
x1 and x2 can be coalesced to x , but then an anti-dependence from x = a + b
and p ? x = c appears that forbid its execution in parallel.
p ? b)
= (a < p = (a <? b) p = (a <
? b) p = (a <
? b)
p ? x1 = a + b x1 = a + b x1 = a + b x =a +b
p ? x2 = c x2 = c p ? x2 = c p ?x =c
x = ψ(p ? x1 , p ? x2 ) ? x = ψ(p ? x1 , p ? x2 ) x = ψ(p ? x1 , p ? x2 )
(a) predicated code (b) fully speculated (c) partially speculated (d) after coalescing
Fig. 20.9 Speculation removes the dependency with the predicate but adds anti-dependencies between
concurrent computations.
projection
if (q ) ψ-projection
then guard
p = (a <
? b) p = (a <
? b) p = (a <? b)
s =q ∧p
x1 = a + b x1 = a + b q ? x1 = a + b
p ? x2 = c p ? x2 = c s ? x2 = c
x = ψ(x1 , p ? x2 ) x = ψ(x1 , p ? x2 ) x = ψ(q ? x1 , s ? x2 )
d 1 = f(x ) d 1 = f(x ) q ? d 1 = f(x )
else
d2 = 3 q ? d2 = 3 q ? d2 = 3
end
d = φ(d 1 , d 2 ) d = ψ(q ? d 1 , q ? d 2 ) d = ψ(q ? d 1 , q ? d 2 )
(a) nested if (b) speculating the then branch (c) predicating both branches
Fig. 20.10 Inner region ψ
The algorithm takes as input a CFG in SSA form and applies incremental re-
ductions using the list of candidate conditional basic blocks sorted in post-
order. Each basic block in the list designates the head of a sub-graph that can
be if-converted using the transformations described in Section 20.1. Post-order
traversal allows to process each region from the inner to the outer. When the if-
converted region cannot grow anymore because of resources, or because a basic
block cannot be if-converted, then the next sub-graph candidate is considered
until all the CFG is explored. Note that as the reduction proceeds, maintaining
SSA?can be done using the general technique described in Chapter 5. Basic local
ad hoc updates can also be implemented instead.
Consider for example the CFG from the gnu wc (word count) program re-
ported in Figure 20.11a. The exit node BB7, and basic block BB3 that contains a
function call cannot be if-converted (represented in gray). The post-order list
of conditional blocks (represented in bold) is [BB11, BB17, BB16, BB14, BB10,
BB9, BB6, BB2]. (1) The first candidate region is composed of {BB11, BB2, BB12};
φ-reduction can be applied, promoting the instructions of BB12 in BB11; BB2
becomes the single successor of BB11. (2) The region headed by BB17 is then
considered; BB19 cannot yet be promoted because of the side entries coming
both from BB15 and BB16; BB19 is duplicated into a BB190 with BB2 as successor;
BB190 can then be promoted into BB17. (3) The region headed by BB16 that have
now BB17 and BB19 as successors is considered; BB19 is duplicated into BB1900 ,
so as to promote BB17 and BB190 into BB16 through φ-reduction; BB190 already
contains predicated operations from the previous transformation, so a new merg-
ing predicate is computed and inserted; After the completion of φ-removal, BB16
has a unique successor, BB2. (4) BB14 is the head of the new candidate region;
Here, BB15 and BB16 can be promoted; Again since BB16 contains predicated
and predicate setting operations, a fresh predicate must be created to hold the
merged conditions. (5) BB10 is then considered; BB14 needs to be duplicated to
BB140 . The process finished with the region head by BB9.
20.2 Global analysis and transformations 291
BB3 BB4
BB5
BB6 BB7
BB1
BB1
BB8
BB2
BB11 BB9 BB2
BB3 BB4
BB12 BB13 BB10 BB3 BB4
Just as for the example of Figure 20.11, some basic blocks (such as BB3) may have
to be excluded from the region to if-convert. Tail duplication?can be used for this
purpose. Similar to path duplication?described in Section 20.1, the goal of tail
duplication is to get rid of incoming edges of a region to if-convert. This is usually
?
done in the context of hyperblock formation, which technique consists in, as
opposed to the inner-outer incremental technique described in this chapter, to if-
convert a region in “one shot”. Consider again the example of Figure 20.11a, and
suppose the set of selected basic blocks defining the region to if-convert consists
of all basic blocks from BB2 to BB19 excluding BB3, BB4, and BB7. Getting rid of
the incoming edge from BB4 to BB6 is possible by duplicating all basic blocks of
the region reachable from BB6 as shown in Figure 20.11b.
Formally, consider a region R made up of a set of basic blocks, a distinguished
one entry and the others denoted (Bi )2≤i ≤n , such that any Bi is reachable from
entry in R. Suppose a basic block Bs has some predecessors out1 , . . . , outm that
are not in R. Tail duplication consists in: (1) for all B j (including Bs ) reachable
from Bs in R, create a basic block B j0 as a copy of B j ; (2) any branch from B j0 that
points to a basic block Bk of the region is rerouted to its duplicate Bk0 ; (3) any
branch from a basic block outk to Bs is rerouted to Bs0 . In our example, we would
have entry = BB2, Bs = BB6, and out = BB4.
A global approach would just do as in Figure 20.11c: First select the region;
Second, get-rid of side incoming edges using tail duplication; Finally perform
if-conversion of the whole region in one shot. We would like to point out that
there is no phasing issue with tail duplication. To illustrate this point, consider
292 20 If-Conversion — (C. Bruel)
branch coalescing the example of Figure 20.12a where BB2 cannot be if-converted. The selected
region is made up of all other basic blocks. Using a global approach as in stan-
dard hyperblock formation, tail duplication would be performed prior to any
if-conversion. This would lead to the CFG of Figure 20.12b. Note that a new node,
BB7, has been added here after the tail duplication by a process called branch
?
coalescing. Applying if-conversion on the two disjoint regions respectively head
by BB4 and BB4’ would lead to the final code shown if Figure 20.12c. Our in-
cremental scheme would first perform if-conversion of the region head by BB4,
leading to the code depicted in Figure 20.12e. Applying tail duplication to get
rid of side entry from BB2 would lead to exactly the same final code as shown in
Figure 20.12f.
BB1
BB2
BB3
BB1 BB1
BB4’
BB6 BB7
BB7
BB1 BB1
20.2.3 Profitability
Fusing execution paths can over commit the architectural ability to execute in
parallel the multiple instructions: Data dependencies and register renaming
introduce new register constraints. Moving operations earlier in the instruction
stream increases live-ranges. Aggressive if-conversion can easily exceed the pro-
20.2 Global analysis and transformations 293
n m 0
costpredicated = head ◦ i =1 Bi ◦
Û
i =1 Bi
where ◦ is the composition function that merges basic blocks together, removes
associated branches and creates the predicate operations.
head head
p B1
p B 10
B 10 B 20 B 10
B1
B 20 B 20
exit exit
The profitability for the logical conjunctive merge of Figure 20.14 can be
evaluated similarly. There are three paths impacted by the transformation:
pathp ∧q = [head, side, B1 [, pathp ∧q = [head, side, exit[, and pathp = [head, exit[
of respective probability prob(p ∧ q ), prob(p ∧ q ), and prob(p ). The overall cost
before the transformation (if branches are on p and q ) pathÛ + path Û + pathØ
p ∧q p ∧q p
294 20 If-Conversion — (C. Bruel)
simplifies to
costcontrol = head
Ö + side Ø + prob(p ) × (1 + prob(q )) × br_lat
Ô = [head]
which should be compared to (if the branch on the new head block is on p ∧ q )
costpredicated = head
Û ◦ side = [head
Û ◦ side] + prob(p ) × prob(q ) × br_lat
head head
p side
p
side p ∧q p ∧q
q q
B1 exit B1 exit
Further readings
Chapter 3 provides a basic algorithm for destructing SSA that suffers from several
limitations and drawbacks: first, it works under implicit assumptions that are
not necessarily fulfilled at machine level; second, it must rely on subsequent
phases to remove the numerous copy operations it inserts; finally, it increases
subsequently the size of the intermediate representation, thus making it not
suitable for just-in-time compilation.??
Correctness
297
298 21 SSA Destruction for Machine Code — (F. Rastello)
conventional SSA, C-SSA same original variable should not interfere, while two names can. Such a flavor
corresponds to the C-SSA form?described in Chapter 2. The former simplifies the
SSA, updating
dedicated registers
calling conventions SSA destruction phase, while the latter simplifies and allows more transforma-
instruction set tions to be performed under SSA (updating C-SSA is very difficult)?. Apart from
architecture!ISA dedicated registers?for which optimizations are usually very careful in managing
φ-function, lowering ?
there live range, register constraints related to calling conventions or instruction
copie, insertion ?
coalescing, conservative set architecture might be handled by the register allocation phase. However, as
coalescing, aggressive we will see, enforcement of register constraints impacts the register pressure as
conventional SSA!C-SSA well as the number of copy operations. For those reasons we may want those
φ-web constraints to be expressed earlier (such as for the pre-pass scheduler), in which
case the SSA destruction phase might have to cope with them.
Code quality
The cleanest and simplest way to perform SSA destruction with good code quality
?
is to first insert copy instructions to make the SSA form conventional, then
take advantage of the SSA form to run efficiently aggressive coalescing (without
breaking the conventional property), before eventually renaming φ-webs?and
getting rid of φ-functions. Unfortunately this approach will lead, in a transitional
stage, to an intermediate representation with a substantial number of variables:
the size of the liveness sets and interference graph classically used to perform
coalescing become prohibitively large for dynamic compilation. To overcome
this difficulty one can compute liveness and interference on demand which, as
we already mentioned, is made simpler by the use of SSA form (see Chapter 9).
Remains the process of copy insertion itself that might still take a substantial
21.1 Correctness 299
amount of time. To fulfill memory and time constraints imposed by just-in-time just-in-time compiler!JIT
compilation? , one idea is to virtually insert those copies, and only effectively
lost-copy problem
isolation, of φ-node
insert the non-coalesced ones. parallel copies
This chapter addresses those three issues: handling of machine level con-
straints, code quality (elimination of copies), and algorithm efficiency (speed
and memory footprint). The layout falls into three corresponding sections.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.1 Correctness
In most cases, edge splitting can be avoided by treating φ-uses and φ-definition
operand symmetrically: instead of just inserting copies on the incoming control-
flow edges of the φ-node (one for each use operand), a copy is also inserted on
the outgoing edge (one for its defining operand). This has the effect of isolating
the value associated to the φ-node thus avoiding (as discussed further) SSA
destruction issues such as the well-known lost-copy problem?. The process of
φ-node isolation?is illustrated by Figure 21.1. The corresponding pseudo-code is
given in Algorithm 21.1. If, because of different φ-functions, several copies are
introduced at the same place, they should be viewed as parallel copies?. For that
reason, an empty parallel copy is initially inserted both at the beginning (i.e., right
after φ-functions, if any) and at the end of each basic block (i.e., just before the
branching operation, if any). Note that, as far as correctness is concerned, those
copies can be sequentialized in any order, as they concern different variables
(this is a consequence of φ-node isolation – see below).
.. ..
a 10 ← a 1 .
B1 B2 . B1 B2
c ? br B 0 c ? br B 0 a 20 ← a 2
B 0 : a 0 ← φ(B 1 : a 1 , B 2 : a 2 ) B 0 : a 00 ← φ(B 1 : a 10 , B 2 : a 20 )
B0 B0
a 0 ← a 00
When incoming edges are not split, inserting a copy not only for each argu-
ment of the φ-function, but also for its result is important: without the copy
a 00 ← a 0 , the φ-function defines directly a 0 whose live range can be long enough
to intersect the live range of some a i0 , i > 0. Prior SSA destruction algorithms that
did not perform the copy a 00 ← a 0 identified two problems. (1) In the “lost-copy
300 21 SSA Destruction for Machine Code — (F. Rastello)
lost-copy problem ?
problem”, a 0 is used in a successor of Bi 6= B0 , and the edge from Bi to B0 is criti-
cal. (2) In the “swap problem”?, a 0 is used in B0 as a φ-function argument. In this
swap problem
parallel copies
branch instruction latter case, if parallel copies?are used, a 0 is dead before a i0 is defined. But, if copies
problem are sequentialized blindly, the live range of a 0 can go beyond the definition point
hardware loop of a i0 and lead to an incorrect code after renaming a 0 and a i0 with the same name.
duplicated edge problem
φ-node isolation allows to solve most of the issues that can be faced at machine
dead code elimination
empty block elimination level. However, there remains subtleties listed below.
copy-folding
instruction set
architecture|ISA
Limitations
application binary
interface|ABI
pinning There is a tricky case, when the basic block contains variables defined after
live range pinning the point of copy insertion. This is for example the case for the PowerPC bclr
branch instructions?with a behavior similar to hardware loop?. In addition to the
condition, a counter u is decremented by the instruction itself. If u is used in a
φ-function in a direct successor block, no copy insertion can split its live range.
It must then be given the same name as the variable defined by the φ-function. If
both variables interfere, this is just impossible! For example, suppose that for the
code of Figure 21.2a, the instruction selection chooses a branch with decrement
(denoted br_dec) for Block B1 (Figure 21.2b). Then, the φ-function of Block B2 ,
which uses u , cannot be translated out of SSA by standard copy insertion because
u interferes with t 1 and its live range cannot be split. To destruct SSA, one could
add t 1 ← u − 1 in Block B1 to anticipate the branch. Or one could split the critical
edge between B1 and B2 as in Figure 21.2c. In other words, simple copy insertions
is not enough in this case. We see several alternatives to solve the problem: (1) the
SSA optimization could be designed with more care; (2) the counter variable
must not be promoted to SSA; (3) some instructions must be changed; (4) the
control-flow edge must be split somehow.
There is another tricky case when a basic block has twice the same predecessor
?
block. This can result from consecutively applying copy-folding and control-
?
flow graph structural optimizations such as dead code elimination or empty
?
block elimination. This is the case for the example of Figure 21.3 where copy-
folding?would remove the copy a 2 ← b in Block B2 . If B2 is eliminated, there is no
way to implement the control dependence of the value to be assigned to a 3 other
than through predicated code (see chapters 15 and 14) or through the reinsertion
of a basic block between B1 and B0 by the split of one of the edges.
The last difficulty SSA destruction faces when performed at machine level
is related to register constraints such as instruction set architecture (ISA)? or
application binary interface (ABI)? constraints. For the sake of the discussion
we differentiate two kinds of resource constraints that we will refer as operand
? ?
pinning and live range pinning . The live range pinning of a variable v to resource
R will be represented R v , just as if v were a version of temporary R . An operand
pinning to a resource R will be represented using the exponent ↑R on the cor-
responding operand. Live range pinning expresses the fact that the entire live
range of a variable must reside in a given resource (usually a dedicated register).
21.1 Correctness 301
stack pointer
two-address mode
u 1 ← φ(u 0 , u 2 )
.. ..
. .. B1 .
B1 .
B1 u2 ← u1 − 1 br_dec u , B 1
t0 ← u 2
u 2 6= 0 ? br B 1
br_dec u , B 1
t0 ← u
B3 . . . ← u 2 B3 . . . ← u 2 B3 . . . ← u 2
(a) Initial SSA code (b) Branch with decrement (c) C-SSA with additional
edge splitting
?
Fig. 21.2 Copy insertion may not be sufficient. br_dec u, B1 decrements u, then branches to B1 if u 6= 0.
a1 ← ...
a1 ← ... a1 ← ... b ← ...
B1
B1 b ← ... B1 b ← ... a 10 ← a 1 k b 0 ← b
b > a 1 ? br B 2 b > a 1 ? br B 0 b > a 1 ? br B 0
B2
a2 ← b
B0
B0 B0 B 0 : a 00 ← φ(B 1 : a 10 , B 2 : b 0 )
B 0 : a 0 ← φ(B 1 : a 1 , B 2 : a 2 ) B 0 : a 0 ← φ(B 1 : a 1 , B 2 : b ) a 0 ← a 00
(a) Initial C-SSA code (b) T-SSA code (c) After φ-isolation
Fig. 21.3 Copy-folding followed by empty block elimination can lead to SSA code for which destruction
is not possible through simple copy insertion?
parallel copies
interference, strong ↑T
p2 ← p1 + 1
↑T Tp1 ← p1
φ-function operands, Tp2 ← Tp1 + 1
split p2 ← Tp2
φ-function operands,
non-split (a) Operand pinning of an auto-increment (b) Corresponding live range pinning
φ-web
a ↑R 0 ← f (b ↑R 0 , c ↑R 1 ) R 0b 0 = b k R 1c 0 = c
R 0a 0 = f (R 0b 0 , R 1c 0 )
a = R 0a 0
(c) Operand pinning of a function call (d) Corresponding live range pinning
Fig. 21.4 Operand pinning and corresponding live range pinning
The scheme we propose in this section to perform SSA destruction that deals with
machine level constraints does not address compilation cost (in terms of speed
?
and memory footprint). It is designed to be simple. It first inserts parallel copies to
isolate φ-functions and operand pinning. Then it checks for interferences that
would persist. We will denote such interferences as strong?, as they cannot be
tackled through the simple insertion of temporary-to-temporary copies in the
code. We consider that fixing strong interferences should be done on a case-by-
case basis and restrict the discussion here on their detection.
As far as correctness is concerned, Algorithm 21.1 splits the data flow between
variables and φ-nodes through the insertion of copies. For a given φ-function
a 0 ← φ(a 1 , . . . , a n ), this transformation is correct as long as the copies can be
inserted close enough to the φ-function. It might not be the case if the insertion
point (for a use-operand) of copy a i0 ← a i is not dominated by the definition
point of a i (such as for argument u of the φ-function t 1 ← φ(u , t 2 ) for the code
of Figure 21.2b); symmetrically, it will not be correct if the insertion point (for
the definition-operand) of copy a 0 ← a 00 does not dominate all the uses of a 0 .
Precisely this leads to inserting in Algorithm 21.1 the following tests:
• line 9: “if the definition of a i does not dominate PCi then continue;”
• line 16: “if one use of a 0 is not dominated by PC0 then continue;”
?
For the discussion, we will denote as split operands the newly created local vari-
ables to differentiate them to the ones concerned by the two previous cases
(designed as non-split operands?). We suppose a similar process have been per-
formed for operand pinning to express them in terms of live range pinning with
very short (when possible) live ranges around the concerned operations.
At this point, the code is still under SSA and the goal of the next step is to check
that it is conventional: this will obviously be the case only if all the variables of a
φ-web?can be coalesced together. But not only: the set of all variables pinned to
a common resource must also be interference free. We say that x and y are pin-
φ-related to one another if they are φ-related or if they are pinned to a common
resource. The transitive closure of this relation defines an equivalence relation
21.2 Code quality 303
that partitions the variables defined locally in the procedure into equivalence pin-φ-web
classes, the pin-φ-webs?. Intuitively, the pin-φ-equivalence class of a resource
interference
Now, one needs to check that each web is interference free. A web contains
variables and resources. The notion of interferences?between two variables is the
one discussed in Section 2.6 for which we will propose an efficient implemen-
tation later in this chapter. A variable and a physical resource do not interfere
while two distinct physical resources interfere with one another.
If any interference has been discovered, it has to be fixed on a case-by-case
basis. Note that some interferences such as the one depicted in Figure 21.3 can
be detected and handled initially (through edge splitting if possible) during the
copy insertion phase.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.2 Code quality
Once the code is in conventional SSA, the correctness problem is solved: destruc-
ting it is by definition straightforward, as it relies in renaming all variables in each
304 21 SSA Destruction for Machine Code — (F. Rastello)
interference graph
interference
Algorithm 21.2: The pin-φ-webs discovery algorithm, based on the union-
affinity find pattern?
1 for each resource R do
2 web(R ) ← {R }
3 for each variable v do
4 web(v ) ← {v }
5 if v pinned to a resource R then
6 union(web(R ), web(v ))
φ-web into a unique representative name and then remove all φ-functions. To
improve the code, however, it is important to remove as many copies as possible.
Aggressive coalescing
B0
x1 ← . . . x1 ← . . .
B0 x 0 ← x
x ← ... liveness!φ-function
B0
1 1 multiplexing mode,
x2 ←
semantic of
x 20 ← φ(x 10 , x 30 ) φ-operator
φ(x 1 , x 3 )
x 2 ← x 20 x2 ← x liveness!φ-function
B1
x3 ← x2 + 1
B1 x3 ← x2 + 1 B1 x ← x 2 + 1 basic-block exit
x 30 ← x 3
p ? br B 1 interference, ultimate
p ? br B 1
p ? br B 1
B2 . . . ← x 2 B2 . . . ← x 2 B2 . . . ← x 2
x 10 x 30
x1 x3 x1 {x 10 , x 20 , x 30 } x3
x 20 x
x2 x2
If?the goal is not to destruct SSA completely but remove as many copies as possible
while maintaining the conventional property, liveness of φ-function operands
should reproduce the behavior of the corresponding non-SSA code as if the vari-
ables of the φ-web were coalesced all together. The semantic of the φ-operator
?
in the so-called multiplexing mode fits the requirements. The corresponding
interference graph on our example is depicted in Figure 21.5c.
Value-based interference
As said earlier, after the φ-isolation phase and the treatment of operand pinning
constraints, the code contains many overlapping live ranges that carry the same
value. Because of this, to be efficient coalescing must use an accurate notion of
?
interference. As already mentioned in Chapter 2, the ultimate notion of interfer-
ence contains two dynamic (i.e., related to the execution) notions: the notion of
liveness and the notion of value. Analyzing statically if a variable is live at a given
execution point or if two variables carry identical values is a difficult problem.
The scope of variable coalescing is usually not so large, and graph coloring based
306 21 SSA Destruction for Machine Code — (F. Rastello)
interference, conservative register allocation commonly take the following conservative test: two variables
interference, updating
interfere if one is live at a definition point of the other and this definition is not a
dominance property
copy-folding copy between the two variables?.
interference, value based One can notice that, with this conservative interference definition, when a
Global Value Numbering and b are coalesced, the set of interferences of the new variable may be strictly
smaller than the union of interferences of a and b . Thus, simply merging the
two corresponding nodes in the interference graph is an over-approximation
with respect to the interference definition. For example, in a block with two
successive copies b = a and c = a where a is defined before, and b and c (and
possibly a ) are used after, it is considered that b and c interfere but that none of
them interfere with a . However, after coalescing a and b , c should not interfere
anymore with the coalesced variable. Hence the interference graph would have
?
to be updated or rebuilt.
However, in SSA, each variable has, statically, a unique value, given by its
unique definition. Furthermore, the “has-the-same-value” binary relation de-
fined on variables is, if the SSA form fulfills the dominance property?, an equiva-
lence relation. The value of an equivalence class 1 is the variable whose definition
dominates the definitions of all other variables in the class. Hence, using the
?
same scheme as in SSA copy folding , finding the value of a variable can be
done by a simple topological traversal of the dominance tree: when reaching
an assignment of a variable b , if the instruction is a copy b = a , V (b ) is set to
V (a ), otherwise V (b ) is set to b . The interference test in now both simple and
accurate (no need to rebuild/update after a coalescing): if live(x ) denotes the set
of program points where x is live,
?
a interfere with b if live(a ) intersects live(b ) and V (a ) 6= V (b )
The first part reduces to def(a ) ∈ live(b ) or def(b ) ∈ live(a ) thanks to the
dominance property. In the previous example, a , b , and c have the same value
V (c ) = V (b ) = V (a ) = a , thus they do not interfere.
Note that our notion of values is limited to the live ranges of SSA variables, as
we consider that each φ-function defines a new variable. We could propagate in-
formation through a φ-function when its arguments are equivalent (same value).
But we would face the complexity of global value numbering?(see Chapter 11).
By comparison, our equality test in SSA comes for free.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.3 Speed and memory footprint
analysis and the interference graph construction. Finally, if a general coalescing interference check
liveness check
algorithm is used, a graph representation with adjacency lists (in addition to
interference, intersection
the bit matrix) and a working graph to explicitly merge nodes when coalescing based
variables, would be required. All these constructions, updates, manipulations are interference, value based
time-consuming and memory-consuming. We may improve the whole process
by: (a) avoiding the use of any interference graph and liveness sets; (b) avoid
the quadratic complexity of interference check between two sets of variables
by an optimistic approach that first coalesces all copy-related variables (even
interfering ones), then traverses each set of coalesced variables and un-coalesce
one by one all the interfering ones; (c) emulating (“virtualizing”) the introduction
of the φ-related copies.
Interference check
Liveness sets and interference graph are the major source of memory usage?. This
motivates, in the context of JIT compilation, not to build any interference graph
?
at all, and rely on the liveness check described in Chapter 9 to test if two live
ranges intersect or not. Let us suppose for this purpose that a “has-the-same-
value” equivalence relation, is available thanks to a mapping V of variables to
symbolic values:
intersect(a , b ) ⇔ liverange(a
Ð ) ∩ liverange(b ) 6= ;
Ð a.def.op = b.def.op
Ð
⇔ ÐÐ a.def.op dominates b.def.op V a.islive out(b.def.op)
V
Ð b.def.op dominates a.def.op b.islive out(a.def.op)
The interference check outlined in the previous paragraph allows to avoid build-
ing an interference graph of the SSA form program. However, coalescing has the
308 21 SSA Destruction for Machine Code — (F. Rastello)
de-coalescing effect of merging vertices and interference queries are actually to be done be-
dominator!tree
tween sets of vertices. To overcome this complexity issue, the technique proposed
here is based on a de-coalescing scheme?. The idea is to first merge all copy and
φ-function related variables together. A merged-set might contain interfering
variables at this point. The principle is to identify some variables that interfere
with some other variables within the merged-set, and remove them (along with
the one they are pinned with) from the merged-set. As we will see, thanks to the
dominance property, this can be done linearly using a single traversal of the set.
In reference to register allocation, and graph coloring, we will associate the
notion of colors to merged-sets: all the variables of the same set are assigned
the same color, and different sets are assigned different colors. The process of
de-coalescing a variable is to extract it from its set; it is not put in another set,
just isolated. We will say uncolored. Actually, variables pinned together have
to stay together. We denote the (interference free) set of variables pinned to a
common resource that contains variable v , atomic-merged-set(v ). So the process
of un-coloring a variable might have the effect of un-coloring some others. In
other words, a colored variable is to be coalesced with variables of the same
color, and any uncolored variable v is to be coalesced only with the variables it
is pinned with, i.e. atomic-merged-set(v ).
We suppose that variables have already been colored and the goal is to un-
color some of them (preferably not all of them) so that each merged-set become
interference free. We suppose that if two variables are pinned together they have
been assigned the same color, and that a merged-set cannot contain variables
pinned to different physical resources. Here we focus on a single merged-set and
the goal is to make it interference free within a single traversal. The idea exploits
the tree shape of variables live ranges under strict SSA?. To this end, variables are
identified by their definition point and ordered using dominance accordingly.
Algorithm 21.3 performs a traversal of this set along the dominance order, en-
forcing at each step the subset of already considered variables to be interference
free. From now, we will abusively design as the dominators of a variable v , the set
of variables of color identical to v which definition dominates the definition of v .
Variables defined at the same program point are arbitrarily ordered, so as to use
the standard definition of immediate dominator (denoted v.idom, set to ⊥ if not
exists, updated lines 6-8). To illustrate the role of v.eanc in Algorithm 21.3, let us
consider the example of Figure 21.6 where all variables are assumed to be origi-
nally in the same merged-set: v.eanc (updated line 16) represents the immediate
intersecting dominator with the same value than v ; so we have b .eanc = ⊥ and
d .eanc = a . When line 14 is reached, cur_anc (if not ⊥) represents a dominating
variable interfering with v and with the same value than v.idom: when v is set to
c (c .idom = b ), as b does not intersect c and as b .eanc = ⊥, cur_anc = ⊥ which
allows to conclude that there is no dominating variable that interfere with c ;
when v is set to e , d does not intersect e but as a intersects and has the same
value than d (otherwise a or d would have been uncolored), we have d .eanc = a
and thus cur_anc = a . This allows to detect on line 18 the interference of e with
a.
21.3 Speed and memory footprint 309
virtual isolation, of
Algorithm 21.3: De-coalescing of a merged-set φ-node
1 cur_idom = ⊥
2 foreach variable v of the merged-set in DFS pre-order of the dominance tree do
3 DeCoalesce(v , cur_idom)
4 cur_idom ← v
5 Function DeCoalesce(v , u)
while (u 6= ⊥) (¬(u dominates v ) ∨ uncolored(u)) do u ← u.idom
V
6
7
8 v.idom ← u
9 v.eanc ← ⊥
10 cur_anc ← v.idom
11 while cur_anc 6= ⊥ do V
12 while cur_anc 6= ⊥ ¬ (colored(cur_anc) ∧ intersect(cur_anc, v )) do
13 cur_anc ← cur_anc.eanc
14 if cur_anc 6= ⊥ then
15 if V (cur_anc) = V (v ) then
16 v.eanc ← cur_anc
17 break
18 else  cur_anc and v interfere
19 if preferable to uncolor v then
20 uncolor atomic-merged-set(v )
21 break
22 else
23 uncolor atomic-merged-set(cur_anc)
24 cur_anc ← cur_anc.eanc
a ← ...
b ←a d ←a
c ←b +2 e ←d +1
... ← c + 1 ... ← a + e
Fig. 21.6 Variables live ranges are sub-trees of the dominator tree?
The last step toward a memory-friendly and fast SSA-destruction algorithm con-
?
sists in emulating the initial introduction of copies and only actually insert them
on the fly when they appear to be required. We use exactly the same algorithms
as for the solution without virtualization, and use a special location in the code,
identified as a “virtual” parallel copy, where the real copies, if any, will be placed.
310 21 SSA Destruction for Machine Code — (F. Rastello)
early point, of a basic Because of this we consider a different semantic for φ-functions than the
block
late point, of a basic block multiplexing mode previously defined. To this end we differentiate φ-operands
copy mode, semantic of for which a copy cannot be inserted (such as for the br_dec of Figure 21.2b) to the
φ-operator others. We use the term non-split and split operands introduced in Section 21.1.
liveness!φ-function
For a φ-function B0 : a 0 = φ(B1 : a 1 , . . . , Bn : a n ) and a split operand Bi : a i , we
basic-block early
basic-block late denote the program point where the corresponding copy would be inserted
?
materialization, of copies as the early point of B0 (early(B0 ) – right after the φ-functions of B0 ) for the
definition-operand, and as the late point?of Bi (late(Bi ) – just before the branching
instruction) for a use-operand.
parallel copy,
Algorithm 21.4: De-coalescing with virtualization of φ-related copies sequentialization of
1 foreach c ∈ COLORS do c .cur_idom = ⊥
2 foreach basic block B in CFG in DFS pre-order of the dominance tree do
3 foreach program point l of B in topological order do
4 if l = late(B ) then
5 foreach c ∈ COLORS do c .curphi = ⊥
6 foreach basic-block B 0 successor of B do
7 foreach operation Phi: “B 0 : a 0 = φ(. . . , B : v, . . . )” in B 0 do
8 if ¬colored(Phi) then continue
9 else c ← color(P hi )
10
11 DeCoalesce_virtual(Phi, B : v , c.curphi, c.cur_idom)
12 if colored(Phi) then c .curphi ← Phi
13
14 else
15 foreach operation OP at l (including φ-functions) do
16 foreach variable v defined by OP do
17 if ¬colored(v ) then continue
18 else c ← color(v )
19
20 DeCoalesce(v, c .cur_idom)
21 if colored(v ) then c .cur_idom ← v
22
During the whole algorithm, we treat the copies placed at a given program point
as parallel copies?, which are indeed the semantics of φ-functions. This gives sev-
eral benefits: a simpler implementation, in particular for defining and updating
liveness sets, a more symmetric implementation, and fewer constraints for the
312 21 SSA Destruction for Machine Code — (F. Rastello)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21.4 Further readings
SSA destruction was first addressed by Cytron et al. [94] who propose to simply
replace each φ-function by copies in the predecessor basic block. Although this
21.4 Further readings 313
11 while to_do 6= [] do
12 while ready 6= [] do
13 b ← ready.pop() Â pick a free location
14 a ← pred(b ) ; c ← loc(a ) Â available in c
15 emit_copy(c 7→ b ) Â generate the copy
16 loc(a ) ← b  now, available in b
17 if a = c and pred(a ) 6= ⊥ then ready.push(a) Â just copied, can be overwritten
18
naive translation seems, at first sight, correct, Briggs et al. [54] pointed subtle
errors due to parallel copies and/or critical edges in the control-flow graph.
Two typical situations are identified, namely the “lost copy problem” and the
“swap problem”. The first solution, both simple and correct, was proposed by
Sreedhar et al. [277]. They address the associated problem of coalescing and
describe three solutions. The first one, consists in three steps: (a) translate SSA
into CSSA, by isolating φ-functions; (b) eliminate redundant copies; (c) eliminate
φ-functions and leave CSSA. The third solution that turns out to be nothing else
than the first solution that would virtualizes the isolation of φ-functions shows
to introduce fewer copies. The reason for that, identified by Boissinot et al., is
due to the fact that in the presence of many copies the code contains many
intersecting variables that do not actually interfere. Boissinot et al. [41] revisited
Sreedhar et al.’s approach in the light of this remark and proposed the value
based interference described in this chapter.
The ultimate notion of interference was discussed by Chaitin et al. [65] in
the context of register allocation. They proposed a simple conservative test:
two variables interfere if one is live at a definition point of the other and this
definition is not a copy between the two variables. This interference notion is the
314 21 SSA Destruction for Machine Code — (F. Rastello)
most commonly used, see for example how the interference graph is computed
in [12]. Still they noticed that, with this conservative interference definition, after
coalescing some variables the interference graph has to be updated or rebuilt. A
counting mechanism to update the interference graph was proposed, but it was
considered to be too space consuming. Recomputing it from time to time was
preferred [65, 64].
The value-based technique described here can also obviously be used in the
context of register allocation even if the code is not under SSA form. The notion
of value may be approximated using data-flow analysis on specific lattices [8]
and under SSA form simple global value numbering [258] can be used.
Leung and George [190] addressed SSA destruction for machine code. Register
renaming constraints, such as calling conventions or dedicated registers, are
treated with pinned variables. Simple data-flow analysis scheme is used to place
repairing copies. By revisiting this approach to address the coalescing of copies
Rastello et al. [249] pointed out and fixed a few errors present in the original
algorithm. While being very efficient in minimizing the introduced copies, this
algorithm is quite complicated to implement and not suited for just in time
compilation.
The first technique to address speed and memory footprint was proposed
by Budimlić et al. [58]. It proposes the de-coalescing technique, revisited in
this chapter, that exploits the underlying tree structure of dominance relation
between variables of the same merged-set.
Last, this chapter describes a fast sequentialization algorithm that requires
the minimum number of copies. A similar algorithm has already been proposed
by C. May [204].
CHAPTER 22
Register Allocation F. Bouchez
S. Hack
F. Rastello
315
316 22 Register Allocation— (F. Bouchez, S. Hack, F. Rastello)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.1 Introduction
Let us first review the basics of register allocation, to help us understand the
choices made by graph-based and linear-scan style allocations.
Register allocation is usually performed per procedure. In each procedure,
a liveness analysis (see Chapter ??) determines for each variable the program
points where the variable is alive. The set of all program points where a variable is
alive is called the live-range of the variable, and all along this live-range, storage
needs to be allocated for that variable, ideally a register. When two variables
“exist” at the same time, they are conflicting for resources, i.e., they cannot reside
in the same location.
This resource conflict of two variables is called interference and is usually
defined via liveness: two variables interfere if (and only if) there exists a program
point where they are simultaneously alive, i.e., their live-ranges intersect.1 It
represents the fact that those two variables cannot share the same register. For
instance, in Figure 22.1, variables a and b interfere as a is alive at the definition
of b .
There are multiple questions that arise at that point that a register allocator has
to answer:
• Are there enough registers for all my variables? (spill test)
• If yes, how do I choose which register to assign to which variable? (assign-
ment)
• If no, how do I choose which variables to spill to memory? (spilling)
Without going into the details, let us see how linear-scan and graph-based
allocators handle these questions. Figure 22.1 will be used in the next paragraphs
to illustrate how these allocators work.
Linear-scan
p p
p ← ... p ← ...
a a ← ... a a ← ...
b b ← ... b b ← ...
if (. . . ) if (. . . ) a
p
x x ← 12
y
y ←b x ← 12 x ← 42 + b b
y ←b y ←a y
x ← 42 + b
y ←a x
return x + y + p p yx return x + y + p
Fig. 22.1 Linear scan makes an over-approximation of live-ranges as intervals, while graph-based allo-
cator create an interference graph capturing the exact interferences. Linear scan requires 5
register in this case while coloring the interference graph can be done with 4 registers.
Graph-based
Comparison
Linear-scan is a very fast allocator that works directly on the procedure. In its
model, its spill test is exact and its spill minimizes the amount of loads and store.
However, the model itself is very imprecise as procedures generally are not just
straight-line codes but involve complex flow structures such as if-conditions
and loops. In this model, the live-ranges are artificially longer so produce more
interferences than there actually are.
On the other hand, a graph-based allocator has a much more precise notion
of interference. Unfortunately, graph k -coloring is known to be an NP-complete
problem, and the interference graphs of programs are arbitrary. What causes this
are precisely the complex control-flow structure as these creates cycles in the
interference graph. This means the allocator will use heuristics to color, and will
base its spill decisions on this heuristic.
Let us look again at the example Figure 22.1, that presents a very simple code
with an if-condition. There are two initial variables a and b that end in the
branches, and two variables x and y are defined in those branches, as well as
a variable p that is always live. Linear scan would decide it needs five registers
(spill test), as the live ranges of all variables are artificially increased: for instance,
a is marked alive all along the left branch even though in is never used there.
IRC would create the graph on the right, which presents a 4-clique (a complete
sub-graph of size 4: with variables a , b , p , and x ), hence would require at least 4
colors. This simple graph would actually be easily 4-colorable with a heuristic,
hence the spill test would succeed with four registers.
Still, one could remark that at each point of the procedure, only three variables
are alive at the same time. But since x interferes with b on the left branch, and
with a on the right branch, it is impossible to use only three registers.
In conclusion, linear-scan allocators are faster, and graph coloring ones have
better results in practice, but both approaches have an inexact spill test: linear-
scan has artificial interferences, and graph coloring uses a coloring heuristic.
Moreover, both require variables to be assigned to exactly one register for all their
live range. This means both allocators will potentially spill more variables than
strictly necessary, and we will see how SSA can help with this problem.
p1
p1 ← . . .
p a1
p ← ... a1 ← . . .
a b1
a ← ... b1 ← . . .
b b ← ... if (. . . )
if (. . . )
x1 ← 12 x2 ← 42 + b1
x0 x 0 ← 42 + b y1 ← b1 y2 ← a 1
x ← 12
y ←a y1 y2
y ←b x1 x2
x ← x0
x3 ← φ(x1 , x2 )
y3 ← φ(y1 , y2 )
p yx return x + y + p y3 x3
return x3 + y3 + p1
Fig. 22.2 Splitting variable x in the previous example breaks the interference between x and a . Now only
3 registers are required. SSA introduces splitting that guarantees Maxlive registers are enough.
Graph-based
Tree-scan (linear-style)
Under SSA, the live-ranges are intervals that can “branch,” but never “join.” This
allows for a simple generalization of the linear-scan mentioned above that we
call the tree-scan, and always succeeds in coloring the tree-shaped live-ranges
with Maxlive colors. This greedy assignment scans the dominance tree, coloring
the variables from the root to the leaves in a top-down order. This means the
variables are simply colored in the order of their definitions. This works because
branches of the tree are independent, so coloring one will not add constraints
on other parts of the tree, contrary to the general non-SSA case where there are
cycles.
The pseudo-code of the tree scan is shown in Algorithm 22.1. Intuitively, when
the scanning arrives at the definition of a variable, the only colored variables
are “above” it and since there is at most Maxlive − 1 other variables live at the
definition, there is always a free color.
22.2 Spilling 321
Conclusion
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.2 Spilling
We have seen previously that, under SSA, it is easy to decide in polynomial time
whether there is enough registers or not, simply by checking if Maxlive ≤ R , the
number of registers. The goal of this section is to present algorithms that will
lower the register pressure when it is too high, i.e., when Maxlive > R , by spilling
(assigning) some variables to memory.
Spilling has a different meaning depending on the type of allocator used. For
a scan-based allocator, the spilling decision happens when we are at a particular
program point. Although it is actually a bit more complex, the idea when spilling
a variable v is that we insert a store at that point, and a load just before its
next use, hence we are spilling only a part of the live-range. On the other end, a
graph-based allocator has no notion of program points since the interferences
have been combined in an abstract structure: the interference graph. In the
graph-coloring setting, spilling means removing a node of the interference graph
and thus the entire live-range of a variable. This is a called a spill everywhere
strategy, which implies inserting load instructions in front of every use and store
322 22 Register Allocation— (F. Bouchez, S. Hack, F. Rastello)
instructions after each definition of the (non-SSA) variables. These loads and
stores require temporary variables that were not present in the initial graph.
Those variables also need to be assigned to registers, which means that whenever
the spilling/coloring is done, the interference graph is rebuilt and a new pass
of allocation is triggered, until no variable is spilled anymore: this is where the
“Iterated” comes from in the IRC name. In practice, a post-pass of a graph coloring
scheme scans each basic block separately, so as to, whenever possible, keep a
reloaded variable in a register between multiple uses.
In this section, we will consider the two approaches: the graph-based ap-
proach with a spill-everywhere scheme, and scan-based approach that allows
partial live-range spilling. In both cases, we will assume that the program was
in SSA before spilling. This is important to notice that there are pros and cons
of assuming so. In particular, the inability to coalesce or move the shuffle code
associated to φ-functions can lead to spurious load and store instructions on
CFG-edges. Luckily, these can be handled by a post-pass of partial redundancy
elimination (PRE, see Chapter ??), and we will consider here the spilling phase
as a full-fledged SSA program transformation.
Suppose we have R registers, the objective is to establish Maxlive ≤ R (Maxlive
lowering) by inserting loads and stores into the program. Indeed, as stated above,
lowering Maxlive to R ensures that a register allocation with R registers can be
found in polynomial time for SSA programs. Thus, spilling should take place
before registers are assigned and yield a program in SSA form. In such a decoupled
register-allocation scheme, the spilling phase is an optimization problem for
which we define the following constraints and objective function:
• the constraints that describe the universe of possible solutions expresses
that the resulting code should be R -colorable;
• the objective function expresses the fact that the (weighted) amount of in-
serted loads and stores should be minimized.
The constraints directly reflect the “spill test” which expresses whether more
spilling necessary or not. The objective is expressed with the profitability test:
among all variables, which one is more profitable to spill? The main implication
of spilling in SSA programs is that the spill test—which amounts to checking
whether Maxlive has been lowered to R or not—becomes precise.
The other related implication of the use of SSA form follows from this observa-
tion: consider a variable such that for any program point in its entire live-range
the register pressure is at most R , then spilling this variable is useless with regard
to the colorability of the code. In other words, spilling such a variable will never
be profitable. We will call this yes-or-no criteria, enabled by the use of SSA form,
the “usefulness test.”
We will see now how to choose, among all “useful” variables (with regard to
the colorability), the ones that seem most profitable. In this regard, we present
in the next section how SSA allows to better account for the program structure
in the spilling decision even in a graph-based allocators, thanks to the enabled
capability to decouple spilling (allocation) to coloring (assignment). However,
22.2 Spilling 323
register allocation under SSA shines the most in a scan-based setting, and we
present guidelines to help the spill decisions in such a scheme in Section 22.2.2.
Whenever a node n is spilled (assume only useful nodes are spilled), those addi-
tional fields of the interference graph must be updated as follows:
1. if n .pressure > R , for all its incoming edges (v → n ), v.useful is decremented
by one;
2. for all its successors p such that (n → p ).high = True, p.pressure is decre-
mented by one; if, following this decrement, p.pressure ≤ R , then for all the
incoming edges (v, p ) of p , v.useful is decremented by one.
In the context of a basic block, a simple algorithm that works well is the “fur-
thest first” algorithm that is presented in Algorithm 22.2. The idea is to scan the
block from top to bottom: whenever the register pressure is too high, we will
spill the variable whose next use is the furthest away, and it is spilled only up to
this next use. In the evict function of Algorithm 22.2, this correspond to maxi-
mizing distance_to_next_use_after(p ). Spilling this variable frees a register for
the longest time, hence diminishing the chances to have to spill other variables
later. This algorithm is not optimal because it does not take into account the
fact that the first time we spill a variable is more costly than subsequent spills of
the same variable (the first time, a store and a load are added, but only a load
must be added afterwards). However, the general problem is NP-complete, and
this heuristic, although it may produce more stores than necessary, gives good
results on “straight-line codes,” i.e., basic blocks.
We now present an algorithm that extends this “furthest first” algorithm to
general control-flow graphs. The idea is to scan the CFG using a topological order
and greedily evict sub live-ranges whenever the register pressure is too high.
There are two main issues:
1. Generalize the priority function distance_to_next_use_after to a general
CFG;
2. Find a way to initialize the “in_regs” set when starting to scan a basic block,
in the situation where predecessor basic blocks have not been processed yet
(e.g., at the entry of a loop).
Profitability to Spill
• If the left branch is taken, and consider the execution trace (AB 100 C ). In this
branch, the next use of y appears in a loop, while the next use of x appears
way further, after the loop has fully executed. It is clearly considered to be
more profitable to evict variable x (at distance 101).
326 22 Register Allocation— (F. Bouchez, S. Hack, F. Rastello)
• If the right branch is taken, and consider the execution trace (AD ). In that
case, this is variable y has the further use (at distance 2) so we would evict
variable y .
Looking at the example as a whole, we see that the left branch is not under
pressure, so spilling x would only help for program point p0 , and one would need
to spill another variable in block D (x is used at the beginning of D ), hence it
would be preferable to evict variable y .
On the other hand, modifying a little bit the example by assuming a high regis-
ter pressure within the loop at program point p1 (by introducing other variables),
then evicting variable x would be preferred in order to avoid a load and store in
a loop!
This dictates the following remarks:
1. Program points with low register pressure can be ignored.
2. Program points within loops, or more generally with higher execution fre-
quency should account in the computation of the “distance” more than
program points with lower execution frequency.
22.2 Spilling 327
In the first case, we would evict y , in the second, we would evict x (plus
another variable later, when arriving at p3 ), which is the behaviour we wanted in
the first place.
For each visited basic block B , the set of variables that must reside in a register
is stored in B.in_regs. For each basic block, the initial value of this set has to be
computed before we start processing it. The heuristic for computing this set is
different for a “regular” basic block and for a loop entry. For a regular basic block,
as we assume a topological order traversal of the CFG, all its predecessors will
have been processed. Live-in variables fall into three sets:
1. The ones that are available in all predecessor basic blocks:
\
B.allpreds_in_regs = P.in_regs
P ∈pred(B )
2. The ones that are available in some of the predecessor basic blocks:
[
B.somepreds_in_regs = P.in_regs
P ∈pred(B )
3 B.in_regs ← allpreds_in_regs
4 while |B.in_regs| < R and |somepreds_in_regs| > |B.in_regs| do
5 let v ∈ (somepreds_in_regs \ B.in_regs) with mininum v .spill_profitability(B .entry)
6 add v to B .in_regs
For a basic block at the entry of a loop, as illustrated by the example of Fig-
ure 22.4, one do not want to account for allocation on predecessor basic block,
22.2 Spilling 329
but start from scratch instead. Assume the first basic block has already been
processed and one wants to compute B.in_regs:
1. Example (a): even if at the end of the predecessor basic block, x is not avail-
able in a register, one wants to insert a reload of x at p1 , i.e., include x in
B.in_regs. Not doing so would involve a reload at every iteration of the loop
at p2 .
2. Example (b): Even if at the entry of the loop, x is available in a register, one
wants to spill it and restore at p2 so as to lower the register pressure that is
too high within the loop. This means excluding x from B.in_regs.
This leads to Algorithm 22.4 where B.livein represents the set of live-in vari-
ables of B and L .Maxlive is the maximal register pressure in the whole loop
L . Init_inregs first fills B.in_regs with live-in variables that are used within
the loop L . Then, we fill with live-through variables, but only those that can
survive the loop: if L .Maxlive > R , then L .Maxlive − R variables will have to be
spilled (hopefully some live-through variables); so no more than |B.livein| −
(L .Maxlive − R ) are allocated to a register at the entry of B .
The overall algorithm for spilling comprises several phases: First, we pre-compute
both liveness and profitability metrics; Then we traverse the CCFG in topolog-
ical order, and each basic block is scanned using initial value of B.in_regs as
explained above. During this phase, we maintain the sets of live-variables avail-
able in register and in memory at basic block boundaries. The last phase of the
algorithms handles the insertion of shuffle code (loads and store) where need.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.3 Coloring and coalescing
We advocate here a decoupled register allocation: First, lower the register pressure
so that Maxlive ≤ R ; Second, assign variable to registers. Live-range splitting
ensures that, after the first phase is done, no more spilling will be required as
R will be sufficient, possibly at the cost of inserting register-to-register copies.
This is practical for instance when working on an existing compiler that uses a
classical register allocation algorithm, for instance, the IRC.
For instance, a solution could be to insert a copy for every variable at every
basic block boundary, then linear scan would be able to color each basic block
independently. However, this would add a lot of copies.
Minimal SSA form (see Chapter ??) has this nice property for us that it provides
sufficient live-range splitting: the insertion of φ-functions effectively splits vari-
able just enough so that a greedy tree-scan coloring scheme can assign variable
to register without more spilling.
We already mentioned in Section 22.1.3 that the well-known “Iterated Register
Coalescing” (IRC) allocation scheme, which uses a simplification scheme can
take advantage of the SSA form property. We will show here that, indeed, the
underlying structural property makes a graph coloring simplification scheme
(recalled below) an “optimal” scheme. This is especially important because,
besides minimizing the amount of spill code, the second objective of register
allocation is to perform a “good coalescing,” i.e., try to minimize the amount
of register-to-register copies: a decoupled approach is practically viable if the
coalescing phase is effective in merging most of the live-ranges, introduced by
the splitting from SSA, by assigning live-ranges linked by copies to the same
register.
In this section, we will first present the traditional graph coloring heuristic,
based on a simplification scheme, and show how it successfully colors programs
under SSA form. We will then explain in greater details the purpose of coalescing,
and how it translates when performed on SSA form program. Finally we will show
how to extend the graph-based (from IRC) and the scan-based (of Algorithm 22.1)
greedy coloring schemes to perform efficient coalescing.
3
Again, how to do this is already discussed in Chapter ??, which deals with SSA destruction.
22.3 Coloring and coalescing 331
13 6 ; then
if V =
14 Failure “The graph is not simplifiable”
15 return stack
The greedy scheme is a coloring heuristic for general graphs, and as such,
it can get stuck; it happens whenever all remaining nodes have degree at least
R . In that case, we do not know whether the graph is R -colorable or not. In
332 22 Register Allocation— (F. Bouchez, S. Hack, F. Rastello)
greedy-k -colorable
Algorithm 22.6: Greedy coloring assignment used in the greedy coloring
scheme. The stack is the output of the Simplify function described in Algo-
rithm 22.5.
1 Function Assign_colors(G )
2 available ← new array of size R , with initialized values to True
3 while stack 6= ; do
4 v ← pop stack
5 foreach neighbor w in G do
6 available[color(w )] ← False  color already used by neighbor
7 foreach color c do
8 if available[c ] then
9 col ← c
10 available[c ] = True  prepare for next round
11 color(v ) ← col
12 add v to G
traditional register allocation, this is the trigger for spilling some variables so as
to unstuck the simplification process. However, under the SSA form, if spilling
has already been done so that the maximum register pressure is at most R , the
greedy coloring scheme can never get stuck! We will not formally prove this fact
here but will nevertheless try to give insight as to why this is true.
The key to understanding that property is to picture the dominance-tree, with
live-ranges are sub-trees of this tree, such as the one on Figure 22.2b. At the end
of each dangling branch there is a “leaf” variable: the one that is defined last
in this branch. These are the variables y1 , y2 , y3 and x3 on Figure 22.2b. We can
visually see that this variable will not have many intersecting variables: those are
the variables alive at its definition point, i.e., no more than Maxlive − 1, hence
less that R − 1. On Figure 22.2b, with Maxlive = 3 we see that each of them has
no more than two neighbors.
Considering again the greedy scheme, this means each of them is a candidate
for simplification. Once removed, another variable will become the new leaf
of that particular branch (e.g., x1 if y1 is simplified). This means simplification
can always happens at the end of the branches of the dominance tree, and the
simplification process can progress upwards until the whole tree is simplified.
In terms of graph theory, the general problem is knowing whether a graph is
k -colorable or not. Here, we can define a new class of graph that contains the
graphs colorable with this simplification scheme.
Definition 7. A graph is greedy-k -colorable if it can be simplified using the sim-
plify function of Algorithm 22.5.?
We then have the following theorem:
Theorem 1. Setting k = Maxlive, the interference graph of a code under SSA form
is always greedy-k -colorable.
22.3 Coloring and coalescing 333
This tells us that, if we are under SSA form and the spilling has already been coalescing
move instruction
done so that Maxlive ≤ R , the classical greedy coloring scheme is guaranteed to
perform register allocation with R colors without any additional spilling, as the
graph is greedy-R -colorable.4
Coalescing comes with several flavors, which can be either aggressive or con-
servative. Aggressively coalescing an interference graph, means coalescing non-
interfering nodes (i.e., constraining the coloring) regardless of the chromatic
number of the resulting graph. An aggressive coalescing scheme is presented in
Chapter ??. Conservatively coalescing an interference graph, means coalescing
non-interfering nodes without increasing the chromatic number of the graph. In
both cases, the objective function is the maximization of satisfied affinities, i.e.,
the maximization of the number of (weighted) affinities between nodes that have
been coalesced together. In the current context, we will focus on the conservative
scheme, as we do not want more spilling.
Obviously, because of the reducibility to graph-k -coloring, both coalescing
problems are NP-complete. However, graph coloring heuristics such as the Iter-
ated Register Coalescing use incremental coalescing schemes where affinities are
considered one after another. Incrementally, for two nodes linked by an affinity,
the heuristic will try to determine whether coalescing those two nodes will, with
regard to the coloring heuristic, increase the chromatic number of the graph or
not. If not, then the two corresponding nodes are (conservatively) coalesced. The
IRC considers two conservative coalescing rules that we recall here. Nodes with
degree strictly less than R are called low-degree?nodes (those are simplifiable),
while others are called high-degree?nodes.
Briggs merges u and v if the resulting node has less than R neighbors of high
degree. This node can always be simplified after its neighbors low-degree
neighbors are simplified, thus the graph remains greedy-R -colorable.
George merges u and v if all neighbors of u with high degree are also neighbors
of v . After coalescing and once all low-degree neighbors are simplified, one
gets a subgraph of the original graph, thus greedy-R -colorable too.
Algorithm 22.7: Pruned version of the Iterated Register Coalescing, that does
not handle spilling but only coloring plus coalescing. It can be used instead
of the Simplify function described in Algorithm 22.5, as it also produces a
stack useable by the Assign_color function of Algorithm 22.6.
Input: Undirected greedy-R -colorable graph G = (V , I )
Data: R : number of colors
Data: A: the affinity edges of G
Data: For all v , degree[v ] denotes the number of interfering neighbours of v in G
Data: For all v , affinities[v ] denotes {(v, w ) ∈ A}
1 Function Simplify_and_Coalesce(G)
2 stack ← ;
3 simplifiable ← {v ∈ V | degree[v ] < R and affinities[v ] = ;}
4 while V 6= ; do
5 while simplifiable 6= ; do
6 let v ∈ simplifiable
7 push v on stack
8 remove v from G : update V , I , and simplifiable sets accordingly
9 let a = (v, u) ∈ A of highest weight
 verify there is no interference (can happen after other merges)
10 if a exists and a ∈/ I and can_be_coalesced(a ) then
11 merge u and v into uv
12 update V , I , and simplifiable sets accordingly
13 remove a from A
14 return stack
Originally, those rules were used for any graph, not necessarily greedy-R -
colorable, and with an additional clique of pre-colored nodes—the physical
machine registers. With such general graphs, some restrictions on the applica-
bility of those two rules had to be applied when one of the two nodes was a
pre-colored one. But in the context of greedy-R -colorable graphs, we do not need
such restrictions.
However, in practice, those two rules give insufficient results to coalesce the
many moves introduced, for example, by a basic out-of-SSA conversion. The
main reason is because the decision is too local: it depends on the degree of
neighbors only. But these neighbors may have a high degree just because their
neighbors are not simplified yet, i.e., the coalescing test may be applied too early
in the simplify phase.
This is the reason why the IRC actually iterates: instead of giving up coalescing
when the test fails, the affinity is “frozen,” i.e., placed in a sleeping list and “awak-
ened” when the degree of one of the nodes implied in the rule changes. Thus,
affinities are in general tested several times, and move-related nodes—nodes
linked by affinities with other nodes—should not be simplified too early to ensure
the affinities get tested.
The advocated scheme, which corresponds to the pseudo-code of Figure ??
and is depicted in Figure 22.5, tries the coalescing of a given affinity only once,
336 22 Register Allocation— (F. Bouchez, S. Hack, F. Rastello)
equivalence class and thus does not require any complex freezing mechanism as done in the
original IRC. This is made possible thanks to the following enhancement of the
conservative coalescing rule: Recall that the objective of Briggs and George rules
is to test whether the coalescing v and u breaks the greedy-R -colorable property
of G or not. Testing this property can be done by running function Simplify(G)
itself! Theoretically, this greatly increases the complexity, as for each affinity, a
full simplification process could be potentially performed. However, experience
shows that the overhead is somewhat balanced by the magnitude lowering of
calls to “can_be_coalesced.” This approach still looks quite rough, hence we
named it the Brute coalescing heuristic. While this is more costly than only using
Briggs and George rules, adding this “brute” rule improves the quality of the
result, in terms of suppressed move instructions.
coalesce
Brute
simplify
build simplify Briggs George select
Fig. 22.5 Combined coalescing and coloring simplification scheme including Brute-force rule.
is available, and picks it if it is. If not, it chooses a different color (based on the
other heuristics presented here) and updates the color of the class.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22.4 Practical Discussions and Further Readings
The amount of papers on register allocation is humongous. The most popular
approach is based on graph coloring, including many extensions [76, 229, 197]
of the seminal paper of Chaitin et al. [63]. The Iterated Register Coalescing, men-
tioned several times in this chapter, is from George et al. [129]. The linear-scan
approach is also extremely popular in particular in the context of just-in-time
compilation. This elegant idea goes back to Traub et al. [296] and Poletto and
Sarkar [239]. If this original version of linear-scan contains lots of hidden sub-
tleties (e.g., computation of live-ranges, handling of shuffle code in the presence
of critical edges, etc.), its memory footprint is smaller than a graph coloring ap-
proach, and its practical complexity is smaller than the IRC. However, mostly due
to the highly over-approximated live-ranges, its apparent simplicity comes with a
poor quality resulting code. This lead to the development of interesting but more
complex extensions such as the works of Wimmer and Sarkar [313, 262]. All those
approaches are clearly subsumed by the tree-scan approach [83], both in terms
of simplicity, complexity, and quality of result. On the other hand, the linear-scan
extension proposed by Barik in his thesis [21] is an interesting competitor, its
footprint being, possibly, more compact than the one used by a tree-scan.
There exists an elegant relationship between tree-scan coloring and graph
coloring: Back in 1974, Gavril [127] showed that the intersection graphs of sub-
trees are the chordal graphs. By providing an elimination scheme that exposes
the underlying tree structure, this relationship allowed to prove that chordal
graphs can be optimally colored in linear time with respect to the number of
edges in the graph. This is the rediscovering of this relationship in the context
of live-ranges of variables for SSA-form programs that motivated different re-
search groups [142, 56, 230, 45] to revisit register allocation in the light of this
interesting property. Indeed, at that time, most register allocation schemes were
incompletely assuming that the assignment part was hard by referring to the
NP-completeness reduction of Chaitin et al. to graph coloring. However, the
observation that SSA-based live-range splitting allows to decouple the alloca-
tion and assignment phases was not new [92, 117]. Back in the nineties, the
LaTTe [320] just-in-time compiler already implemented the ancestor of our tree-
scan allocator. The most aggressive live-range splitting that was proposed by
Appel and George [12] allowed to stress the actual challenge that past approaches
were facing when splitting live-range to help coloring, which is coalescing [46].
The PhD theses of Hack [141], Bouchez [43], and Colombet [82] address the diffi-
cult challenge of making a neat idea applicable to real life but without trading the
338 22 Register Allocation— (F. Bouchez, S. Hack, F. Rastello)
register!pairing,pairing elegant simplicity of the original approach. For some more exhaustive related
register!constraints,constraints|seeregisters
work references we refer to the bibliography of those documents.
As done in (too) many register allocation papers, the heuristics described in this
chapter assume a simple non-realistic architecture where all variables or registers
are equivalent and where the instruction set architecture does not impose any
specific constraint on register usage. Reality is different, including: 1. register
constraints such as 2-address mode instructions that impose two of the three
operands to use the same register, or instructions that impose the use of specific
registers; 2. registers of various size (vector registers usually leading to register
aliasing), historically known as register pairing?problem; 3. instruction operands
that cannot reside in memory. Finally, SSA-based—but also any scheme that
rely on live-range splitting—must deal with critical edges possibly considered
abnormal (i.e., that cannot be split) by the compiler.
?
In the context of graph coloring, register constraints are usually handled by
adding an artificial clique of precolored node in the graph and splitting live-
ranges around instructions with precolored operands. This approach has several
disadvantages. First, it substantially increases the number of variables; Second, it
makes coalescing much harder. This motivated Colombet et al. [83] to introduce
the notion of antipathies (affinities with negative weight) and extend the coa-
lescing rules accordingly. The general idea is, instead of enforcing architectural
constraints, to simply express the cost (through affinities and antipathies) of
shuffle code inserted by a post-pass repairing. In a scan-based context, handling
of register constraints is usually done locally [211, 262]. The biased coloring strat-
egy used in this chapter is proposed by Braun et al. and Colombet et al. [52, 83]
and allows to reduce need for shuffle code.
In the context of graph-based heuristics, vector registers are usually handled
through a generalized graph coloring approach [272, 291]. In the context of scan-
based heuristics, the puzzle solver [231] is an elegant formulation that allows to
express the local constraints as a puzzle.
One of the main problems of the graph-based approach is its underlying
assumption that the live-range of a variable is atomic. But when spilled, not all
instructions can access it through a memory operand. In other words, spilling has
the effect of removing the majority of the live-range, but leaving “chads” which
correspond to shuffle code around instructions that use the spilled variable.
These replace the removed node in the interference graph, which is usually
re-build at some point. Those subtleties and associated complexity issues are
exhaustively studied by Bouchez et al. [47].
As already mentioned, the semantic of φ-functions correspond to parallel
copies on the incoming edges of the basic block where they textually appear.
When lowering φ-functions, the register for the def operand may not match the
register of the use operand. If done naively, this imposes the following:
22.4 Practical Discussions and Further Readings 339
? splitting,edge!splitting
1. splitting of the corresponding control-flow edge;
2. inserting of copies on this freshly created basic block;
3. using use of spill code in case parallel copy requires a temporary register but
none is available.
This issue shares similarities with the SSA destruction that was described in
Chapter 21. The advocated approach is, just as for the register constraints, to
express the cost of an afterward repairing in the objective function of the register
allocation scheme. Then, locality, the repairing can be expressed as a standard
graph coloring problem on a very small graph containing the variables created
by φ-nodes isolation (see Figure 21.1). However, it turns out that most of the
associated shuffle code can usually be moved (and even annihilated) to and
within the surrounding basic blocks. Such post-pass optimizations correspond
to the optimistic move insertion of Braun et al. [52] or the parallel copy motion
of Bouchez et al. [44].
The scan-based spilling heuristic described in this chapter is inspired from the
heuristic developed by Braun and Hack [51]. It is an extension of Belady’s al-
gorithm [26], which was originally designed for page eviction, but can easily
be shown to be optimal for interval graphs (straight-line code and single use
variables). For straight-line code but multiple-uses per variable, Farrach and
Liberatore [120] showed that if this further-first strategy works reasonably well, it
can also be formalized as a flow problem. For a deeper discussion about the com-
plexity of the spilling problem under different configurations, we refer to the work
of Bouchez et al. [47]. To conclude on the spilling part of the register allocation
problem, we need to mention the important problem of load/store placement.
As implicitly done by our heuristic given in Algorithm 22.4, one should, whenever
possible, hoist shuffle code outside of loops. This problem corresponds to the
global code-motion addressed in Chapter 11 that should idealistically be coupled
to the register allocation problem. Another related problem, not mentioned in
this chapter, is rematerialization [55]. Experience shows that: 1. Rematerialization
is one of the main source of performance improvement for register allocation;
2. In the vast majority of cases, rematerialization simply amounts in reschedul-
ing some of the instructions. This remark allows to highlight one of the major
limitation of the presented approach common to almost all papers in the area of
register allocation: while scheduling and register allocation are clearly highly cou-
pled problems, all those approaches only consider a fixed schedule and only few
papers try to address the coupled problem [218, 235, 309, 212, 31, 81, 295, 251].
Static single assignment has not been adopted by compiler designers for a
long time. A reason is the numerous copy instructions inserted by SSA destruc-
tion, that the compiler could not get rid of afterward. The coalescing heuristics
were not effective enough in removing all such copies, so even if it was clear that
live-range splitting was useful for improving colorability, i.e., avoiding spill code,
340 22 Register Allocation— (F. Bouchez, S. Hack, F. Rastello)
CHAPTER 23
Hardware Compilation using SSA P. C. Diniz
P. Brisk
?
This chapter describes the use of SSA-based high-level program representations
for the realization of the corresponding computations using hardware digital
circuits. We begin by highlighting the benefits of using a compiler SSA-based
intermediate representation in this hardware mapping process using an illustra-
tive example. The subsequent sections describe hardware translation schemes
for discrete hardware logic structures or data-paths of hardware circuits, and
outline several compiler transformations that benefit from SSA. We conclude
with a brief survey of various hardware compilation efforts both from academia
as well as industry that have adopted SSA-based internal representations.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.1 Brief history and overview
Hardware compilation is the process by which a high-level language, or be-
havioral, description of a computation is translated into a hardware-based im-
plementation i.e., a circuit expressed in a hardware design language such as
VHDL?[15], or Verilog?[292] which can be directly realized as an electrical (often
digital) circuit.
Hardware-oriented languages such as VHDL or Verilog allow programmers
to develop such digital circuits either by structural composition of blocks using
abstractions such as wires and ports or behaviorally by definition of the input-
output relations of signals in these blocks. A mix of both design approaches is
often found in medium to large designs. Using a structural approach, a circuit
description will typically include discrete elements such as registers (Flip-Flops)
that capture the state of the computation at specific events, such as clock edges,
and combinatorial elements that transform the values carried by wires. The com-
341
342 23 Hardware Compilation using SSA — (P. C. Diniz, P. Brisk)
× ×
1 variable X is
std_logic_vector(0..7)
−
2 ... +
3 X <= (A*B)-(C*D)+F
Register
× ×
−
+
Register
MUX MUX
Register
Register
Nevertheless, the emergence of multi-core processing has led to the intro- FPGA
synchronous data flow
duction of new parallel programming languages and parallel programming con-
functional language
structs that are more amenable to hardware compilation than traditional lan- multiplexer, logic circuit
guages . For example, MapReduce[106], originally introduced by Google to spread
parallel jobs across clusters of servers, has been an effective programming model
?
for FPGAs as it naturally exposes task-level concurrency with data independence.
Similarly, high-level languages based on parallel models of computation such
as synchronous data flow?, or functional single-assignment languages?have also
been shown to be good choices for hardware compilation as not only they make
data independence obvious, but in many cases, the natural data partitioning
they expose, is a natural match for the spatial concurrency of FPGAs.
Although in the remainder of this chapter we will focus primarily on the use
of SSA representations for hardware compilation of imperative high-level pro-
gramming languages , many of the emerging parallel languages, while including
sequential constructs, (such as control-flow graphs), also support true concur-
rent constructs. These languages can be a natural fit to exploit the spatial and
customization opportunities of FPGA-based computing architectures. While
the extension of SSA Form to these emerging languages is an open area of re-
search the fundamental uses of SSA for hardware compilation, as discussed in
this chapter, are likely to remain a solid foundation for the mapping of these
parallel constructs to hardware.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.2 Why use SSA for hardware compilation?
VLIW
control dependency a b
data dependency
t1 t2
p MUX
1 p ← ... 1 p ← ...
2 if p then 2 if p then
3 t ←a 3 t1 ← a t3
4 else 4 else
5 t ←b 5 t2 ← b +
6 6 t 3 ← φ(t 1 , t 2 )
7 v ← t + ... 7 v ← t3 + . . . v
While the Electronic Design Automation (EDA) community has for decades
now exploited similar information regarding data and control dependences for
the generation of hardware circuits from increasingly higher-level representa-
tions (e.g., Behavioral HDL), SSA-based representations make these dependences
explicit in the intermediate representation itself. Similarly, the more classical
compiler representations, using three-address instructions augmented with the
def-use chains already exposes the data-flow information as for the SSA-based
representation. The later however, and as we will explore in the next section,
facilitates the mapping and selection of hardware resources.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23.3 Mapping a control-flow graph to hardware
hyper-block
BB0
x1
x1
BB0
BB1 BB2
x1 ← . . .
x1 x1
x2
BB1 BB2
. . . ← x1 + . . . . . . ← x1 + . . .
BB3
x2 ← . . .
MUX
BB3
x3 ← φ(x1 , x2 ) x3
BB0 BB1
x1
Storage
x1 x2
x1 x3
BB2 BB3
BB0
x1 ← . . .
y1 ← . . .
BB1 BB2
. . . ← x1 + . . . . . . ← x1 + . . .
x2 ← . . . . . . ← y1 + . . .
. . . ← y1 + . . .
BB3
x3 ← φ(x1 , x2 )
. . . ← y1 + . . .
(a) CFG
BB0 BB0
x y x1 y1
BB3
BB3
MUX MUX MUX y1
x y x3
(b) naive multiplexer placement using liveness (c) multiplexer placement using φ-function
Fig. 23.4 Mapping of variable values across hardware circuit using spatial mapping
the logic circuits that implement the predicates associated with these nodes. If an
hardware circuit already exists that evaluates a given predicate that corresponds
to a given region, the implementation can simply reuses its output signal. This
lazy code generation and predicate composition achieves the goal of hardware
circuit sharing as illustrated by the example in Figure 23.6 where some of the
details were omitted for simplicity. When using the PDG representation, however,
care must be taken regarding the potential lack of referential transparency. To
this effect, it is often desirable to combine the SSA information with the PDG’s
regions to ensure correct reuse of the hardware that evaluates the predicates
associated with each control-dependence region.
352 23 Hardware Compilation using SSA — (P. C. Diniz, P. Brisk)
x0
1 if x > 0 then
2 x ← ... BB1 BB1
3 else
4 x ← ...
5 v ← x + ... x1 x2
(b) Gated-SSA form (c) Hardware circuit implementation using spatial mapping
Fig. 23.5 Hardware generation example using Gated-SSA form
lookup table
1 if P (x , y ) then
2 ...
3 if Q (x , y ) then
4 ...
5 else
6 ...
7 else
8 ...
P MUX
...
P (x , y ) Q
true false
... ... MUX
...
(b) PDG skeleton. Region nodes
omitted. (c) hardware design implementation
Fig. 23.6 Use of the predicates in region nodes of the PDG for mapping into the multiplexers associated
with each φ-function
basic flavor of these transformations is to push the addition operators toward the
outputs of a data-flow graph, so that they can be merged at the bottom. Example
of these transformations that use multiplexers are depicted in Figure 23.7(a,b).
In the case of Figure 23.7(a) the transformation leads to the fact that an addition
is always executed unlike in the original hardware design. This can lead to more
predictable timing or more uniform power draw signatures 7 . Figure 23.7(c) de-
picts a similar transformation that merges two multiplexers sharing a common
input, while exploiting the commutative property of the addition operator. The
SSA-based representation facilitates these transformations as it explicitly indi-
cates which values (by tracing backwards in the representation) are involved in
the computation of the corresponding values. For the example in Figure 23.7(b) a
compiler could quickly detect the variable a to be common to the two expressions
associated with the φ-function.
A second transformation that can be applied to multiplexers is specific to
?
FPGA whose basic building block consists of a k-input lookup table (LUT) logic
7
An important issue in security-related aspects of the execution
354 23 Hardware Compilation using SSA — (P. C. Diniz, P. Brisk)
a b c d a c b “0” d
+ MUX MUX
MUX +
y y
(a)
a b c a “0” b c
+ MUX
MUX +
y y
(b)
a b d c a d b c
+ +
y y
(c)
Fig. 23.7 Multiplexer-Operator transformations: juxtaposition of a multiplexer and an adder (a,b); re-
ducing the number of multiplexers placed on the input of an adder (c)
8
A typical k-input LUT will include an arbitrary combinatorial functional block of those k
inputs followed by an optional register element (e.g.,?)
23.3 Mapping a control-flow graph to hardware 355
For spatial oriented hardware circuits, moving a φ-function from one basic block
to another can alter the length of the wires that are required to transmit data
from the hardware circuits corresponding to the various basic blocks. As the
boundaries of basic blocks are natural synchronization points, where values
are captured in hardware registers, the length of wires dictate the maximum
allowed hardware clock rate for synchronous designs. We illustrate this effect via
an example as depicted in Figure 23.8. In this figure each basic block is mapped
to a distinct hardware unit, whose spatial implementation is approximated by
?
a rectangle. A floor-planning algorithm must place each of the units in a two-
dimensional plane while ensuring that no two units overlap. As can be seen
in Figure 23.8(a) placing the block 5 on the right-hand-side of the plane will
results in several mid-range and one long-range wire connections. However,
placing block 5 at the center of the design will virtually eliminate all mid-range
connections as all connections corresponding to the transmission of the values
for variable x are now next-neighboring connections.
Fig. 23.8 Example of the impact of φ-function movement in reducing hardware wire length
Further Readings
CHAPTER 24
Building SSA in a Compiler for PHP P. Biggar
D. Gregg
? ?
Dynamic scripting languages such as PHP, Python and Javascript are among the
most widely-used programming languages.
Dynamic scripting languages provide flexible high-level features, a fast modify-
compile-test environment for rapid prototyping, strong integration with popular
strongly-typed programming languages, and an extensive standard library. Dy-
namic scripting languages are widely used in the implementation of many of
the best known web applications of the last decade such as Facebook, Wikipedia
and Gmail. Most web browsers support scripting languages such as Javascript for
client-side applications that run within the browser. Languages such as PHP and
Ruby are popular for server-side web pages that generate content dynamically
and provide close integration with back-end databases.
One of the most widely used dynamic scripting languages is PHP, a general-
purpose language that was originally designed for creating dynamic web pages.
PHP has many of the features that are typical of dynamic scripting languages.
These include a simple syntax, dynamic typing and the ability to dynamically
generate new source code during run time, and execute that code. These simple,
flexible features facilitate rapid prototyping, exploratory programming, and in
the case of many non-professional websites, a copy-paste-and-modify approach
to scripting by non-expert programmers.
Constructing SSA form for languages such as C/C++ and Java is a well-studied
problem. Techniques exist to handle the most common features of static lan-
guages, and these solutions have been tried and tested in production level com-
pilers over many years. In these static languages it is not difficult to identify a set
of scalar variables that can be safely renamed. Better analysis may lead to more
such variables being identified, but significant numbers of such variables can be
found with very simple analysis.
In our study of optimizing dynamic scripting languages, specifically PHP, we
find this is not the case. The information required to build SSA form — that is,
359
360 24 Building SSA in a Compiler for PHP — (P. Biggar, D. Gregg)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24.1 SSA form in statically-typed languages
ables which are used and defined is immediately obvious from a simple syntactic Alias analysis
check.
1: void func() {
2: int x=5, *y;
3: y = &x;
4: *y = 7;
5: return x;
6: }
1: $x = 5;
2: $y =& $x;
3: $y = 7;
The most common appearance of aliasing in PHP is due to variables that store
references. Creating references in PHP does not look so very different from C. In
Figure 24.3, the variable y becomes a reference to the variable x . Once y has
become an alias for x , the assignment to y (in line 3) also changes the value of x .
On first glance the PHP code in Figure 24.3 is not very different from similar
C code in Figure 24.2. From the syntax it is easy to see that a reference to the
variable x is taken. Thus, it is clear that x cannot be easily renamed. However,
the problem is actually with the variable y which contains the reference to the
variable x .
There is no type declaration to say that y in Figure 24.3 is a reference. In fact,
due to dynamic typing, PHP variables may be references at one point in the
program, and stop being references a moment later. Or at a given point in a
program a given variable may be a reference variable or a non-reference variable
depending upon the control flow that preceded that point. PHP’s dynamic typing
makes it difficult to simply identify when this occurs, and a sophisticated analysis
over a larger region of code is essential to building a more precise conservative
SSA form.
Fig. 24.4 Similar (a) PHP and (b) C functions with parameters
The size of this larger region of code is heavily influenced by the semantics
of function parameters in PHP. Consider the PHP function in Figure 24.4(a).
Superficially, it resembles the C function in Figure 24.4(b). In the C version,
24.3 Our whole-program analysis 363
we know that x and y are simple integer variables and that no pointer aliasing
relationship between them is possible. They are separate variables that can
be safely renamed. In fact a relatively simple analysis can show that the the
assignment to x in line 2 of the C code can be optimized away because it is an
assignment to a dead variable.
In the PHP version in Figure 24.4(a), x may alias y upon function entry. This
can happen if x is a reference to y, or vice versa, or if both x and y are references
to a third variable. It is important to note, however, that the possibility of such
aliasing is not apparent from the function prototype or any type declarations.
Instead, whether the formal parameter x and/or y are references depends on
the types of actual parameters that are passed when the function is invoked. If
a reference is passed as a parameter to a function in PHP, the corresponding
formal parameter in the function also becomes a reference.
The addition operation in line 2 of Figure 24.4(a) may therefore change the
value of y, if x is a reference to y or vice versa. In addition, recall that dynamic
typing in PHP means that whether or not a variable contains a reference can
depend on control flow leading to different assignments. Therefore, on some
executions of a function the passed parameters may be references, whereas on
other executions they may not.
In the PHP version, there are no syntactic clues that variables may alias. Fur-
thermore, as we show in section 24.4, there are additional features of PHP that
can cause the values of variables to be changed without simple syntactic clues.
In order to be sure that no such features can affect a given variable, an analysis is
needed to detect such features. As a result, a simple conservative aliasing esti-
mate that does not take account of PHP’s references and other difficult features
— similar to C’s address-taken alias analysis — would need to place all variables
in the alias set. This would leave no variables available for conversion to SSA
form. Instead an interprocedural analysis is needed to track references between
functions.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24.3 Our whole-program analysis
PHP’s dynamic typing means that program analysis cannot be performed a
function at a time. As function signatures do not indicate whether parameters
are references, this information must be determined by inter-procedural analysis.
Furthermore, each function must be analyzed with full knowledge of its calling
context. This requires a whole-program analysis. We present an overview of the
analysis below. A full description is beyond the scope of this chapter.
364 24 Building SSA in a Compiler for PHP — (P. Biggar, D. Gregg)
Our analysis computes and stores three different of kinds of results. Each kind of
result is stored at each point in the program.
The first models the alias relationships in the program in a points-to graph. The
graph contains variable names as nodes, and the edges between them indicate
aliasing relationships. An aliasing relationship indicates that two variables either
must-alias, or may-alias. Two unconnected nodes cannot alias. A points-to graph
is stored for each point in the program. Graphs are merged at CFG join points.
Secondly, our analysis also computes a conservative estimate of the types of
variables in the program. Since PHP is an object-oriented language, polymorphic
method calls are possible, and they must be analyzed. As such, the set of possible
1
The analysis is actually based on a variation, conditional constant propagation.
24.3 Our whole-program analysis 365
types of each variable is stored at each point in the program. This portion of the
analysis closely resembles using SCCP for type propagation.
Finally, like the SCCP algorithm, constants are identified and propagated
through the analysis of the program. Where possible, the algorithm resolves
branches statically using propagated constant values. This is particularly valuable
because our PHCahead-of-time compiler for PHP creates many branches in the
intermediate representation during early stages of compilation. Resolving these
branches statically eliminates unreachable paths, leading to significantly more
precise results from the analysis algorithm.
To build SSA form we need to be able to identify the set of points in a program
where a given variable is defined or used. Since we cannot easily identify these
sets due to potential aliasing, we build them as part of our program analysis. Using
our alias analysis, any variables which may be written to or read from during
a statement’s execution, are added to a set of defs and uses for that statement.
These are then used during construction of the SSA form.
For an assignment by copy, $x = $y:
24.3.4 HSSA
Once the set of locations where each variable is defined or used has been identi-
fied, we have the information needed to construct SSA. However, it is important
to note that due to potential aliasing and potential side effects of some difficult-
to-analyze PHP features (see section 24.4) many of the definitions we compute
366 24 Building SSA in a Compiler for PHP — (P. Biggar, D. Gregg)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24.4 Other challenging PHP features
For simplicity, the description so far of our algorithm has only considered the
problems arising from aliasing due to PHP reference variables. However, in the
process of construction SSA in our PHP compiler several other PHP language fea-
tures that make it difficult to identify all the points in a program where a variable
2
Or more precisely, a may definition means that there exists at least one possible execution
of the program where the variable is defined at that point. Our algorithm computes a conser-
vative approximation of may-definition information. Therefore, our algorithm reports a may
definition in any case where the algorithm cannot prove that no such definition can exist on
any possible execution.
24.4 Other challenging PHP features 367
may be defined. In this section we briefly describe these language features and
how they may be dealt with in order to conservatively identify all may-definitions
of all variables in the program.
1: $x = 5;
2: $var_name = readline( );
3: $$var_name = 6;
4: print $x;
Further Readings
The HSSA algorithm is described further in Chapter 16. Array SSA form — as
described in Chapter 17 — was also considered for our algorithm. While we
deemed HSSA a better match for the problem space, we believe we could also
have adapted Array SSA for this purpose.
370 24 Building SSA in a Compiler for PHP — (P. Biggar, D. Gregg)
? version|seevariable
? name|seevariable
? construction|see SSA
destruction|see SSA
?
out-of-SSA|see
? SSA!destruction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Index
Page numbers are underlined in the index when they represent the definition
or the main source of information about whatever is being indexed. A page
number is given in italics when that page contains an instructive example, use,
or discussion of the concept in question.
γ-function, 8 analysis
φ reduction, reduction, 236 null pointer, 159, 161
φ removal, reduction, 236 range, 150, 157, 160
φentry -function, 175 analysis, demand-driven, 178
φentry -node, 113 analysis, symbolic, 177
φexit -function, 175 application binary interface, 252
φexit -node, 114 available, 127
φif -function, 8, 175, 235, 237 available, can be, 127
ψ-SSA form, 239
ψ-T-SSA, 192, 193 back edge, 54, 55
ψ-permutation, 190 back-edge, 100
ψ-projection, 189, 241 back-end, compiler, 12, 186
ψ-promotion, 190 backward analysis, 95, 97, 156
ψ-reduction, 189 basic-block early, 262
Φ-function, 123 basic-block exit, 257
φ-function, 7 basic-block late, 262
φ-function, as multiplexer, 7, 8, 97, bitsets, 104
224 block
φ-function, lowering, 250 sinking, 79
φ-function, parallel semantic of, 8 branch coalescing, 244
φ-function operands, non-split, 254 branch instruction problem, 252, 253
φ-function operands, split, 254
φ-web, 21, 36, 250, 254 C-SSA, 22, 191
φ-web, discovery, 256 calling conventions, 219, 250
φ-web, discovery., 36 CFG merge, 229
ψ-function, 186, 240 chain of recurrences, 115
ψ-web, 193–195 chaotic iteration, 157
σ-function, 154, 165, 224 chordal graph, 19
circuits, spatial combination of, 274
ψ-inlining, 189 circuits, temporal combination of,
274
affinity, 256 class inference, 158, 160
aggressive coalescing, 37 cmov, 223
Alias analysis, 287, 288 cmov, instruction, 223
α-renaming, 70 coalescing, 193, 239
372
INDEX 373
live range, as a dominator sub-tree, parallel copy, 36, 154, 222, 224
261 parallel copy, sequentialization of,
live-in, 97, 98 38, 263
live-out, 97 parallel execution, 240
live-range, 18, 21, 28, 96, 97, 196 parallelisation, 183
live-range splitting, 28, 33, 59 parallelism, 233
liveness, 19, 95, 96, 194, 219 parallelization, 172, 174
φ-function, 97, 257, 262 parameter dropping, 80
liveness check, 104, 196, 259 partial product reduction tree, 278
load, dismissible, 225, 236 partial redundancy, 122
load, speculative, 236 partial redundancy elimination, 189
local variable, 39 partitioned lattice per variable, 152
lookup table, 279 partitioned variable problems, 152
loop header, 100 path duplication, 60, 237, 243
loop nesting forest, 45, 53, 87, 95, PDG, predicate nodes, 172
224 PDG, region nodes, 172
loop-closed, 83 PDG, statement nodes, 172
lost-copy problem, 85, 251, 252, 257 phi-lists, 227
phi-ops, 227
materialization, of copies, 262 phiweb, 122, 191
minimal SSA, 18 PHP, 285
minimal SSA form, 28, 35, 239 pin-φ-web, 255
minimal, making SSA, 40, 66 pining, 193
minimal, making SSA non-, 40 pinning, 222, 252
monadic IR, 90 PLV problems, 152
multiplexer, 276 PRE, see redundancy elimination, 122
multiplexer, logic circuit, 271 PRE, computational optimality, 125
multiplexing mode, semantic of φ- PRE, correctness, 125
operator, 257 PRE, lifetime optimality, 126
myths, 13 PRE, of loads, 133
PRE, of loads and stores, 132
name, see variable PRE, of stores), 133
non-strict, 18 PRE, safety, 125
normalized-ψ, 193 PRE, speculative, 129
null pointer analysis, 159, 161 predicate, 185, 186
predicate promotion, 226
operand, indirect, 221
predicated execution, 185, 223, 233,
operands, explicit, 220
239
operands, implicit, 220
predicated switching instruction, 226
operation, 221
predicated switching, SSA-, 226
operator, 221
predication, 189
outermost excluding loop
predication, full, 234
OLE, 102
predication, partial, 190, 235
outermost loop excluding, 102
program dependence graph, 172, 183,
parallel copies, 251, 252, 254 226, 276
376 INDEX
1. A. V. Aho and S. C. Johnson. Optimal code generation for expression trees. J. ACM,
23(3):488–501, 1976. 278
2. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and
Tools. Addison-Wesley series in computer science / World student series edition. Addison-
Wesley, 1986. 42, 128, 253
3. F. E. Allen and J. Cocke. A program data flow analysis procedure. Communications of the
ACM, 19(3):137–146, 1976. 186
4. Frances E. Allen. Control flow analysis. SIGPLAN Not., 5:1–19, 1970. 186
5. Frances E. Allen. Control flow analysis. In Proceedings of a Symposium on Compiler
Optimization, pages 1–19, New York, NY, USA, 1970. ACM. 203
6. J. R. Allen, Ken Kennedy, Carrie Porterfield, and Joe Warren. Conversion of control
dependence to data dependence. In Proceedings of the 10th ACM SIGACT-SIGPLAN
symposium on Principles of programming languages, POPL ’83, pages 177–189, New York,
NY, USA, 1983. ACM. 259
7. Bowen Alpern, Mark N. Wegman, and F. Kenneth Zadeck. Detecting equality of variables
in programs. In POPL ’88: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on
Principles of programming languages, pages 1–11, New York, NY, USA, 1988. ACM. 11
8. Bowen Alpern, Mark N. Wegman, and F. Kenneth Zadeck. Detecting equality of variables
in programs. In Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles
of programming languages, POPL ’88, pages 1–11. ACM, 1988. 158, 204, 314
9. Scott Ananian. The static single information form. Master’s thesis, MIT, September 1999.
187, 188
10. Jo ao M.P. Cardoso and Pedro C. Diniz. Compilation Techniques for Reconfigurable Archi-
tectures. Springer, 2008. 356
11. Andrew W. Appel. Compiling with continuations. Cambridge University Press, 1992. 70,
90
12. Andrew W. Appel. Modern Compiler Implementation in {C,Java,ML}. Cambridge Univer-
sity Press, 1997. 127, 314, 337
13. Andrew W. Appel. Modern Compiler Implementation in ML. Cambridge University Press,
1998. 90
14. Andrew W. Appel. SSA is functional programming. SIGPLAN Notices, 33(4):17–20, 1998.
90
15. P. Ashenden. The Designer’s Guide to VHDL. Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 2001. 341, 356
16. David I. August, Daniel A. Connors, Scott A. Mahlke, John W. Sias, Kevin M. Crozier, Ben-
Chung Cheng, Patrick R. Eaton, Qudus B. Olaniran, and Wen-mei W. Hwu. Integrated
predicated and speculative execution in the impact epic architecture. In Proceedings
of the 25th annual international symposium on Computer architecture, ISCA ’98, pages
227–237, Washington, DC, USA, 1998. IEEE Computer Society. 259
379
380 References
17. David I. August, Wen-Mei W. Hwu, and Scott A. Mahlke. The partial reverse if-conversion
framework for balancing control flow and predication. Int. J. Parallel Program., 27:381–
423, October 1999. 295
18. John Aycock and Nigel Horspool. Simple generation of static single assignment form. In
Proceedings of the 9th International Conference in Compiler Construction, volume 1781 of
Lecture Notes in Computer Science, pages 110–125. Springer, 2000. 42, 66
19. Olaf Bachmann, Paul S. Wang, and Eugene V. Zima. Chains of recurrences a method to
expedite the evaluation of closed-form functions. In Proceedings of the international
symposium on Symbolic and algebraic computation, pages 242–249. ACM Press, 1994. 137
20. U. Banerjee. Dependence Analysis for Supercomputing. Kluwer Academic Publishers,
Boston, 1988. 137, 138
21. Rajkishore Barik. Efficient optimization of memory accesses in parallel programs. PhD
thesis, Rice University, 2010. 337
22. Rajkishore Barik and Vivek Sarkar. Interprocedural Load Elimination for Dynamic Op-
timization of Parallel Programs. In PACT’09: Proceedings of the 18th International Con-
ference on Parallel Architectures and Compilation Techniques, pages 41–52, Washington,
DC, USA, Sep 2009. IEEE Computer Society. 247
23. Denis Barthou, Jean-François Collard, and Paul Feautrier. Fuzzy array dataflow analysis.
J. of Parallel and Distributed Computing, 40(2):210–226, 1997. 247
24. Samuel Bates and Susan Horwitz. Incremental program testing using program depen-
dence graphs. In POPL ’93: Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on
Principles of programming languages, pages 384–396, New York, NY, USA, 1993. ACM. 204
25. W. Baxter and H. R. Bauer, III. The program dependence graph and vectorization. In
POPL ’89: Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of
programming languages, pages 1–11, New York, NY, USA, 1989. ACM. 204
26. L. Belady. A Study of Replacement of Algorithms for a Virtual Storage Computer. IBM
Systems Journal, 5:78???101, 1966. 339
27. Michael Bender and Martín Farach-Colton. The LCA problem revisited. In Gastón Gonnet,
Daniel Panario, and Alfredo Viola, editors, LATIN 2000: Theoretical Informatics, volume
1776 of Lecture Notes in Computer Science, pages 88–94. Springer, 2000. 128
28. Nick Benton, Andrew Kennedy, and George Russell. Compiling Standard ML to Java Byte-
codes. In Proceedings of the third ACM SIGPLAN International Conference on Functional
Programming (ICFP ’98), pages 129–140. ACM Press, 1998. SIGPLAN Notices 34(1),January
1999. 90
29. Lennart Beringer. Functional elimination of phi-instructions. Electronic Notes in Theo-
retical Computer Science, 176(3):3–20, 2007. 91
30. Lennart Beringer, Kenneth MacKenzie, and Ian Stark. Grail: a functional form for im-
perative mobile code. Electronic Notes in Theoretical Computer Science, 85(1), 2003.
91
31. David A. Berson, Rajiv Gupta, and Mary Lou Soffa. Integrated instruction scheduling
and register allocation techniques. In Proceedings of the 11th International Workshop
on Languages and Compilers for Parallel Computing, LCPC ’98, pages 247–262. Springer-
Verlag, 1999. 339
32. Paul Biggar. Design and Implementation of an Ahead-of-Time Compiler for PHP. PhD
thesis, Trinity College Dublin, 2010. 369, 370
33. Gianfranco Bilardi and Keshav Pingali. A framework for generalized control dependence.
SIGPLAN Not., 31(5):291–300, May 1996. 57, 204
34. Jan Olaf Blech, Sabine Glesner, Johannes Leitner, and Steffen Mülling. Optimizing code
generation from ssa form: A comparison between two formal correctness proofs in is-
abelle/hol. Electronic Notes in Theoretical Computer Science, 141(2):33–51, 2005. 43
35. David S. Blickstein, Peter W. Craig, Caroline S. Davidson, R. Neil Faiman, Jr., Kent D.
Glossop, Richard B. Grove, Steven O. Hobbs, and William B. Noyce. The GEM optimizing
compiler system. Digital Technical Journal, 4(4):121–136, Fall 1992. 257, 260
References 381
36. Rastislav Bodík, Rajiv Gupta, and Vivek Sarkar. ABCD: Eliminating array bounds checks
on demand. In PLDI ’00: Proceedings of the Conference on Programming Language Design
and Implementation, pages 321–333, New York, NY, USA, 2000. ACM. 111
37. Rastislav Bodik, Rajiv Gupta, and Vivek Sarkar. ABCD: eliminating array bounds checks
on demand. In PLDI, pages 321–333. ACM, 2000. 187, 188
38. W. Böhm, J. Hammes, B. Draper, M. Chawathe, C. Ross, R. Rinker, and W. Najjar. Mapping
a single assignment programming language to reconfigurable systems. The Journal of
Supercomputing, 21(2):117–130, February 2002. 357
39. Benoit Boissinot, Florian Brandner, Alain Darte, Benoît Dupont de Dinechin, and Fabrice
Rastello. A non-iterative data-flow algorithm for computing liveness sets in strict ssa
programs. In Proceedings of the 9th Asian conference on Programming Languages and
Systems, APLAS’11, pages 137–154, Berlin, Heidelberg, 2011. Springer-Verlag. 254, 258
40. Benoit Boissinot, Philip Brisk, Alain Darte, and Fabrice Rastello. SSI properties revisited.
ACM Transactions on Embedded Computing Systems, 2012. Special Issue on Software and
Compilers for Embedded Systems. 127, 187
41. Benoit Boissinot, Alain Darte, Fabrice Rastello, Benoit Dupont de Dinechin, and
Christophe Guillon. Revisiting out-of-SSA translation for correctness, code quality and
efficiency. In Proceedings of the 7th annual IEEE/ACM International Symposium on Code
Generation and Optimization (CGO’09), pages 114–125. IEEE Computer Society Press,
2009. 43, 254, 256, 259, 264, 265, 313
42. Benoit Boissinot, Sebastian Hack, Daniel Grund, Benoît Dupont de Dinechin, and Fabrice
Rastello. Fast Liveness Checking for SSA-Form Programs. In CGO ’08: Proc. of the sixth
annual IEEE/ACM international symposium on Code Generation and Optimization, pages
35–44, 2008. 254
43. Florent Bouchez. Étude des problèmes de spilling et coalescing liés à l’allocation de registres
en tant que deux phases distinctes. (A Study of Spilling and Coalescing in Register Allocation
as Two Separate Phases). PhD thesis, École normale supérieure de Lyon, France, 2009.
337
44. Florent Bouchez, Quentin Colombet, Alain Darte, Fabrice Rastello, and Christophe Guil-
lon. Parallel copy motion. In Proceedings of the 13th International Workshop on Software
& Compilers for Embedded Systems, SCOPES ’10, pages 1:1–1:10. ACM, 2010. 339
45. Florent Bouchez, Alain Darte, Christophe Guillon, and Fabrice Rastello. Register alloca-
tion: What does the NP-completeness proof of chaitin et al. really prove? or revisiting
register allocation: Why and how. In LCPC, volume 4382, pages 283–298. Springer, 2006.
337
46. Florent Bouchez, Alain Darte, and Fabrice Rastello. On the complexity of register coa-
lescing. In CGO ’07: Proceedings of the International Symposium on Code Generation and
Optimization, pages 102–114, Washington, DC, USA, March 2007. IEEE Computer Society
Press. 337, 340
47. Florent Bouchez, Alain Darte, and Fabrice Rastello. On the complexity of spill everywhere
under SSA form. In Santosh Pande and Zhiyuan Li, editors, Proceedings of the 2007 ACM
SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems
(LCTES’07), San Diego, California, USA, June 13-15, 2007, pages 103–112. ACM, 2007. 338,
339
48. Florent Bouchez, Alain Darte, and Fabrice Rastello. Advanced conservative and optimistic
register coalescing. In Proceedings of the 2008 International Conference on Compilers,
Architecture, and Synthesis for Embedded Systems, CASES 2008, Atlanta, GA, USA, October
19-24, 2008, pages 147–156, 2008. 340
49. Marc M. Brandis and Hanspeter Mössenböck. Single-pass generation of static single-
assignment form for structured languages. ACM Transactions on Programming Languages
and Systems, 16(6):1684–1698, Nov 1994. 42
50. Matthias Braun, Sebastian Buchwald, Sebastian Hack, Roland Leißa, Christoph Mallon,
and Andreas Zwinkau. Simple and efficient construction of static single assignment form.
In Compiler Construction - 22nd International Conference, CC 2013, Held as Part of the
382 References
European Joint Conferences on Theory and Practice of Software, ETAPS 2013, Rome, Italy,
March 16-24, 2013. Proceedings, pages 102–122, 2013. 66
51. Matthias Braun and Sebastian Hack. Register Spilling and Live-Range Splitting for SSA-
Form Programs. In Compiler Construction 2009, volume 5501, pages 174–189. Springer,
2009. 339
52. Matthias Braun, Christoph Mallon, and Sebastian Hack. Preference-guided register as-
signment. In International Conference on Compiler Construction, pages 205–223. Springer,
2010. 338, 339
53. P. Briggs, K. Cooper, and L Simpson. Value numbering. Software Practice and Experience,
27(6):701–724, 1997. 158
54. Preston Briggs, Keith D. Cooper, Timothy J. Harvey, and L. Taylor Simpson. Practical
improvements to the construction and destruction of static single assignment form.
Software – Practice and Experience, 28(8):859–881, July 1998. 42, 91, 264, 265, 313
55. Preston Briggs, Keith D. Cooper, and Linda Torczon. Rematerialization. In PLDI ’92:
Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and
implementation, pages 311–321, New York, NY, USA, 1992. ACM. 203, 339
56. Philip Brisk, Foad Dabiri, Jamie Macbeth, and Majid Sarrafzadeh. Polynomial Time Graph
Coloring Register Allocation. In 14th International Workshop on Logic and Synthesis. ACM
Press, 2005. 337
57. Christian Bruel. If-Conversion SSA Framework for partially predicated VLIW architectures.
In ODES 4, pages 5–13. SIGPLAN, ACM and IEEE, March 2006. 261
58. Zoran Budimlic, Keith D. Cooper, Timothy J. Harvey, Ken Kennedy, Timothy S. Oberg, and
Steven W. Reeves. Fast copy coalescing and live-range identification. In International
Conference on Programming Language Design and Implementation (PLDI’02), pages
25–32. ACM Press, June 2002. 24, 265, 266, 314
59. M. Budiu and S. Goldstein. Compiling application-specific hardware. In Proc. of the Intl.
Conf. on Field Programmable Logic and Applications (FPL), pages 853–863, 2002. 356
60. T. Callahan, J. Hauser, and J. Wawrzynek. The Garp architecture and C compiler. Computer,
33(4):62–69, 2000. 356
61. Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H.
Anderson, Stephen Brown, and Tomasz Czajkowski. High-level synthesis for fpga-based
processor/accelerator systems. In Proceedings of the 19th ACM/SIGDA International
Symposium on Field Programmable Gate Arrays, FPGA ’11, pages 33–36. ACM, Mar. 2011.
357
62. L. Carter, B. Simon, B. Calder, L. Carter, and J. Ferrante. Predicated static single assignment.
In Proc. of the 1999 Intl. Conf. on Parallel Architectures and Compilation Techniques
(PACT’99), page 245, Washington, DC, USA, 1999. IEEE Computer Society. 217, 356
63. G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hopkins, and P. W. Markstein.
Register allocation via graph coloring. Journal of Computer Languages, 6:45–57, 1981. 337
64. Gregory J. Chaitin. Register allocation & spilling via graph coloring. In SIGPLAN Symp.
on Compiler Construction (CC’82), pages 98–105. ACM Press, 1982. 314
65. Gregory J. Chaitin, Marc A. Auslander, Ashok K. Chandra, John Cocke, Martin E. Hopkins,
and Peter W. Markstein. Register allocation via coloring. Computer Languages, 6:47–57,
January 1981. 25, 313, 314
66. Manuel M. T. Chakravarty, Gabriele Keller, and Patryk Zadarnowski. A functional perspec-
tive on SSA optimisation algorithms. Electronic Notes in Theoretical Computer Science,
82(2), 2003. 91
67. Craig Chambers and David Ungar. Customization: optimizing compiler technology
for self, a dynamically-typed object-oriented programming language. SIGPLAN Not.,
24(7):146–160, 1989. 188
68. Shing-Chow Chan, G. R. Gao, B. Chapman, T. Linthicum, and A. Dasgupta. Open64
compiler infrastructure for emerging multicore/manycore architecture All Symposium
Tutorial. 2008 IEEE International Symposium on Parallel and Distributed Processing,
pages 1–1, 2008. 253
References 383
69. Barbara Chapman, Deepak Eachempati, and Oscar Hernandez. Experiences Developing
the OpenUH Compiler and Runtime Infrastructure. Int. J. Parallel Program., 41(6):825–
854, December 2013. 253
70. Jong Choi, Vivek Sarkar, and Edith Schonberg. Incremental computation of static single
assignment form. In Compiler Construction, pages 223–237. IBM T.J. Watson Research
Center P. O. Box 704 10598 Yorktown Heights NY P. O. Box 704 10598 Yorktown Heights NY,
Springer Berlin / Heidelberg, 1996. 66
71. Jong-Deok Choi, Ron Cytron, and Jeanne Ferrante. Automatic construction of sparse
data flow evaluation graphs. In POPL ’91: Proceedings of the 18th ACM SIGPLAN-SIGACT
symposium on Principles of programming languages, pages 55–66, New York, NY, USA,
1991. ACM. 24, 111, 187, 188
72. F. Chow, S. Chan, R. Kennedy, S. Liu, R. Lo, and P. Tu. A new algorithm for partial redun-
dancy elimination based on ssa form. In Proceedings of the ACM SIGPLAN ’97 Conference
on Programming Language Design and Implementation, pages 273–286, 1997. 157
73. Fred Chow, Sun Chan, Robert Kennedy, Shin-Ming Liu, Raymond Lo, and P. Tu. A new
algorithm for partial redundancy elimination based on ssa form. ACM SIGPLAN Notices,
32(5):273 – 286, 1997. 217
74. Fred Chow, Sun Chan, Shin Ming Liu, Raymond Lo, and Mark Streich. Effective represen-
tation of aliases and indirect memory operations in ssa form. In Compiler Construction,
pages 253–267. Springer Berlin Heidelberg, 1996. 230
75. Fred C. Chow. Minimizing register usage penalty at procedure calls. In Proceedings of the
ACM SIGPLAN ’88 Conference on Programming Language Design and Implementation,
pages 85–94, July 1988. 158
76. Fred C Chow and John L Hennessy. The priority-based coloring approach to register alloca-
tion. ACM Transactions on Programming Languages and Systems (TOPLAS), 12(4):501–536,
1990. 337
77. Weihaw Chuang, Brad Calder, and Jeanne Ferrante. Phi-predication for light-weight
if-conversion. In Proceedings of the International Symposium on Code Generation and
Optimization (CGO’03), CGO ’03, pages 179–190, Washington, DC, USA, 2003. IEEE Com-
puter Society. 260, 261, 295
78. C. Click. Global code motion global value numbering. In SIGPLAN International Con-
ference on Programming Languages Design and Implementation, pages 246 – 257, 1995.
217
79. Clifford Click. Combining Analyses, Combining Optimizations. PhD thesis, Rice University,
February 1995. 66
80. J. Cocke and J. Schwartz. Programming languages and their compilers. Technical report,
Courant Institute of Mathematical Sciences, New York University, April 1970. 158
81. Josep M. Codina, Jesús Sánchez, and Antonio González. A unified modulo scheduling
and register allocation technique for clustered processors. In Proceedings of the 2001
International Conference on Parallel Architectures and Compilation Techniques, PACT ’01,
pages 175–184. IEEE Computer Society, 2001. 339
82. Quentin Colombet. Decoupled (SSA-based) register allocators : from theory to practice,
coping with just-in-time compilation and embedded processors constraints. PhD thesis,
Ecole normale supérieure de lyon - ENS LYON, December 2012. 337
83. Quentin Colombet, Benoit Boissinot, Philip Brisk, Sebastian Hack, and Fabrice Rastello.
Graph-coloring and register allocation using repairing. In CASES, pages 45–54, 2011. 337,
338
84. Robert P. Colwell, Robert P. Nix, John J. O’Donnell, David B. Papworth, and Paul K. Rodman.
A vliw architecture for a trace scheduling compiler. In Proceedings of the second inter-
national conference on Architectual support for programming languages and operating
systems, ASPLOS-II, pages 180–192, Los Alamitos, CA, USA, 1987. IEEE Computer Society
Press. 257
85. Keith D. Cooper, Timothy J. Harvey, and Ken Kennedy. An empirical study of iterative
data-flow analysis. In CIC ’06: Proceedings of the Conference on Computing, pages 266–276,
Washington, DC, USA, 2006. IEEE Computer Society. 110
384 References
86. Keith D. Cooper, Timothy J. Harvey, and Ken Kennedy. An empirical study of iterative data-
flow analysis. In 15th International Conference on Computing (ICC’06), pages 266–276,
Washington, DC, USA, 2006. IEEE Computer Society. 127
87. Keith D Cooper and L Taylor Simpson. Live range splitting in a graph coloring register
allocator. In International Conference on Compiler Construction, pages 174–187. Springer,
1998. 340
88. Keith D. Cooper, L. Taylor Simpson, and Christopher A. Vick. Operator strength reduction.
ACM Trans. Program. Lang. Syst., 23(5):603–625, 2001. 203
89. Keith D. Cooper and Linda Torczon. Engineering a Compiler. Morgan Kaufmann, 2004.
128
90. Intel Corp. Arria10 device overview, 2017. 357
91. P. Cousot and N.. Halbwachs. Automatic discovery of linear restraints among variables of
a program. In POPL, pages 84–96. ACM, 1978. 187
92. Ron Cytron and Jeanne Ferrante. What’s in a Name? Or the Value of Renaming for Paral-
lelism Detection and Storage Allocation. Proceedings of the 1987 International Conference
on Parallel Processing, pages 19–27, August 1987. 337
93. Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck.
An efficient method of computing static single assignment form. In POPL ’89: Proceedings
of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages,
pages 25–35, New York, NY, USA, 1989. ACM. 11, 42
94. Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck.
Efficiently computing static single assignment form and the control dependence graph.
ACM Transactions on Programming Languages and Systems, 13(4):451–490, 1991. 24, 42,
57, 66, 203, 264, 312
95. Olivier Danvy, Kevin Millikin, and Lasse R. Nielsen. On one-pass CPS transformations.
Journal of Functional Programming, 17(6):793–812, 2007. 90
96. Olivier Danvy and Ulrik Pagh Schultz. Lambda-dropping: transforming recursive equa-
tions into programs with block structure. Theoretical Computer Science, 248(1-2):243–287,
2000. 90
97. Alain Darte, Yves Robert, and Frederic Vivien. Scheduling and Automatic Parallelization.
Birkhauser Boston, 1st edition, 2000. 204
98. Dibyendu Das and U. Ramakrishna. A practical and fast iterative algorithm for Φ-function
computation using dj graphs. ACM Trans. Program. Lang. Syst., 27(3):426–440, May 2005.
58
99. B. Dupont de Dinechin, F. de Ferri, C. Guillon, and A. Stoutchinin. Code generator opti-
mizations for the st120 dsp-mcu core. In Proceedings of the 2000 international conference
on Compilers, architecture, and synthesis for embedded systems, CASES ’00, pages 93–102,
New York, NY, USA, 2000. ACM. 254, 255
100. Benoît Dupont de Dinechin. A unified software pipeline construction scheme for modulo
scheduled loops. In Parallel Computing Technologies, 4th International Conference, PaCT-
97, volume 1277 of Lecture Notes in Computer Science, pages 189–200, 1997. 263
101. Benoît Dupont de Dinechin. Extending modulo scheduling with memory reference
merging. In 8th International Conference on Compiler Construction, 8th International
Conference, CC’99, pages 274–287, 1999. 254
102. Benoît Dupont de Dinechin. Time-Indexed Formulations and a Large Neighborhood
Search for the Resource-Constrained Modulo Scheduling Problem. In 3rd Multidisci-
plinary International Scheduling conference: Theory and Applications (MISTA), 2007. 254
103. Benoît Dupont de Dinechin. Inter-Block Scoreboard Scheduling in a JIT Compiler for
VLIW Processors. In Euro-Par 2008, pages 370–381, 2008. 254
104. Benoît Dupont de Dinechin, Christophe Monat, Patrick Blouet, and Christian Bertin.
Dsp-mcu processor optimization for portable applications. Microelectron. Eng., 54(1-
2):123–132, December 2000. 254
105. François de Ferrière. Improvements to the psi-SSA representation. In Proceedings of the
10th International Workshop on Software and Compilers for Embedded Systems (SCOPES),
SCOPES ’07, pages 111–121. ACM, 2007. 217, 261, 262, 356
References 385
106. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large
clusters. Communications of the ACM, 51(1):107–113, 2008. 345, 357
107. J. B. Dennis. First version of a data flow procedure language. In Programming Symposium,
Proceedings Colloque sur la Programmation, pages 362–376, London, UK, 1974. Springer-
Verlag. 203
108. J. B. Dennis. Data flow supercomputers. Computer, 13(11):48–56, 1980. 203
109. D. M. Dhamdhere. E-path_pre: partial redundancy made easy. SIGPLAN Notices, 37(8):53–
65, 2002. 157
110. K. Drechsler and M. Stadel. A variation of knoop, rüthing and steffen’s lazy code motion.
SIGPLAN Notices, 28(5):29–38, 1993. 157
111. Evelyn Duesterwald, Rajiv Gupta, and Mary Lou Soffa. Reducing the cost of data flow
analysis by congruence partitioning. In CC ’94: Proceedings of the Conference on Compiler
Construction, pages 357–373, London, UK, 1994. Springer-Verlag. 111
112. Dietmar Ebner, Florian Brandner, Bernhard Scholz, Andreas Krall, Peter Wiedermann,
and Albrecht Kadlec. Generalized instruction selection using SSA -graphs. In Krisztián
Flautner and John Regehr, editors, Proceedings of the 2008 ACM SIGPLAN/SIGBED Con-
ference on Languages, Compilers, and Tools for Embedded Systems (LCTES’08), Tucson, AZ,
USA, June 12-13, 2008, pages 31–40. ACM, 2008. 279
113. Dietmar Ebner, Florian Brandner, Bernhard Scholz, Andreas Krall, Peter Wiedermann,
and Albrecht Kadlec. Generalized instruction selection using SSA-graphs. In LCTES ’08:
Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on languages, compilers, and
tools for embedded systems, pages 31–40, New York, NY, USA, 2008. ACM. 203
114. Erik Eckstein, Oliver König, and Bernhard Scholz. Code Instruction Selection Based on
SSA-Graphs. In Andreas Krall, editor, SCOPES, volume 2826 of Lecture Notes in Computer
Science, pages 49–65. Springer, 2003. 279
115. Maryam Emami, Rakesh Ghiya, and Laurie J. Hendren. Context-sensitive interprocedural
points-to analysis in the presence of function pointers. In Proceedings of the ACM SIGPLAN
1994 conference on Programming language design and implementation, PLDI ’94, pages
242–256, New York, NY, USA, 1994. ACM. 370
116. Pedro C. Diniz et al. DEFACTO compilation and synthesis system. Elsevier Journal on
Microprocessors and Microsystems, 29(2):51–62, 2005. 356
117. J. Fabri. Automatic Storage Optimization. Proc ACM SIGPLAN Symp. on Compiler Con-
struction, 1979. 337
118. Jesse Zhixi Fang. Compiler algorithms on if-conversion, speculative predicates assign-
ment and predicated code optimizations. In Proceedings of the 9th International Workshop
on Languages and Compilers for Parallel Computing, LCPC ’96, pages 135–153, London,
UK, UK, 1997. Springer-Verlag. 260, 261, 262
119. Paolo Faraboschi, Geoffrey Brown, Joseph A. Fisher, Giuseppe Desoli, and Fred Home-
wood. Lx: a technology platform for customizable vliw embedded processing. SIGARCH
Comput. Archit. News, 28(2):203–213, May 2000. 254, 259
120. Martin Farach and Vincenzo Liberatore. On local register allocation. In SODA ’98: Proceed-
ings of the ninth annual ACM-SIAM Symposium on Discrete Algorithms, pages 564–573,
Philadelphia, PA, USA, 1998. Society for Industrial and Applied Mathematics. 339
121. P. Feautrier. Parametric integer programming. RAIRO Recherche OpÃl’rationnelle, 22:243–
268, September 1988. 137
122. J. Ferrante, M. Mace, and B. Simons. Generating sequential code from parallel code.
In ICS ’88: Proceedings of the 2nd international conference on Supercomputing, pages
582–592, New York, NY, USA, 1988. ACM. 204
123. Jeanne Ferrante and Mary Mace. On linearizing parallel code. In POPL ’85: Proceedings
of the 12th ACM SIGACT-SIGPLAN symposium on Principles of programming languages,
pages 179–190, New York, NY, USA, 1985. ACM. 204
124. Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. The program dependence graph
and its use in optimization. ACM Trans. on Programming Languages and Systems, 9(3):319–
349, 1987. 11, 203, 260
386 References
125. Stephen J. Fink, Kathleen Knobe, and Vivek Sarkar. Unified analysis of array and object
references in strongly typed languages. In SAS ’00: Proceedings of the 7th International
Symposium on Static Analysis, pages 155–174, London, UK, 2000. Springer-Verlag. 247
126. Cormac Flanagan, Amr Sabry, Bruce F. Duba, and Matthias Felleisen. The essence of
compiling with continuations. In Proceedings of the ACM SIGPLAN Conference on Pro-
gramming Language Design and Implementation (PLDI’93), pages 237–247. ACM Press,
1993. 90
127. F??nic?? Gavril. The intersection graphs of subtrees in trees are exactly the chordal graphs.
Journal of Combinatorial Theory, Series B, 16(1):47–56, 1974. 337
128. Thomas Gawlitza, Jerome Leroux, Jan Reineke, Helmut Seidl, Gregoire Sutre, and Reinhard
Wilhelm. Polynomial precise interval analysis revisited. Efficient Algorithms, 1:422 – 437,
2009. 188
129. Lal George and Andrew W. Appel. Iterated register coalescing. ACM Transactions on
Programming Languages and Systems, 18(3):300–324, May 1996. 337, 340
130. Lal George and Blu Matthias. Taming the ixp network processor. In PLDI, pages 26–37.
ACM, 2003. 187
131. M. Gerlek, M. Wolfe, and E. Stoltz. A Reference Chain Approach for Live Variables. Tech-
nical Report CSE 94-029, Oregon Graduate Institute of Science & Technology, 1994. 127
132. Michael P. Gerlek, Eric Stoltz, and Michael Wolfe. Beyond induction variables: Detect-
ing and classifying sequences using a demand-driven ssa form. ACM Transactions on
Programming Languages and Systems, 17(1):85–122, 1995. 138, 203
133. David M. Gillies, Dz-ching Roy Ju, Richard Johnson, and Michael Schlansker. Global
predicate analysis and its application to register allocation. In Proceedings of the 29th
annual ACM/IEEE international symposium on Microarchitecture, MICRO 29, pages
114–125, Washington, DC, USA, 1996. IEEE Computer Society. 256, 257
134. Sabine Glesner. An ASM semantics for SSA intermediate representations. In Wolf Zimmer-
mann and Bernhard Thalheim, editors, Abstract State Machines, volume 3052 of LNCS,
pages 144–160. Springer, 2004. 91
135. Ricardo E. Gonzalez. Xtensa: A configurable and extensible processor. IEEE Micro,
20(2):60–70, March 2000. 256
136. J. R. Goodman and W.-C. Hsu. Code scheduling and register allocation in large basic
blocks. In Proceedings of the 2nd international conference on Supercomputing, ICS ’88,
pages 442–452, New York, NY, USA, 1988. ACM. 263
137. Kronos Group. Opencl overview, 2018. 356
138. Daniel Grund and Sebastian Hack. A fast cutting-plane algorithm for optimal coalescing.
In Compiler Construction, 16th International Conference, CC 2007, volume 4420 of Lecture
Notes in Computer Science, pages 111–125. Springer, 2007. 340
139. Z. Guo, B. Buyukkurt, J. Cortes, A. Mitra, and W. Najjar. A compiler intermediate repre-
sentation for reconfigurable fabrics. Int. J. Parallel Program., 36(5):493–520, 2008. 356
140. Yuri Gurevich. Sequential abstract-state machines capture sequential algorithms. ACM
Transactions on Computational Logic (TOCL), 1(1):77–111, 2000. 91
141. Sebastian Hack. Register Allocation for Programs in SSA Form. PhD thesis, Universität
Karlsruhe, October 2007. 337
142. Sebastian Hack, Daniel Grund, and Gerhard Goos. Towards Register Allocation for Pro-
grams in SSA Form. Technical Report 2005-27, University of Karlsruhe, September 2005.
337
143. A. Hagiescu, W.-F. Wong, D. Bacon, and R. Rabbah. A computing origami: folding streams
in FPGAs. In Proc. of the 2009 ACM/IEEE Design Automation Conf. (DAC), pages 282–287,
2009. 357
144. Rebecca Hasti and Susan Horwitz. Using static single assignment form to improve flow-
insensitive pointer analysis. In Proceedings of the ACM SIGPLAN 1998 Conference on
Programming Language Design and Implementation, pages 97–105, 1998. 13
145. W. Havanki, S. Banerjia, and T. Conte. Treegion scheduling for wide issue processors.
High-Performance Computer Architecture, International Symposium on, 0:266, 1998. 263
References 387
146. Paul Havlak. Construction of thinned gated single-assignment form. In In Proc. 6rd
Workshop on Programming Languages and Compilers for Parallel Computing, pages
477–499. Springer Verlag, 1993. 204
147. Paul Havlak. Nesting of Reducible and Irreducible Loops. ACM Transactions on Program-
ming Languages and Systems, 19(4):557–567, 1997. 91, 128, 258
148. Matthew S. Hecht. Flow Analysis of Computer Programs. Elsevier, 1977. 186
149. Matthew S. Hecht and Jeffrey D. Ullman. Analysis of a simple algorithm for global data
flow problems. In POPL ’73: Proceedings of the Symposium on Principles of Programming
Languages, pages 207–217, New York, NY, USA, 1973. ACM. 110
150. A. Hormati, M. Kudlur, S. Mahlke, D. Bacon, and R. Rabbah. Optimus: Efficient Realiza-
tion of Streaming Applications on FPGAs. In Proc. of the 2008 Intl. Conf. on Compilers,
Architecture, and Synthesis for Embedded Systems (CASES), pages 41–50, 2008. 357
151. Wen-Mei W. Hwu, Scott A. Mahlke, William Y. Chen, Pohua P. Chang, Nancy J. Warter,
Roger A. Bringmann, Roland G. Ouellette, Richard E. Hank, Tokuzo Kiyohara, Grant E.
Haab, John G. Holm, and Daniel M. Lavery. The superblock: an effective technique for
vliw and superscalar compilation. J. Supercomput., 7(1-2):229–248, May 1993. 263
152. Xilinx Inc. Vivado high-level synthesis, 2014. 357
153. Xilinx Inc. Virtex ultrascale+ FPGA devices, 2016. 357
154. Margarida F. Jacome, Gustavo De Veciana, and Satish Pillai. Clustered vliw architectures
with predicated switching. In Proceedings of the 38th Design Automation Conference,
DAC, pages 696–701, 2001. 260, 295
155. Johan Janssen and Henk Corporaal. Making graphs reducible with controlled node
splitting. ACM Transactions on Programming Languages and Systems, 19:1031–1052,
November 1997. 128
156. Simon Holm Jensen, Anders Møller, and Peter Thiemann. Type analysis for javascript.
In Proceedings of the 16th International Symposium on Static Analysis, SAS ’09, pages
238–255, Berlin, Heidelberg, 2009. Springer-Verlag. 369
157. Neil Johnson and Alan Mycroft. Combined code motion and register allocation using
the value state dependence graph. In In Proc. 12th International Conference on Compiler
Construction (CC’03) (April 2003, pages 1–16, 2003. 204
158. Neil E. Johnson. Code size optimization for embedded processors. Technical Report
UCAM-CL-TR-607, University of Cambridge, Computer Laboratory, November 2004. 204
159. R. Johnson, D. Pearson, and K. Pingali. The program tree structure. In PLDI, pages
171–185. ACM, 1994. 188
160. Richard Johnson and Keshav Pingali. Dependence-based program analysis. In PLDI ’93:
Proceedings of the Conference on Programming Language Design and Implementation,
pages 78–89, New York, NY, USA, 1993. ACM. 111
161. Richard Johnson and Keshav Pingali. Dependence-based program analysis. In PLDI,
pages 78–89. ACM, 1993. 188
162. Thomas Johnsson. Lambda lifting: Transforming programs to recursive equations. In Jean-
Pierre Jouannaud, editor, Functional Programming Languages and Computer Architecture,
Proceedings, volume 201 of LNCS, pages 190–203. Springer, 1985. 90
163. Nenad Jovanovic, Christopher Kruegel, and Engin Kirda. Pixy: A static analysis tool for
detecting web application vulnerabilities (short paper). In 2006 IEEE Symposium on
Security and Privacy, pages 258–263, 2006. 369, 370
164. John B. Kam and Jeffrey D. Ullman. Global data flow analysis and iterative algorithms.
Journal of the ACM (JACM), 23(1):158–171, 1976. 110, 127
165. John B. Kam and Jeffrey D. Ullman. Monotone data flow analysis frameworks. Acta
Informatica, 7(3):305–317, Sep 1977. 186
166. Daniel Kästner and Sebastian Winkel. ILP-based Instruction Scheduling for IA-64. In Pro-
ceedings of the ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded
Systems, LCTES ’01, pages 145–154, New York, NY, USA, 2001. ACM. 254
167. Richard Kelsey. A correspondence between continuation passing style and static single
assignment form. In Intermediate Representations Workshop, pages 13–23, 1995. 90
388 References
168. Andrew Kennedy. Compiling with continuations, continued. In Ralf Hinze and Norman
Ramsey, editors, Proceedings of the 12th ACM SIGPLAN International Conference on
Functional Programming, ICFP 2007, Freiburg, Germany, October 1-3, 2007, pages 177–
190. ACM Press, 2007. 90
169. K. W. Kennedy. Node Listings applied to Data Flow Analysis. In 2nd ACM SIGACT-SIGPLAN
symposium on Principles of Programming Languages (POPL’75), pages 10–21. ACM, 1975.
127
170. R. Kennedy, S. Chan, S. Liu, R. Lo, P. Tu, and F. Chow. Partial redundancy elimination in
ssa form. ACM Trans. Program. Lang. Syst., 21(3):627–676, 1999. 157
171. R. Kennedy, F. Chow, P. Dahl, S. Liu, R. Lo, P. Tu, and M. Streich. Strength reduction via
ssapre. In Proceedings of the Seventh International Conference on Compiler Construction,
1988. 158
172. Robert Kennedy, Sun Chan, Shin-Ming Liu, Raymond Lo, Peng Tu, and Fred Chow. Partial
Redundancy Elimination in SSA Form. ACM Trans. Program. Lang. Syst., 21(3):627–676,
May 1999. 258, 264
173. Uday P. Khedker and Dhananjay M. Dhamdhere. Bidirectional data flow analysis: myths
and reality. SIGPLAN Not., 34(6):47–57, 1999. 187
174. Gary A. Kildall. A unified approach to global program optimization. In Proceedings of the
1st Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages
(POPL’73), pages 194–206. ACM Press, 1973. 126, 186
175. V. Kislenkov, V. Mitrofanov, and E. Zima. Multidimensional chains of recurrences. In
Proceedings of the 1998 international symposium on symbolic and algebraic computation,
pages 199–206. ACM Press, 1998. 137
176. Kathleen Knobe and Vivek Sarkar. Array SSA form and its use in Parallelization. Conf.
Rec. Twenty-fifth ACM Symposium on Principles of Programming Languages, San Diego,
California, January 1998. 247
177. Kathleen Knobe and Vivek Sarkar. Conditional constant propagation of scalar and array
references using array SSA form. In Giorgio Levi, editor, Lecture Notes in Computer Science,
1503, pages 33–56. Springer-Verlag, 1998. Proceedings from the 5th International Static
Analysis Symposium. 247
178. J. Knoop, O. Rüthing, and B. Steffen. Lazy code motion. In Proceedings of the ACM
SIGPLAN ’92 Conference on Programming Language Design and Implementation, pages
224–234, 1992. 157
179. J. Knoop, O. Rüthing, and B. Steffen. Lazy strength reduction. Journal of Programming
Languages, 1(1):71–91, 1993. 158
180. J. Knoop, O. Rüthing, and B. Steffen. Optimal code motion: theory and practice. ACM
Trans. Program. Lang. Syst., 16(4):1117–1155, 1994. 157
181. M. Lam. Software Pipelining: an Effective Scheduling Technique for VLIW Machines. In
PLDI ’88: Proc. of the ACM SIGPLAN 1988 conference on Programming Language Design
and Implementation, pages 318–328, 1988. 263
182. Peter Landin. A generalization of jumps and labels. Technical report, UNIVAC Systems
Programming Research, August 1965. Reprinted in Higher Order and Symbolic Computa-
tion, 11(2):125-143, 1998, with a foreword by Hayo Thielecke. 90
183. Christopher Lapkowski and Laurie J. Hendren. Extended ssa numbering: introducing ssa
properties to languages with multi-level pointers. In Proceedings of the 1996 conference of
the Centre for Advanced Studies on Collaborative research, CASCON ’96, pages 23–34. IBM
Press, 1996. 259
184. Chris Lattner and Vikram S. Adve. LLVM: A compilation framework for lifelong program
analysis & transformation. In Code Generation and Optimization, CGO 2004, pages 75–88,
Palo Alto, CA, March 2004. IEEE Computer Society. 253
185. Peeter Laud, Tarmo Uustalu, and Varmo Vene. Type systems equivalent to data-flow
analyses for imperative languages. Theoretical Computer Science, 364(3):292–310, 2006.
91
References 389
186. Alan C. Lawrence. Optimizing compilation with the Value State Dependence Graph.
Technical Report UCAM-CL-TR-705, University of Cambridge, Computer Laboratory,
December 2007. 204
187. E. Lee and D. Messerschmitt. Synchronous data flow. Proc. of the IEEE, 75(9):1235–1245,
1987. 357
188. J.-Y. Lee and I.-C. Park. Address code generation for dsp instruction-set architectures.
ACM Trans. Des. Autom. Electron. Syst., 8(3):384–395, 2003. 254, 256
189. Alexandre Lenart, Christopher Sadler, and Sandeep K. S. Gupta. Ssa-based flow-sensitive
type analysis: combining constant and type propagation. In Proceedings of the 2000 ACM
symposium on Applied computing - Volume 2, SAC ’00, pages 813–817, New York, NY, USA,
2000. ACM. 370
190. Allen Leung and Lal George. Static single assignment form for machine code. In Inter-
national Conference on Programming Language Design and Implementation (PLDI’99),
PLDI ’99, pages 204–214, New York, NY, USA, 1999. ACM. 255, 265, 314
191. Rainer Leupers. Retargetable Code Generation for Digital Signal Processors. Kluwer
Academic Publishers, Norwell, MA, USA, 1997. 254
192. Rainer Leupers. Exploiting conditional instructions in code generation for embedded
vliw processors. In Proceedings of the conference on Design, automation and test in Europe,
DATE ’99, New York, NY, USA, 1999. ACM. 260
193. Shin-Ming Liu, Raymond Lo, and Fred Chow. Loop induction variable canonicalization
in parallelizing compilers. Proceedings of the 1996 Conference on Parallel Architectures
and Compilation Techniques (PACT ’96), page 228, 1996. 138
194. Raymond Lo, Fred Chow, Robert Kennedy, Shin-Ming Liu, and Peng Tu. Register promo-
tion by sparse partial redundancy elimination of loads and stores. In Proceedings of the
ACM SIGPLAN ’98 Conference on Programming Language Design and Implementation,
pages 26–37. ACM, 1998. 158, 187
195. Fancesco Logozzo and Manuel Fahndrich. Pentagons: a weakly relational abstract domain
for the efficient validation of array accesses. In SAC, pages 184–188. ACM, 2008. 187
196. P. Geoffrey Lowney, Stefan M. Freudenberger, Thomas J. Karzes, W. D. Lichtenstein,
Robert P. Nix, John S. O’Donnell, and John Ruttenberg. The multiflow trace schedul-
ing compiler. J. Supercomput., 7(1-2):51–142, May 1993. 257, 259, 260, 263
197. Guei-Yuan Lueh, Thomas Gross, and Ali-Reza Adl-Tabatabai. Fusion-based register
allocation. ACM Trans. Program. Lang. Syst., 22(3):431–470, May 2000. 337
198. LLVM Website. http://llvm.cs.uiuc.edu. 278
199. S. Mahlke, R. Ravindran, M. Schlansker, R. Schreiber, and T. Sherwood. Bitwidth cog-
nizant architecture synthesis of custom hardware accelerators. Computer-Aided Design
of Integrated Circuits and Systems, IEEE Transactions on, 20(11):1355–1371, 2001. 188
200. Scott A. Mahlke, William Y. Chen, Wen-mei W. Hwu, B. Ramakrishna Rau, and Michael S.
Schlansker. Sentinel scheduling for vliw and superscalar processors. In Proceedings of
the fifth international conference on Architectural support for programming languages
and operating systems, ASPLOS-V, pages 238–247, New York, NY, USA, 1992. ACM. 259
201. Scott A. Mahlke, Richard E. Hank, James E. McCormick, David I. August, and Wen-Mei W.
Hwu. A comparison of full and partial predicated execution support for ilp processors.
In Proceedings of the 22nd annual international symposium on Computer architecture,
ISCA ’95, pages 138–150, New York, NY, USA, 1995. ACM. 257, 259
202. Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, and Roger A. Bringmann.
Effective compiler support for predicated execution using the hyperblock. In Proceedings
of the 25th Annual International Symposium on Microarchitecture, MICRO 25, pages
45–54, Los Alamitos, CA, USA, 1992. IEEE Computer Society Press. 262, 263, 295, 356
203. Yutaka Matsuno and Atsushi Ohori. A type system equivalent to static single assignment.
In Annalisa Bossi and Michael J. Maher, editors, Proceedings of the 8th International ACM
SIGPLAN Conference on Principles and Practice of Declarative Programming (PPDP’06),
pages 249–260. ACM Press, 2006. 91
204. Cathy May. The parallel assignment problem redefined. IEEE Transactions on Software
Engineering, 15(6):821–824, June 1989. 91, 314
390 References
205. David McAllester. On the complexity analysis of static analyses. Journal of the ACM,
49:512–537, July 2002. 127
206. P. Metzgen and D. Nancekievill. Multiplexer restructuring for FPGA implementation cost
reduction. In Proc. of the ACM/IEEE Design Automation Conf. (DAC), pages 421–426, 2005.
356
207. Antoine Miné. The octagon abstract domain. Higher Order Symbol. Comput., 19:31–100,
2006. 187
208. Eugenio Moggi. Notions of computation and monads. Information and Computation,
93(1):55–92, 1991. 90
209. E. Morel and C. Renvoise. Global optimization by suppression of partial redundancies.
Communications of the ACM, 22(2):96–103, 1979. 157
210. Robert Morgan. Building an optimizing compiler, January 1998. 217
211. Hanspeter Mössenböck and Michael Pfeiffer. Linear scan register allocation in the context
of ssa form and register constraints. In R. Nigel Horspool, editor, Compiler Construction,
pages 229–246. Springer Berlin Heidelberg, 2002. 338
212. Rajeev Motwani, Krishna V. Palem, Vivek Sarkar, and Salem Reyen. Combining register
allocation and instruction scheduling. Technical report, Stanford University, Stanford,
CA, USA, 1995. 339
213. Steven S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann,
1997. 253
214. B. Murphy, V. Menon, F. Schneider, T. Shpeisman, and A. Adl-Tabatabai. Fault-safe code
motion for type-safe languages. In Proceedings of the 6th annual IEEE/ACM international
symposium on Code generation and optimization, pages 144–154, 2008. 157
215. Mangala Gowri Nanda and Saurabh Sinha. Accurate interprocedural null-dereference
analysis for java. In ICSE, pages 133–143, 2009. 188
216. Flemming Nielson, Hanne R. Nielson, and Chris Hankin. Principles of Program Analysis.
Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1999. 110
217. Flemming Nielson, Hanne Riis Nielson, and Chris Hankin. Principles of program analysis.
Springer, 2005. 187
218. Cindy Norris and Lori L Pollock. A scheduler-sensitive global register allocator. In
Supercomputing’93. Proceedings, pages 804–813. IEEE, 1993. 339
219. D. Novillo. A propagation engine for gcc. In Proceedings of the 2005 GCC Developers
Summit, pages 175–184, 2005. http://www.gccsummit.org/2005. 138
220. Diego Novillo. A propagation engine for GCC. In Proceedings of the GCC Developers???
Summit, pages 175–184, 2005. 110
221. D. Nuzman and R. Henderson. Multi-platform auto-vectorization. In CGO, 2006. 137
222. Ciaran O’Donnell. High level compiling for low level machines. PhD thesis, Ecole Nationale
Superieure des Telecommunications, 1994. Available from ftp.enst.fr. 90
223. Karl J. Ottenstein, Robert A. Ballance, and Arthur B. MacCabe. The program dependence
web: a representation supporting control-, data-, and demand-driven interpretation of
imperative languages. In PLDI ’90: Proceedings of the ACM SIGPLAN 1990 conference on
Programming language design and implementation, pages 257–271, New York, NY, USA,
1990. ACM. 11, 204
224. Karl J. Ottenstein and Linda M. Ottenstein. The program dependence graph in a software
development environment. In SDE 1: Proceedings of the first ACM SIGSOFT/SIGPLAN
software engineering symposium on Practical software development environments, pages
177–184, New York, NY, USA, 1984. ACM. 204
225. David A. Padua, editor. Encyclopedia of Parallel Computing. Springer, 2011. 167
226. V.K. Paleri, Y.N. Srikant, and P. Shankar. Partial redundancy elimination: a simple, prag-
matic and provably correct algorithm. Science of Programming Programming, 48(1):1–20,
2003. 157
227. P. Panda. SystemC: a modeling platform supporting multiple design abstractions. In Proc.
of the 14th Intl. Symp. on Systems Synthesis (ISSS’01), pages 75–80, New York, NY, USA,
2001. ACM. 356
References 391
248. Fabrice Rastello. On Sparse Intermediate Representations: Some Structural Properties and
Applications to Just In Time Compilation. Habilitation à diriger des recherches, ENS Lyon,
December 2012. 127, 128
249. Fabrice Rastello, François de Ferrière, and Christophe Guillon. Optimizing translation
out of SSA using renaming constraints. In International Symposium on Code Generation
and Optimization (CGO’04), pages 265–278. IEEE Computer Society Press, 2004. 254, 265,
314
250. B. Ramakrishna Rau. Iterative Modulo Scheduling. International Journal of Parallel
Programming, 24(1):3–65, 1996. 262
251. Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet,
Atanas Rountev, and P. Sadayappan. Register optimizations for stencils on gpus. In
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, PPoPP ’18, pages 168–182. ACM, 2018. 339
252. John H. Reppy. Optimizing nested loops using local CPS conversion. Higher-Order and
Symbolic Computation, 15(2-3):161–180, 2002. 90
253. John C. Reynolds. Definitional interpreters for higher-order programming languages. In
Proceedings of the 25th ACM National Conference, pages 717 – 714, 1972. Reprinted in
Higher-Order and Symbolic Computation 11(4):363-397,1998. 90
254. John C. Reynolds. On the relation between direct and continuation semantics. In Proceed-
ings of the 2nd Colloquium on Automata, Languages and Programming, pages 141–156,
London, UK, 1974. Springer. 90
255. John C. Reynolds. The discoveries of continuations. Lisp and Symbolic Computation,
6(3-4):233–248, 1993. 90
256. Laurence Rideau, Bernard P. Serpette, and Xavier Leroy. Tilting at windmills with Coq:
Formal verification of a compilation algorithm for parallel moves. Journal of Automated
Reasoning, 40(4):307–326, 2008. 90
257. Andrei Alves Rimsa, Marcelo D’Amorim, and Fernando M. Q. Pereira. Tainted flow analysis
on e-SSA-form programs. In CC, pages 124–143. Springer, 2011. 188
258. B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Global value numbers and redundant
computations. In 5th ACM SIGPLAN-SIGACT symposium on Principles of Programming
Languages, pages 12–27. ACM Press, January 1988. 158, 314
259. Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. Global value numbers and
redundant computations. In POPL ’88: Proceedings of the 15th ACM SIGPLAN-SIGACT
symposium on Principles of programming languages, pages 12–27, New York, NY, USA,
1988. ACM. 11
260. Erik Ruf. Optimizing sparse representations for dataflow analysis. In IR ’95: Proceedings
of the Workshop on Intermediate Representations, pages 50–61, New York, NY, USA, 1995.
ACM. 111
261. Barbara G. Ryder and Marvin C. Paull. Elimination Algorithms for Data Flow Analysis.
ACM Computing Surveys, 18(3):277–316, September 1986. 127
262. Vivek Sarkar and Rajkishore Barik. Extended linear scan: An alternate foundation for
global register allocation. In International Conference on Compiler Construction, pages
141–155. Springer, 2007. 337, 338
263. Vivek Sarkar and Stephen Fink. Efficient dependence analysis for java arrays. In Rizos
Sakellariou, John Gurd, Len Freeman, and John Keane, editors, Euro-Par 2001 Parallel
Processing, volume 2150 of Lecture Notes in Computer Science, pages 273–277. Springer
Berlin / Heidelberg, 2001. 247
264. Stefan Schäfer and Bernhard Scholz. Optimal chain rule placement for instruction
selection based on SSA graphs. In SCOPES ’07: Proceedingsof the 10th international
workshop on Software & compilers for embedded systems, pages 91–100, New York, NY,
USA, 2007. ACM. 203, 279
265. Michael Schlansker, Scott Mahlke, and Richard Johnson. Control cpr: a branch height
reduction optimization for epic architectures. In Proceedings of the ACM SIGPLAN 1999
conference on Programming language design and implementation, PLDI ’99, pages 155–
168, New York, NY, USA, 1999. ACM. 263
References 393
266. Nat Seshan. High velociti processing. IEEE Signal Processing Magazine, pages 86–101,
1998. 257
267. Tom Shanley. X86 Instruction Set Architecture. Mindshare Press, 2010. 256
268. Robert M. Shapiro and Harry Saint. The representation of algorithms. Technical Report
RADC-TR-69-313, Rome Air Development Center, September 1969. 42
269. B. Simons, D. Alpern, and J. Ferrante. A foundation for sequentializing parallel code. In
SPAA ’90: Proceedings of the second annual ACM symposium on Parallel algorithms and
architectures, pages 350–359, New York, NY, USA, 1990. ACM. 204
270. Jeremy Singer. Sparse bidirectional data flow analysis as a basis for type inference. In
APPSEM ’04: Web Proceedings of the Applied Semantics Workshop, 2004. 110
271. Jeremy Singer. Static Program Analysis Based on Virtual Register Renaming. PhD thesis,
University of Cambridge, 2005. 110, 127, 187, 188
272. Michael D. Smith, Norman Ramsey, and Glenn H. Holloway. A generalized algorithm
for graph-coloring register allocation. In William Pugh and Craig Chambers, editors,
Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and
Implementation 2004, Washington, DC, USA, June 9-11, 2004, pages 277–288. ACM, 2004.
338
273. Vugranam C. Sreedhar and Guang R. Gao. A Linear Time Algorithm for Placing φ-nodes. In
Proceedings of the 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming
Languages, pages 62–73, 1995. 42, 57
274. Vugranam C. Sreedhar, Guang R. Gao, and Yong-fong Lee. Incremental Computation
of Dominator Trees. In Papers from the 1995 ACM SIGPLAN workshop on Intermediate
representations, pages 1–12, New York, NY, USA, 1995. ACM. 66
275. Vugranam C. Sreedhar, Guang R. Gao, and Yong-fong Lee. A New Framework for Exhaus-
tive and Incremental Data Flow Analysis using DJ Graphs. In Proceedings of the ACM
SIGPLAN 1996 conference on Programming Language Design and Implementation, pages
278–290, New York, NY, USA, 1996. ACM. 66
276. Vugranam C. Sreedhar, Guang R. Gao, and Yong-Fong Lee. Identifying Loops Using DJ
Graphs. ACM Transactions on Programming Languages and Systems, 18(6):649–658, 1996.
91
277. Vugranam C. Sreedhar, Roy Dz-Ching Ju, David M. Gillies, and Vatsa Santhanam. Trans-
lating out of static single assignment form. In SAS ’99: Proc. of the 6th International
Symposium on Static Analysis, pages 194–210, London, UK, 1999. Springer-Verlag. 24,
218, 256, 258, 259, 264, 265, 313
278. R.M. Stallman and GCC Dev. Community. Gcc 7.0 Gnu Compiler Collection Internals.
Samurai Media Limited, 2017. 253
279. James Stanier. Removing and Restoring Control Flow with the Value State Dependence
Graph. PhD thesis, University of Sussex, School of Informatics, 2011. 204
280. James Stanier and Des Watson. Intermediate Representations in Imperative Compilers: A
Survey. ACM Comput. Surv., 45(3):26:1–26:27, July 2013. 253
281. Bjarne Steensgaard. Sequentializing program dependence graphs for irreducible pro-
grams. Technical Report MSR-TR-93-14, Microsoft Research, Redmond, WA, August 1993.
88
282. Mark Stephenson, Jonathan Babb, and Saman Amarasinghe. Bidwidth analysis with
application to silicon compilation. In PLDI, pages 108–120. ACM, 2000. 188
283. Eric Stoltz, Michael P. Gerlek, and Michael Wolfe. Extended SSA with factored use-def
chains to support optimization and parallelism. In In Proceedings of the 27th Annual
Hawaii International Conference on System Sciences, pages 43–52, 1993. 203
284. Arthur Stoutchinin and François de Ferrière. Efficient Static Single Assignment Form
for Predication. In Proc. of the 34th annual ACM/IEEE international symposium on
Microarchitecture, MICRO 34, pages 172–181, 2001. 217, 254, 261, 262, 356
285. Artour Stoutchinin and Guang Gao. If-Conversion in SSA Form. In Marco Danelutto,
Marco Vanneschi, and Domenico Laforenza, editors, Euro-Par 2004 Parallel Processing,
volume 3149 of Lecture Notes in Computer Science, pages 336–345, 2004. 261
394 References
286. Zhendong Su and David Wagner. A class of polynomially solvable range constraints for
interval analysis without widenings. Theoretical Computeter Science, 345(1):122–138,
2005. 188
287. Matthew John Surawski. Loop Optimizations for MLton. Master’s thesis, Department of
Computer Science, Rochester Institute of Technology, Rochester, New York, 2016. 91
288. Gerald Jay Sussman and Guy Lewis Steele Jr. Scheme: An interpreter for extended lambda
calculus. Technical Report AI Lab Memo AIM-349, MIT AI Lab, December 1975. Reprinted
in Higher-Order and Symbolic Computation 11(4):405-439, 1998. 90
289. David Tarditi, J. Gregory Morrisett, Perry Cheng, Christopher A. Stone, Robert Harper,
and Peter Lee. Til: A type-directed optimizing compiler for ML. In Proceedings of the ACM
SIGPLAN’96 Conference on Programming Language Design and Implementation (PLDI),
Philadephia, Pennsylvania, May 21-24, 1996, pages 181–192, 1996. 90
290. Andre L. C. Tavares, Benoit Boissinot, Mariza A. S. Bigonha, Roberto Bigonha, Fernando
M. Q. Pereira, and Fabrice Rastello. A program representation for sparse dataflow analyses.
Science of Computer Programming, X(X):2–25, 201X. Invited paper with publication
expected for 2012. 187
291. André L. C. Tavares, Quentin Colombet, Mariza A. S. Bigonha, Christophe Guillon, Fer-
nando M. Q. Pereira, and Fabrice Rastello. Decoupled graph-coloring register allocation
with hierarchical aliasing. In Proceedings of the 14th International Workshop on Software
and Compilers for Embedded Systems, SCOPES ’11, pages 1–10. ACM, 2011. 338
292. D. Thomas and P. Moorby. The Verilog Hardware Description Language. Kluwer Academic
Publishers, Norwell, MA, USA, 1998. 341, 356
293. Sam Tobin-Hochstadt and Matthias Felleisen. The design and implementation of typed
scheme. POPL, pages 395–406, 2008. 188
294. Andrew P. Tolmach and Dino Oliva. From ML to Ada: Strongly-typed language interoper-
ability via source translation. Journal of Functional Programming, 8(4):367–412, 1998.
90
295. Sid Touati and Christine Eisenbeis. Early Periodic Register Allocation on ILP Processors.
Parallel Processing Letters, 14(2):287–313, June 2004. 339
296. Omri Traub, Glenn Holloway, and Michael D. Smith. Quality and speed in linear-scan
register allocation. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming
Language Design and Implementation, PLDI ’98, pages 142–151. ACM, 1998. 337
297. J. Tripp, P. Jackson, and B. Hutchings. Sea cucumber: A synthesizing compiler for FPGAs.
In Proc. of the 12th Intl. Conf. on Field-Programmable Logic and Applications (FPL’02),
pages 875–885, London, UK, 2002. Springer-Verlag. 356
298. Peng Tu and David Padua. Efficient building and placing of gating functions. In PLDI ’95:
Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and
implementation, pages 47–55, New York, NY, USA, 1995. ACM. 204
299. Peng Tu and David Padua. Gated SSA-based demand-driven symbolic analysis for paral-
lelizing compilers. In Proceedings of the 9th International Conference on Supercomputing
(ICS’ 95), pages 414–423, New York, NY, USA, 1995. ACM. 204, 356
300. Eben Upton. Optimal sequentialization of gated data dependence graphs is NP-complete.
In PDPTA, pages 1767–1770, 2003. 204
301. Eben Upton. Compiling with data dependence graphs, 2006. 204
302. R. A. van Engelen. Efficient symbolic analysis for optimizing compilers. Proceedings of
the International Conference on Compiler Construction (ETAPS CC’01), pages 118–132,
2001. 137
303. Adriaan van Wijngaarden. Recursive defnition of syntax and semantics. In T. B. Steel Jr.,
editor, Formal Language Description Languages for Computer Programming, pages 13
–24. North-Holland, 1966. 90
304. Thomas Vandrunen and Antony L. Hosking. Anticipation-based partial redundancy elimi-
nation for static single assignment form. Software — Practice and Experience, 34(15):1413–
1439, December 2004. 158
References 395
305. Thomas Vandrunen and Antony L. Hosking. Value-based partial redundancy elimination.
In Proceedings of the 13th International Conference on Compiler Construction, pages
167–184, 2004. 158
306. A. Verma, P. Brisk, and P. Ienne. Dataflow transformations to maximize the use of carry-
save representation in arithmetic circuits. IEEE Trans. on Computer-Aided Design of
Integrated Circuits and Systems, 27(10):1761–1774, October 2008. 356
307. Christopher P. Wadsworth. Continuations revisited. Higher-Order and Symbolic Compu-
tation, 13(1/2):131–133, 2000. 90
308. Mitchell Wand. Loops in Combinator-Based Compilers. Information and Control,
57(2/3):148–164, 1983. 255
309. Jian Wang, Andreas Krall, M. Anton Ertl, and Christine Eisenbeis. Software pipelining
with register allocation and spilling. In Proceedings of the 27th Annual International
Symposium on Microarchitecture, MICRO 27, pages 95–99. ACM, 1994. 339
310. Gary Wassermann and Zhendong Su. Sound and precise analysis of web applications
for injection vulnerabilities. In Proceedings of the 2007 ACM SIGPLAN conference on
Programming language design and implementation, PLDI ’07, pages 32–41, New York,
NY, USA, 2007. ACM. 370
311. Mark N. Wegman and F. Kenneth Zadeck. Constant propagation with conditional
branches. ACM Transactions on Programming Languages and Systems, 13(2):181 – 210,
1991. 91, 110, 217
312. Daniel Weise, Roger F. Crew, Michael Ernst, and Bjarne Steensgaard. Value dependence
graphs: representation without taxation. In POPL ’94: Proceedings of the 21st ACM
SIGPLAN-SIGACT symposium on Principles of programming languages, pages 297–310,
New York, NY, USA, 1994. ACM. 204
313. Christian Wimmer and Hanspeter Mössenböck. Optimized interval splitting in a linear
scan register allocator. In Proceedings of the 1st ACM/USENIX international conference
on Virtual execution environments, pages 132–141. ACM, 2005. 337
314. M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, Read-
ing, MA., 1996. 137
315. Michael Wolfe. Beyond induction variables. In ACM SIGPLAN Conference on Programming
Language Design and Implementation (PLDI’ 92), pages 162 – 174. ACM, 1992. 137, 203,
217
316. Michael Wolfe. J+=j. SIGPLAN Not., 29(7):51–53, 1994. 24
317. Peng Wu, Albert Cohen, and David Padua. Induction variable analysis without idiom
recognition: Beyond monotonicity. In In Proceedings of the 14th International Workshop
on Languages and Compilers for Parallel Computing, page 2624, 2001. 137
318. J. Xue and Q. Cai. A lifetime optimal algorithm for speculative pre. ACM Transactions on
Architecture and Code Optimization, 3(2):115–155, 2006. 157
319. J. Xue and J. Knoop. A fresh look at pre as a maximum flow problem, 2006. 157
320. Byung-Sun Yang, Soo-Mook Moon, Seongbae Park, Junpyo Lee, SeungIl Lee, Jinpyo Park,
Yoo C Chung, Suhyun Kim, Kemal Ebcioglu, and Erik Altman. Latte: A java vm just-in-
time compiler with fast and efficient register allocation. In Parallel Architectures and
Compilation Techniques, 1999. Proceedings. 1999 International Conference on, pages
128–138. IEEE, 1999. 337
321. J. Yeung, C. Tsang, K. Tsoi, B. Kwan, C. Cheung, A. Chan, and P. Leong. Map-reduce as
a programming model for custom computing machines. In Proc. of the 2008 16th Intl.
Symp. on Field-Programmable Custom Computing Machines (FCCM’08), pages 149–159,
Washington, DC, USA, 2008. IEEE Computer Society. 357
322. Cliff Young and Michael D. Smith. Better global scheduling using path profiles. In
Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture,
MICRO 31, pages 115–123, Los Alamitos, CA, USA, 1998. IEEE Computer Society Press.
263
323. Frank Kenneth Zadeck. Incremental Data Flow Analysis in a Structured Program Editor.
PhD thesis, Rice University, 1984. 187
396 References
324. H. Zhou, W.G. Chen, and F. Chow. An ssa-based algorithm for optimal speculative code
motion under an execution profile. In Proceedings of the ACM SIGPLAN ’11 Conference
on Programming Language Design and Implementation, 2011. 157
325. Hucheng Zhou, Wenguang Chen, and Fred C. Chow. An ssa-based algorithm for optimal
speculative code motion under an execution profile. In Proceedings of the 32nd ACM
SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2011),
pages 98–108. ACM, 2011. 230
326. Eugene V. Zima. On computational properties of chains of recurrences. In Proceedings of
the 2001 international symposium on symbolic and algebraic computation, pages 345–352.
ACM Press, 2001. 137