Formalizing The LLVM Intermediate Representation For Verified Program Transformations
Formalizing The LLVM Intermediate Representation For Verified Program Transformations
Abstract Optimizations/
Transformations
This paper presents Vellvm (verified LLVM), a framework for rea- C, C++, Haskell, Alpha, ARM,
soning about programs expressed in LLVM’s intermediate repre- Code
ObjC, ObjC++, LLVM IR PowerPC, Sparc,
sentation and transformations that operate on it. Vellvm provides a Generator/
Scheme, Scala... X86, Mips, …
mechanized formal semantics of LLVM’s intermediate representa- JIT
tion, its type system, and properties of its SSA form. The frame- Program analysis
work is built using the Coq interactive theorem prover. It includes
multiple operational semantics and proves relations among them to Figure 1. The LLVM compiler infrastructure
facilitate different reasoning styles and proof techniques.
To validate Vellvm’s design, we extract an interpreter from the paper formalizes both the static and dynamic semantics of the IR
Coq formal semantics that can execute programs from LLVM test that forms the heart of the LLVM compiler infrastructure.
suite and thus be compared against LLVM reference implementa- LLVM [19] (Low-Level Virtual Machine) uses a platform-
tions. To demonstrate Vellvm’s practicality, we formalize and ver- independent SSA-based IR originally developed as a research
ify a previously proposed transformation that hardens C programs tool for studying optimizations and modern compilation tech-
against spatial memory safety violations. Vellvm’s tools allow us to niques [16]. The LLVM project has since blossomed into a ro-
extract a new, verified implementation of the transformation pass bust, industrial-strength, and open-source compilation platform
that plugs into the real LLVM infrastructure; its performance is that competes with GCC in terms of compilation speed and per-
competitive with the non-verified, ad-hoc original. formance of the generated code [16]. As a consequence, it has been
widely used in both academia and industry.
Categories and Subject Descriptors D.2.4 [Software Engineer-
An LLVM-based compiler is structured as a translation from a
ing]: Software/Program Verification - Correctness Proofs; F.3.1
high-level source language to the LLVM IR (see Figure 1). The
[Logics and Meanings of Programs]: Speficying and Verifying and
LLVM tools provide a suite of IR to IR translations, which pro-
Reasoning about Programs - Mechanical verification; F.3.2 [Log-
vide optimizations, program transformations, and static analyses.
ics and Meanings of Programs]: Semantics of Programming Lan-
The resulting LLVM IR code can then be lowered to a variety of
guages - Operational semantics
target architectures, including x86, PowerPC, and ARM (either by
General Terms Languages, Verification, Reliability static compilation or dynamic JIT-compilation). The LLVM project
focuses on C and C++ front-ends, but many source languages, in-
Keywords LLVM, Coq, memory safety
cluding Haskell, Scheme, Scala, Objective C and others have been
ported to target the LLVM IR.
1. Introduction This paper introduces Vellvm—for verified LLVM—a frame-
Compilers perform their optimizations and transformations over an work that includes a formal semantics and associated tools for
intermediate representation (IR) that hides details about the tar- mechanized verification of LLVM IR code, IR to IR transforma-
get execution platform. Rigorously proving properties about these tions, and analyses. The description of this framework in this paper
IR transformations requires that the IR itself have a well-defined is organized into two parts.
formal semantics. Unfortunately, the IRs used in main-stream pro- The first part formalizes the LLVM IR. It presents the LLVM
duction compilers generally do not. To address this deficiency, this syntax and static properties (Section 2), including a variety of
well-formedness and structural properties about LLVM’s static
∗ Thisresearch was funded in part by the U.S. Government. The views and single assignment (SSA) representation that are useful in proofs
conclusions contained in this document are those of the authors and should about LLVM code and transformation passes. Vellvm’s memory
not be interpreted as representing the official policies, either expressed or model (Section 3) is based on CompCert’s [18], extended to han-
implied, of the U.S. Government. dle LLVM’s arbitrary bit-width integers, padding, and alignment
issues. In developing the operational semantics (Section 4), a sig-
nificant challenge is adequately capturing the nondeterminism that
arises due to LLVM’s explicit undef value and its intentional
Permission to make digital or hard copies of all or part of this work for personal or underspecification of certain erroneous behaviors such as reading
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
from uninitialized memory; this underspecification is needed to
on the first page. To copy otherwise, to republish, to post on servers or to redistribute justify the correctness of aggressive optimizations. Vellvm there-
to lists, requires prior specific permission and/or a fee. fore implements several related operational semantics, including a
POPL’12, January 25–27, 2012, Philadelphia, PA, USA. nondeterministic semantics and several deterministic refinements
Copyright
c 2012 ACM 978-1-4503-1083-3/12/01. . . $10.00 to facilitate different proof techniques and reasoning styles.
1
The second part of the paper focuses on the utility of for- %ST = type { i10 , [10 x i8*] }
malizing the LLVM IR. We describe Vellvm’s implementation in
define %ST* @foo(i8* %ptr) {
Coq [10] and validate its LLVM IR semantics by extracting an exe- entry:
cutable interpreter and comparing its behavior to that of the LLVM %p = malloc %ST, i32 1
reference interpreter and compiled code (Section 5). The Vellvm %r = getelementptr %ST* %p, i32 0, i32 0
framework provides support for moving code between LLVM’s IR store i10 648, %r ; decomposes as 136, 2
representation and its Coq representation. This infrastructure, along %s = getelementptr %ST* %p, i32 0, i32 1, i32 0
with Coq’s facility for extracting executable code from constructive store i8* %ptr, %s
proofs, enables Vellvm users to manipulate LLVM IR code with ret %ST* %p
high confidence in the results. For example, using this framework, }
we can extract verified LLVM transformations that plug directly
into the LLVM compiler. We demonstrate the effectiveness of this Figure 2. An example use of LLVM’s memory operations. Here,
technique by using Vellvm to implement a verified instance of Soft- %p is a pointer to a single-element array of structures of type %ST.
Bound [21], an LLVM-pass that hardens C programs against buffer Pointer %r indexes into the first component of the first element in
overflows and other memory safety violations (Section 6). the array, and has type i10*, as used by the subsequent store,
To summarize, this paper and the Vellvm framework provide: which writes the 10-bit value 648. Pointer %s has type i8** and
points to the first element of the nested array in the same structure.
• A formalization of the LLVM IR, its static semantics, memory
model, and several operational semantics;
Types typ include arbitrary bit-width integers i8, i16, i32, etc.,
• Metatheoretic results (preservation and progress theorems) re- or, more generally, isz where sz is a natural number. Types also
lating the static and dynamic semantics; include float, void, pointers typ∗, arrays [ sz × typ ] that have
j
• Coq infrastructure implementing the above, along with tools for a statically-known size sz . Anonymous structure types { typj }
j
interacting with the LLVM compiler; contain a list of types. Functions typ typj have a return type,
j
• Validation of the semantics in the form of an extracted LLVM and a list of argument types. Here, typj denotes a list of typ
interpreter; and components; we use similar notation for other lists throughout the
paper. Finally, types can be named by identifiers id which is useful
• Demonstration of applying this framework to extract a verified
to define recursive types.
transformation pass for enforcing spatial memory-safety. The sizes and alignments for types, and endianness are defined
in layout. For example. int sz align0 align1 dictates that values
2. Static Properties of the LLVM IR with type isz are align0 -byte aligned when they are within an
The LLVM IR is a typed, static single assignment (SSA) [13] lan- aggregate and when used as an argument, and align1 -byte aligned
guage that is a suitable representation for expressing many com- when emitted as a global.
piler transformations and optimizations. This section describes the Operations in the LLVM IR compute with values val, which are
syntax and basic static properties, emphasizing those features that either identifiers id naming temporaries, or constants cnst com-
are either unique to the LLVM or have non-trivial implications for puted from statically-known data, using the compile-time analogs
the formalization. Vellvm’s formalization is based on the LLVM of the commands described below. Constants include base values
release version 2.6, and the syntax and semantics are intended to (i.e., integers or floats of a given bit width), and zero-values of a
model the behavior as described in the LLVM Language Refer- given type, as well as structures and arrays built from other con-
ence,1 although we also used the LLVM IR reference interpreter stants.
and the x86 backend to inform our design. To account for uninitialized variables and to allow for various
program optimizations, the LLVM IR also supports a type-indexed
2.1 Language syntax undef constant. Semantically, undef stands for a set of possible
bit patterns, and LLVM compilers are free to pick convenient values
Figure 3 shows (a fragment of) the abstract syntax for the subset for each occurrence of undef to enable aggressive optimizations
of the LLVM IR formalized in Vellvm. The metavariable id ranges or program transformations. As described in Section 4, the pres-
over LLVM identifiers, written %X, %T, %a, %b, etc., which are used ence of undef makes the LLVM operational semantics inherently
to name local types and temporary variables, and @a, @b, @main, nondeterministic.
etc., which name global values and functions. All code in the LLVM IR resides in top-level functions, whose
Each source file is a module mod that includes data layout bodies are composed of block bs. As in classic compiler represen-
information layout (which defines sizes and alignments for types; tations, a basic block consists of a labeled entry point l , a series
see below), named types, and a list of prods that can be function of φ nodes, a list of commands, and a terminator instruction. As
declarations, function definitions, and global variables. Figure 2 is usual in SSA representations, the φ nodes join together values
shows a small example of LLVM syntax (its meaning is described from a list of predecessor blocks of the control-flow graph—each
in more detail in Section 3). φ node takes a list of (value, label) pairs that indicates the value
Every LLVM expression has a type, which can easily be deter- chosen when control transfers from a predecessor block with the
mined from type annotations that provide sufficient information to associated label. Block terminators (br and ret) branch to another
check an LLVM program for type compatibility. The LLVM IR is block or return (possibly with a value) from the current function.
not a type-safe language, however, because its type system allows Terminators also include the unreachable marker, indicating that
arbitrary casts, calling functions with incorrect signatures, access- control should never reach that point in the program.
ing invalid memory, etc. The LLVM type system ensures only that The core of the LLVM instruction set is its commands (c), which
the size of a runtime value in a well-formed program is compati- include the usual suite of binary arithmetic operations (bop—e.g.,
ble with the type of the value—a well-formed program can still be add, lshr, etc.), memory accessors (load, store), heap opera-
stuck (see Section 4.3). tions (malloc and free), stack allocation (alloca), conversion
operations among integers, floats and pointers (eop, trop, and cop),
1 See http://llvm.org/releases/2.6/docs/LangRef.html comparison over integers (icmp and select), and calls (call).
2
Modules mod, P :: = layout namedt prod
Layouts layout :: = bigendian | littleendian | ptr sz align0 align1 | int sz align0 align1
| float sz align0 align1 | aggr sz align0 align1 | stack sz align0 align1
Products prod :: = id = global typ const align | define typ id (arg){b} | declare typ id (arg)
Floats fp :: = float | double
j j
Types typ :: = isz | fp | void | typ∗ | [ sz × typ ] | { typj } | typ typj | id
Values val :: = id | cnst
Binops bop :: = add | sub | mul | udiv | sdiv | urem | srem | shl | lshr | ashr | and | or | xor
Float ops fbop :: = fadd | fsub | fmul | fdiv | frem
Extension eop :: = zext | sext | fpext
Cast op cop :: = fptoui | ptrtoint | inttoptr | bitcast
Trunc op trop :: = truncint | truncfp
j j
Constants cnst :: = isz Int | fp Float | typ ∗ id | (typ∗) null | typ zeroinitializer | typ[ cnst j ] | { cnst j }
| typ undef | bop cnst 1 cnst 2 | fbop cnst 1 cnst 2 | trop cnst to typ | eop cnst to typ
j
| cop cnst to typ | getelementptr cnst cstj | select cnst 0 cnst 1 cnst 2 | icmp cond cnst 1 cnst 2
| fcmp fcond cnst 1 cnst 2
Blocks b :: = l φ c tmn
j
φ nodes φ :: = id = phi typ [valj , lj ]
Tmns tmn :: = br val l1 l2 | br l | ret typ val | ret void | unreachable
Commands c :: = id = bop( int sz )val1 val2 | id = fbop fp val1 val2 | id = load (typ∗)val1 align
| store typ val1 val2 align | id = malloc typ val align | free ( typ ∗ ) val
| id = alloca typ val align | id = trop typ1 val to typ2 | id = eop typ1 val to typ2
| id = cop typ1 val to typ2 | id = icmp cond typ val1 val2 | id = select val0 typ val1 val2
| id = fcmp fcond fp val1 val2 | option id = call typ0 val0 param
j
| id = getelementptr ( typ ∗ ) val valj
Figure 3. Syntax for LLVM. Note that this figure omits some syntax definitions (e.g., cond —the comparison operators) for the sake of space;
they are, of course, present in Vellvm’s implementation. Some other parts of the LLVM have been omitted from the Vellvm development;
these are discussed in Section 5.
Note that a call site is allowed to ignore the return value of a func- the control-flow graph: the instruction defining an identifier must
tion call. Finally, getelementptr computes pointer offsets into dominate all the instructions that use it. Within a block insn1
structured datatypes based on their types; it provides a platform- dominates insn2 if insn1 appears before insn2 in a program order.
and layout-independent way of performing array indexing, struct A block labeled l1 dominates a block labeled l2 if every execution
field access, and pointer arithmetic. path from the program entry to l2 must go through l1 .
The Vellvm formalization provides an implementation of this
2.2 Static semantics dominator analysis using a standard dataflow fixpoint computa-
Following the LLVM IR specification, Vellvm requires that every tion [14]. It also proves that the implementation is correct, as stated
LLVM program satisfy certain invariants to be considered well in the following lemma, which is needed to establish preservation
formed: every variable in a function is well-typed, well-scoped, of the well-formedness invariants by the operational semantics (see
and assigned exactly once. At a minimum, any reasonable LLVM Section 4).
transformation must preserve these invariants; together they imply
that the program is in SSA form [13]. L EMMA 1 (Dominator Analysis Correctness).
All the components in the LLVM IR are annotated with types, • The entry block of a function dominates itself.
so the typechecking algorithm is straightforward and determined • Given a block b2 that is an immediate successor of b1 , all the
only by local information.The only subtlety is that types themselves strict dominators of b2 also dominate b1
must be well formed. All typs except void and function types
are considered to be first class, meaning that values of these types These well-formedness constraints must hold only of blocks
can be passed as arguments to functions. A set of first-class type that are reachable from a function’s entry point—unreachable code
definitions is well formed if there are no degenerate cycles in their may contain ill-typed and ill-scoped instructions.
definitions (i.e., every cycle through the definitions is broken by a
pointer type). This ensures that the physical sizes of such typs are 3. A Memory Model for Vellvm
positive, finite, and known statically.
The LLVM IR has two syntactic scopes—a global scope and 3.1 Rationale
a function scope—and does not have nested local scopes. In the Understanding the semantics of LLVM’s memory operations is
global scope, all named types, global variables and functions have crucial for reasoning about LLVM programs. LLVM developers
different names, and are defined mutually. In the scope of a function make many assumptions about the “legal” behaviors of such LLVM
fid in module mod, all the global identifiers in mod, the names code, and they informally use those assumptions to justify the
of arguments, locally defined variables and block labels in the correctness of program transformations.
function fid must be unique, which enforces the single-assignment There are many properties expected of a reasonable implemen-
part of the SSA property. tation of the LLVM memory operations (especially in the absence
The set of blocks making up a function constitute a control- of errors). For example, we can reasonably assume that the load
flow graph with a well-defined entry point. All instructions in the instruction does not affect which memory addresses are allocated,
function must satisfy the SSA scoping invariant with respect to or that different calls to malloc do not inappropriately reuse mem-
3
Allocated Next block
ory locations. Unfortunately, the LLVM Language Reference Man-
ual does not enumerate all such properties, which should hold of Blk ... Blk 5 Blk 11 offset Blk ... Blk 39 offset Blk 40
any “reasonable” memory implementation. ... ...
On the other hand, details about the particular memory man- mb(10,136) 32 ...
i10
agement implementation can be observed in the behavior of LLVM mb(10,2) 33 muninit 20
programs (e.g., you can print a pointer after casting it to an integer). muninit 34 muninit 21
{i10, i8*} i32
For this reason, and also to address error conditions, the LLVM muninit 35 muninit 22
specification intentionally leaves some behaviors undefined. Exam- mptr(b39,24,0) 36 muninit 23
mptr(b39,24,1) 37 mptr(b11,32,0) 24
ples include: loading from an unallocated address; loading with im- i8*
mptr(b39,24,2) 38 mptr(b11,32,1) 25
proper alignment; loading from properly allocated but uninitialized [10 x i8*]
mptr(b39,24,3) 39
i16*
mptr(b11,32,2) 26
memory; and loading from properly initialized memory but with an ... mptr(b11,32,3) 27
incompatible type. ...
Because of the dependence on a concrete implementation of valid invalid valid valid valid invalid
4
is a byte-sized chunk of such a pointer where idx is an index
LLVMND
identifying which byte the chunk corresponds to. Because Vellvm’s
∈
implementation assumes 32-bit pointers, four such cells are needed
LLVMInterp ≈ LLVMD & LLVM∗DFn & LLVM∗DB
to encode one LLVM-pointer, as shown in Figure 4. Loading a
pointer succeeds only if the 4 bytes loaded are sequentially indexed
from 0 to 3. Figure 5. Relations between different operational semantics. Each
The last kind of cell is muninit, which represents uninitialized equivalence or inclusion is justified by a proof in Vellvm.
memory, layout padding, and bogus values that result from unde-
fined computations (such as might arise from an arithmetic over-
flow). This scheme induces a notion of dynamically-checked physical
Given this definition of memory cells, a memory state M = subtyping: it is permitted to read a structured value at a different
(N, B, C) includes the following components: N is the next fresh type from the one at which it was written, so long as the basic
block to allocate, B maps a valid block identifier to the size of the types they flatten into agree. For non-structured data types such as
block; C maps a block identifier and an offset within the block to a integers, Vellvm’s implementation is conservative—for example,
memory cell (if the location is valid). Initially, N is 1; B and C are reading an integer with bit width two from the second byte of a 10-
empty. Figure 4 gives a concrete example of such a memory state bit wide integer yields undef because the results are, in general,
for the program in Figure 2. platform specific. Because of this dynamically-checked, physical
There are four basic operations over this byte-oriented memory subtyping, pointer-to-pointer casts can be treated as the identity.
state: alloc, mfree, mload, and mstore. alloc allocates a fresh Similar ideas arise in other formalizations of low-level language
memory block N with a given size, increments N , fills the newly semantics [24, 25].
allocated memory cells with muninit. mfree simply removes the The LLVM malloc and free operations are defined by alloc
deallocated block from B, and its contents from C. Note that the and mfree in a straightforward manner. As the LLVM IR does
memory model does not recycle block identifiers deallocated by a not explicitly distinguish the heap and stack and function calls are
mfree operation, because this model assumes that a memory is of implementation-specific, the memory model defines the same se-
infinite size. mantics for stack allocation (alloca) and heap allocation (malloc)
The mstore operation is responsible for breaking non-byte — both of them allocate memory blocks in memory. However, the
sized basic values into chunks and updating the appropriate mem- operational semantics (described next) maintains a list of blocks
ory locations. Basic values are integers (with their bit-widths), allocated by alloca for each function, and it deallocates them on
floats, addresses, and padding. return.
5
Configurations:
Fun tables θ :: = v 7→ id Globals g :: = id 7→ v Configurations config :: = mod, g, θ
Nondeterministic Machine States:
Value sets V : : = {v | Φ(v )} Locals ∆ :: = id →
7 V Allocas α :: = [] | blk , α
Frames Σ : : = fid , l , c, tmn, ∆, α Call stacks Σ :: = [] | Σ, Σ Program states S :: = M,Σ
config ` S S 0
evalND (g, ∆, val) = bV c findfdef (mod, θ, v ) = bdefine typ fid 0 (arg){(l 0 []c 0 tmn0 ), b}c
v ∈ V initlocals (g, ∆, arg, param) = b∆0 c c0 = (option id = call typ val param)
NDS CALL
mod, g, θ ` M , ((fid , l , (c0 , c), tmn, ∆, α), Σ ) M , ((fid 0 , l 0 , c 0 , tmn0 , ∆0 , []), (fid , l , (c0 , c), tmn, ∆, α), Σ )
evalND (g, ∆, val) = bV c v ∈ V c0 = (id = malloc typ val align) malloc (M , typ, v , align) = bM 0 , blk c
NDS MALLOC
mod, g, θ ` M , ((fid , l , (c0 , c), tmn, ∆, α), Σ ) M 0 , ((fid , l , c, tmn, ∆{id ← {blk.0}}, α), Σ )
evalND (g, ∆, val) = bV c v ∈ V c0 = (id = alloca typ val align) malloc (M , typ, v , align) = bM , blk c
NDS ALLOCA
mod, g, θ ` M , ((fid , l , (c0 , c), tmn, ∆, α), Σ ) M 0 , ((fid , l , c, tmn, ∆{id ← {blk.0}}, (blk , α)), Σ )
evalND (g, ∆, val1 ) = bV1 c evalND (g, ∆, val2 ) = bV2 c evalbopND (bop, sz , V1 , V2 ) = V3
NDS BOP
mod, g, θ ` M , ((fid , l , (id = bop( int sz )val1 val2 , c), tmn, ∆, α), Σ ) M , ((fid , l , c, tmn, ∆{id ← V3 }, α), Σ )
Nondeterminism shows up in two ways in the LLVMND seman- The reason is that the LLVM IR adopts a liberal substitution prin-
tics. First, stack frames bind local variables to sets of values V ; ciple: because %x = undef would be a legitimate replacement
second, the relation itself may relate one state to many possible for first assignment in (b), it is allowed to substitute undef for %x
successors. The semantics teases apart these two kinds of nonde- throughout, which reduces the assignment to %z to the same code
terminism because of the way that the undef value interacts with as in (a).
memory operations, as illustrated by the examples below. Example (c) shows why the semantics needs arbitrary sets of
From the LLVM Language Reference Manual: “Undefined val- values. Here, %z evaluates to the set of odd 8-bit integers, which
ues indicate to the compiler that the program is well defined no is the result of oring 1 with each element of the set {0, . . . , 255}.
matter what value is used, giving the compiler more freedom to This code snippet could therefore not safely be replaced by
optimize.” Semantically, LLVMND treats undef as the set of all %z = undef; however it could be optimized to %z = 1 (or any
values of a given type. For some motivating examples, consider the other odd 8-bit integer).
following code fragments: Example (d) illustrates the interaction between the set-semantics
for local values and the nondeterminism of the relation. The
(a) %z = xor i8 undef undef control state of the machine holds definite information, so when a
branch occurs, there may be multiple successor states. Similarly,
(b) %x = add i8 0 undef we choose to model memory cells as holding definite values, so
%z = xor i8 %x %x when writing a set to memory, there is one successor state for each
possible value that could be written. As an example of that interac-
(c) %z = or i8 undef 1 tion, consider the following example program, which was posted to
the LLVMdev mailing list, that reads from an uninitialized memory
(d) br undef %l1 %l2
location:
The value computed for %z in example (a) is the set of all 8-bit %buf = alloca i32
integers: because each occurrence of undef could take on any bit %val = load i32* %buf
pattern, the set of possible results obtained by xoring them still store i32 10, i32* %buf
includes all 8-bit integers. Perhaps surprisingly, example (b) com- ret %val
putes the same set of values for %z: one might reason that no mat-
ter which value is chosen for undef , the result of xoring %x with The LLVM mem2reg pass optimizes this program to program
itself would always be 0, and therefore %z should always be 0. (a) below; though according to the LLVM semantics, it would also
be admissible to replace this program with option (b) (perhaps to
However, while that answer is compatible with the LLVM language expose yet more optimizations):
reference (and hence allowed by the nondeterministic semantics),
it is also safe to replace code fragment (b) with %z = undef. (a) ret i32 10 (b) ret i32 undef
6
4.2 Nondeterministic operational semantics of the SSA form store on a pointer with bad alignment or a deallocated address,
The LLVMND semantics we have developed for Vellvm (and the (4) trying to call a non-function pointer, or (5) trying to execute the
others described below) is parameterized by a configuration, which unreachable command. We model these events by stuck states
is a triple of a module containing the code, a (partial) map g that because they correspond to fatal errors that will occur in any rea-
gives the values of global constants, and a function pointer table θ sonable realization of the LLVM IR by translation to a target plat-
that is a (partial) map from values to function identifiers (see the top form. Each of these errors is precisely characterized by a predi-
of Figure 6). The globals and function pointer maps are initialized cate over the machine state (e.g., BadFree(config, S)), and the
from the module definition when the machine is started. “allowed” stuck states are defined to be the disjunction of these
The LLVMND rules relate machine states to machine states, predicates:
where a machine state takes the form of a memory M (from Stuck(config, S) = BadFree(config, S)
Section 3) and a stack of evaluation frames. The frames keep track ∨ BadLoad(config, S)
of the (sets of) values bound to locally-allocated temporaries and ∨ ...
which instructions are currently being evaluated. Figure 6 shows a ∨ Unreachable(config, S)
selection of evaluation rules from the development.
Most of the commands of the LLVM have straight-forward in- To see that the well-formedness properties of the static seman-
terpretation: the arithmetic, logic, and data manipulation instruc- tics rule out all but these known error configurations, we prove the
tions are all unsurprising—the evalND function computes a set usual preservation and progress theorems for the LLVMND seman-
of flattened values from the global state, the local state, and an tics.
LLVM val, looking up the meanings of variables in the local state T HEOREM 2 (Preservation for LLVMND ). If (config, S) is well
as needed; similarly, evalbopN D implements binary operations, formed and config ` S S 0 , then (config, S 0 ) is well formed.
computing the result set by combining all possible pairs drawn
from its input sets. LLVMND ’s malloc behaves as described in Here, well-formedness includes the static scoping, typing prop-
Section 3, while load uses the memory model’s ability to detect erties, and SSA invariants from Section 2 for the LLVM code, but
ill-typed and uninitialized reads and, in the case of such errors, also requires that the local mappings ∆ present in all frames of the
yields undef as the result. Function calls push a new stack frame call stack must be inhabited—each binding contains at least one
whose initial local bindings are computed from the function param- value v —and that each defined variable that dominates the current
eters. The α component of the stack frame keeps track of which continuation is in ∆’s domain.
blocks of memory are created by the alloca instruction (see rule To show that the ∆ bindings are inhabited after the step, we
NDS ALLOCA); these are freed when the function returns (rule prove that (1) non-undef values V are singletons; (2) undefined
NDS RET). values from constants typ undef contain all possible values of first
There is one other wrinkle in specifying the operational se- class types typ; (3) undefined values from loading uninitialized
mantics when compared to a standard environment-passing call-
by-value language. All of the φ instructions for a block must be memory or incompatible physical data contain at least paddings
executed atomically and with respect to the “old” local value map- indicating errors; (4) evaluation of non-deterministic values by
ping due to possibility of self loops and dependencies among the evalbopND returns non-empty sets of values given non-empty
φ nodes. For example the well-formed code fragment below has a inputs.
circular dependency between %x and %z The difficult part of showing that defined variables dominate
their uses in the current continuation is proving that control-
blk:
%x = phi i32 [ %z, %blk ], [ 0, %pred ] transfers maintain the dominance property [20]. If a program
%z = phi i32 [ %x, %blk ], [ 1, %pred ] branches from a block b1 to b2 , the first command in b2 can use
%b = icmp leq %x %z either the falling-through variables from b1 , which must be defined
br %b %blk %succ in ∆ by Lemma 1, or the variables updated by the φs at the be-
ginning of b2 . This latter property requires a lemma showing that
If control enters this block from %pred, %x will map to 0 and computephinodeND behaves as expected.
%z to 1, which causes the conditional branch to succeed, jumping
back to the label %blk. The new values of %x and %z should be T HEOREM 3 (Progress for LLVMND ). If the pair (config, S)
1 and 0, and not, 1 and 1 as might be computed if they were is well formed, then either S has terminated successfully or
handled sequentially. This update of the local state is handled by the Stuck(config, S) or there exists S’ such that config ` S S 0 .
computephinodesND function in the operational semantics, as
shown, for example, in rule NDS BR TRUE. This theorem holds because in a well-formed machine state,
evalN D always returns a non-empty value set V ; moreover jump
4.3 Partiality, preservation, and progress targets and internal functions are always present.
Throughout the rules the “lift” notation f (x) = bv c indicates that
a partial function f is defined on x with value v . As seen by the 4.4 Deterministic refinements
frequent uses of lifting, both the nondeterministic and deterministic Although the LLVMND semantics is useful for reasoning about
semantics are partial—the program may get stuck. the validity of LLVM program transformations, Vellvm provides
Some of this partiality is related to well-formedness of the SSA a LLVMD , a deterministic, small-step refinement, along with two
program. For example, evalND (g, ∆, %x) is undefined if %x is not large-step operational semantics LLVM∗DFn and LLVM∗DB .
bound in ∆. These kinds of errors are ruled out by the static well- These different deterministic semantics are useful for several
formedness constraints imposed by the LLVM IR (Section 2). reasons: (1) they provide the basis for testing LLVM programs with
In other cases, we have chosen to use partiality in the oper- a concrete implementation of memory (see the discussion about
ational semantics to model certain failure modes for which the Vellvm’s extracted interpreter in the next Section), (2) proving that
LLVM specification says that the behavior of the program is unde- LLVMD is an instance of the LLVMND and relating the small-
fined. These include: (1) attempting to free memory via a pointer step rules to the large-step ones provides validation of all of the
not returned from malloc or that has already been deallocated, semantics (i.e., we found bugs in Vellvm by formalizing multiple
(2) allocating a negative amount of memory, (3) calling load or semantics and trying to prove that they are related), and (3) the
7
small- and large-step semantics have different applications when LLVMD . Note that in the deterministic setting, one-direction sim-
reasoning about LLVM program transformations. ulation implies bisimulation [18]. Moreover, LLVMD is a refine-
Unlike LLVMND , the frames for these semantics map identi- ment instance of the nondeterministic LLVMND semantics.
fiers to single values, not sets, and the operational rules call deter- These relations are useful because the large-step semantics in-
ministic variants of the nondeterministic counterparts (e.g., eval duce different proof styles than the small-step semantics: in partic-
instead of evalND ). To resolve the nondeterminism from undef ular, the induction principles obtained from the large step seman-
and faulty memory operations, these semantics fix a concrete inter- tics allow one to gloss over insignificant details of the small step
pretation as follows: semantics.
• undef is treated as a zeroinitializer
5. Vellvm Infrastructure and Validation
• Reading uninitialized memory returns zeroinitializer
This section briefly describes the Coq implementation of Vellvm
These choices yield unrealistic behaviors compared to what one and its related tools for interacting with the LLVM infrastructure. It
might expect from running a LLVM program against a C-style run- also describes how we validate the Vellvm semantics by extracting
time system, but the cases where this semantics differs correspond an executable interpreter and comparing its behavior to the LLVM
to unsafe programs. There are still many programs, namely those reference interpreter.
compiled to LLVM from type-safe languages, whose behaviors un-
der this semantics should agree with their realizations on target 5.1 The Coq development
platforms. Despite these differences from LLVMND , LLVMD also Vellvm encodes the abstract syntax from Section 2 in an entirely
has the preservation and progress properties. straightforward way using Coq’s inductive datatypes (generated in
a preprocessing step via the Ott [27] tool). The implementation uses
Big-step semantics Vellvm also provides big-step operational se- Penn’s Metatheory library [4], which was originally designed for
mantics LLVM∗DFn , which evaluates a function call as one large the locally nameless representation, to represent identifiers of the
step, and LLVM∗DB , which evaluates each sub-block—i.e., the LLVM, and to reason about their freshness.
code between two function calls—as one large step. Big-step se- The Coq representation deviates from the full LLVM language
mantics are useful because compiler optimizations often transform in only a few (mostly minor) ways. In particular, the Coq represen-
multiple instructions or blocks within a function in one pass. Such tation requires that some type annotations be in normal form (e.g.,
transformations do not preserve the small-step semantics, making the type annotation on load must be a pointer), which simplifies
it hard to create simulations that establish correctness properties. type checking at the IR level. The Vellvm tool that imports LLVM
As a simple application of the large-step semantics, consider bitcode into Coq provides such normalization, which simply ex-
trying to prove the correctness of a transformation that re-orders
program statements that do not depend on one another. For exam- pands definitions to reach the normal form. In total, the syntax and
ple, the following two programs result in the same states if we con- static semantics constitute about 2500 lines of Coq definitions and
sider their execution as one big-step, although their intermediate proof scripts.
states do not match in terms of the small-step semantics. Vellvm’s memory model implementation extends CompCert’s
with approximately 5000 lines of code to support integers with ar-
(a) %x = add i32 %a, %b (b) %y = load i32* %p bitrary precision, padding, and an experimental treatment of casts
%y = load i32* %p %x = add i32 %a, %b
that has not yet been needed for any of our proofs. On top of this
The proof of this claim in Vellvm uses the LLVM∗DB rules extended memory model, all of the operational semantics and their
to hide the details about the intermediate states. To handle mem- metatheory have been proved in Coq. In total, the development rep-
ory effects, we use a simulation relation that uses symbolic eval- resents approximately 32,000 lines of Coq code. Checking the en-
uation [22] to define the equivalence of two memory states. The tire Vellvm implementation using coqc takes about 13.5 minutes
memory contents are defined abstractly in terms of the program on a 1.73 GHz Intel Core i7 processor with 8 GB RAM. We expect
operations by recording the sequence of writes. Using this tech- that this codebase could be significantly reduced in size by refac-
nique, we defined a simple translation validator to check whether toring the proof structure and making it more modular.
the semantics of two programs are equivalent with respect to such The LLVM distribution includes primitive OCaml bindings that
re-orderings execution. For each pair of functions, the validator en- are sufficient to generate LLVM IR code (‘bitcode” in LLVM jar-
sures that their control-flow graphs match, and that all correspond- gon) from OCaml. To convert between the LLVM bitcode repre-
ing sub-blocks are equivalent in terms of their symbolic evaluation. sentation and the extracted OCaml representation, we implemented
This approach is similar to the translation validation used in prior a library consisting of about 5200 lines of OCaml-LLVM bindings.
work for verifying instruction scheduling optimizations [32]. This library also supports pretty-printing of the AST’s; this code
Although this is a simple application of Vellvm’s large-step was also useful in the extracted the interpreter.
semantics, proving correctness of other program transformations Omitted details This paper does not discuss all of the LLVM IR
such as dead expression elimination and constant propagation fol- features that the Vellvm Coq development supports. Most of these
low a similar pattern—the difference is that, rather than checking features are uninteresting technically but necessary to support real
that two memories are syntactically equivalent according to the LLVM code: (1) The LLVM IR provides aggregate data operations
symbolic evaluation, we must check them with respect to a more (extractvalue and insertvalue) for projecting and updating
semantic notion of equivalence [22]. the elements of structures and arrays; (2) the operational semantics
supports external function calls by assuming that their behavior is
Relationships among the semantics Figure 5 illustrates how
specified by axioms; the implementation applies these axioms to
these various operational semantics relate to one another. Vel-
transition program states upon calling external functions; (3) the
lvm provides proofs that LLVM∗DB simulates LLVM∗DFn and that
LLVM switch instruction, which is used to compile jump tables,
LLVM∗DFn simulates LLVMD . In these proofs, simulation is taken
is lowered to the normal branch instructions that Vellvm supports
to mean that the machine states are syntactically identical at cor-
by a LLVM-supported pre-processing step.
responding points during evaluation. For example, the state at a
function call of a program running on the LLVM∗DFn semantics Unsupported features Some features of LLVM are not supported
matches the corresponding state at the function call reached in by Vellvm. First, the LLVM provides intrinsic functions for extend-
8
ing LLVM or to represent functions that have well known names and stores of pointer with parallel loads and stores of their associ-
and semantics and are required to follow certain restrictions—for ated metadata. This instrumentation ensures that each pointer deref-
example, functions from standard C libraries, handling variable ar- erenced is within bounds and aborts the program otherwise.
gument functions, etc. Second, the LLVM functions, global vari- The original SoftBound paper includes a mechanized proof that
ables, and parameters can be decorated with attributes that denote validates the correctness of this idea, but it is not complete. In par-
linkage type, calling conventions, data representation, etc. which ticular, the proof is based on a subset of a C-like language with only
provide more information to compiler transformations than what straight-line commands and non-aggregate types, while a real Soft-
the LLVM type system provides. Vellvm does not statically check Bound implementation needs to consider all of the LLVM IR shown
the well-formedness of these attributes, though they should be in Figure 3, the memory model, and the operational semantics of
obeyed by any valid program transformation. Third, Vellvm does the LLVM. Also the original proof ensures the correctness only
not support the invoke and unwind instructions, which are used to with respect to a specification that the SoftBound instrumentation
implement exception handling, nor does it support variable argu- must implement, but does not prove the correctness of the instru-
ment functions. Forth, Vellvm does not support vector types, which mentation pass itself. Moreover, the specification requires that ev-
allow for multiple primitive data values to be computed in parallel ery temporary must contain metadata, not just pointer temporaries.
using a single instruction.
Using Vellvm to verify SoftBound This section describes how
5.2 Extracting an interpreter we use Vellvm to formally verify the correctness of the Soft-
To test Vellvm’s operational semantics for the LLVM IR, we used Bound instrumentation pass with respect to the LLVM semantics,
Coq’s code extraction facilities to obtain an interpreter for execut- demonstrating that the promised spatial memory safety property is
ing the LLVM distribution’s regression test suite. Extracting such achieved. Moreover, Vellvm allows us to extract a verified OCaml
an interpreter is one of the main motivations for developing a deter- implementation of the transformation from Coq. The end result is
ministic semantics, because the evaluation under the nondetermin- a compiler pass that is formally verified to transform a program in
istic semantics cannot be directly compared against actual runs of the LLVM IR into a program augmented with sufficient checking
LLVM IR programs. code such that it will dynamically detect and prevent all spatial
Unfortunately, the small-step deterministic semantics LLVMD memory safety violations.
is defined relationally in the logical fragment of Coq, which is con- SoftBound is a good test case for the Vellvm framework. It is
venient for proofs, but can not be used to extract code. Therefore, a non-trivial translation pass that nevertheless only inserts code,
Vellvm provides yet another operational semantics, LLVMInterp , thereby making it easier to prove correct. SoftBound’s intended use
which is a deterministic functional interpreter implemented in the is to prevent security vulnerabilities, so bugs in its implementation
computational fragment of Coq. LLVMInterp is proved to be bisim- can potentially have severe consequences. Also, the existing Soft-
ilar to LLVMD , so we can port results between the two semantics. Bound implementation already uses the LLVM.
Although one could run this extracted interpreter directly, doing
so is not efficient. First, integers with arbitrary bit-width are induc- Modifications to SoftBound since the original paper As de-
tively defined in Coq. This yields easy proof principles, but does not scribed in the original paper, SoftBound modifies function signa-
give an efficient runtime representation; floating point operations tures to pass metadata associated with the pointer parameters or
are defined axiomatically. To remedy these problems, at extraction, returned pointers. To improve the robustness of the tool, we transi-
we realize Vellvm’s integer and floating point values by efficient tioned to an implementation that instead passes all pointer metadata
C++ libraries that are a standard part of the LLVM distribution. on a shadow stack. This has two primary advantages. The first is
Second, the memory model implementation of Vellvm maintains that this design simplifies the implementation while simultaneously
memory blocks and their associated metadata as functional lists, better supporting indirect function calls (via function pointers) and
and it converts between byte-list and value representations at each more robustly handling improperly declared function prototypes.
memory access. Using the extracted data-structures directly incurs The second is that it also simplifies the proofs.
tremendous performance overhead, so we replaced the memory op-
erations of the memory model with native implementations from 6.1 Formalizing SoftBound for the LLVM IR
the C standard library. A value v in local mappings δ is boxed, and The SoftBound correctness proof has the following high-level
it is represented by a reference to memory that stores its content. structure:
Our implementation faithfully runs 134 out of the 145 tests from
the LLVM regression suite that lli, the LLVM distribution inter- 1. We define a nonstandard operational semantics SBspec for the
preter, can run. The missing tests cover instructions (like variable LLVM IR. This semantics “builds in” the safety properties that
arguments) that are not yet implemented in Vellvm. should be enforced by a correct implementation of SoftBound.
Although replacing the Coq data-structures by native ones inval- It uses meta-level datastructures to implement the metadata
idates the absolute correctness guarantees one would expect from and meta-level functions to define the semantics of the bounds
an extracted interpreter, this exercise is still valuable. In the course checks.
of carrying out this experiment, we found one severe bug in the 2. We prove that an LLVM program P, when run on the SBspec
semantics: the br instruction inadvertently swapped the true and semantics, has no spatial safety violations.
false branches. 3. We define a translation pass SBtrans(−) that instruments the
LLVM code to propagate metadata.
6. Verified SoftBound
4. We prove that a program if SBtrans(P ) = bP 0 c then P’, when
SoftBound [21] is a previously proposed program transformation run on the LLVMD , simulates P running on SBspec.
that hardens C programs against spatial memory safety violations
(e.g., buffer overflows, array indexing errors, and pointer arithmetic The SoftBound specification Figure 7 gives the program config-
errors). SoftBound works by first compiling C programs into the urations and representative rules for the SBspec semantics. SBspec
LLVM IR, and then instrumenting the program with instructions behaves the same as the standard semantics except that it creates,
that propagate and check per-pointer metadata. SoftBound main- propagates, and checks metadata of pointers in the appropriate in-
tains base and bound metadata with each pointer, shadowing loads structions.
9
Nondeterministic rules:
Deterministic configurations:
Frames σ̂ : : = fid , l , c, tmn, δ, µ, α Call stacks σ̂ :: = [] | σ̂, σ̂ Program states ŝ :: = M , MM , σ̂
Figure 7. SBspec: The specification semantics for SoftBound. Differences from the LLVMND rules are highlighted.
A program state Ŝ is an extension of the standard program state The second part of the correctness is proved by the following
S for maintaining metadata md, which is a pair defining the start preservation and progress theorems.
and end address for a pointers: µ in each function frame Σ̂ maps
temporaries of pointer type to their metadata; MM is the shadow T HEOREM 5 (Preservation for SBspec).
heap that stores metadata for pointers in memory. Note that al- If (config, Ŝ) is well formed, and config ` Ŝ Ŝ 0 , then (config,
though the specification is nondeterministic, the metadata is de- Ŝ 0 ) is well formed.
terministic. Therefore, a pointer loaded from uninitialized memory
space can be undef , but it cannot have arbitrary md (which might Here, SBspec well-formedness strengthens the invariants for
not be valid). LLVMND by requiring that if any id defined in ∆ is of pointer
SBspec is correct if a program P must either abort on detecting type, then µ contains its metadata and a spatial safety invariant: all
a spatial memory violation with respect to the SBspec, or preserve bounds in µs of function frames and MM must be memory ranges
the LLVM semantics of the original program P ; and, moreover, P within which all memory addresses are spatially safe.
is not stuck by any spatial memory violation in the SBspec (i.e., The interesting part is proving that the spatial safety invariant is
SBspec must catch all spatial violations). preserved. It holds initially, because a program’s initial frame stack
is empty, and we assume that MM is also empty. The other cases
D EFINITION 1 (Spatial safety). Accessing a memory location at depend on the rules in Figure 7.
the offset ofs of a block blk is spatially safe if blk is less than the The rule SB MALLOC, which allocates the number v of ele-
next fresh block N , and ofs is within the bounds of blk : ments with typ at a memory block blk , updates the metadata of
blk < N ∧ (B(blk ) = bsizec → 0 ≤ ofs < size) id with the start address that is the beginning of blk , and the end
address that is at the offset blk.(sizeof typ × v) in the same block.
The legal stuck states of SoftBound—StuckSB (config, Ŝ) in- LLVM’s memory model ensures that the range of memory is valid.
clude all legal stuck states of LLVMND (recall Section 4.3) except The rule SB LOAD reads from a pointer val with runtime data
the states that violate spatial safety. The case when B does not map v , finds the md of the pointer, and ensures that v is within the
blk to some size indicates that blk is not valid, and pointers into the md via checkbounds. If the val is an identifier, findbounds
blk are dangling—this indicates a temporal safety error that is not simply returns the identifier’s metadata from µ, which must be a
prevented by SoftBound and therefore it is included in the set of spatial safe memory range. If val is a constant of pointer type,
legal stuck states. findbounds returns bounds as the following. For global point-
Because the program states of a program in the LLVMND se- ers, findbounds returns bounds derived from their types because
mantics are identical to the corresponding parts in the SBspec, it globals must be allocated before a program starts. For pointers con-
is easy to relate them: let Ŝ ⊇◦ S mean that common parts of the verted from some constant integers by inttoptr, it conservatively
SoftBound state Ŝ and S are identical. Because memory instruc- returns the bounds [null, null) to indicate a potentially invalid
tions in the SBspec may abort without accessing memory, the first memory range. For a pointer cnst 1 derived from an other constant
part of correctness is by a straightforward simulation relation be- pointer cnst 2 by bitcase or getelementptr, findbounds re-
tween states of the two semantics. turns the same bound of cnst 2 for cnst 1 . Note that {|v 0 |} denotes
conversion from a deterministic value to a nondeterministic value.
T HEOREM 4 (SBspec simulates LLVMND ). If the state Ŝ ⊇◦ S , If the load reads a pointer-typed value v from memory, the
and config ` Ŝ Ŝ 0 , then there exists a state S 0 , such that rule finds its metadata in MM and updates the local metadata
config ` S S 0 , and Ŝ 0 ⊇◦ S 0 . mapping µ. If MM does not contain any metadata indexed by
10
(Δ, μ) ≈○ Δ’ 250%
(MM, p1 b1 e1 p1’ Extracted
runtime overhead
M) mi v2 b1’ 200%
C++ SOFTBOUND
p3 b3 e3 e1’
≈○ v4 v2 ’ 150%
M’ p3’
b’
Allocated Where Vi ≈○ Vi’ 3 100%
Globals e3’
v4 ’ 50%
Memory simulation Frame simulation
0%
bh isort mst tsp go omp art uake mp gzip lbm libq. ean
Figure 8. Simulation relations of the SoftBound pass b c eq am m
11
mentation (rightmost bar of each benchmark) for various bench- prototype tool that applies their methodology to verification of the
marks from SPEC95, SPEC2000 and SPEC2006. Because of the LLVM compiler. The LLVM-MD project [35] validates LLVM op-
check elimination optimization performed by the C++ implemen- timizations by symbolic evaluation. The Peggy tool performs trans-
tation, the code is slightly faster, but overall the extracted imple- lation validation for the LLVM compiler using a technique called
mentation provides similar performance. equality saturation [28]. These applications are not fully certified.
Bugs found in the original SoftBound implementation In the
course of formalizing the SoftBound transformation, we discov- 8. Conclusion
ered two implementation bugs in the original C++ implementation
Although we do not consider it in this paper, our intention is that
of SoftBound. First, when one of the incoming values of a φ node
the Vellvm framework will serve as a first step toward a fully-
with pointer type is an undef , undef was propagated as its base
verified LLVM compiler, similar to that of Leroy et al.’s Comp-
and bound. Subsequent compiler transformations may instantiate
Cert [18]. Our Coq development extends some of CompCert’s
the undefined base and bound with defined values that allow the
libraries and our LLVM memory model is based on CompCert’s
checkbounds to succeed, which would lead to memory viola-
memory model. The focus of this paper is the LLVM IR semantics
tion. Second, the base and bound of constant pointer (typ∗) null
itself, the formalization of which is a necessary step toward a fully-
was set to be (typ∗) null and (typ∗) null + sizeof (typ), allowing
verified LLVM compiler. Because much of the complexity of an
dereferences of null or pointers pointing to an offset from null. Ei-
LLVM-based compiler lies in the IR to IR transformation passes,
ther of these bugs could have resulted in faulty checking and thus
formalizing correctness properties at this level stands to yield a
expose the program to the spatial violations that SoftBound was
significant payoff, as demonstrated by our SoftBound case study,
designed to prevent. These bugs underscore the importance of a
even without fully verifying a compiler.
formally verified and extracted implementation to avoid such bugs.
12
[9] A. Chlipala. A certified type-preserving compiler from lambda cal- [24] M. Nita and D. Grossman. Automatic transformation of bit-level C
culus to assembly language. In PLDI ’07: Proceedings of the ACM code to support multiple equivalent data layouts. In CC’08: Proceed-
SIGPLAN 2007 Conference on Programming Language Design and ings of the 17th International Conference on Compiler Construction,
Implementation, 2007. 2008.
[10] The Coq Proof Assistant Reference Manual (Version 8.3pl1). The Coq [25] M. Nita, D. Grossman, and C. Chambers. A theory of platform-
Development Team, 2011. dependent low-level software. In POPL ’08: Proceedings of the
[11] K. Crary. Toward a foundational typed assembly language. In POPL 35th Annual ACM SIGPLAN-SIGACT Symposium on Principles of
’03: Proceedings of the 30th ACM SIGPLAN-SIGACT Symposium on Programming Languages, 2008.
Principles of Programming Languages, 2003. [26] A. Pnueli, M. Siegel, and E. Singerman. Translation validation. In
[12] K. Crary and R. Harper. Mechanized def- TACAS ’98: Proceedings of the 4th International Conference on Tools
inition of standard ml (alpha release), 2009. and Algorithms for Construction and Analysis of Systems, 1998.
http://www.cs.cmu.edu/˜crary/papers/2009/ [27] P. Sewell, F. Zappa Nardelli, S. Owens, G. Peskine, T. Ridge, S. Sarkar,
mldef-alpha.tar.gz. and R. Strniša. Ott: Effective tool support for the working semanticist.
[13] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. In ICFP ’07: Proceedings of the 9th ACM SIGPLAN International
Efficiently computing static single assignment form and the control Conference on Functional Programming, 2007.
dependence graph. ACM Trans. Program. Lang. Syst., 13:451–490, [28] M. Stepp, R. Tate, and S. Lerner. Equality-Based translation validator
1991. for LLVM. In CAV ’11: Proceedings of the 23rd International Con-
[14] G. A. Kildall. A unified approach to global program optimization. ference on Computer Aided Verification, 2011.
In POPL ’73: Proceedings of the 1st Annual ACM SIGACT-SIGPLAN [29] Z. T. Sudipta Kundu and S. Lerner. Proving optimizations correct
Symposium on Principles of Programming Languages, 1973. using parameterized program equivalence. In PLDI ’09: Proceedings
[15] G. Klein, T. Nipkow, and T. U. München. A machine-checked model of the ACM SIGPLAN 2009 Conference on Programming Language
for a Java-like language, virtual machine and compiler. ACM Trans. Design and Implementation, 2009.
Program. Lang. Syst., 28:619–695, 2006. [30] D. Syme. Reasoning with the formal definition of Standard ML
[16] C. Lattner and V. Adve. LLVM: A Compilation Framework for Life- in HOL. In Sixth International Workshop on Higher Order Logic
long Program Analysis & Transformation. In CGO ’04: Proceedings Theorem Proving and its Applications, 1993.
of the International Symposium on Code Generation and Optimiza- [31] Z. Tatlock and S. Lerner. Bringing extensibility to verified compilers.
tion: Feedback-directed and Runtime Optimization, 2004. In PLDI ’10: Proceedings of the ACM SIGPLAN 2010 Conference on
[17] S. Lerner, T. Millstein, E. Rice, and C. Chambers. Automated sound- Programming Language Design and Implementation, 2010.
ness proofs for dataflow analyses and transformations via local rules. [32] J.-B. Tristan and X. Leroy. Formal verification of translation valida-
In POPL ’05: Proceedings of the 32th ACM SIGPLAN-SIGACT Sym- tors: a case study on instruction scheduling optimizations. In POPL
posium on Principles of Programming Languages, 2005. ’08: Proceedings of the 35th Annual ACM SIGPLAN-SIGACT Sympo-
[18] X. Leroy. A formally verified compiler back-end. Journal of Auto- sium on Principles of Programming Languages, 2008.
mated Reasoning, 43(4):363–446, 2009. [33] J.-B. Tristan and X. Leroy. Verified validation of lazy code motion.
[19] The LLVM Reference Manual (Version 2.6). The LLVM Development In PLDI ’09: Proceedings of the ACM SIGPLAN 2009 Conference on
Team, 2010. http://llvm.org/releases/2.6/docs/LangRef.html. Programming Language Design and Implementation, 2009.
[20] V. S. Menon, N. Glew, B. R. Murphy, A. McCreight, T. Shpeisman, [34] J. B. Tristan and X. Leroy. A simple, verified validator for soft-
A.-R. Adl-Tabatabai, and L. Petersen. A verifiable SSA program rep- ware pipelining. In POPL ’10: Proceedings of the 37th Annual ACM
resentation for aggressive compiler optimization. In POPL ’06: Pro- SIGPLAN-SIGACT Symposium on Principles of Programming Lan-
ceedings of the 33th ACM SIGPLAN-SIGACT Symposium on Princi- guages, 2010.
ples of Programming Languages, 2006. [35] J.-B. Tristan, P. Govereau, and G. Morrisett. Evaluating value-graph
[21] S. Nagarakatte, J. Zhao, M. M. K. Martin, and S. Zdancewic. Soft- translation validation for llvm. In PLDI ’11: Proceedings of the ACM
Bound: Highly compatible and complete spatial memory safety for C. SIGPLAN 2011 Conference on Programming Language Design and
In PLDI ’09: Proceedings of the ACM SIGPLAN 2009 Conference on Implementation, 2011.
Programming Language Design and Implementation, 2009. [36] A. Zaks and A. Pnueli. Program analysis for compiler validation. In
[22] G. C. Necula. Translation validation for an optimizing compiler. In PASTE ’08: Proceedings of the 8th ACM SIGPLAN-SIGSOFT Work-
PLDI ’00: Proceedings of the ACM SIGPLAN 2000 Conference on shop on Program Analysis for Software Tools and Engineering, 2008.
Programming Language Design and Implementation, 2000. [37] L. Zhao, G. Li, B. De Sutter, and J. Regehr. ARMor: Fully verified
[23] NIST Juliet Test Suite for C/C++. NIST, 2010. software fault isolation. In EMSOFT ’11: Proceedings of the 9th ACM
http://samate.nist.gov/SRD/testCases/suites/Juliet-2010-12.c.cpp.zip. International Conference on Embedded Software, 2011.
13