0% found this document useful (0 votes)

71 views13 pages

Formalizing The LLVM Intermediate Representation For Verified Program Transformations

Uploaded by

wu jinhua

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views13 pages

Formalizing The LLVM Intermediate Representation For Verified Program Transformations

Uploaded by

wu jinhua

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Formalizing the LLVM Intermediate Representation

for Verified Program Transformations ∗

Jianzhou Zhao Santosh Nagarakatte Milo M. K. Martin Steve Zdancewic
Computer and Information Science Department, University of Pennsylvania
[email protected] [email protected] [email protected] [email protected]

Abstract Optimizations/
Transformations
This paper presents Vellvm (verified LLVM), a framework for rea- C, C++, Haskell, Alpha, ARM,
soning about programs expressed in LLVM’s intermediate repre- Code
ObjC, ObjC++, LLVM IR PowerPC, Sparc,
sentation and transformations that operate on it. Vellvm provides a Generator/
Scheme, Scala... X86, Mips, …
mechanized formal semantics of LLVM’s intermediate representa- JIT
tion, its type system, and properties of its SSA form. The frame- Program analysis
work is built using the Coq interactive theorem prover. It includes
multiple operational semantics and proves relations among them to Figure 1. The LLVM compiler infrastructure
facilitate different reasoning styles and proof techniques.
To validate Vellvm’s design, we extract an interpreter from the paper formalizes both the static and dynamic semantics of the IR
Coq formal semantics that can execute programs from LLVM test that forms the heart of the LLVM compiler infrastructure.
suite and thus be compared against LLVM reference implementa- LLVM [19] (Low-Level Virtual Machine) uses a platform-
tions. To demonstrate Vellvm’s practicality, we formalize and ver- independent SSA-based IR originally developed as a research
ify a previously proposed transformation that hardens C programs tool for studying optimizations and modern compilation tech-
against spatial memory safety violations. Vellvm’s tools allow us to niques [16]. The LLVM project has since blossomed into a ro-
extract a new, verified implementation of the transformation pass bust, industrial-strength, and open-source compilation platform
that plugs into the real LLVM infrastructure; its performance is that competes with GCC in terms of compilation speed and per-
competitive with the non-verified, ad-hoc original. formance of the generated code [16]. As a consequence, it has been
widely used in both academia and industry.
Categories and Subject Descriptors D.2.4 [Software Engineer-
An LLVM-based compiler is structured as a translation from a
ing]: Software/Program Verification - Correctness Proofs; F.3.1
high-level source language to the LLVM IR (see Figure 1). The
[Logics and Meanings of Programs]: Speficying and Verifying and
LLVM tools provide a suite of IR to IR translations, which pro-
Reasoning about Programs - Mechanical verification; F.3.2 [Log-
vide optimizations, program transformations, and static analyses.
ics and Meanings of Programs]: Semantics of Programming Lan-
The resulting LLVM IR code can then be lowered to a variety of
guages - Operational semantics
target architectures, including x86, PowerPC, and ARM (either by
General Terms Languages, Verification, Reliability static compilation or dynamic JIT-compilation). The LLVM project
focuses on C and C++ front-ends, but many source languages, in-
Keywords LLVM, Coq, memory safety
cluding Haskell, Scheme, Scala, Objective C and others have been
ported to target the LLVM IR.
1. Introduction This paper introduces Vellvm—for verified LLVM—a frame-
Compilers perform their optimizations and transformations over an work that includes a formal semantics and associated tools for
intermediate representation (IR) that hides details about the tar- mechanized verification of LLVM IR code, IR to IR transforma-
get execution platform. Rigorously proving properties about these tions, and analyses. The description of this framework in this paper
IR transformations requires that the IR itself have a well-defined is organized into two parts.
formal semantics. Unfortunately, the IRs used in main-stream pro- The first part formalizes the LLVM IR. It presents the LLVM
duction compilers generally do not. To address this deficiency, this syntax and static properties (Section 2), including a variety of
well-formedness and structural properties about LLVM’s static
∗ Thisresearch was funded in part by the U.S. Government. The views and single assignment (SSA) representation that are useful in proofs
conclusions contained in this document are those of the authors and should about LLVM code and transformation passes. Vellvm’s memory
not be interpreted as representing the official policies, either expressed or model (Section 3) is based on CompCert’s [18], extended to han-
implied, of the U.S. Government. dle LLVM’s arbitrary bit-width integers, padding, and alignment
issues. In developing the operational semantics (Section 4), a sig-
nificant challenge is adequately capturing the nondeterminism that
arises due to LLVM’s explicit undef value and its intentional
Permission to make digital or hard copies of all or part of this work for personal or underspecification of certain erroneous behaviors such as reading
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
from uninitialized memory; this underspecification is needed to
on the first page. To copy otherwise, to republish, to post on servers or to redistribute justify the correctness of aggressive optimizations. Vellvm there-
to lists, requires prior specific permission and/or a fee. fore implements several related operational semantics, including a
POPL’12, January 25–27, 2012, Philadelphia, PA, USA. nondeterministic semantics and several deterministic refinements
Copyright c 2012 ACM 978-1-4503-1083-3/12/01. . . $10.00 to facilitate different proof techniques and reasoning styles.

1
The second part of the paper focuses on the utility of for- %ST = type { i10 , [10 x i8*] }
malizing the LLVM IR. We describe Vellvm’s implementation in
define %ST* @foo(i8* %ptr) {
Coq [10] and validate its LLVM IR semantics by extracting an exe- entry:
cutable interpreter and comparing its behavior to that of the LLVM %p = malloc %ST, i32 1
reference interpreter and compiled code (Section 5). The Vellvm %r = getelementptr %ST* %p, i32 0, i32 0
framework provides support for moving code between LLVM’s IR store i10 648, %r ; decomposes as 136, 2
representation and its Coq representation. This infrastructure, along %s = getelementptr %ST* %p, i32 0, i32 1, i32 0
with Coq’s facility for extracting executable code from constructive store i8* %ptr, %s
proofs, enables Vellvm users to manipulate LLVM IR code with ret %ST* %p
high confidence in the results. For example, using this framework, }
we can extract verified LLVM transformations that plug directly
into the LLVM compiler. We demonstrate the effectiveness of this Figure 2. An example use of LLVM’s memory operations. Here,
technique by using Vellvm to implement a verified instance of Soft- %p is a pointer to a single-element array of structures of type %ST.
Bound [21], an LLVM-pass that hardens C programs against buffer Pointer %r indexes into the first component of the first element in
overflows and other memory safety violations (Section 6). the array, and has type i10*, as used by the subsequent store,
To summarize, this paper and the Vellvm framework provide: which writes the 10-bit value 648. Pointer %s has type i8** and
points to the first element of the nested array in the same structure.
• A formalization of the LLVM IR, its static semantics, memory
model, and several operational semantics;
Types typ include arbitrary bit-width integers i8, i16, i32, etc.,
• Metatheoretic results (preservation and progress theorems) re- or, more generally, isz where sz is a natural number. Types also
lating the static and dynamic semantics; include float, void, pointers typ∗, arrays [ sz × typ ] that have
j
• Coq infrastructure implementing the above, along with tools for a statically-known size sz . Anonymous structure types { typj }
j
interacting with the LLVM compiler; contain a list of types. Functions typ typj have a return type,
j
• Validation of the semantics in the form of an extracted LLVM and a list of argument types. Here, typj denotes a list of typ
interpreter; and components; we use similar notation for other lists throughout the
paper. Finally, types can be named by identifiers id which is useful
• Demonstration of applying this framework to extract a verified
to define recursive types.
transformation pass for enforcing spatial memory-safety. The sizes and alignments for types, and endianness are defined
in layout. For example. int sz align0 align1 dictates that values
2. Static Properties of the LLVM IR with type isz are align0 -byte aligned when they are within an
The LLVM IR is a typed, static single assignment (SSA) [13] lan- aggregate and when used as an argument, and align1 -byte aligned
guage that is a suitable representation for expressing many com- when emitted as a global.
piler transformations and optimizations. This section describes the Operations in the LLVM IR compute with values val, which are
syntax and basic static properties, emphasizing those features that either identifiers id naming temporaries, or constants cnst com-
are either unique to the LLVM or have non-trivial implications for puted from statically-known data, using the compile-time analogs
the formalization. Vellvm’s formalization is based on the LLVM of the commands described below. Constants include base values
release version 2.6, and the syntax and semantics are intended to (i.e., integers or floats of a given bit width), and zero-values of a
model the behavior as described in the LLVM Language Refer- given type, as well as structures and arrays built from other con-
ence,1 although we also used the LLVM IR reference interpreter stants.
and the x86 backend to inform our design. To account for uninitialized variables and to allow for various
program optimizations, the LLVM IR also supports a type-indexed
2.1 Language syntax undef constant. Semantically, undef stands for a set of possible
bit patterns, and LLVM compilers are free to pick convenient values
Figure 3 shows (a fragment of) the abstract syntax for the subset for each occurrence of undef to enable aggressive optimizations
of the LLVM IR formalized in Vellvm. The metavariable id ranges or program transformations. As described in Section 4, the pres-
over LLVM identifiers, written %X, %T, %a, %b, etc., which are used ence of undef makes the LLVM operational semantics inherently
to name local types and temporary variables, and @a, @b, @main, nondeterministic.
etc., which name global values and functions. All code in the LLVM IR resides in top-level functions, whose
Each source file is a module mod that includes data layout bodies are composed of block bs. As in classic compiler represen-
information layout (which defines sizes and alignments for types; tations, a basic block consists of a labeled entry point l , a series
see below), named types, and a list of prods that can be function of φ nodes, a list of commands, and a terminator instruction. As
declarations, function definitions, and global variables. Figure 2 is usual in SSA representations, the φ nodes join together values
shows a small example of LLVM syntax (its meaning is described from a list of predecessor blocks of the control-flow graph—each
in more detail in Section 3). φ node takes a list of (value, label) pairs that indicates the value
Every LLVM expression has a type, which can easily be deter- chosen when control transfers from a predecessor block with the
mined from type annotations that provide sufficient information to associated label. Block terminators (br and ret) branch to another
check an LLVM program for type compatibility. The LLVM IR is block or return (possibly with a value) from the current function.
not a type-safe language, however, because its type system allows Terminators also include the unreachable marker, indicating that
arbitrary casts, calling functions with incorrect signatures, access- control should never reach that point in the program.
ing invalid memory, etc. The LLVM type system ensures only that The core of the LLVM instruction set is its commands (c), which
the size of a runtime value in a well-formed program is compati- include the usual suite of binary arithmetic operations (bop—e.g.,
ble with the type of the value—a well-formed program can still be add, lshr, etc.), memory accessors (load, store), heap opera-
stuck (see Section 4.3). tions (malloc and free), stack allocation (alloca), conversion
operations among integers, floats and pointers (eop, trop, and cop),
1 See http://llvm.org/releases/2.6/docs/LangRef.html comparison over integers (icmp and select), and calls (call).

2
Modules mod, P :: = layout namedt prod
Layouts layout :: = bigendian | littleendian | ptr sz align0 align1 | int sz align0 align1
| float sz align0 align1 | aggr sz align0 align1 | stack sz align0 align1
Products prod :: = id = global typ const align | define typ id (arg){b} | declare typ id (arg)
Floats fp :: = float | double
j j
Types typ :: = isz | fp | void | typ∗ | [ sz × typ ] | { typj } | typ typj | id
Values val :: = id | cnst
Binops bop :: = add | sub | mul | udiv | sdiv | urem | srem | shl | lshr | ashr | and | or | xor
Float ops fbop :: = fadd | fsub | fmul | fdiv | frem
Extension eop :: = zext | sext | fpext
Cast op cop :: = fptoui | ptrtoint | inttoptr | bitcast
Trunc op trop :: = truncint | truncfp
j j
Constants cnst :: = isz Int | fp Float | typ ∗ id | (typ∗) null | typ zeroinitializer | typ[ cnst j ] | { cnst j }
| typ undef | bop cnst 1 cnst 2 | fbop cnst 1 cnst 2 | trop cnst to typ | eop cnst to typ
j
| cop cnst to typ | getelementptr cnst cstj | select cnst 0 cnst 1 cnst 2 | icmp cond cnst 1 cnst 2
| fcmp fcond cnst 1 cnst 2
Blocks b :: = l φ c tmn
j
φ nodes φ :: = id = phi typ [valj , lj ]
Tmns tmn :: = br val l1 l2 | br l | ret typ val | ret void | unreachable
Commands c :: = id = bop( int sz )val1 val2 | id = fbop fp val1 val2 | id = load (typ∗)val1 align
| store typ val1 val2 align | id = malloc typ val align | free ( typ ∗ ) val
| id = alloca typ val align | id = trop typ1 val to typ2 | id = eop typ1 val to typ2
| id = cop typ1 val to typ2 | id = icmp cond typ val1 val2 | id = select val0 typ val1 val2
| id = fcmp fcond fp val1 val2 | option id = call typ0 val0 param
j
| id = getelementptr ( typ ∗ ) val valj

Figure 3. Syntax for LLVM. Note that this figure omits some syntax definitions (e.g., cond —the comparison operators) for the sake of space;
they are, of course, present in Vellvm’s implementation. Some other parts of the LLVM have been omitted from the Vellvm development;
these are discussed in Section 5.
Note that a call site is allowed to ignore the return value of a func- the control-flow graph: the instruction defining an identifier must
tion call. Finally, getelementptr computes pointer offsets into dominate all the instructions that use it. Within a block insn1
structured datatypes based on their types; it provides a platform- dominates insn2 if insn1 appears before insn2 in a program order.
and layout-independent way of performing array indexing, struct A block labeled l1 dominates a block labeled l2 if every execution
field access, and pointer arithmetic. path from the program entry to l2 must go through l1 .
The Vellvm formalization provides an implementation of this
2.2 Static semantics dominator analysis using a standard dataflow fixpoint computa-
Following the LLVM IR specification, Vellvm requires that every tion [14]. It also proves that the implementation is correct, as stated
LLVM program satisfy certain invariants to be considered well in the following lemma, which is needed to establish preservation
formed: every variable in a function is well-typed, well-scoped, of the well-formedness invariants by the operational semantics (see
and assigned exactly once. At a minimum, any reasonable LLVM Section 4).
transformation must preserve these invariants; together they imply
that the program is in SSA form [13]. L EMMA 1 (Dominator Analysis Correctness).
All the components in the LLVM IR are annotated with types, • The entry block of a function dominates itself.
so the typechecking algorithm is straightforward and determined • Given a block b2 that is an immediate successor of b1 , all the
only by local information.The only subtlety is that types themselves strict dominators of b2 also dominate b1
must be well formed. All typs except void and function types
are considered to be first class, meaning that values of these types These well-formedness constraints must hold only of blocks
can be passed as arguments to functions. A set of first-class type that are reachable from a function’s entry point—unreachable code
definitions is well formed if there are no degenerate cycles in their may contain ill-typed and ill-scoped instructions.
definitions (i.e., every cycle through the definitions is broken by a
pointer type). This ensures that the physical sizes of such typs are 3. A Memory Model for Vellvm
positive, finite, and known statically.
The LLVM IR has two syntactic scopes—a global scope and 3.1 Rationale
a function scope—and does not have nested local scopes. In the Understanding the semantics of LLVM’s memory operations is
global scope, all named types, global variables and functions have crucial for reasoning about LLVM programs. LLVM developers
different names, and are defined mutually. In the scope of a function make many assumptions about the “legal” behaviors of such LLVM
fid in module mod, all the global identifiers in mod, the names code, and they informally use those assumptions to justify the
of arguments, locally defined variables and block labels in the correctness of program transformations.
function fid must be unique, which enforces the single-assignment There are many properties expected of a reasonable implemen-
part of the SSA property. tation of the LLVM memory operations (especially in the absence
The set of blocks making up a function constitute a control- of errors). For example, we can reasonably assume that the load
flow graph with a well-defined entry point. All instructions in the instruction does not affect which memory addresses are allocated,
function must satisfy the SSA scoping invariant with respect to or that different calls to malloc do not inappropriately reuse mem-

3
Allocated Next block
ory locations. Unfortunately, the LLVM Language Reference Man-
ual does not enumerate all such properties, which should hold of Blk ... Blk 5 Blk 11 offset Blk ... Blk 39 offset Blk 40
any “reasonable” memory implementation. ... ...
On the other hand, details about the particular memory man- mb(10,136) 32 ...
i10
agement implementation can be observed in the behavior of LLVM mb(10,2) 33 muninit 20
programs (e.g., you can print a pointer after casting it to an integer). muninit 34 muninit 21
{i10, i8*} i32
For this reason, and also to address error conditions, the LLVM muninit 35 muninit 22
specification intentionally leaves some behaviors undefined. Exam- mptr(b39,24,0) 36 muninit 23
mptr(b39,24,1) 37 mptr(b11,32,0) 24
ples include: loading from an unallocated address; loading with im- i8*
mptr(b39,24,2) 38 mptr(b11,32,1) 25
proper alignment; loading from properly allocated but uninitialized [10 x i8*]
mptr(b39,24,3) 39
i16*
mptr(b11,32,2) 26
memory; and loading from properly initialized memory but with an ... mptr(b11,32,3) 27
incompatible type. ...
Because of the dependence on a concrete implementation of valid invalid valid valid valid invalid

memory operations, which can be platform specific, there are many

possible memory models for the LLVM. One of the challenges Figure 4. Vellvm’s byte-oriented memory model. This figure
we encountered in formalizing the LLVM was finding a point in shows (part of) a memory state that might be reached by calling the
the design space that accurately reflects the intent of the LLVM function foo from Figure 2. Blocks less than 40 were allocated;
documentation while still providing a useful basis for reasoning the next fresh block to allocate is 40. Block 5 is deallocated, and
about LLVM programs. thus marked invalid to access; fresh blocks (≥ 40) are also invalid.
In this paper we adopt a memory model that is based on the one Invalid memory blocks are gray, and valid memory blocks that
implemented for CompCert [18]. This model allows Vellvm to ac- are accessible are white. Block 11 contains data with structure
curately implement the LLVM IR and, in particular, detect the kind type {i10, [10 x i8*]} but it might be read (due to physical
of errors mentioned above while simultaneously justifying many of subtyping) at the type {i10, i8*}. This type is flattened into two
the “reasonable” assumptions that LLVM programmers make. The byte-sized memory cells for the i10 field, two uninitialized padding
nondeterministic operational semantics presented in Section 4 takes cells to adjust alignment, and four pointer memory cells for the
advantage of this precision to account for much of the LLVM’s first element of the array of 32-bit i8* pointers. Here, that pointer
under-specification.
points to the 24th memory cell of block 39. Block 39 contains an
Although Vellvm’s design is intended to faithfully capture the
uninitialized i32 integer represented by four muninit cells followed
LLVM specification, it is also partly motivated by pragmatism:
building on CompCert’s existing memory model allowed us to re- by a pointer that points to the 32nd memory cell of block 11.
use a significant amount of their Coq infrastructure. A benefit of
this choice is that our memory model is compatible with Com- be any first-class type, which includes arbitrary bit-width integers,
pCert’s memory model (i.e., our memory model implements the floating point values, pointers, and aggregated types—arrays and
CompCert Memory signature). structures. The LLVM IR semantics treats memory as though it is
This Vellvm memory model inherits some features from the dynamically typed: the sizes, layout, and alignment, of a value read
CompCert implementation: it is single threaded (in this paper we via a load instruction must be consistent with that of the data that
consider only single-threaded programs); it assumes that point- was stored at that address, otherwise the result is undefined.
ers are 32-bits wide, and 4-byte aligned; and it assumes that the This approach leads to a memory model structured in two parts:
memory is infinite. Unlike CompCert, Vellvm’s model must also (1) a low-level byte-oriented representation that stores values of
deal with arbitrary bit-width integers, padding, and alignment con- basic (non-aggregated) types along with enough information to in-
straints that are given by layout annotations in the LLVM program, dicate physical size, alignment, and whether or not the data is a
as described next. pointer, and (2) an encoding that flattens LLVM-level structured
3.2 LLVM memory commands data with first-class types into a sequence of basic values, comput-
ing appropriate padding and alignment from the type. The next two
The LLVM supports several commands for working with heap- subsections describe these two parts in turn.
allocated data structures:
• malloc and alloca allocate array-structured regions of mem- 3.3 The byte-oriented representation
ory. They take a type parameter, which determines layout and The byte-oriented representation is composed of blocks of memory
padding of the elements of the region, and an integral size that cells. Each cell is a byte-sized quantity that describes the smallest
specifies the number of elements; they return a pointer to the chunk of contents that a memory operation can access. Cells come
newly allocated region. in several flavors:
• free deallocates the memory region associated with a given Memory cells mc : : = mb(sz , byte) | mptr(blk , ofs, idx )
pointer (which should have been created by malloc). Memory | muninit
allocated by alloca is implicitly freed upon return from the The memory cell mb(sz , byte) represents a byte-sized chunk of
function in which alloca was invoked. numeric data, where the LLVM-level bit-width of the integer is
• load and store respectively read and write LLVM values to given by sz and whose contents is byte. For example, an integer
memory. They take type parameters that govern the expected with bit-width 32 is represented by four mb cells, each with size
layout of the data being read/written. parameter 32. An integer with bit-width that is not divisible by 8 is
• getelementptr indexes into a structured data type by com- encoded by the minimal number of bytes that can store the integer,
puting an offset pointer from another given pointer based on its i.e., an integer with bit-width 10 is encoded by two bytes, each with
type and a list of indices that describe a path into the datatype. size parameter 10 (see Figure 4). Floating point values are encoded
similarly.
Figure 2 gives a small example program that uses these oper- Memory addresses are represented as a block identifier blk
ations. Importantly, the type annotations on these operations can and an offset ofs within that block; the cell mptr(blk , ofs, idx )

4
is a byte-sized chunk of such a pointer where idx is an index
LLVMND
identifying which byte the chunk corresponds to. Because Vellvm’s

∈
implementation assumes 32-bit pointers, four such cells are needed
LLVMInterp ≈ LLVMD & LLVM∗DFn & LLVM∗DB
to encode one LLVM-pointer, as shown in Figure 4. Loading a
pointer succeeds only if the 4 bytes loaded are sequentially indexed
from 0 to 3. Figure 5. Relations between different operational semantics. Each
The last kind of cell is muninit, which represents uninitialized equivalence or inclusion is justified by a proof in Vellvm.
memory, layout padding, and bogus values that result from unde-
fined computations (such as might arise from an arithmetic over-
flow). This scheme induces a notion of dynamically-checked physical
Given this definition of memory cells, a memory state M = subtyping: it is permitted to read a structured value at a different
(N, B, C) includes the following components: N is the next fresh type from the one at which it was written, so long as the basic
block to allocate, B maps a valid block identifier to the size of the types they flatten into agree. For non-structured data types such as
block; C maps a block identifier and an offset within the block to a integers, Vellvm’s implementation is conservative—for example,
memory cell (if the location is valid). Initially, N is 1; B and C are reading an integer with bit width two from the second byte of a 10-
empty. Figure 4 gives a concrete example of such a memory state bit wide integer yields undef because the results are, in general,
for the program in Figure 2. platform specific. Because of this dynamically-checked, physical
There are four basic operations over this byte-oriented memory subtyping, pointer-to-pointer casts can be treated as the identity.
state: alloc, mfree, mload, and mstore. alloc allocates a fresh Similar ideas arise in other formalizations of low-level language
memory block N with a given size, increments N , fills the newly semantics [24, 25].
allocated memory cells with muninit. mfree simply removes the The LLVM malloc and free operations are defined by alloc
deallocated block from B, and its contents from C. Note that the and mfree in a straightforward manner. As the LLVM IR does
memory model does not recycle block identifiers deallocated by a not explicitly distinguish the heap and stack and function calls are
mfree operation, because this model assumes that a memory is of implementation-specific, the memory model defines the same se-
infinite size. mantics for stack allocation (alloca) and heap allocation (malloc)
The mstore operation is responsible for breaking non-byte — both of them allocate memory blocks in memory. However, the
sized basic values into chunks and updating the appropriate mem- operational semantics (described next) maintains a list of blocks
ory locations. Basic values are integers (with their bit-widths), allocated by alloca for each function, and it deallocates them on
floats, addresses, and padding. return.

Basic values bv :: = Int sz | Float | blk .ofs | pad sz 4. Operational Semantics

Basic types btyp :: = isz | fp | typ∗
Vellvm provides several related operational semantics for the
mload is a partial function that attempts to read a value from LLVM IR, as summarized in Figure 5. The most general is
a memory location. It is annotated by a basic type, and ensures LLVMND , a small-step, nondeterministic evaluation relation given
compatibility between memory cells at the address it reads from by rules of the form config ` S S 0 (see Figure 6). This sec-
and the given type. For example, memory cells for an integer with tion first motivates the need for nondeterminism in understanding
bit-width sz cannot be accessed as an integer type with a different the LLVM semantics and then illustrates LLVMND by explain-
bit-width; a sequence of bytes can be accessed as floating point ing some of its rules. Next, we introduce several equivalent de-
values if they can be decoded as a floating point value; pointers terministic refinements of LLVMND —LLVMD , LLVM∗DB , and
stored in memory can only be accessed by pointer types. If an LLVM∗DFn —each of which has different uses, as described in Sec-
access is type incompatible, mload returns pad sz , which is an tion 4.4. All of these operational semantics must handle various
“error” value representing an arbitrary bit pattern with the bitwidth error conditions, which manifest as partiality in the rules. Sec-
sz of the type being loaded. mload is undefined in the case that tion 4.3 describes these error conditions, and relates them to the
the memory address is not part of a valid allocation block. static semantics of Section 2.
Vellvm’s operational rules are specified as transitions between
3.4 The LLVM flattened values and memory accesses machine states S of the form M , Σ , where M is the memory and
Σ is a stack of frames. A frame keeps track of the current function
LLVM’s structured data is flattened to lists of basic values that fid and block label l , as well as the “continuation” sequence of
indicate its physical representation: commands c to execute next ending with the block terminator tmn.
Flattened Values v :: = bv | bv , v The map ∆ tracks bindings for the local variables (which are not
stored in M ), and the list α keeps track of which memory blocks
A constant cnst is flattened into a list of basic values according were created by the alloca instruction so that they can be marked
to it annotated type. If the cnst is already of basic type, it flattens as invalid when the function call returns.
into the singleton list. Values of array type [ sz × typ ] are first
flattened element-wise according to the representation given by typ 4.1 Nondeterminism in the LLVM operational semantics
and then padded by uninitialized values to match typ’s alignment There are several sources of nondeterminism in the LLVM se-
requirements as determined by the module’s layout descriptor. The mantics: the undef value, which stands for an arbitrary (and
resulting list is then concatenated to obtain the appropriate flattened ephemeral) bit pattern of a given type, various memory errors, such
value. The case when a cnst is a structure type is similar. as reading from an uninitialized location. Unlike the “fatal” errors,
The LLVM load instruction works by first flattening its type which are modeled by stuck states (see Section 4.3), we choose
annotation typ into a list of basic types, and mapping mload to model these behaviors nondeterministically because they corre-
across the list; it then merges the returned basic values into the spond to choices that would be resolved by running the program
final LLVM value. Storing an LLVM value to memory works by with a concrete memory implementation. Moreover, the LLVM op-
first flattening to a list of basic values and mapping mstore over timization passes use the flexibility granted by this underspecificity
the result. to justify aggressive optimizations.

5
Configurations:
Fun tables θ :: = v 7→ id Globals g :: = id 7→ v Configurations config :: = mod, g, θ
Nondeterministic Machine States:
Value sets V : : = {v | Φ(v )} Locals ∆ :: = id →
7 V Allocas α :: = [] | blk , α
Frames Σ : : = fid , l , c, tmn, ∆, α Call stacks Σ :: = [] | Σ, Σ Program states S :: = M,Σ

config ` S S 0

evalND (g, ∆, val) = bV c findfdef (mod, θ, v ) = bdefine typ fid 0 (arg){(l 0 []c 0 tmn0 ), b}c
v ∈ V initlocals (g, ∆, arg, param) = b∆0 c c0 = (option id = call typ val param)
NDS CALL
mod, g, θ ` M , ((fid , l , (c0 , c), tmn, ∆, α), Σ ) M , ((fid 0 , l 0 , c 0 , tmn0 , ∆0 , []), (fid , l , (c0 , c), tmn, ∆, α), Σ )

evalND (g, ∆, val) = bV c c0 = (option id = call typ val param) freeallocas (M , α0 ) = bM 0 c

NDS RET
mod, g, θ ` M , ((fid 0 , l 0 , [], ret typ val, ∆0 , α0 ), (fid , l , (c0 , c), tmn, ∆, α), Σ ) M 0 , ((fid , l , c, tmn, ∆{id ← V }, α), Σ )

evalND (g, ∆, val) = bV c true ∈ V

findblock (mod, fid , l1 ) = (l1 φ1 c 1 tmn1 ) computephinodesND (g, ∆, l , l1 , φ1 ) = b∆0 c
NDS BR TRUE
mod, g, θ ` M , ((fid , l , [], br val l1 l2 , ∆, α), Σ ) M , ((fid , l1 , c 1 , tmn1 , ∆0 , α), Σ )

evalND (g, ∆, val) = bV c v ∈ V c0 = (id = malloc typ val align) malloc (M , typ, v , align) = bM 0 , blk c
NDS MALLOC
mod, g, θ ` M , ((fid , l , (c0 , c), tmn, ∆, α), Σ ) M 0 , ((fid , l , c, tmn, ∆{id ← {blk.0}}, α), Σ )

evalND (g, ∆, val) = bV c v ∈ V c0 = (id = alloca typ val align) malloc (M , typ, v , align) = bM , blk c
NDS ALLOCA
mod, g, θ ` M , ((fid , l , (c0 , c), tmn, ∆, α), Σ ) M 0 , ((fid , l , c, tmn, ∆{id ← {blk.0}}, (blk , α)), Σ )

evalND (g, ∆, val1 ) = bV1 c evalND (g, ∆, val2 ) = bV2 c evalbopND (bop, sz , V1 , V2 ) = V3
NDS BOP
mod, g, θ ` M , ((fid , l , (id = bop( int sz )val1 val2 , c), tmn, ∆, α), Σ ) M , ((fid , l , c, tmn, ∆{id ← V3 }, α), Σ )

Figure 6. LLVMND : Small-step, nondeterministic semantics of the LLVM IR (selected rules).

Nondeterminism shows up in two ways in the LLVMND seman- The reason is that the LLVM IR adopts a liberal substitution prin-
tics. First, stack frames bind local variables to sets of values V ; ciple: because %x = undef would be a legitimate replacement
second, the relation itself may relate one state to many possible for first assignment in (b), it is allowed to substitute undef for %x
successors. The semantics teases apart these two kinds of nonde- throughout, which reduces the assignment to %z to the same code
terminism because of the way that the undef value interacts with as in (a).
memory operations, as illustrated by the examples below. Example (c) shows why the semantics needs arbitrary sets of
From the LLVM Language Reference Manual: “Undefined val- values. Here, %z evaluates to the set of odd 8-bit integers, which
ues indicate to the compiler that the program is well defined no is the result of oring 1 with each element of the set {0, . . . , 255}.
matter what value is used, giving the compiler more freedom to This code snippet could therefore not safely be replaced by
optimize.” Semantically, LLVMND treats undef as the set of all %z = undef; however it could be optimized to %z = 1 (or any
values of a given type. For some motivating examples, consider the other odd 8-bit integer).
following code fragments: Example (d) illustrates the interaction between the set-semantics
for local values and the nondeterminism of the relation. The
(a) %z = xor i8 undef undef control state of the machine holds definite information, so when a
branch occurs, there may be multiple successor states. Similarly,
(b) %x = add i8 0 undef we choose to model memory cells as holding definite values, so
%z = xor i8 %x %x when writing a set to memory, there is one successor state for each
possible value that could be written. As an example of that interac-
(c) %z = or i8 undef 1 tion, consider the following example program, which was posted to
the LLVMdev mailing list, that reads from an uninitialized memory
(d) br undef %l1 %l2
location:
The value computed for %z in example (a) is the set of all 8-bit %buf = alloca i32
integers: because each occurrence of undef could take on any bit %val = load i32* %buf
pattern, the set of possible results obtained by xoring them still store i32 10, i32* %buf
includes all 8-bit integers. Perhaps surprisingly, example (b) com- ret %val
putes the same set of values for %z: one might reason that no mat-
ter which value is chosen for undef , the result of xoring %x with The LLVM mem2reg pass optimizes this program to program
itself would always be 0, and therefore %z should always be 0. (a) below; though according to the LLVM semantics, it would also
be admissible to replace this program with option (b) (perhaps to
However, while that answer is compatible with the LLVM language expose yet more optimizations):
reference (and hence allowed by the nondeterministic semantics),
it is also safe to replace code fragment (b) with %z = undef. (a) ret i32 10 (b) ret i32 undef

6
4.2 Nondeterministic operational semantics of the SSA form store on a pointer with bad alignment or a deallocated address,
The LLVMND semantics we have developed for Vellvm (and the (4) trying to call a non-function pointer, or (5) trying to execute the
others described below) is parameterized by a configuration, which unreachable command. We model these events by stuck states
is a triple of a module containing the code, a (partial) map g that because they correspond to fatal errors that will occur in any rea-
gives the values of global constants, and a function pointer table θ sonable realization of the LLVM IR by translation to a target plat-
that is a (partial) map from values to function identifiers (see the top form. Each of these errors is precisely characterized by a predi-
of Figure 6). The globals and function pointer maps are initialized cate over the machine state (e.g., BadFree(config, S)), and the
from the module definition when the machine is started. “allowed” stuck states are defined to be the disjunction of these
The LLVMND rules relate machine states to machine states, predicates:
where a machine state takes the form of a memory M (from Stuck(config, S) = BadFree(config, S)
Section 3) and a stack of evaluation frames. The frames keep track ∨ BadLoad(config, S)
of the (sets of) values bound to locally-allocated temporaries and ∨ ...
which instructions are currently being evaluated. Figure 6 shows a ∨ Unreachable(config, S)
selection of evaluation rules from the development.
Most of the commands of the LLVM have straight-forward in- To see that the well-formedness properties of the static seman-
terpretation: the arithmetic, logic, and data manipulation instruc- tics rule out all but these known error configurations, we prove the
tions are all unsurprising—the evalND function computes a set usual preservation and progress theorems for the LLVMND seman-
of flattened values from the global state, the local state, and an tics.
LLVM val, looking up the meanings of variables in the local state T HEOREM 2 (Preservation for LLVMND ). If (config, S) is well
as needed; similarly, evalbopN D implements binary operations, formed and config ` S S 0 , then (config, S 0 ) is well formed.
computing the result set by combining all possible pairs drawn
from its input sets. LLVMND ’s malloc behaves as described in Here, well-formedness includes the static scoping, typing prop-
Section 3, while load uses the memory model’s ability to detect erties, and SSA invariants from Section 2 for the LLVM code, but
ill-typed and uninitialized reads and, in the case of such errors, also requires that the local mappings ∆ present in all frames of the
yields undef as the result. Function calls push a new stack frame call stack must be inhabited—each binding contains at least one
whose initial local bindings are computed from the function param- value v —and that each defined variable that dominates the current
eters. The α component of the stack frame keeps track of which continuation is in ∆’s domain.
blocks of memory are created by the alloca instruction (see rule To show that the ∆ bindings are inhabited after the step, we
NDS ALLOCA); these are freed when the function returns (rule prove that (1) non-undef values V are singletons; (2) undefined
NDS RET). values from constants typ undef contain all possible values of first
There is one other wrinkle in specifying the operational se- class types typ; (3) undefined values from loading uninitialized
mantics when compared to a standard environment-passing call-
by-value language. All of the φ instructions for a block must be memory or incompatible physical data contain at least paddings
executed atomically and with respect to the “old” local value map- indicating errors; (4) evaluation of non-deterministic values by
ping due to possibility of self loops and dependencies among the evalbopND returns non-empty sets of values given non-empty
φ nodes. For example the well-formed code fragment below has a inputs.
circular dependency between %x and %z The difficult part of showing that defined variables dominate
their uses in the current continuation is proving that control-
blk:
%x = phi i32 [ %z, %blk ], [ 0, %pred ] transfers maintain the dominance property [20]. If a program
%z = phi i32 [ %x, %blk ], [ 1, %pred ] branches from a block b1 to b2 , the first command in b2 can use
%b = icmp leq %x %z either the falling-through variables from b1 , which must be defined
br %b %blk %succ in ∆ by Lemma 1, or the variables updated by the φs at the be-
ginning of b2 . This latter property requires a lemma showing that
If control enters this block from %pred, %x will map to 0 and computephinodeND behaves as expected.
%z to 1, which causes the conditional branch to succeed, jumping
back to the label %blk. The new values of %x and %z should be T HEOREM 3 (Progress for LLVMND ). If the pair (config, S)
1 and 0, and not, 1 and 1 as might be computed if they were is well formed, then either S has terminated successfully or
handled sequentially. This update of the local state is handled by the Stuck(config, S) or there exists S’ such that config ` S S 0 .
computephinodesND function in the operational semantics, as
shown, for example, in rule NDS BR TRUE. This theorem holds because in a well-formed machine state,
evalN D always returns a non-empty value set V ; moreover jump
4.3 Partiality, preservation, and progress targets and internal functions are always present.
Throughout the rules the “lift” notation f (x) = bv c indicates that
a partial function f is defined on x with value v . As seen by the 4.4 Deterministic refinements
frequent uses of lifting, both the nondeterministic and deterministic Although the LLVMND semantics is useful for reasoning about
semantics are partial—the program may get stuck. the validity of LLVM program transformations, Vellvm provides
Some of this partiality is related to well-formedness of the SSA a LLVMD , a deterministic, small-step refinement, along with two
program. For example, evalND (g, ∆, %x) is undefined if %x is not large-step operational semantics LLVM∗DFn and LLVM∗DB .
bound in ∆. These kinds of errors are ruled out by the static well- These different deterministic semantics are useful for several
formedness constraints imposed by the LLVM IR (Section 2). reasons: (1) they provide the basis for testing LLVM programs with
In other cases, we have chosen to use partiality in the oper- a concrete implementation of memory (see the discussion about
ational semantics to model certain failure modes for which the Vellvm’s extracted interpreter in the next Section), (2) proving that
LLVM specification says that the behavior of the program is unde- LLVMD is an instance of the LLVMND and relating the small-
fined. These include: (1) attempting to free memory via a pointer step rules to the large-step ones provides validation of all of the
not returned from malloc or that has already been deallocated, semantics (i.e., we found bugs in Vellvm by formalizing multiple
(2) allocating a negative amount of memory, (3) calling load or semantics and trying to prove that they are related), and (3) the

7
small- and large-step semantics have different applications when LLVMD . Note that in the deterministic setting, one-direction sim-
reasoning about LLVM program transformations. ulation implies bisimulation [18]. Moreover, LLVMD is a refine-
Unlike LLVMND , the frames for these semantics map identi- ment instance of the nondeterministic LLVMND semantics.
fiers to single values, not sets, and the operational rules call deter- These relations are useful because the large-step semantics in-
ministic variants of the nondeterministic counterparts (e.g., eval duce different proof styles than the small-step semantics: in partic-
instead of evalND ). To resolve the nondeterminism from undef ular, the induction principles obtained from the large step seman-
and faulty memory operations, these semantics fix a concrete inter- tics allow one to gloss over insignificant details of the small step
pretation as follows: semantics.
• undef is treated as a zeroinitializer
5. Vellvm Infrastructure and Validation
• Reading uninitialized memory returns zeroinitializer
This section briefly describes the Coq implementation of Vellvm
These choices yield unrealistic behaviors compared to what one and its related tools for interacting with the LLVM infrastructure. It
might expect from running a LLVM program against a C-style run- also describes how we validate the Vellvm semantics by extracting
time system, but the cases where this semantics differs correspond an executable interpreter and comparing its behavior to the LLVM
to unsafe programs. There are still many programs, namely those reference interpreter.
compiled to LLVM from type-safe languages, whose behaviors un-
der this semantics should agree with their realizations on target 5.1 The Coq development
platforms. Despite these differences from LLVMND , LLVMD also Vellvm encodes the abstract syntax from Section 2 in an entirely
has the preservation and progress properties. straightforward way using Coq’s inductive datatypes (generated in
a preprocessing step via the Ott [27] tool). The implementation uses
Big-step semantics Vellvm also provides big-step operational se- Penn’s Metatheory library [4], which was originally designed for
mantics LLVM∗DFn , which evaluates a function call as one large the locally nameless representation, to represent identifiers of the
step, and LLVM∗DB , which evaluates each sub-block—i.e., the LLVM, and to reason about their freshness.
code between two function calls—as one large step. Big-step se- The Coq representation deviates from the full LLVM language
mantics are useful because compiler optimizations often transform in only a few (mostly minor) ways. In particular, the Coq represen-
multiple instructions or blocks within a function in one pass. Such tation requires that some type annotations be in normal form (e.g.,
transformations do not preserve the small-step semantics, making the type annotation on load must be a pointer), which simplifies
it hard to create simulations that establish correctness properties. type checking at the IR level. The Vellvm tool that imports LLVM
As a simple application of the large-step semantics, consider bitcode into Coq provides such normalization, which simply ex-
trying to prove the correctness of a transformation that re-orders
program statements that do not depend on one another. For exam- pands definitions to reach the normal form. In total, the syntax and
ple, the following two programs result in the same states if we con- static semantics constitute about 2500 lines of Coq definitions and
sider their execution as one big-step, although their intermediate proof scripts.
states do not match in terms of the small-step semantics. Vellvm’s memory model implementation extends CompCert’s
with approximately 5000 lines of code to support integers with ar-
(a) %x = add i32 %a, %b (b) %y = load i32* %p bitrary precision, padding, and an experimental treatment of casts
%y = load i32* %p %x = add i32 %a, %b
that has not yet been needed for any of our proofs. On top of this
The proof of this claim in Vellvm uses the LLVM∗DB rules extended memory model, all of the operational semantics and their
to hide the details about the intermediate states. To handle mem- metatheory have been proved in Coq. In total, the development rep-
ory effects, we use a simulation relation that uses symbolic eval- resents approximately 32,000 lines of Coq code. Checking the en-
uation [22] to define the equivalence of two memory states. The tire Vellvm implementation using coqc takes about 13.5 minutes
memory contents are defined abstractly in terms of the program on a 1.73 GHz Intel Core i7 processor with 8 GB RAM. We expect
operations by recording the sequence of writes. Using this tech- that this codebase could be significantly reduced in size by refac-
nique, we defined a simple translation validator to check whether toring the proof structure and making it more modular.
the semantics of two programs are equivalent with respect to such The LLVM distribution includes primitive OCaml bindings that
re-orderings execution. For each pair of functions, the validator en- are sufficient to generate LLVM IR code (‘bitcode” in LLVM jar-
sures that their control-flow graphs match, and that all correspond- gon) from OCaml. To convert between the LLVM bitcode repre-
ing sub-blocks are equivalent in terms of their symbolic evaluation. sentation and the extracted OCaml representation, we implemented
This approach is similar to the translation validation used in prior a library consisting of about 5200 lines of OCaml-LLVM bindings.
work for verifying instruction scheduling optimizations [32]. This library also supports pretty-printing of the AST’s; this code
Although this is a simple application of Vellvm’s large-step was also useful in the extracted the interpreter.
semantics, proving correctness of other program transformations Omitted details This paper does not discuss all of the LLVM IR
such as dead expression elimination and constant propagation fol- features that the Vellvm Coq development supports. Most of these
low a similar pattern—the difference is that, rather than checking features are uninteresting technically but necessary to support real
that two memories are syntactically equivalent according to the LLVM code: (1) The LLVM IR provides aggregate data operations
symbolic evaluation, we must check them with respect to a more (extractvalue and insertvalue) for projecting and updating
semantic notion of equivalence [22]. the elements of structures and arrays; (2) the operational semantics
supports external function calls by assuming that their behavior is
Relationships among the semantics Figure 5 illustrates how
specified by axioms; the implementation applies these axioms to
these various operational semantics relate to one another. Vel-
transition program states upon calling external functions; (3) the
lvm provides proofs that LLVM∗DB simulates LLVM∗DFn and that
LLVM switch instruction, which is used to compile jump tables,
LLVM∗DFn simulates LLVMD . In these proofs, simulation is taken
is lowered to the normal branch instructions that Vellvm supports
to mean that the machine states are syntactically identical at cor-
by a LLVM-supported pre-processing step.
responding points during evaluation. For example, the state at a
function call of a program running on the LLVM∗DFn semantics Unsupported features Some features of LLVM are not supported
matches the corresponding state at the function call reached in by Vellvm. First, the LLVM provides intrinsic functions for extend-

8
ing LLVM or to represent functions that have well known names and stores of pointer with parallel loads and stores of their associ-
and semantics and are required to follow certain restrictions—for ated metadata. This instrumentation ensures that each pointer deref-
example, functions from standard C libraries, handling variable ar- erenced is within bounds and aborts the program otherwise.
gument functions, etc. Second, the LLVM functions, global vari- The original SoftBound paper includes a mechanized proof that
ables, and parameters can be decorated with attributes that denote validates the correctness of this idea, but it is not complete. In par-
linkage type, calling conventions, data representation, etc. which ticular, the proof is based on a subset of a C-like language with only
provide more information to compiler transformations than what straight-line commands and non-aggregate types, while a real Soft-
the LLVM type system provides. Vellvm does not statically check Bound implementation needs to consider all of the LLVM IR shown
the well-formedness of these attributes, though they should be in Figure 3, the memory model, and the operational semantics of
obeyed by any valid program transformation. Third, Vellvm does the LLVM. Also the original proof ensures the correctness only
not support the invoke and unwind instructions, which are used to with respect to a specification that the SoftBound instrumentation
implement exception handling, nor does it support variable argu- must implement, but does not prove the correctness of the instru-
ment functions. Forth, Vellvm does not support vector types, which mentation pass itself. Moreover, the specification requires that ev-
allow for multiple primitive data values to be computed in parallel ery temporary must contain metadata, not just pointer temporaries.
using a single instruction.
Using Vellvm to verify SoftBound This section describes how
5.2 Extracting an interpreter we use Vellvm to formally verify the correctness of the Soft-
To test Vellvm’s operational semantics for the LLVM IR, we used Bound instrumentation pass with respect to the LLVM semantics,
Coq’s code extraction facilities to obtain an interpreter for execut- demonstrating that the promised spatial memory safety property is
ing the LLVM distribution’s regression test suite. Extracting such achieved. Moreover, Vellvm allows us to extract a verified OCaml
an interpreter is one of the main motivations for developing a deter- implementation of the transformation from Coq. The end result is
ministic semantics, because the evaluation under the nondetermin- a compiler pass that is formally verified to transform a program in
istic semantics cannot be directly compared against actual runs of the LLVM IR into a program augmented with sufficient checking
LLVM IR programs. code such that it will dynamically detect and prevent all spatial
Unfortunately, the small-step deterministic semantics LLVMD memory safety violations.
is defined relationally in the logical fragment of Coq, which is con- SoftBound is a good test case for the Vellvm framework. It is
venient for proofs, but can not be used to extract code. Therefore, a non-trivial translation pass that nevertheless only inserts code,
Vellvm provides yet another operational semantics, LLVMInterp , thereby making it easier to prove correct. SoftBound’s intended use
which is a deterministic functional interpreter implemented in the is to prevent security vulnerabilities, so bugs in its implementation
computational fragment of Coq. LLVMInterp is proved to be bisim- can potentially have severe consequences. Also, the existing Soft-
ilar to LLVMD , so we can port results between the two semantics. Bound implementation already uses the LLVM.
Although one could run this extracted interpreter directly, doing
so is not efficient. First, integers with arbitrary bit-width are induc- Modifications to SoftBound since the original paper As de-
tively defined in Coq. This yields easy proof principles, but does not scribed in the original paper, SoftBound modifies function signa-
give an efficient runtime representation; floating point operations tures to pass metadata associated with the pointer parameters or
are defined axiomatically. To remedy these problems, at extraction, returned pointers. To improve the robustness of the tool, we transi-
we realize Vellvm’s integer and floating point values by efficient tioned to an implementation that instead passes all pointer metadata
C++ libraries that are a standard part of the LLVM distribution. on a shadow stack. This has two primary advantages. The first is
Second, the memory model implementation of Vellvm maintains that this design simplifies the implementation while simultaneously
memory blocks and their associated metadata as functional lists, better supporting indirect function calls (via function pointers) and
and it converts between byte-list and value representations at each more robustly handling improperly declared function prototypes.
memory access. Using the extracted data-structures directly incurs The second is that it also simplifies the proofs.
tremendous performance overhead, so we replaced the memory op-
erations of the memory model with native implementations from 6.1 Formalizing SoftBound for the LLVM IR
the C standard library. A value v in local mappings δ is boxed, and The SoftBound correctness proof has the following high-level
it is represented by a reference to memory that stores its content. structure:
Our implementation faithfully runs 134 out of the 145 tests from
the LLVM regression suite that lli, the LLVM distribution inter- 1. We define a nonstandard operational semantics SBspec for the
preter, can run. The missing tests cover instructions (like variable LLVM IR. This semantics “builds in” the safety properties that
arguments) that are not yet implemented in Vellvm. should be enforced by a correct implementation of SoftBound.
Although replacing the Coq data-structures by native ones inval- It uses meta-level datastructures to implement the metadata
idates the absolute correctness guarantees one would expect from and meta-level functions to define the semantics of the bounds
an extracted interpreter, this exercise is still valuable. In the course checks.
of carrying out this experiment, we found one severe bug in the 2. We prove that an LLVM program P, when run on the SBspec
semantics: the br instruction inadvertently swapped the true and semantics, has no spatial safety violations.
false branches. 3. We define a translation pass SBtrans(−) that instruments the
LLVM code to propagate metadata.
6. Verified SoftBound
4. We prove that a program if SBtrans(P ) = bP 0 c then P’, when
SoftBound [21] is a previously proposed program transformation run on the LLVMD , simulates P running on SBspec.
that hardens C programs against spatial memory safety violations
(e.g., buffer overflows, array indexing errors, and pointer arithmetic The SoftBound specification Figure 7 gives the program config-
errors). SoftBound works by first compiling C programs into the urations and representative rules for the SBspec semantics. SBspec
LLVM IR, and then instrumenting the program with instructions behaves the same as the standard semantics except that it creates,
that propagate and check per-pointer metadata. SoftBound main- propagates, and checks metadata of pointers in the appropriate in-
tains base and bound metadata with each pointer, shadowing loads structions.

9
Nondeterministic rules:

Metadata md :: = [v1 , v2 ) Memory metadata MM :: = blk.of s 7→ md Frames Σ̂ :: = fid , l , c, tmn, ∆, µ, α

Call stacks Σ̂ :: = [] | Σ̂, Σ̂ Local metadata µ :: = id 7→ md Program states Ŝ :: = M , MM , Σ̂

evalND (g, ∆, val) = bV c v ∈ V c0 = (id = malloc typ val align)

malloc (M , typ, v , align) = bM 0 , blk c µ0 = µ{id ← [blk.0, blk.(sizeof typ × v))}
SB MALLOC
mod, g, θ ` M , MM , ((fid , l , (c0 , c), tmn, ∆, µ, α), Σ̂ ) M 0 , MM , ((fid , l , c, tmn, ∆{id ← {blk.0}}, µ0 , α), Σ̂ )

evalND (g, ∆, val) = bV c v ∈ V c0 = (id = load (typ∗)val align)

findbounds(g, µ, val) = bmdc checkbounds(typ, v , md) load (M , typ, v , align) = bv 0 c
if isPtrTyp typ then µ0 = µ{id ← findbounds (MM , v )} else µ0 = µ
SB LOAD
mod, g, θ ` M , MM , ((fid , l , (c0 , c), tmn, ∆, µ, α), Σ̂ ) M , MM , ((fid , l , c, tmn, ∆{id ← {|v 0 |}}, µ0 , α), Σ̂ )

evalND (g, ∆, val1 ) = bV1 c v1 ∈ V1 evalND (g, ∆, val2 ) = bV2 c v2 ∈ V2

c0 = (store typ val1 val2 align) findbounds(g, µ, val2 ) = bmdc checkbounds(typ, v2 , md)
store (M , typ, v1 , v2 , align) = bM 0 c if isPtrTyp typ then MM 0 = MM {v2 ← md} else MM 0 = MM
SB STORE
mod, g, θ ` M , MM , ((fid , l , (c0 , c), tmn, ∆, µ, α), Σ̂ ) M 0 , MM 0 , ((fid , l , c, tmn, ∆, µ, α), Σ̂ )

Deterministic configurations:
Frames σ̂ : : = fid , l , c, tmn, δ, µ, α Call stacks σ̂ :: = [] | σ̂, σ̂ Program states ŝ :: = M , MM , σ̂

Figure 7. SBspec: The specification semantics for SoftBound. Differences from the LLVMND rules are highlighted.

A program state Ŝ is an extension of the standard program state The second part of the correctness is proved by the following
S for maintaining metadata md, which is a pair defining the start preservation and progress theorems.
and end address for a pointers: µ in each function frame Σ̂ maps
temporaries of pointer type to their metadata; MM is the shadow T HEOREM 5 (Preservation for SBspec).
heap that stores metadata for pointers in memory. Note that al- If (config, Ŝ) is well formed, and config ` Ŝ Ŝ 0 , then (config,
though the specification is nondeterministic, the metadata is de- Ŝ 0 ) is well formed.
terministic. Therefore, a pointer loaded from uninitialized memory
space can be undef , but it cannot have arbitrary md (which might Here, SBspec well-formedness strengthens the invariants for
not be valid). LLVMND by requiring that if any id defined in ∆ is of pointer
SBspec is correct if a program P must either abort on detecting type, then µ contains its metadata and a spatial safety invariant: all
a spatial memory violation with respect to the SBspec, or preserve bounds in µs of function frames and MM must be memory ranges
the LLVM semantics of the original program P ; and, moreover, P within which all memory addresses are spatially safe.
is not stuck by any spatial memory violation in the SBspec (i.e., The interesting part is proving that the spatial safety invariant is
SBspec must catch all spatial violations). preserved. It holds initially, because a program’s initial frame stack
is empty, and we assume that MM is also empty. The other cases
D EFINITION 1 (Spatial safety). Accessing a memory location at depend on the rules in Figure 7.
the offset ofs of a block blk is spatially safe if blk is less than the The rule SB MALLOC, which allocates the number v of ele-
next fresh block N , and ofs is within the bounds of blk : ments with typ at a memory block blk , updates the metadata of
blk < N ∧ (B(blk ) = bsizec → 0 ≤ ofs < size) id with the start address that is the beginning of blk , and the end
address that is at the offset blk.(sizeof typ × v) in the same block.
The legal stuck states of SoftBound—StuckSB (config, Ŝ) in- LLVM’s memory model ensures that the range of memory is valid.
clude all legal stuck states of LLVMND (recall Section 4.3) except The rule SB LOAD reads from a pointer val with runtime data
the states that violate spatial safety. The case when B does not map v , finds the md of the pointer, and ensures that v is within the
blk to some size indicates that blk is not valid, and pointers into the md via checkbounds. If the val is an identifier, findbounds
blk are dangling—this indicates a temporal safety error that is not simply returns the identifier’s metadata from µ, which must be a
prevented by SoftBound and therefore it is included in the set of spatial safe memory range. If val is a constant of pointer type,
legal stuck states. findbounds returns bounds as the following. For global point-
Because the program states of a program in the LLVMND se- ers, findbounds returns bounds derived from their types because
mantics are identical to the corresponding parts in the SBspec, it globals must be allocated before a program starts. For pointers con-
is easy to relate them: let Ŝ ⊇◦ S mean that common parts of the verted from some constant integers by inttoptr, it conservatively
SoftBound state Ŝ and S are identical. Because memory instruc- returns the bounds [null, null) to indicate a potentially invalid
tions in the SBspec may abort without accessing memory, the first memory range. For a pointer cnst 1 derived from an other constant
part of correctness is by a straightforward simulation relation be- pointer cnst 2 by bitcase or getelementptr, findbounds re-
tween states of the two semantics. turns the same bound of cnst 2 for cnst 1 . Note that {|v 0 |} denotes
conversion from a deterministic value to a nondeterministic value.
T HEOREM 4 (SBspec simulates LLVMND ). If the state Ŝ ⊇◦ S , If the load reads a pointer-typed value v from memory, the
and config ` Ŝ Ŝ 0 , then there exists a state S 0 , such that rule finds its metadata in MM and updates the local metadata
config ` S S 0 , and Ŝ 0 ⊇◦ S 0 . mapping µ. If MM does not contain any metadata indexed by

10
(Δ, μ) ≈○ Δ’ 250%
(MM, p1 b1 e1 p1’ Extracted

runtime overhead
M) mi v2 b1’ 200%
C++ SOFTBOUND
p3 b3 e3 e1’
≈○ v4 v2 ’ 150%
M’ p3’
b’
Allocated Where Vi ≈○ Vi’ 3 100%
Globals e3’
v4 ’ 50%
Memory simulation Frame simulation
0%
bh isort mst tsp go omp art uake mp gzip lbm libq. ean
Figure 8. Simulation relations of the SoftBound pass b c eq am m

Figure 9. Execution time overhead of the extracted and the C++

version of SoftBound
v , that means the pointer being loaded was not stored with valid
bounds, so findbounds returns [null, null) to ensure the spatial
safety invariant. Similarly, the rule SB STORE checks whether the Here, config ` ŝ1 −→ ŝ2 is a deterministic SBspec that, as in
address to be stored to is in bounds and, if storing a pointer, updates Section 4, is an instance of the non-deterministic SBspec.
MM accordingly. SoftBound disallows dereferencing a pointer that
was converted from an interger, even if that integer was originally The correctness of SoftBound
obtained from a valid pointer. Following the same design choice, T HEOREM 8 (SoftBound is correct). Let SBtrans(P ) = bP 0 c
findbounds returns [null, null) for pointers cast from integers. denote that the SoftBound pass instruments a well-formed program
checkbounds fails when a program accesses such pointers. P to be P 0 . A SoftBound instrumented program P 0 either aborts
on detecting spatial memory violations or preserves the LLVM se-
T HEOREM 6 (Progress for SBspec). If Ŝ1 is well-formed, then ei- mantics of the original program P . P 0 is not stuck by any spatial
ther Ŝ1 is a final state, or Ŝ1 is a legal stuck state, or there exists a memory violation.
Ŝ2 such that config ` Ŝ1 Ŝ2 .
6.2 Extracted verified implemention of SoftBound
This theorem holds because all the bounds in a well-formed SBspec The above formalism not only shows that the SoftBound trans-
state give memory ranges that are spatially safe, if checkbounds formation enforces the promised safety properties, but the Vellvm
succeeds, the memory access must be spatially safe. framework allows us to extract a translator directly from the Coq
The correctness of the SoftBound instrumentation Given SB- code, resulting in a verified implementation of the SoftBound trans-
spec, we designed an instrumentation pass in Coq. For each func- formation. The extracted implementation uses the same underlying
tion of an original program, the pass implements µ by generating shadowspace implementation and wrapped external functions as
two fresh temporaries for every temporary of pointer type to record the non-extracted SoftBound transformation written in C++. The
its bounds. For manipulating metadata stored in MM , the pass ax- only aspect not handled by the extracted transformation is initial-
iomatizes a set of interfaces that manage a disjoint metadata space izing the metadata for pointers in the global segment that are non-
with specifications for their behaviors. NULL initialized (i.e., they point to another variable in the global
Figure 8 pictorially shows the simulation relations '◦ between segment). Without initialization, valid programs can be incorrectly
an original program P in the semantics of SBspec and its trans- rejected as erroneous. Thus, we reuse the code from the C++ imple-
formed program P 0 in the LLVM semantics. First, because P 0 mentation of the SoftBound to properly initialize these variables.
needs additional memory space to store metadata, we need a map- Effectiveness To measure the effectiveness of the extracted im-
ping mi that maps each allocated memory block in M to a mem- plementation of SoftBound versus the C++ implementation, we
ory block in M 0 without overlap, but allows M 0 to have additional tested both implementations on the same programs. To test whether
blocks for metadata, as shown in dashed boxes. Note that we as- the implementations detect spatial memory safety violations, we
sume the two programs initialize globals identically. Second, basic used 1809 test cases from the NIST Juliet test suite of C/C++
values are related in terms of the mapping between blocks: pointers codes [23]. We chose the test cases which exercised the buffer over-
are related if they refer to corresponding memory locations; other flows on both the heap and stack. Both implementations of Soft-
basic values are related if they are same. Two values are related if Bound correctly detected all the buffer overflows without any false
they are of the same length and the corresponding basic values are violations. We also confirmed that both implementations properly
related. detected the buffer overflow in the go SPEC95 benchmark. Fi-
Using the value simulations, '◦ defines a simulation for mem- nally, the extracted implementation is robust enough to success-
ory and stack frames. Given two related memory locations blk .ofs fully transform and execute (without false violations) several ap-
and blk 0 .ofs 0 , their contents in M and M 0 must be related; if MM plications selected from the SPEC95, SPEC2000, and SPEC2006
maps blk .ofs to the bound [v1 , v2 ), then the additional metadata suites (around 110K lines of C code in total).
space in M 0 must store v10 and v20 that relate to v1 and v2 for the
location blk 0 .ofs 0 . For each pair of corresponding frames in the two Performance overheads Unlike the C++ implementation of Soft-
stacks, ∆ and ∆0 must store related values for the same temporary; Bound that removes some obviously redundant checks, the ex-
if µ maps a temporary id to the bound [v1 , v2 ), then ∆0 must store tracted implementation of SoftBound performs no SoftBound-
the related bound in the fresh temporaries for the id . specific optimizations. In both cases, the same suite of standard
LLVM optimizations are applied post-transformation to optimize
T HEOREM 7. Given a state ŝ1 of P with configuration config the code to reduce the overhead of the instrumentation. To deter-
and a state s10 of P 0 with configuration config 0 , if ŝ1 '◦ s10 , mine the performance impact on the resulting program, Figure 9
and config ` ŝ1 −→ ŝ2 , then there exists a state s20 , such that reports the execution time overheads (lower is better) of extracted
config 0 ` s10 −→∗ s20 , ŝ2 '◦ s20 . SoftBound (leftmost bar of each benchmark) and the C++ imple-

11
mentation (rightmost bar of each benchmark) for various bench- prototype tool that applies their methodology to verification of the
marks from SPEC95, SPEC2000 and SPEC2006. Because of the LLVM compiler. The LLVM-MD project [35] validates LLVM op-
check elimination optimization performed by the C++ implemen- timizations by symbolic evaluation. The Peggy tool performs trans-
tation, the code is slightly faster, but overall the extracted imple- lation validation for the LLVM compiler using a technique called
mentation provides similar performance. equality saturation [28]. These applications are not fully certified.
Bugs found in the original SoftBound implementation In the
course of formalizing the SoftBound transformation, we discov- 8. Conclusion
ered two implementation bugs in the original C++ implementation
Although we do not consider it in this paper, our intention is that
of SoftBound. First, when one of the incoming values of a φ node
the Vellvm framework will serve as a first step toward a fully-
with pointer type is an undef , undef was propagated as its base
verified LLVM compiler, similar to that of Leroy et al.’s Comp-
and bound. Subsequent compiler transformations may instantiate
Cert [18]. Our Coq development extends some of CompCert’s
the undefined base and bound with defined values that allow the
libraries and our LLVM memory model is based on CompCert’s
checkbounds to succeed, which would lead to memory viola-
memory model. The focus of this paper is the LLVM IR semantics
tion. Second, the base and bound of constant pointer (typ∗) null
itself, the formalization of which is a necessary step toward a fully-
was set to be (typ∗) null and (typ∗) null + sizeof (typ), allowing
verified LLVM compiler. Because much of the complexity of an
dereferences of null or pointers pointing to an offset from null. Ei-
LLVM-based compiler lies in the IR to IR transformation passes,
ther of these bugs could have resulted in faulty checking and thus
formalizing correctness properties at this level stands to yield a
expose the program to the spatial violations that SoftBound was
significant payoff, as demonstrated by our SoftBound case study,
designed to prevent. These bugs underscore the importance of a
even without fully verifying a compiler.
formally verified and extracted implementation to avoid such bugs.

7. Related Work Acknowledgments

Mechanized language semantics There is a large literature on This research was funded in part by the U.S. Government. The
formalizing language semantics and reasoning about the correct- views and conclusions contained in this document are those of the
ness of language implementations. Prominent examples include: authors and should not be interpreted as representing the official
Foundational Proof Carrying Code [2], Foundational Typed As- policies, either expressed or implied, of the U.S. Government. This
sembly Language [11], Standard ML [12, 30], and (a substantial research was funded in part by DARPA contract HR0011-10-9-
subset of) Java [15]. 0008 and ONR award N000141110596.
This material is based upon work supported by the National Sci-
Verified compilers Compiler verification has a considerable his- ence Foundation under Grant No. CNS-1116682, CCF-1065166,
tory; see the bibliography [18] for a comprehensive overview. Other and CCF-0810947. Any opinions, findings, and conclusions or rec-
research has also used Coq for compiler verification tasks, includ- ommendations expressed in this material are those of the author(s)
ing much recent work on compiling functional source languages to and do not necessarily reflect the views of the National Science
assembly [5, 8, 9]. Foundation.
Vellvm is closer in spirit to CompCert [18], which was the first
fully-verified compiler to generate compact and efficient assembly
code for a large fragment of the C language. CompCert also uses References
Coq. It formalizes the operational semantics of CompCert C, sev- [1] E. Alkassar and M. A. Hillebrand. Formal functional verification of
eral intermediate languages used in the compilation, and assembly device drivers. In VSTTE ’08: Proceedings of the 2nd International
languages including PowerPC, ARM and x86. The latest version Conference on Verified Software: Theories, Tools, Experiments, 2008.
of CompCert also provides an executable reference interpreter for [2] A. W. Appel. Foundational proof-carrying code. In LICS ’01: Pro-
the semantics of CompCert C. Based on the formalized seman- ceedings of the 16th Annual IEEE Symposium on Logic in Computer
tics, the CompCert project fully proves that all compiler phases Science, 2001.
produce programs that preserve the semantics of the original pro- [3] A. W. Appel. Verified software toolchain. In ESOP ’11: Proceedings
gram. Optimization passes include local value numbering, constant of the 20th European Conference on Programming Languages and
propagation, coalescing graph coloring register allocation [6], and Systems, 2011.
other back-end transformations. CompCert has also certified some [4] B. Aydemir, A. Charguéraud, B. C. Pierce, R. Pollack, and S. Weirich.
advanced compiler optimizations [32–34] using translation valida- Engineering formal metatheory. In POPL ’08: Proceedings of the
tion [22, 26]. The XCERT project [29, 31] extends the CompCert 35th Annual ACM SIGPLAN-SIGACT Symposium on Principles of
compiler by a generic translation validator based on SMT solvers. Programming Languages, 2008.
Other mechanization efforts The verified software tool-chain [5] N. Benton and N. Tabareau. Compiling functional types to relational
project [3] assures that the machine-checked proofs claimed at the specifications for low level imperative code. In TLDI ’09: Proceedings
top of the tool-chain hold in the machine language program. Typed of the 4th International Workshop on Types in Language design and
Implementation, 2009.
assembly languages [7] provide a platform for proving back-end
optimizations. Similarly, The Verisoft project [1] also attempts to [6] S. Blazy, B. Robillard, and A. W. Appel. Formal verification of co-
mathematically prove the correct functionality of systems in auto- alescing graph-coloring register allocation. In ESOP ’10: Proceed-
motive engineering and security technology. ARMor [37] guaran- ings of the 19th European Conference on Programming Languages
and Systems, 2010.
tees control flow integrity for application code running on embed-
ded processors. The Rhodium project [17] uses a domain specific [7] J. Chen, D. Wu, A. W. Appel, and H. Fang. A provably sound TAL
language to express optimizations via local rewrite rules and pro- for back-end optimization. In PLDI ’03: Proceedings of the ACM
SIGPLAN 2003 Conference on Programming Language Design and
vides a soundness checker for optimizations
Implementation, 2003.
Validating LLVM optimizations The CoVac project [36] devel- [8] A. Chlipala. A verified compiler for an impure functional language. In
ops a methodology that adapts existing program analysis tech- POPL ’10: Proceedings of the 37th Annual ACM SIGPLAN-SIGACT
niques to the setting of translation validation, and reports on a Symposium on Principles of Programming Languages, 2010.

12
[9] A. Chlipala. A certified type-preserving compiler from lambda cal- [24] M. Nita and D. Grossman. Automatic transformation of bit-level C
culus to assembly language. In PLDI ’07: Proceedings of the ACM code to support multiple equivalent data layouts. In CC’08: Proceed-
SIGPLAN 2007 Conference on Programming Language Design and ings of the 17th International Conference on Compiler Construction,
Implementation, 2007. 2008.
[10] The Coq Proof Assistant Reference Manual (Version 8.3pl1). The Coq [25] M. Nita, D. Grossman, and C. Chambers. A theory of platform-
Development Team, 2011. dependent low-level software. In POPL ’08: Proceedings of the
[11] K. Crary. Toward a foundational typed assembly language. In POPL 35th Annual ACM SIGPLAN-SIGACT Symposium on Principles of
’03: Proceedings of the 30th ACM SIGPLAN-SIGACT Symposium on Programming Languages, 2008.
Principles of Programming Languages, 2003. [26] A. Pnueli, M. Siegel, and E. Singerman. Translation validation. In
[12] K. Crary and R. Harper. Mechanized def- TACAS ’98: Proceedings of the 4th International Conference on Tools
inition of standard ml (alpha release), 2009. and Algorithms for Construction and Analysis of Systems, 1998.
http://www.cs.cmu.edu/˜crary/papers/2009/ [27] P. Sewell, F. Zappa Nardelli, S. Owens, G. Peskine, T. Ridge, S. Sarkar,
mldef-alpha.tar.gz. and R. Strniša. Ott: Effective tool support for the working semanticist.
[13] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. In ICFP ’07: Proceedings of the 9th ACM SIGPLAN International
Efficiently computing static single assignment form and the control Conference on Functional Programming, 2007.
dependence graph. ACM Trans. Program. Lang. Syst., 13:451–490, [28] M. Stepp, R. Tate, and S. Lerner. Equality-Based translation validator
1991. for LLVM. In CAV ’11: Proceedings of the 23rd International Con-
[14] G. A. Kildall. A unified approach to global program optimization. ference on Computer Aided Verification, 2011.
In POPL ’73: Proceedings of the 1st Annual ACM SIGACT-SIGPLAN [29] Z. T. Sudipta Kundu and S. Lerner. Proving optimizations correct
Symposium on Principles of Programming Languages, 1973. using parameterized program equivalence. In PLDI ’09: Proceedings
[15] G. Klein, T. Nipkow, and T. U. München. A machine-checked model of the ACM SIGPLAN 2009 Conference on Programming Language
for a Java-like language, virtual machine and compiler. ACM Trans. Design and Implementation, 2009.
Program. Lang. Syst., 28:619–695, 2006. [30] D. Syme. Reasoning with the formal definition of Standard ML
[16] C. Lattner and V. Adve. LLVM: A Compilation Framework for Life- in HOL. In Sixth International Workshop on Higher Order Logic
long Program Analysis & Transformation. In CGO ’04: Proceedings Theorem Proving and its Applications, 1993.
of the International Symposium on Code Generation and Optimiza- [31] Z. Tatlock and S. Lerner. Bringing extensibility to verified compilers.
tion: Feedback-directed and Runtime Optimization, 2004. In PLDI ’10: Proceedings of the ACM SIGPLAN 2010 Conference on
[17] S. Lerner, T. Millstein, E. Rice, and C. Chambers. Automated sound- Programming Language Design and Implementation, 2010.
ness proofs for dataflow analyses and transformations via local rules. [32] J.-B. Tristan and X. Leroy. Formal verification of translation valida-
In POPL ’05: Proceedings of the 32th ACM SIGPLAN-SIGACT Sym- tors: a case study on instruction scheduling optimizations. In POPL
posium on Principles of Programming Languages, 2005. ’08: Proceedings of the 35th Annual ACM SIGPLAN-SIGACT Sympo-
[18] X. Leroy. A formally verified compiler back-end. Journal of Auto- sium on Principles of Programming Languages, 2008.
mated Reasoning, 43(4):363–446, 2009. [33] J.-B. Tristan and X. Leroy. Verified validation of lazy code motion.
[19] The LLVM Reference Manual (Version 2.6). The LLVM Development In PLDI ’09: Proceedings of the ACM SIGPLAN 2009 Conference on
Team, 2010. http://llvm.org/releases/2.6/docs/LangRef.html. Programming Language Design and Implementation, 2009.
[20] V. S. Menon, N. Glew, B. R. Murphy, A. McCreight, T. Shpeisman, [34] J. B. Tristan and X. Leroy. A simple, verified validator for soft-
A.-R. Adl-Tabatabai, and L. Petersen. A verifiable SSA program rep- ware pipelining. In POPL ’10: Proceedings of the 37th Annual ACM
resentation for aggressive compiler optimization. In POPL ’06: Pro- SIGPLAN-SIGACT Symposium on Principles of Programming Lan-
ceedings of the 33th ACM SIGPLAN-SIGACT Symposium on Princi- guages, 2010.
ples of Programming Languages, 2006. [35] J.-B. Tristan, P. Govereau, and G. Morrisett. Evaluating value-graph
[21] S. Nagarakatte, J. Zhao, M. M. K. Martin, and S. Zdancewic. Soft- translation validation for llvm. In PLDI ’11: Proceedings of the ACM
Bound: Highly compatible and complete spatial memory safety for C. SIGPLAN 2011 Conference on Programming Language Design and
In PLDI ’09: Proceedings of the ACM SIGPLAN 2009 Conference on Implementation, 2011.
Programming Language Design and Implementation, 2009. [36] A. Zaks and A. Pnueli. Program analysis for compiler validation. In
[22] G. C. Necula. Translation validation for an optimizing compiler. In PASTE ’08: Proceedings of the 8th ACM SIGPLAN-SIGSOFT Work-
PLDI ’00: Proceedings of the ACM SIGPLAN 2000 Conference on shop on Program Analysis for Software Tools and Engineering, 2008.
Programming Language Design and Implementation, 2000. [37] L. Zhao, G. Li, B. De Sutter, and J. Regehr. ARMor: Fully verified
[23] NIST Juliet Test Suite for C/C++. NIST, 2010. software fault isolation. In EMSOFT ’11: Proceedings of the 9th ACM
http://samate.nist.gov/SRD/testCases/suites/Juliet-2010-12.c.cpp.zip. International Conference on Embedded Software, 2011.

LLVM IR Quick Reference
No ratings yet
LLVM IR Quick Reference
455 pages
Metamorphic Code Generation From LLVM IR Bytecode
No ratings yet
Metamorphic Code Generation From LLVM IR Bytecode
72 pages
Nacke, Kai, Kwan, Amy - Learn LLVM 17 - A Beginner's Guide To Learning LLVM Compiler Tools and Core Libraries With C++-Packt (2023)
No ratings yet
Nacke, Kai, Kwan, Amy - Learn LLVM 17 - A Beginner's Guide To Learning LLVM Compiler Tools and Core Libraries With C++-Packt (2023)
10 pages
LLVM Homework
100% (1)
LLVM Homework
7 pages
Core Java With SCJP OCJP Notes by Durga
83% (30)
Core Java With SCJP OCJP Notes by Durga
58 pages
1 Davis Chisnall LLVM 2017
No ratings yet
1 Davis Chisnall LLVM 2017
166 pages
Exploiting ILP With Software Approach
No ratings yet
Exploiting ILP With Software Approach
104 pages
Zhao 13
No ratings yet
Zhao 13
136 pages
PLDI Week 04 LLVM
No ratings yet
PLDI Week 04 LLVM
62 pages
ASPLOS19 LLVM Tutorial
No ratings yet
ASPLOS19 LLVM Tutorial
71 pages
CompilerTalk 2019
No ratings yet
CompilerTalk 2019
55 pages
LLVM at Raincode Labs
No ratings yet
LLVM at Raincode Labs
28 pages
2004 09 22 LCPCLLVMTutorial
No ratings yet
2004 09 22 LCPCLLVMTutorial
61 pages
5 6217282258395465893 PDF
No ratings yet
5 6217282258395465893 PDF
32 pages
VCC: A Practical System For Verifying Concurrent C
No ratings yet
VCC: A Practical System For Verifying Concurrent C
20 pages
LLVM Demo
No ratings yet
LLVM Demo
81 pages
Create A Working Compiler With The LLVM Framework, Part 1
No ratings yet
Create A Working Compiler With The LLVM Framework, Part 1
13 pages
MLIR Tutorial
No ratings yet
MLIR Tutorial
78 pages
Hypersafe: A Lightweight Approach To Provide Lifetime Hypervisor Control-Flow Integrity
No ratings yet
Hypersafe: A Lightweight Approach To Provide Lifetime Hypervisor Control-Flow Integrity
16 pages
The Architecture of Open Source Applications (Volume 1) LLVM
No ratings yet
The Architecture of Open Source Applications (Volume 1) LLVM
16 pages
4 LLVM
No ratings yet
4 LLVM
26 pages
LLVM
No ratings yet
LLVM
12 pages
Formal Methods Roadmap PVS, ICS, and SAL
No ratings yet
Formal Methods Roadmap PVS, ICS, and SAL
28 pages
L3 LLVM Part1
No ratings yet
L3 LLVM Part1
31 pages
TechTalk Kruppe Espasa RISC V Vectors and LLVM
No ratings yet
TechTalk Kruppe Espasa RISC V Vectors and LLVM
23 pages
XIVE: External Interrupt Virtualization For The Cloud Infrastructure
No ratings yet
XIVE: External Interrupt Virtualization For The Cloud Infrastructure
10 pages
A Complete Guide To LLVM For Programming Language Creators
No ratings yet
A Complete Guide To LLVM For Programming Language Creators
22 pages
IRDL: An IR Definition Language For SSA Compilers: Mathieu Fehr Jeff Niu River Riddle
No ratings yet
IRDL: An IR Definition Language For SSA Compilers: Mathieu Fehr Jeff Niu River Riddle
14 pages
LLVM Essentials - Sample Chapter
No ratings yet
LLVM Essentials - Sample Chapter
16 pages
LLVM
No ratings yet
LLVM
474 pages
5 Svfir
No ratings yet
5 Svfir
24 pages
LLVM
No ratings yet
LLVM
1,703 pages
cs471 16 Ir
No ratings yet
cs471 16 Ir
6 pages
SVF: Interprocedural Static Value-Flow Analysis in LLVM: Yulei Sui Jingling Xue
No ratings yet
SVF: Interprocedural Static Value-Flow Analysis in LLVM: Yulei Sui Jingling Xue
5 pages
Low Level Virtual Machine C# Compiler Senior Project Proposal Presentation
No ratings yet
Low Level Virtual Machine C# Compiler Senior Project Proposal Presentation
38 pages
The LLVM Compiler Framework and Infrastructure
No ratings yet
The LLVM Compiler Framework and Infrastructure
44 pages
Quick Primer On LLVM IR: (For Those Already Familiar With LLVM IR, Feel Free To)
No ratings yet
Quick Primer On LLVM IR: (For Those Already Familiar With LLVM IR, Feel Free To)
13 pages
LLVM Framework Research and Applications
No ratings yet
LLVM Framework Research and Applications
6 pages
Mlir
No ratings yet
Mlir
13 pages
1 LLVM Introduction 16-07-2024
No ratings yet
1 LLVM Introduction 16-07-2024
11 pages
Generating Stack Machine Code Using LLVM
No ratings yet
Generating Stack Machine Code Using LLVM
5 pages
LLVM Tutorial
100% (1)
LLVM Tutorial
59 pages
Exp 11
No ratings yet
Exp 11
4 pages
The Architecture of Open Source Applications (Volume 1) LLVM5
No ratings yet
The Architecture of Open Source Applications (Volume 1) LLVM5
1 page
MLIR - A Compiler Infrastructure For The End of Moore's Law
No ratings yet
MLIR - A Compiler Infrastructure For The End of Moore's Law
21 pages
Department of Computing: CS 354: Compiler Construction Class: BSCS-7A
No ratings yet
Department of Computing: CS 354: Compiler Construction Class: BSCS-7A
4 pages
ChatGPT - MyLearning On Compiler Backend With LLVM
No ratings yet
ChatGPT - MyLearning On Compiler Backend With LLVM
8 pages
The Architecture of Open Source Applications (Volume 1) LLVM4
No ratings yet
The Architecture of Open Source Applications (Volume 1) LLVM4
1 page
Developed by University of Illinois at Urbana-Champaign CIS Dept Cisc 471 Matthew Warner
No ratings yet
Developed by University of Illinois at Urbana-Champaign CIS Dept Cisc 471 Matthew Warner
9 pages
Concept Paper Compiler Design-1
No ratings yet
Concept Paper Compiler Design-1
3 pages
L6 LLVM Part2
No ratings yet
L6 LLVM Part2
6 pages
The LLVM Compiler Framework and Infrastructure
No ratings yet
The LLVM Compiler Framework and Infrastructure
61 pages
14.25 Tao Liu Richard Ho UVM Based RISC V Processor Verification Platform
No ratings yet
14.25 Tao Liu Richard Ho UVM Based RISC V Processor Verification Platform
22 pages
Cgo22 Noelle
No ratings yet
Cgo22 Noelle
14 pages
Principles of Compiler Design: Run Time Environments
No ratings yet
Principles of Compiler Design: Run Time Environments
61 pages
LLVM
No ratings yet
LLVM
1 page
Arxiv21 Noelle
No ratings yet
Arxiv21 Noelle
12 pages
LLVM Cookbook - Sample Chapter
No ratings yet
LLVM Cookbook - Sample Chapter
30 pages
LLVM Crash Course
No ratings yet
LLVM Crash Course
15 pages
B0400DF - HLBL Users Guide PDF
No ratings yet
B0400DF - HLBL Users Guide PDF
100 pages
High-Throughput, Formal-Methods-Assisted Fuzzing For LLVM
No ratings yet
High-Throughput, Formal-Methods-Assisted Fuzzing For LLVM
11 pages
Beckhoff Training Courses v2 2 GBP
100% (1)
Beckhoff Training Courses v2 2 GBP
11 pages
Basic JavaScript Interview Questions
No ratings yet
Basic JavaScript Interview Questions
19 pages
PRO1 11E Data Blocks
100% (1)
PRO1 11E Data Blocks
17 pages
Subroutines and Macros
100% (1)
Subroutines and Macros
14 pages
PL - I Tips
No ratings yet
PL - I Tips
28 pages
TargetLink AUTOSAR Guidelines
No ratings yet
TargetLink AUTOSAR Guidelines
72 pages
Final Report
No ratings yet
Final Report
72 pages
C++ For Hackers - Hack Insight
100% (1)
C++ For Hackers - Hack Insight
204 pages
Dca1102 C Language
100% (1)
Dca1102 C Language
57 pages
Start Learning C
No ratings yet
Start Learning C
6 pages
Chapter 2: Basic Elements of Java
No ratings yet
Chapter 2: Basic Elements of Java
60 pages
Chapter No 1 Program Logic Development Algorithms:: Algorithm
No ratings yet
Chapter No 1 Program Logic Development Algorithms:: Algorithm
14 pages
Student Feedback System in PHP
No ratings yet
Student Feedback System in PHP
56 pages
Design and Code Review Checklists Assignment Kit
No ratings yet
Design and Code Review Checklists Assignment Kit
14 pages
Unit 7
No ratings yet
Unit 7
23 pages
The Ultimate Beginner's Guide To Apple Script - Mac - Appstorm
100% (1)
The Ultimate Beginner's Guide To Apple Script - Mac - Appstorm
12 pages
SPIRV
No ratings yet
SPIRV
190 pages
Object Oriented Programming With Java 1st Edition by Hanumanth Ladwa ISBN 8450193403 9788450193403 - Get The Ebook Instantly With Just One Click
100% (5)
Object Oriented Programming With Java 1st Edition by Hanumanth Ladwa ISBN 8450193403 9788450193403 - Get The Ebook Instantly With Just One Click
84 pages
Programming Language Pragmatics: Elsevier
No ratings yet
Programming Language Pragmatics: Elsevier
45 pages
Multiple Choice Question Based On Arrays (Solution)
No ratings yet
Multiple Choice Question Based On Arrays (Solution)
10 pages
Stat Match
No ratings yet
Stat Match
44 pages
Instant Download Python Crash Course A Hands On Project Based Introduction To Programming 2nd Edition Eric Matthes PDF All Chapter
75% (4)
Instant Download Python Crash Course A Hands On Project Based Introduction To Programming 2nd Edition Eric Matthes PDF All Chapter
50 pages
Scalar App User Manual
No ratings yet
Scalar App User Manual
206 pages
TIBCO BE Sol Best Practices v0.4
No ratings yet
TIBCO BE Sol Best Practices v0.4
10 pages
Mach 3 Server Writer's Guide
No ratings yet
Mach 3 Server Writer's Guide
154 pages
Lec14 - NAME 336
No ratings yet
Lec14 - NAME 336
16 pages
I QC Programming
No ratings yet
I QC Programming
10 pages

Formalizing The LLVM Intermediate Representation For Verified Program Transformations

Uploaded by

Formalizing The LLVM Intermediate Representation For Verified Program Transformations

Uploaded by

Formalizing the LLVM Intermediate Representation

for Verified Program Transformations ∗

memory operations, which can be platform specific, there are many

Basic values bv :: = Int sz | Float | blk .ofs | pad sz 4. Operational Semantics

evalND (g, ∆, val) = bV c c0 = (option id = call typ val param) freeallocas (M , α0 ) = bM 0 c

evalND (g, ∆, val) = bV c true ∈ V

Figure 6. LLVMND : Small-step, nondeterministic semantics of the LLVM IR (selected rules).

Metadata md :: = [v1 , v2 ) Memory metadata MM :: = blk.of s 7→ md Frames Σ̂ :: = fid , l , c, tmn, ∆, µ, α

evalND (g, ∆, val) = bV c v ∈ V c0 = (id = malloc typ val align)

evalND (g, ∆, val) = bV c v ∈ V c0 = (id = load (typ∗)val align)

evalND (g, ∆, val1 ) = bV1 c v1 ∈ V1 evalND (g, ∆, val2 ) = bV2 c v2 ∈ V2

Figure 9. Execution time overhead of the extracted and the C++

7. Related Work Acknowledgments

You might also like