PLDI Week 05 Lexing
PLDI Week 05 Lexing
Week 5: LLVMlite
Basics of Lexical Analysis
Ilya Sergey
[email protected]
ilyasergey.net/CS4212/
This Week
• LLVMLite Specification
• Overview of HW3
• HW3: LLVMlite
– Will be available on Canvas and GitHub on Thursday.
– Due: Sunday, 1 October 2023 at 23:59:59
So, LLVM
Low-Level Virtual Machine (LLVM)
LLVM
frontends
Typed SSA llc
like backend
'clang' IR
code gen
jit
Optimisations/
Transformations
Analysis
Example LLVM Code
factorial-pretty.ll
• LLVM offers a textual representation of its IR define @factorial(%n) {
%1 = alloca
– files ending in .ll %acc = alloca
store %n, %1
store 1, %acc
factorial64.c br label %start
else:
%12 = load %acc
ret %12
}
Real LLVM
factorial.ll
; Function Attrs: nounwind ssp
define i64 @factorial(i64 %n) #0 {
%1 = alloca i64, align 8
• Decorates values with type information %acc = alloca i64, align 8
i64 store i64 %n, i64* %1, align 8
store i64 1, i64* %acc, align 8
i64* br label %2
; <label>:5 ; preds = %2
%6 = load i64* %acc, align 8
• Has alignment annotations %7 = load i64* %1, align 8
(padding for some specified number of bytes) %8 = mul nsw i64 %6, %7
store i64 %8, i64* %acc, align 8
%9 = load i64* %1, align 8
%10 = sub nsw i64 %9, 1
• Keeps track of entry edges for each block: store i64 %10, i64* %1, align 8
br label %2
preds = %5, %0
; <label>:11 ; preds = %2
%12 = load i64* %acc, align 8
ret i64 %12
}
Example Control-flow Graph
define @factorial(%n) {
entry:
define @factorial(%n) { %1 = alloca
%1 = alloca
%acc = alloca
%acc = alloca store %n, %1
store %n, %1 store 1, %acc
store 1, %acc br label %start
br label %start
start:
start: %3 = load %1
%4 = icmp sgt %3, 0
%3 = load %1
br %4, label %then, label %else
%4 = icmp sgt %3, 0
br %4, label %then, label %else
then:
%6 = load %acc
then: else:
%7 = load %1
%8 = mul %6, %7
%6 = load %acc store %8, %acc
%12 = load %acc
%7 = load %1 %9 = load %1
ret %12
%8 = mul %6, %7
%10 = sub %9, 1
store %8, %acc
store %10, %1
%9 = load %1
%10 = sub %9, 1 br label %start
store %10, %1
br label %start else:
%12 = load %acc
ret %12
} }
LL Basic Blocks and Control-Flow Graphs
type block = {
insns : (uid * insn) list;
term : (uid * terminator)
}
• A control flow graph is represented as a list of labeled basic blocks with these invariants:
– No two blocks have the same label
– All terminators mention only labels that are defined among the set of basic blocks
– There is a distinguished, unlabelled, entry block:
• Local variables:
– Defined by the instructions of the form %uid = …
– Must satisfy the single static assignment (SSA) invariant
• Each %uid appears on the left-hand side of an assignment only once in the entire control flow graph.
– The value of a %uid remains unchanged throughout its lifetime
– Analogous to “let %uid = e in …” in OCaml
• Full “SSA” to allow richer use of local variables by taking the control flow into the account
– phi functions (https://en.wikipedia.org/wiki/Static_single-assignment_form)
LL Storage Model: alloca
• The alloca instruction allocates stack space and returns a reference to it.
– The returned reference is stored in local:
%ptr = alloca typ
– The amount of space allocated is determined by the type
• The contents of the slot are accessed via the load and store instructions:
p x y
• Compiler needs to know the size of the struct at compile time to allocate the needed storage space.
• Compiler needs to know the shape of the struct at compile time to index into the structure.
Assembly-level Member Access
square ll.x ll.y lr.x lr.y ul.x ul.y ur.x ur.y
x a b y
x a b y
x a b y
Padding
Copy-in/Copy-out
When we do an assignment in C as in:
then we copy all of the elements out of the source and put them
in the target. Same as doing word-level operations:
• For really large copies, the compiler uses something like memcpy
(which is implemented using a loop in assembly).
C Procedure Calls
• Similarly, when we call a procedure, we copy arguments in, and copy results out.
– Caller sets aside extra space in its frame to store results that are bigger than will fit in %rax.
– We do the same with scalar values such as integers or doubles.
• Benefit: locality
• Problem: expensive for large records…
• Languages like Java and OCaml always pass non-word-sized objects by reference.
Representing Data Types
Working with Arrays
Arrays
void foo() { void foo() {
char buf[27]; char buf[27];
arr
Size=7 A[0] A[1] A[2] A[3] A[4] A[5] A[6]
• Other possibilities:
– Pascal: only permit statically known array sizes (very unwieldy in practice)
– What about multi-dimensional arrays?
Array Bounds Checks (Implementation)
• Example: Assume %rax holds the base pointer (arr) and %rcx holds the array index i.
To read a value from the array arr[i]:
char *p = "foo”;
p[0] = 'b’;
• In C: enum Day {sun, mon, tue, wed, thu, fri, sat} today;
• In OCaml: type day = Sun | Mon | Tue | Wed | Thu | Fri | Sat
• OCaml datatypes can also carry data: type foo = Bar of int | Baz of int * foo
switch (e) {
case sun: s1; break;
case mon: s2; break;
…
case sat: s3; break;
}
• Each $tag1…$tagN is just a constant int l1: %cmp1 = icmp eq %tag, $tag1
br %cmp1 label %b1, label %l2
tag value. b1: ⟦s1⟧
br label %l2
• Note: ⟦break;⟧
l2: %cmp2 = icmp eq %tag, $tag2
(within the switch branches) is: br %cmp2 label %b2, label %l3
b2: ⟦s2⟧
br %merge br label %l3
…
lN: %cmpN = icmp eq %tag, $tagN
br %cmpN label %bN, label %merge
bN: ⟦sN⟧
br label %merge
merge:
Alternatives for Switch Compilation
• Nested if-then-else works OK in practice if # of branches is small
– (e.g. < 16 or so).
• For more branches, use better data structures to organise the jumps:
– Create a table of pairs (v1, branch_label) and loop through
– Or, do binary search rather than linear search
– Or, use a hash table rather than binary search
– Code for each branch additionally must copy data from ⟦e⟧ to the variables bound in the patterns.
• There are many opportunities for optimisations, many papers about “pattern-match compilation”
– Many of these transformations can be done at the AST level
Good place for a break
Datatypes in LLVM IR
Structured Data in LLVM
• LLVM’s IR is uses types to describe the structure of data.
t ::=
void
i1 | i8 | i64 N-bit integers
[<#elts> x t] arrays
fty function types
{t1, t2, … , tn} structures
t* pointers
%Tident named (identified) type
insn ::= …
| getelementptr t* %val, t1 idx1, t2 idx2 ,…
• The first is i32 0 a “step through” the pointer to, e.g., %square, with offset 0.
• To index into a deeply nested structure, one has to “follow the pointer” by
loading from the computed pointer
Compiling Data Structures via LLVM
define @foo() {
%1 = alloca %rect3 ; allocate a three-field record
%2 = bitcast %rect3* %1 to %rect2* ; safe cast
%3 = getelementptr %rect2* %2, i32 0, i32 1 ; allowed
…
}
LLVMlite Specification
https://ilyasergey.net/CS4212/hw03-llvmlite-spec.html
LLVMlite features
• A C-like “weak type system” to statically rule out some malformed programs.
• A variety of different kinds of integer values, pointers, function pointers, and
structured data including strings, arrays, and structs.
• Top-level mutually-recursive function definitions and function calls as primitives.
• An infinite number of “locals” (also known as “pseudo-registers”, “SSA variables”, or
“temporaries”) to hold intermediate results of computations.
• An abstract memory model that doesn't constrain the layout of data in memory.
• Dynamically allocated memory associated with a function invocation (in C, the stack).
• Static and dynamically (heap) allocated structured data.
• A control-flow graph representation of function bodies.
Syntax
Example
e Syntax
define i64 @fac(i64 %n) { ; (1) function definition, argument prefixed with %
%1 = icmp sle i64 %n, 0 ; (2) signed comparison, result assigned to %1
br i1 %1, label %ret, label %rec ; (3) “terminator”, marks the end of the block
ret: ; (4) label, indicates the beginning of the new block
ret i64 1 return the result (1)
rec: ; (5) another block
%2 = sub i64 %n, 1 ; (6) subtract 1 from %n, name result %2
%3 = call i64 @fac(i64 %2) ; (7) call function @fac, assign the result for %3
%4 = mul i64 %n, %3
ret i64 %4 ; (8) return result
}
the func on defini on at (1). The i64 annota ons declare the return type and the
range over simple and aggregate (non-void, non-func on) types,
LLVMlite types
F to range over function types, and S to range over simple types.
Global Definitions
type. The global iden fier @IDENT, when used in the prog
where G ranges over global ini alizers, described in the following table, and T is the associated
type. The global iden fier @IDENT, when following program
used in the program, fragment
has type has valid
T* . For example, the annota ons:
The next kind program
following of top-level defini
fragment has on isons:
valid annota global data
@foo = global i64 42 @foo = global i64 42
@IDENT = @bar = global i64* @foo
global T G @bar = global i64* @foo
@baz = global i64** @bar
@baz = global i64** @bar
where G ranges over global ini alizers, described in the following table, and T is the as
type. The global
Concrete iden fierType
Syntax @IDENT, when used
Descrip on in the program, has type T* . For example
followingnull
program fragment T* has validThe
annota ons:
null pointer constant.
[0-9]+ i64
Concrete Syntax
64-bit integer literal.
Type Descrip on
@foo = global
@IDENT T* i64 null
42
Global iden fier. The type is always a T* pointer of the The null poin
type associated with the global defini on.
@bar = global i64* @foo
[0-9]+
@baz = global i64**length
c"[A-z]*\00" [ N x i8 ]
@bar i64 be the
String literal. The size of the array N should 64-bit intege
of the string in bytes, including the null
terminator
@IDENT \00 . T* Global iden
[ T G1, ..., T GN ] [ N x T ] Array literal. type associa
{ T1 G1, ..., TN GN } {T1,...,TN} Struct literal.
c"[A-z]*\00" [ N x i8 ] String literal
Concrete Syntax
bitcast (T1* G1 to T2*) Type
T2* Bitcast. Descrip on length of the
Operands
null T* The null pointer constant. terminator
{ T1 G1, ..., TN GN } {T1,...,TN} Struct literal.
OperandsBitcast.
bitcast (T1* G1 to T2*) of functions
T2*
Operands
We now turn to the parts of a func on declara on. Each instruc on in a func on has zero or more
operands which for the purposes of determining the well-formedness of programs, are restricted to
the following types.
The following table describes the restric ons on the types that may appear as parameters of well-
formed instruc ons, and the constraints on the operands and result of the instruc on for the
purposes of type-checking. We assume that named types have been replace by their defini ons.
For example, in the call instruc on, each type parameter S1, ..., SN must be a simple type.
When we type check a program containing this instruc on, we must make sure that the operand
OP2, ..., OPN have types Types
. of instructions
S2, ..., SN
%L = alloca S - → S*
%L = load S* OP S* → S
call void OP1(S2 OP2, ... ,SN OPN) void(S2, ..., SN)* x S2 x ... x SN → void
%L = getelementptr T1* OP1, i32 OP2, ..., i32 OPN T1* x i64 x ... x i64 -> GEPTY(T1, OP1, ..., OPN)*
No ce•that GEPTY is a par al func on. When GEPTY is not defined, the corresponding instruc on
GEPTY is a partial function.
is malformed. This happens when, for example:
• When GEPTY is not defined, the corresponding
The list of index operands provided is empty instruction is malformed.
An•operand
This happens when,a struct
used to index for example:
is not a constant
• The
The type listan
is not ofaggregate
index operands
and the provided is empty
list of indices is not empty
• An operand used to index a struct is not a constant
Also no ce that a GEP instruc on that indexes beyond the size of an array is well-formed. The
• The type is not an aggregate and the list of indices is not empty
length informa on on array tags is only present to help the compiler lay out data in memory and is
not verified sta cally.
Notes on GEP
• Real LLVM requires that constants appearing in getelementptr be declared with type i32:
• LLVMlite ignores the i32 annotation and treats these as i64 values
– we keep the i32 annotation in the syntax to retain compatibility with the clang compiler
– we assume the arguments of getelementptr always fall in the range [0, Int32.max_int].
Blocks, CFGs, and Function Definitions
length informa on on array tags is only present to help the compiler lay out data in memory and is
Blocks, CFGs, and
not verified sta cally.
Function Definitions
A block (or "basic" block) is just a sequence of instruc ons followed by a terminator:
Blocks, CFGs, and Function Definitions
• A block
Concrete Syntax A is just
block (ora sequence
"basic" block) of
is instructions followed
Operand → Result Types
just a sequence of instruc onsby a terminator
followed by a terminator:
ret void - → -
Concrete Syntax Operand → Result Types
ret S OP ret void S → - - → -
br label %LAB - → -
br i1 OP, label %LAB1, label %LAB2 i1 → -
br i1 OP, label %LAB1, label %LAB2 i1 → -
The body of a func Theon body
is represented by a control flow graph (CFG). A CFG consists
of a func on is represented by a control flow graph (CFG). A CFG consists of a
of a
dis nguished entry block
dis and
nguished a sequence
entry block and blocks
a of
sequence prefixed
• The body of a function is represented by a control flow graph (CFG).
blocks of with
prefixed a label
with a LAB:
label .
LAB:A . func
A func on
on
defini on has a return type, the funca on name, a list ofparameters
formal parameters and their types, and the
• A CFG consists of a distinguished entry block and a sequence blocks of prefixed with a label. the
defini on has a return type, the func on name, list of formal and their types, and
body of the func on. The full syntax of a func on defini on is then:
body of the func on. The full syntax
• The full syntax of a function definition: of a func on defini on is then:
define [S|void] @IDENT(S1 OP, ... , SN OP) { BLOCK (LAB: BLOCK)...}
define [S|void] @IDENT(S1 OP, ... , SN OP) { BLOCK (LAB: BLOCK)...}
Like global data defini ons, the type of the defined global iden fier @IDENT is S(S1, ... , SN)* or
void(S1, ... , SN)* , a func on pointer.
Like global data defini ons, the type of the defined global iden fier @IDENT is S(S1, ... , SN)* or
Semantics
LLVMlite Semantics
Memory Model
Memory values will be represented as tree structures where the leaves are simple values (or strings)
and finitely-branching nodes represent arrays and structs. The memory state of the LLVMlite
• The memory state isofrepresented
machine the LLVMlite machine
by a mapping is represented
between byand
block iden fiers a mapping between
memory values. We willblock
refer
identifiers to
and memory
a top-level values.
memory value that is not a subtree of another as a memory block.
We will refer to a top-level memory value that is not a subtree of another as a memory block.
At this point, an illustra on might be helpful:
In order to manipulate the simple values at the leaves of our memory blocks, we need specify a
Memory Model
This approach might seem a more complicated than our memory representa on in X86lite, but it
At this point, an illustra on might be helpful:
has some advantages as a specifica on. If instead, like in X86lite, we represented memory as an
array of bytes, we would be forced us to make decisions about how large each value in the
{ bid0 -> node bid1 -> node bid2 -> node }
language is and the rela ve/ posi
| \
on of values in memory.
|
Using an unordered
|
set of trees for the
language specifica on lets L us define
L Lhow opera ons on structured data work
node L while leaving such
details up to the compiler. / \
L node
/ | | \
The simple values include: L L L* L
This approach might seem a more complicated than our memory representa on in X86lite, but it
Interpreter
• interp_call takes
• the global identifier of a function in an LLVMlite program,
• a list of (simple) values to serve as arguments, and
• an initial memory state; and
• returns the memory state after the function call has completed and the return value.
Instruc on/Terminator
Some Instructions
Behavior
(see implementation)
%L = BOP i64 OP1, OP2 Update locals( %L ) with the result of the computa on.
%L = alloca S Allocate a slot in the current stack frame and return a
pointer to it. This involves adding a subtree of undef to
the root node of the memory block represen ng the
frame at the next available index.
%L = load S* OP OP must be a pointer or undef. Find the value
referenced by the pointer in the current memory state.
Update locals( %L ) with the result. If OP is not a valid
pointer, either because it evaluates to undef, no
memory value is associated with its block iden fier or
its path does not iden fy a valid subtree, then the
opera on raises an error and the machine crashes. If
the pointer is valid, but the value in memory is not a
simple value of type S, the opera on raises an error and
the machine crashes.
store S OP1, S* OP2 Update the memory state by se ng the target of OP2
to the value of OP1. If OP2 is not a valid pointer, or if
the target of OP2 is not a simple value in memory of
type S, the opera on raises an error and the machine
crashes.
%L = icmp CND S OP1, OP2 Update locals( %L ) to 1 if the condi on holds and 0
otherwise.
store S OP1, S* OP2 Update the memory state by se ng the target of OP2
Some Instructions (c’d)
to the value of OP1. If OP2 is not a valid pointer, or if
the target of OP2 is not a simple value in memory of
(seetype
implementation)
S, the opera on raises an error and the machine
crashes.
%L = icmp CND S OP1, OP2 Update locals( %L ) to 1 if the condi on holds and 0
otherwise.
%L = call S1 OP1(S2 OP2, ... ,SN OPN) Evaluate all of the operands and use them to recursively
invoke the interpreter through interp_call with the
current memory state. If OP1 does not evaluate to a
func on pointer that iden fies a func on with return
type S1 and argument types S2, ... , SN, then the
opera on raises an error and the machine crashes.
Update the local ( %L ) to the result of interp_call and
con nue with the return memory state.
call void OP1(S2 OP2, ... ,SN OPN) The same as a non-void call, but no locals are updated
with the returned value.
%L = getelementptr T1* OP1, Create a new pointer by adding the first index operand
i64 OP2, ... , i64 OPN OP2 to the last index of the pointer value of OP1 and
then concatena ng the remaining indices onto the path.
If the target of the resul ng pointer is not a valid
memory value compa ble with the type %L, then update
locals( %L ) with the undef value. Otherwise, update
locals( %L ) with the new pointer. See the following
func on pointer that iden fies a func on with return
Some Instructions (c’d)
type S1 and argument types S2, ... , SN, then the
opera on raises
(see implementation)an error and the machine crashes.
Update the local ( %L ) to the result of interp_call and
con nue with the return memory state.
call void OP1(S2 OP2, ... ,SN OPN) The same as a non-void call, but no locals are updated
with the returned value.
%L = getelementptr T1* OP1, Create a new pointer by adding the first index operand
i64 OP2, ... , i64 OPN OP2 to the last index of the pointer value of OP1 and
then concatena ng the remaining indices onto the path.
If the target of the resul ng pointer is not a valid
memory value compa ble with the type %L, then update
locals( %L ) with the undef value. Otherwise, update
locals( %L ) with the new pointer. See the following
sec on for a more detailed explana on.
%L = bitcast T1* OP to T2* Update locals( %L) with the value of OP .
P Indexing
GEP Indexing
seman cs of GEP and exactly when the resul ng pointer is valid is the most complicated part
LVMlite. Here we walk though a slightly more complex example.
• Start with the pointer pn1 = (bid0, 0) pointing to n1.
%t1 = type { A, B, C } • The first GEP instruction above will compute the
%t2 = type [ 2 x %t1 ] pointer (bid0, 0, 0), by first adding 0 to the last index
@pn1 = global %t2 [ {a0, b0, c0}, {a1, b1, c1} ] of pn1 and then concatenating the rest of the indices to
; Memory:
the end of the path.
; { ... bid0 -> root ... } • The next GEP instruction will compute the pointer
; |
; n1 (bid0, 0, 1, 1), which points to b1.
; / \
; n2 n3 • Why?
; / | \ / | \
; a0 b0 c0 a1 b1 c1 • Indexing into a sibling (rather than a child) of a node
using GEP with a non-zero first index is only legal if
...
%pn2 = getelementptr %t2* pn1, i32 0, i32 0 ; %t1* -> n2 sibling nodes are allocated as part of an array.
%pb1 = getelementptr %t1* pn2, i32 1, i32 1 ; B* -> b1
• In our example, n1 was allocated as [ 2 x %t1 ], so this
is the case.
pose we start with the pointer pn1 = (bid0, 0) poin ng to n1. The first GEP instruc on above
compute the pointer (bid0, 0, 0), by first adding 0 to the last index of pn1 and then
• Check out effective_tag in the interpreter code.
atena ng the rest of the indices to the end of the path. The next GEP instruc on will compute
pointer (bid0, 0, 1, 1), which points to b1.
• Some examples in llprograms: gep3.ll, gep5.ll, gep6.ll
VMlite, indexing into a sibling (rather than a child) of a node using GEP with a non-zero first
x is only legal if sibling nodes are allocated as part of an array. In our example, n1 was allocated
2 x %t1 ] , so this is the case. In addi on to the restricted use of the first, the resul ng path
target a subtree. If, instead, we tried to create a pointer off the end of the array, the resul ng
Compiling LLVMlite to X86
Compiling LLVMlite Types to X86
• raw i8 values are not allowed (they must be manipulated via i8*)
• Option 1:
– Map each %uid to a x86 register
– Efficient!
– Difficult to do effectively: many %uid values, only 16 registers
– We will see how to do this later in the semester
• Option 2:
– Map each %uid to a stack-allocated space
– Less efficient!
– Simple to implement
• Globals
– must use %rip relative addressing
• Calls
– Follow x64 AMD ABI calling conventions
– Should interoperate with C programs
• getelementptr
– trickiest part
Tour of HW3
• Using main.native
if ( b == 0 ) { a = 0 ; }
Parsing
Abstract Syntax Tree:
If Intermediate code:
l1:
Analysis &
Transformation
%cnd = icmp eq i64 %b, 0
Eq Assn None br i1 %cnd, label %l2, label %l3
l2:
store i64* %a, 1
br label %l3
l3:
b 0 a 1
Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:
…
Today: Lexing
Source Code
(Character stream)
if (b == 0) { a = 1; }
Lexical Analysis
Token stream:
if ( b == 0 ) { a = 0 ; }
Parsing
Abstract Syntax Tree:
If Intermediate code:
l1:
Analysis &
Transformation
%cnd = icmp eq i64 %b, 0
Eq Assn None br i1 %cnd, label %l2, label %l3
l2:
store i64* %a, 1
br label %l3
l3:
b 0 a 1
Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:
…
First Step: Lexical Analysis
if ( b == 0 ) { a = 0 ; }
See handlex.ml
https://github.com/cs4212/week-05-lexing
Lexing By Hand
• Useful extensions:
– “foo” Strings, equivalent to 'f''o''o'
– R+ One or more repetitions of R, equivalent to RR*
– R? Zero or one occurrences of R, equivalent to (ε|R)
– ['a'-'z'] One of a or b or c or … z, equivalent to (a|b|…|z)
– [^'0'-'9'] Any character except 0 through 9
– R as x Name the string matched by R as x
Example Regular Expressions
• Regular expressions alone are ambiguous, need a rule to choose between the options above
• Most languages choose “longest match”
– So the 2nd option above will be picked
– Note that only the first option is “correct” for parsing purposes
• Conflicts: arise due to two tokens whose regular expressions have a shared prefix
– Ties broken by giving some matches higher priority
– Example: keywords have priority over identifiers
– Usually specified by order the rules appear in the lex input file
Lexer Generators
• Reads a list of regular expressions: R1,…,Rn , one per token.
• Each token has an attached “action” Ai
(just a piece of code to run when the regular expression is matched)
lexlex.mll