0% found this document useful (0 votes)
16 views78 pages

PLDI Week 05 Lexing

This document discusses the basics of LLVM and its intermediate representation. It introduces the LLVM compiler infrastructure and describes how code is represented in LLVM's low-level IR format including basic blocks, control flow graphs, and the storage model for local variables and memory allocation.

Uploaded by

Victor Zhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views78 pages

PLDI Week 05 Lexing

This document discusses the basics of LLVM and its intermediate representation. It introduces the LLVM compiler infrastructure and describes how code is represented in LLVM's low-level IR format including basic blocks, control flow graphs, and the storage model for local variables and memory allocation.

Uploaded by

Victor Zhao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

CS4212: Compiler Design

Week 5: LLVMlite
Basics of Lexical Analysis

Ilya Sergey
[email protected]

ilyasergey.net/CS4212/
This Week

• Working with data types in LLVM

• LLVMLite Specification

• Overview of HW3

• Lexical Analysis (basics)


Announcements

• HW3: LLVMlite
– Will be available on Canvas and GitHub on Thursday.
– Due: Sunday, 1 October 2023 at 23:59:59
So, LLVM
Low-Level Virtual Machine (LLVM)

• Open-Source Compiler Infrastructure


– see llvm.org for full documentation
• Created by Chris Lattner (advised by Vikram Adve) at UIUC
– LLVM: An infrastructure for Multi-stage Optimization, 2002
– LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation, 2004
• 2005: Adopted by Apple for XCode 3.1
• Front ends:
– llvm-gcc (drop-in replacement for gcc)
– Clang: C, objective C, C++ compiler supported by Apple
– various languages: Swift, ADA, Scala, Haskell, …
• Back ends:
– x86 / Arm / Power / etc.
LLVM Compiler Infrastructure
[Lattner et al.]

LLVM
frontends
Typed SSA llc
like backend
'clang' IR
code gen
jit
Optimisations/
Transformations

Analysis
Example LLVM Code
factorial-pretty.ll
• LLVM offers a textual representation of its IR define @factorial(%n) {
%1 = alloca
– files ending in .ll %acc = alloca
store %n, %1
store 1, %acc
factorial64.c br label %start

#include <stdio.h> start:


#include <stdint.h> %3 = load %1
%4 = icmp sgt %3, 0
br %4, label %then, label %else
int64_t factorial(int64_t n) {
int64_t acc = 1; then:
%6 = load %acc
while (n > 0) {
%7 = load %1
acc = acc * n; %8 = mul %6, %7
n = n - 1; store %8, %acc
%9 = load %1
}
%10 = sub %9, 1
return acc; store %10, %1
} br label %start

else:
%12 = load %acc
ret %12
}
Real LLVM
factorial.ll
; Function Attrs: nounwind ssp
define i64 @factorial(i64 %n) #0 {
%1 = alloca i64, align 8
• Decorates values with type information %acc = alloca i64, align 8
i64 store i64 %n, i64* %1, align 8
store i64 1, i64* %acc, align 8
i64* br label %2

i1 (boolean) ; <label>:2 ; preds = %5, %0


%3 = load i64* %1, align 8
%4 = icmp sgt i64 %3, 0
• Permits numeric identifiers br i1 %4, label %5, label %11

; <label>:5 ; preds = %2
%6 = load i64* %acc, align 8
• Has alignment annotations %7 = load i64* %1, align 8
(padding for some specified number of bytes) %8 = mul nsw i64 %6, %7
store i64 %8, i64* %acc, align 8
%9 = load i64* %1, align 8
%10 = sub nsw i64 %9, 1
• Keeps track of entry edges for each block: store i64 %10, i64* %1, align 8
br label %2
preds = %5, %0
; <label>:11 ; preds = %2
%12 = load i64* %acc, align 8
ret i64 %12
}
Example Control-flow Graph
define @factorial(%n) {
entry:
define @factorial(%n) { %1 = alloca
%1 = alloca
%acc = alloca
%acc = alloca store %n, %1
store %n, %1 store 1, %acc
store 1, %acc br label %start
br label %start
start:
start: %3 = load %1
%4 = icmp sgt %3, 0
%3 = load %1
br %4, label %then, label %else
%4 = icmp sgt %3, 0
br %4, label %then, label %else
then:
%6 = load %acc
then: else:
%7 = load %1
%8 = mul %6, %7
%6 = load %acc store %8, %acc
%12 = load %acc
%7 = load %1 %9 = load %1
ret %12
%8 = mul %6, %7
%10 = sub %9, 1
store %8, %acc
store %10, %1
%9 = load %1
%10 = sub %9, 1 br label %start
store %10, %1
br label %start else:
%12 = load %acc
ret %12
} }
LL Basic Blocks and Control-Flow Graphs

• LLVM enforces (some of) the basic block invariants syntactically.


• Representation in OCaml:

type block = {
insns : (uid * insn) list;
term : (uid * terminator)
}

• A control flow graph is represented as a list of labeled basic blocks with these invariants:
– No two blocks have the same label
– All terminators mention only labels that are defined among the set of basic blocks
– There is a distinguished, unlabelled, entry block:

type cfg = block * (lbl * block) list


LL Storage Model: Locals
• Several kinds of storage:
– Local variables (or temporaries): %uid
– Global declarations (e.g. for string constants): @gid
– Abstract locations: references to (stack-allocated) storage created by the alloca instruction
– Heap-allocated structures created by external calls (e.g. to malloc)

• Local variables:
– Defined by the instructions of the form %uid = …
– Must satisfy the single static assignment (SSA) invariant
• Each %uid appears on the left-hand side of an assignment only once in the entire control flow graph.
– The value of a %uid remains unchanged throughout its lifetime
– Analogous to “let %uid = e in …” in OCaml

• Intended to be an abstract version of machine registers.

• Full “SSA” to allow richer use of local variables by taking the control flow into the account
– phi functions (https://en.wikipedia.org/wiki/Static_single-assignment_form)
LL Storage Model: alloca

• The alloca instruction allocates stack space and returns a reference to it.
– The returned reference is stored in local:
%ptr = alloca typ
– The amount of space allocated is determined by the type

• The contents of the slot are accessed via the load and store instructions:

%acc = alloca i64 ; allocate a storage slot


store i64 4212, i64* %acc ; store the integer value 4212
%x = load i64, i64* %acc ; load the value 4212 into %x

• Gives an abstract version of stack slots


Structured Data
Compiling Structured Data
• Consider C-style structures like those below.

• How do we represent Point and Rect values?

struct Point { int x; int y; };

struct Rect { struct Point ll, lr, ul, ur };

struct Rect mk_square(struct Point ll, int len) {


struct Rect square;
square.ll = square.lr = square.ul = square.ur = ll;
square.lr.x += len;
square.ul.y += len;
square.ur.x += len;
square.ur.y += len;
return square;
}
Representing Structs
struct Point { int x; int y;};

• Store the data using two contiguous words of memory.


• Represent a Point value p as the address of the first word.

p x y

struct Rect { struct Point ll, lr, ul, ur };


• Store the data using 8 contiguous words of memory.

square ll.x ll.y lr.x lr.y ul.x ul.y ur.x ur.y

• Compiler needs to know the size of the struct at compile time to allocate the needed storage space.
• Compiler needs to know the shape of the struct at compile time to index into the structure.
Assembly-level Member Access
square ll.x ll.y lr.x lr.y ul.x ul.y ur.x ur.y

struct Point { int32 x; int32 y; };

struct Rect { struct Point ll, lr, ul, ur };

• Consider: ⟦square.ul.y⟧ = (x86.operand, x86.insns)

• Assume that %rcx holds the base address of square

• Calculate the offset relative to the base pointer of the data:


– ul = sizeof(struct Point) + sizeof(struct Point)
– y = sizeof(int)

• So: ⟦square.ul.y⟧ = (ans, Movq 20(%rcx) ans)


Padding & Alignment
• How to lay out non-homogeneous structured data?
struct Example {
int x;
char a;
char b; Not 32-bit
32-bit boundaries int y; aligned
};

x a b y

x a b y

x a b y

Padding
Copy-in/Copy-out
When we do an assignment in C as in:

struct Rect mk_square(struct Point ll, int elen) {


struct Square res;
res.lr = ll;
...

then we copy all of the elements out of the source and put them
in the target. Same as doing word-level operations:

struct Rect mk_square(struct Point ll, int elen) {


struct Square res;
res.lr.x = ll.x;
res.lr.y = ll.x;
...

• For really large copies, the compiler uses something like memcpy
(which is implemented using a loop in assembly).
C Procedure Calls
• Similarly, when we call a procedure, we copy arguments in, and copy results out.
– Caller sets aside extra space in its frame to store results that are bigger than will fit in %rax.
– We do the same with scalar values such as integers or doubles.

• Sometimes, this is termed "call-by-value".


– This is bad terminology.
– Copy-in/copy-out is more accurate.

• Benefit: locality
• Problem: expensive for large records…

• In C: can opt to pass pointers to structs: “call-by-reference”

• Languages like Java and OCaml always pass non-word-sized objects by reference.
Representing Data Types
Working with Arrays
Arrays
void foo() { void foo() {
char buf[27]; char buf[27];

buf[0] = 'a'; *(buf) = 'a';


buf[1] = 'b'; *(buf+1) = 'b';
... ...
buf[25] = 'z'; *(buf+25) = 'z';
buf[26] = 0; *(buf+26) = 0;
} }

• Space is allocated on the stack for buf.


– Note, without the ability to allocated stack space dynamically (C’s alloca function)
need to know size of buf at compile time…

• buf[i] is really just


(base_of_array) + i * elt_size
Multi-Dimensional Arrays

• In C, int M[4][3] yields an array with 4 rows and 3 columns.


• Laid out in row-major order:
M[0][0] M[0][1] M[0][2] M[1][0] M[1][1] M[1][2] M[2][0] …

• In Fortran, arrays are laid out in column major order.

M[0][0] M[1][0] M[2][0] M[3][0] M[0][1] M[1][1] M[2][1] …

• In ML and Java, there are no multi-dimensional arrays:


– (int array) array is represented as an array of pointers to arrays of ints.

• Why is knowing these memory layout strategies important?


Array Bounds Checks
• Safe languages (e.g. Java, C#, ML but not C, C++) check array indices to
ensure that they’re in bounds.
– Compiler generates code to test that the computed offset is legal

• Needs to know the size of the array… where to store it?


– One answer: Store the size before the array contents.

arr
Size=7 A[0] A[1] A[2] A[3] A[4] A[5] A[6]

• Other possibilities:
– Pascal: only permit statically known array sizes (very unwieldy in practice)
– What about multi-dimensional arrays?
Array Bounds Checks (Implementation)
• Example: Assume %rax holds the base pointer (arr) and %rcx holds the array index i.
To read a value from the array arr[i]:

movq -8(%rax) %rdx // load size into rdx


cmpq %rdx %rcx // compare index to bound
j l __ok // jump if 0 <= i < size
callq __err_oob // test failed, call the error handler
__ok:
movq (%rax, %rcx, 8) dest // do the load from the array access

• Clearly more expensive: adds move, comparison & jump


– More memory traffic
– Hardware can improve performance: executing instructions in parallel, branch prediction
• These overheads are particularly bad in an inner loop
• Compiler optimisations can help remove the overhead
– e.g. In a for loop, if bound on index is known, only do the test once
C-style Strings
• A string constant "foo" is represented as global data:
_string42: 102 111 111 0

• C uses null-terminated strings


• Strings are usually placed in the text segment so they are read only.
– allows all copies of the same string to be shared.

• Rookie mistake (in C): write to a string constant.

char *p = "foo”;
p[0] = 'b’;

Attempting to modify the string literal is undefined behaviour.


• Instead, must allocate space on the heap:
char *p = (char *)malloc(4 * sizeof(char));
strncpy(p, “foo”, 4); /* include the null byte */
p[0] = 'b’;
Tagged Datatypes
C-style Enumerations / ML-style datatypes

• In C: enum Day {sun, mon, tue, wed, thu, fri, sat} today;

• In OCaml: type day = Sun | Mon | Tue | Wed | Thu | Fri | Sat

• Associate an integer tag with each case: sun = 0, mon = 1, …


– C lets programmers choose the tags

• OCaml datatypes can also carry data: type foo = Bar of int | Baz of int * foo

• Representation: a foo value is a pointer to a pair: (tag, data)


• Example: tag(Bar) = 0, tag(Baz) = 1
⟦let f = Bar(3)⟧ = f 0 3

⟦let g = Baz(4, f)⟧ = g 1 4 f


Switch Compilation

• Consider the C statement:

switch (e) {
case sun: s1; break;
case mon: s2; break;

case sat: s3; break;
}

• How to compile this?


– What happens if some of the break statements are omitted?
(Control falls through to the next branch.)
Cascading ifs and Jumps
%tag = ⟦e⟧;
⟦switch(e) {case tag1: s1; case tag2 s2; …}⟧ = br label %l1

• Each $tag1…$tagN is just a constant int l1: %cmp1 = icmp eq %tag, $tag1
br %cmp1 label %b1, label %l2
tag value. b1: ⟦s1⟧
br label %l2
• Note: ⟦break;⟧
l2: %cmp2 = icmp eq %tag, $tag2
(within the switch branches) is: br %cmp2 label %b2, label %l3
b2: ⟦s2⟧
br %merge br label %l3

lN: %cmpN = icmp eq %tag, $tagN
br %cmpN label %bN, label %merge
bN: ⟦sN⟧
br label %merge

merge:
Alternatives for Switch Compilation
• Nested if-then-else works OK in practice if # of branches is small
– (e.g. < 16 or so).

• For more branches, use better data structures to organise the jumps:
– Create a table of pairs (v1, branch_label) and loop through
– Or, do binary search rather than linear search
– Or, use a hash table rather than binary search

• One common case: the tags are dense in some range


[min…max]
– Let N = max – min
– Create a branch table Branches[N] where Branches[i] = branch_label for tag i.
– Compute tag = ⟦e⟧ and then do an indirect jump: J Branches[tag]
• Common to use heuristics to combine these techniques.
ML-style Pattern Matching

• ML-style match statements are like C’s switch statements except:


– Patterns can bind variables match e with
– Patterns can nest | Bar(z) -> e1
| Baz(y, Bar(w)) -> e2
| _ -> e3

• Compilation strategy: match e with


– “Flatten” nested patterns into matches against | Bar(z) -> e1
| Baz(y, tmp) ->
one constructor at a time. (match tmp with
– Compile the match against the tags of the datatype | Bar(w) -> e2
as for C-style switches. | Baz(_, _) -> e3)

– Code for each branch additionally must copy data from ⟦e⟧ to the variables bound in the patterns.

• There are many opportunities for optimisations, many papers about “pattern-match compilation”
– Many of these transformations can be done at the AST level
Good place for a break
Datatypes in LLVM IR
Structured Data in LLVM
• LLVM’s IR is uses types to describe the structure of data.

t ::=
void
i1 | i8 | i64 N-bit integers
[<#elts> x t] arrays
fty function types
{t1, t2, … , tn} structures
t* pointers
%Tident named (identified) type

fty ::= Function Types


t (t1, .., tn) return, argument types

• <#elts> is an integer constant >= 0


• Structure types can be named at the top level:

%T1 = type {t1, t2, … , tn}


• Such structure types can be recursive
Example LL Types
• A static array of 4212 integers: [ 4212 x i64 ]

• A two-dimensional array of integers: [ 3 x [ 4 x i64 ] ]

• Structure for representing dynamically-allocated arrays with their length:


{ i64 , [0 x i64] }
– There is no array-bounds check; the static type information is only used for calculating pointer offsets.

• C-style linked lists (declared at the top level):


%Node = type { i64, %Node*}

• Structs from the C program shown earlier:


%Rect = type { %Point, %Point, %Point, %Point }
%Point = type { i64, i64 }
getelementptr
• LLVM provides the getelementptr instruction to compute pointer values
– Given a pointer and a “path” through the structured data pointed to by that pointer,
getelementptr computes an address
– This is the abstract analog of the X86 LEA (load effective address). It does not access memory.
– It is a “type indexed” operation, since the size computations depend on the type

insn ::= …
| getelementptr t* %val, t1 idx1, t2 idx2 ,…

• Example: access the x component of the first point of a rectangle:

%tmp1 = getelementptr %Rect* %square, i32 0, i32 0


%tmp2 = getelementptr %Point* %tmp1, i32 0, i32 0

• The first is i32 0 a “step through” the pointer to, e.g., %square, with offset 0.

See “Why is the extra 0 index required?”: https://llvm.org/docs/GetElementPtr.html#why-is-the-extra-0-index-required


GEP Example*
struct RT {
int A; 1. %s is a pointer to an (array of) %ST structs,
int B[10][20]; suppose the pointer value is ADDR
int C;
} 2. Compute the index of the 1st element by
struct ST { adding size_ty(%ST).
struct RT X;
int Y; 3. Compute the index of the Z field by
struct RT Z; adding size_ty(%RT) +
} size_ty(i32) to skip past X and Y.
int *foo(struct ST *s) {
4. Compute the index of the B field by
return &s[1].Z.B[5][13];
} adding size_ty(i32) to skip past A.

5. Index into the 2d array.

%RT = type { i32, [10 x [20 x i32]], i32 }


%ST = type { %RT, i32, %RT }
define i32* @foo(%ST* %s) {
entry:
%arrayidx = getelementptr %ST* %s, i32 1, i32 2, i32 1, i32 5, i32 13
ret i32* %arrayidx
}

Final answer: ADDR + size_ty(%ST) + size_ty(%RT) + size_ty(i32)


+ size_ty(i32) + 5*20*size_ty(i32) + 13*size_ty(i32)
*adapted from the LLVM documentation: see http://llvm.org/docs/LangRef.html#getelementptr-instruction
getelementptr

• GEP never dereferences the address it’s calculating:


– GEP only produces pointers by doing arithmetic
– It doesn’t actually traverse the links of a data structure

• To index into a deeply nested structure, one has to “follow the pointer” by
loading from the computed pointer
Compiling Data Structures via LLVM

1. Translate high level language types into an LLVM representation type.


– For some languages (e.g. C) this process is straightforward
• The translation simply uses platform-specific alignment and padding
– For other languages, (e.g. OO languages) there might be a fairly complex
elaboration.
• e.g. for OCaml, arrays types might be translated to pointers to length-indexed structs.
⟦int array⟧ = {i32, [0 x i32]}*

2. Translate accesses of the data into getelementptr operations:


– e.g. for OCaml array size access:
⟦length a⟧ =
%1 = getelementptr {i32, [0 x i32]}* %a, i32 0, i32 0
Type Casting
• What if the LLVM IR’s type system isn’t expressive enough?
– e.g. if the source language has subtyping, perhaps due to inheritance
– e.g. if the source language has polymorphic/generic types

• LLVM IR provides a bitcast instruction


– This is a form of (potentially) unsafe cast. Misuse can cause serious bugs
(segmentation faults, or silent memory corruption)

%rect2 = type { i64, i64 } ; two-field record


%rect3 = type { i64, i64, i64 } ; three-field record

define @foo() {
%1 = alloca %rect3 ; allocate a three-field record
%2 = bitcast %rect3* %1 to %rect2* ; safe cast
%3 = getelementptr %rect2* %2, i32 0, i32 1 ; allowed

}
LLVMlite Specification

https://ilyasergey.net/CS4212/hw03-llvmlite-spec.html
LLVMlite features

• A C-like “weak type system” to statically rule out some malformed programs.
• A variety of different kinds of integer values, pointers, function pointers, and
structured data including strings, arrays, and structs.
• Top-level mutually-recursive function definitions and function calls as primitives.
• An infinite number of “locals” (also known as “pseudo-registers”, “SSA variables”, or
“temporaries”) to hold intermediate results of computations.
• An abstract memory model that doesn't constrain the layout of data in memory.
• Dynamically allocated memory associated with a function invocation (in C, the stack).
• Static and dynamically (heap) allocated structured data.
• A control-flow graph representation of function bodies.
Syntax
Example
e Syntax

define i64 @fac(i64 %n) { ; (1) function definition, argument prefixed with %
%1 = icmp sle i64 %n, 0 ; (2) signed comparison, result assigned to %1
br i1 %1, label %ret, label %rec ; (3) “terminator”, marks the end of the block
ret: ; (4) label, indicates the beginning of the new block
ret i64 1 return the result (1)
rec: ; (5) another block
%2 = sub i64 %n, 1 ; (6) subtract 1 from %n, name result %2
%3 = call i64 @fac(i64 %2) ; (7) call function @fac, assign the result for %3
%4 = mul i64 %n, %3
ret i64 %4 ; (8) return result
}

define i64 @main() { ; (9) call @fac with the argument 6


%1 = call i64 @fac(i64 6)
ret i64 %1
}

the func on defini on at (1). The i64 annota ons declare the return type and the
range over simple and aggregate (non-void, non-func on) types,
LLVMlite types
F to range over function types, and S to range over simple types.

Concrete Syntax Kind Descrip on


void void Indicates the instruc on does not return a usable value.
i1, i64 simple 1-bit (boolean) and 64-bit integer values.
T* simple Pointer that can be dereferenced if its target is compa ble with T
i8* simple Pointer to the first character in a null-terminated array of bytes.
Note: i8* is a valid type, but just i8 is not. LLVMlite programs
do not operate over byte-sized integer values.
F* simple Func on pointer
S(S1, ..., SN) func on A func on from S1, ..., SN to S
void(S1, ..., SN) func on A func on from S1, ..., SN to void
{ T1, ..., TN } aggregate Tuple of values of types T1, ..., TN
[ N x T ] aggregate Exactly N values of type T
%NAME * Abbrevia on defined by a top-level named type defini on
Named Types
• Simple types appear on stack and as arguments to functions
Named type defini ons
• Aggregate types that may only appear in global and heap-allocated data
• One can define
%IDENT = type T abbreviations for types:
%IDENT = type
define abbrevia T types in the scope of the en re compila
ons for on unit. The following
specifica on assumes that these are replaced with their defini ons whenever they are
where G ranges over global ini alizers, described in the
Global Definitions
@IDENT = global T G

Global Definitions
type. The global iden fier @IDENT, when used in the prog
where G ranges over global ini alizers, described in the following table, and T is the associated
type. The global iden fier @IDENT, when following program
used in the program, fragment
has type has valid
T* . For example, the annota ons:
The next kind program
following of top-level defini
fragment has on isons:
valid annota global data
@foo = global i64 42 @foo = global i64 42
@IDENT = @bar = global i64* @foo
global T G @bar = global i64* @foo
@baz = global i64** @bar
@baz = global i64** @bar
where G ranges over global ini alizers, described in the following table, and T is the as
type. The global
Concrete iden fierType
Syntax @IDENT, when used
Descrip on in the program, has type T* . For example
followingnull
program fragment T* has validThe
annota ons:
null pointer constant.
[0-9]+ i64
Concrete Syntax
64-bit integer literal.
Type Descrip on
@foo = global
@IDENT T* i64 null
42
Global iden fier. The type is always a T* pointer of the The null poin
type associated with the global defini on.
@bar = global i64* @foo
[0-9]+
@baz = global i64**length
c"[A-z]*\00" [ N x i8 ]
@bar i64 be the
String literal. The size of the array N should 64-bit intege
of the string in bytes, including the null
terminator
@IDENT \00 . T* Global iden
[ T G1, ..., T GN ] [ N x T ] Array literal. type associa
{ T1 G1, ..., TN GN } {T1,...,TN} Struct literal.
c"[A-z]*\00" [ N x i8 ] String literal
Concrete Syntax
bitcast (T1* G1 to T2*) Type
T2* Bitcast. Descrip on length of the
Operands
null T* The null pointer constant. terminator
{ T1 G1, ..., TN GN } {T1,...,TN} Struct literal.

OperandsBitcast.
bitcast (T1* G1 to T2*) of functions
T2*

Operands

We now turn to the parts of a func on declara on. Each instruc on in a func on has zero or more
operands which for the purposes of determining the well-formedness of programs, are restricted to
the following types.

Concrete Syntax Type Descrip on


null T* The null pointer constant
[0-9]+ i64 64-bit integer literal
@IDENT T* Global iden fier. The type can always be determined from the
global defini ons and is always a pointer
%IDENT S Local iden fier: can only name values of simple type. The type
determined by an local defini on of %IDENT in scope
Instructions and Terminators

The following table describes the restric ons on the types that may appear as parameters of well-
formed instruc ons, and the constraints on the operands and result of the instruc on for the
purposes of type-checking. We assume that named types have been replace by their defini ons.

For example, in the call instruc on, each type parameter S1, ..., SN must be a simple type.
When we type check a program containing this instruc on, we must make sure that the operand
OP2, ..., OPN have types Types
. of instructions
S2, ..., SN

Concrete Syntax Operand → Result Types


%L = BOP i64 OP1, OP2 i64 x i64 → i64

%L = alloca S - → S*

%L = load S* OP S* → S

store S OP1, S* OP2 S x S* → void

%L = icmp CND S OP1, OP2 S x S → i1

%L = call S1 OP1(S2 OP2, ..., SN OPN) S1(S2, ..., SN)* x S2 x ... x SN → S1

call void OP1(S2 OP2, ... ,SN OPN) void(S2, ..., SN)* x S2 x ... x SN → void

%L = getelementptr T1* OP1, i32 OP2, ..., i32 OPN T1* x i64 x ... x i64 -> GEPTY(T1, OP1, ..., OPN)*

%L = bitcast T1* OP to T2* T1* → T2*

Getelementptr Well-Formedness and Result Type


• The
Let’s discuss the meaning of these types
getelementptr instruc on has some addi onal well-formedness requirements. Operands a er
• the
Thefirstgetelementptr instruction
must all be constants, unless they has some
are used additional
to index well-formedness
into an array. requirements
LLVM actually requires
(see
the the specification)
operands used to index into structs to be 32-bit integers. Rather than introducing 32-bit
integers into our language, we will use our 64-bit constants and operands and assume the
arguments of getelementptr always fall in the range [0, Int32.max_int].
In the table above, the result type of a instruc on described using the GEPTY
GEP Type
getelementptr

func on, which is defined in pseudocode as follows:

GEPTY : T -> operand list -> T


GEPTY T operand::path' = GEPTY' T path'

GEPTY' : T -> operand list -> T


GEPTY' T [] = T
GEPTY' { T1, ..., TN } (Const m)::path' = GEPTY' Tm path' when m <= N
GEPTY' [ _ x T ] operand::path' = GEPTY' T path'

No ce•that GEPTY is a par al func on. When GEPTY is not defined, the corresponding instruc on
GEPTY is a partial function.
is malformed. This happens when, for example:
• When GEPTY is not defined, the corresponding
The list of index operands provided is empty instruction is malformed.
An•operand
This happens when,a struct
used to index for example:
is not a constant
• The
The type listan
is not ofaggregate
index operands
and the provided is empty
list of indices is not empty
• An operand used to index a struct is not a constant
Also no ce that a GEP instruc on that indexes beyond the size of an array is well-formed. The
• The type is not an aggregate and the list of indices is not empty
length informa on on array tags is only present to help the compiler lay out data in memory and is
not verified sta cally.
Notes on GEP

• Real LLVM requires that constants appearing in getelementptr be declared with type i32:

%struct = type { i64, [5 x i64], i64}

@gbl = global %struct {i64 1,


[5 x i64] [i64 2, i64 3, i64 4, i64 5, i64 6], i64 7}

define void @foo() {


%1 = getelementptr %struct* @gbl, i32 0, i32 0

}

• LLVMlite ignores the i32 annotation and treats these as i64 values
– we keep the i32 annotation in the syntax to retain compatibility with the clang compiler
– we assume the arguments of getelementptr always fall in the range [0, Int32.max_int].
Blocks, CFGs, and Function Definitions
length informa on on array tags is only present to help the compiler lay out data in memory and is
Blocks, CFGs, and
not verified sta cally.
Function Definitions
A block (or "basic" block) is just a sequence of instruc ons followed by a terminator:
Blocks, CFGs, and Function Definitions
• A block
Concrete Syntax A is just
block (ora sequence
"basic" block) of
is instructions followed
Operand → Result Types
just a sequence of instruc onsby a terminator
followed by a terminator:

ret void - → -
Concrete Syntax Operand → Result Types
ret S OP ret void S → - - → -

br label %LAB ret S OP - → - S → -

br label %LAB - → -
br i1 OP, label %LAB1, label %LAB2 i1 → -
br i1 OP, label %LAB1, label %LAB2 i1 → -
The body of a func Theon body
is represented by a control flow graph (CFG). A CFG consists
of a func on is represented by a control flow graph (CFG). A CFG consists of a
of a
dis nguished entry block
dis and
nguished a sequence
entry block and blocks
a of
sequence prefixed
• The body of a function is represented by a control flow graph (CFG).
blocks of with
prefixed a label
with a LAB:
label .
LAB:A . func
A func on
on
defini on has a return type, the funca on name, a list ofparameters
formal parameters and their types, and the
• A CFG consists of a distinguished entry block and a sequence blocks of prefixed with a label. the
defini on has a return type, the func on name, list of formal and their types, and
body of the func on. The full syntax of a func on defini on is then:
body of the func on. The full syntax
• The full syntax of a function definition: of a func on defini on is then:
define [S|void] @IDENT(S1 OP, ... , SN OP) { BLOCK (LAB: BLOCK)...}
define [S|void] @IDENT(S1 OP, ... , SN OP) { BLOCK (LAB: BLOCK)...}
Like global data defini ons, the type of the defined global iden fier @IDENT is S(S1, ... , SN)* or
void(S1, ... , SN)* , a func on pointer.
Like global data defini ons, the type of the defined global iden fier @IDENT is S(S1, ... , SN)* or
Semantics
LLVMlite Semantics

• Like for X86lite, we define the semantics of LLVMlite by describing the


execution of an abstract machine.
• LLVMlite machine explicitly differentiates between stack, heap, code, and
global memory (X86 was treating all of those uniformly).
• A definitional (reference) interpreter for LLVMlite is provided in HW3:
check llinterp.ml
• If you have a question about a detail of the semantics, you can simply run a
program through the interpreter!
registers.

Memory Model
Memory values will be represented as tree structures where the leaves are simple values (or strings)
and finitely-branching nodes represent arrays and structs. The memory state of the LLVMlite
• The memory state isofrepresented
machine the LLVMlite machine
by a mapping is represented
between byand
block iden fiers a mapping between
memory values. We willblock
refer
identifiers to
and memory
a top-level values.
memory value that is not a subtree of another as a memory block.
We will refer to a top-level memory value that is not a subtree of another as a memory block.
At this point, an illustra on might be helpful:

{ bid0 -> node bid1 -> node bid2 -> node }


/ | \ | |
L L L node L
/ \
L node
/ | | \
L L L* L

• Even simple values, such


The above as ashows
diagram single global
three i64
memory willmapped
values be represented using
to the block iden a node
fiers with, and
bid0, bid1 one leaf.
. One thing to no ce is that every memory value contains at least one node. Even simple
bid2
• To identify the leaf such
values, marked * weglobal
as a single provide the
i64 will be indices 0, using
represented 1, 2 aalong with
node with onethe
leaf.identifier bid1.
The iden fier
bid2 is an example of how non-aggregate data will be represented in memory, while bid1 might
• This means that we're selecting the 2nd child of the 1st child of the 0th child of the root node.
be a structure having two fields. There's no deep reason for this, it's just a convenient invariant to
represent the par cular way LLVM computes pointers into structs.

In order to manipulate the simple values at the leaves of our memory blocks, we need specify a
Memory Model
This approach might seem a more complicated than our memory representa on in X86lite, but it
At this point, an illustra on might be helpful:
has some advantages as a specifica on. If instead, like in X86lite, we represented memory as an
array of bytes, we would be forced us to make decisions about how large each value in the
{ bid0 -> node bid1 -> node bid2 -> node }
language is and the rela ve/ posi
| \
on of values in memory.
|
Using an unordered
|
set of trees for the
language specifica on lets L us define
L Lhow opera ons on structured data work
node L while leaving such
details up to the compiler. / \
L node
/ | | \
The simple values include: L L L* L

• The simple values include:


1-bit (boolean) and 64-bit 2's complement signed integers
• 1-bit (boolean) and 64-bit
The above diagram2's complement
shows signed
three memory integers
values mapped to the block iden fiers bid0, bid1 , and
Pointers to a subtree
. One thing of a par
to no ce ismemorycular memory
that every memory block containing
value contains a block iden fier and path
• Pointers tobid2
a subtree of a particular block containing a blockatidentifier
least one and
node. Even simple
path
Avalues,
specialsuch
undef
as avalue
singlethat represents
global anrepresented
i64 will be unusable value
using a node with one leaf. The iden fier
• A special undef value that represents an unusable value
bid2 is an example of how non-aggregate data will be represented in memory, while bid1 might
be a structure having two fields. There's no deep reason for this, it's just a convenient invariant to
So, arepresent
real piecethe
of par
LLVMlite memory might look like:
cular way LLVM computes pointers into structs.

{ bid0 -> node bid1 -> node bid2 -> node }


In order to manipulate the simple values at the leaves of our memory blocks, we need specify a
| \ |
path to a leaf. For example, to uniquely iden fy the leaf marked * we might provide the indices
node node undef
0,1,2 along with / the block
| \iden fier bid1 . This means / that
| we're
\ selec ng the 2nd child of the
1st child of"foo" "bar"
the 0th child "baz"
of the (Ptr Null) 4 (Ptr bid0, 0, 1)
root node.

This approach might seem a more complicated than our memory representa on in X86lite, but it
Interpreter
• interp_call takes
• the global identifier of a function in an LLVMlite program,
• a list of (simple) values to serve as arguments, and
• an initial memory state; and
• returns the memory state after the function call has completed and the return value.

• interp_cfg does most of the work. It takes


• a control-flow graph,
• an initial locals map, and
• a memory state; and
• evaluates the cfg, returning the new memory state and the return value of the function body.
Instruc ons are executed as follows:

Instruc on/Terminator
Some Instructions
Behavior
(see implementation)
%L = BOP i64 OP1, OP2 Update locals( %L ) with the result of the computa on.
%L = alloca S Allocate a slot in the current stack frame and return a
pointer to it. This involves adding a subtree of undef to
the root node of the memory block represen ng the
frame at the next available index.
%L = load S* OP OP must be a pointer or undef. Find the value
referenced by the pointer in the current memory state.
Update locals( %L ) with the result. If OP is not a valid
pointer, either because it evaluates to undef, no
memory value is associated with its block iden fier or
its path does not iden fy a valid subtree, then the
opera on raises an error and the machine crashes. If
the pointer is valid, but the value in memory is not a
simple value of type S, the opera on raises an error and
the machine crashes.
store S OP1, S* OP2 Update the memory state by se ng the target of OP2
to the value of OP1. If OP2 is not a valid pointer, or if
the target of OP2 is not a simple value in memory of
type S, the opera on raises an error and the machine
crashes.
%L = icmp CND S OP1, OP2 Update locals( %L ) to 1 if the condi on holds and 0
otherwise.
store S OP1, S* OP2 Update the memory state by se ng the target of OP2
Some Instructions (c’d)
to the value of OP1. If OP2 is not a valid pointer, or if
the target of OP2 is not a simple value in memory of
(seetype
implementation)
S, the opera on raises an error and the machine
crashes.
%L = icmp CND S OP1, OP2 Update locals( %L ) to 1 if the condi on holds and 0
otherwise.
%L = call S1 OP1(S2 OP2, ... ,SN OPN) Evaluate all of the operands and use them to recursively
invoke the interpreter through interp_call with the
current memory state. If OP1 does not evaluate to a
func on pointer that iden fies a func on with return
type S1 and argument types S2, ... , SN, then the
opera on raises an error and the machine crashes.
Update the local ( %L ) to the result of interp_call and
con nue with the return memory state.
call void OP1(S2 OP2, ... ,SN OPN) The same as a non-void call, but no locals are updated
with the returned value.
%L = getelementptr T1* OP1, Create a new pointer by adding the first index operand
i64 OP2, ... , i64 OPN OP2 to the last index of the pointer value of OP1 and
then concatena ng the remaining indices onto the path.
If the target of the resul ng pointer is not a valid
memory value compa ble with the type %L, then update
locals( %L ) with the undef value. Otherwise, update
locals( %L ) with the new pointer. See the following
func on pointer that iden fies a func on with return
Some Instructions (c’d)
type S1 and argument types S2, ... , SN, then the
opera on raises
(see implementation)an error and the machine crashes.
Update the local ( %L ) to the result of interp_call and
con nue with the return memory state.
call void OP1(S2 OP2, ... ,SN OPN) The same as a non-void call, but no locals are updated
with the returned value.
%L = getelementptr T1* OP1, Create a new pointer by adding the first index operand
i64 OP2, ... , i64 OPN OP2 to the last index of the pointer value of OP1 and
then concatena ng the remaining indices onto the path.
If the target of the resul ng pointer is not a valid
memory value compa ble with the type %L, then update
locals( %L ) with the undef value. Otherwise, update
locals( %L ) with the new pointer. See the following
sec on for a more detailed explana on.
%L = bitcast T1* OP to T2* Update locals( %L) with the value of OP .
P Indexing
GEP Indexing
seman cs of GEP and exactly when the resul ng pointer is valid is the most complicated part
LVMlite. Here we walk though a slightly more complex example.
• Start with the pointer pn1 = (bid0, 0) pointing to n1.
%t1 = type { A, B, C } • The first GEP instruction above will compute the
%t2 = type [ 2 x %t1 ] pointer (bid0, 0, 0), by first adding 0 to the last index
@pn1 = global %t2 [ {a0, b0, c0}, {a1, b1, c1} ] of pn1 and then concatenating the rest of the indices to
; Memory:
the end of the path.
; { ... bid0 -> root ... } • The next GEP instruction will compute the pointer
; |
; n1 (bid0, 0, 1, 1), which points to b1.
; / \
; n2 n3 • Why?
; / | \ / | \
; a0 b0 c0 a1 b1 c1 • Indexing into a sibling (rather than a child) of a node
using GEP with a non-zero first index is only legal if
...
%pn2 = getelementptr %t2* pn1, i32 0, i32 0 ; %t1* -> n2 sibling nodes are allocated as part of an array.
%pb1 = getelementptr %t1* pn2, i32 1, i32 1 ; B* -> b1
• In our example, n1 was allocated as [ 2 x %t1 ], so this
is the case.
pose we start with the pointer pn1 = (bid0, 0) poin ng to n1. The first GEP instruc on above
compute the pointer (bid0, 0, 0), by first adding 0 to the last index of pn1 and then
• Check out effective_tag in the interpreter code.
atena ng the rest of the indices to the end of the path. The next GEP instruc on will compute
pointer (bid0, 0, 1, 1), which points to b1.
• Some examples in llprograms: gep3.ll, gep5.ll, gep6.ll
VMlite, indexing into a sibling (rather than a child) of a node using GEP with a non-zero first
x is only legal if sibling nodes are allocated as part of an array. In our example, n1 was allocated
2 x %t1 ] , so this is the case. In addi on to the restricted use of the first, the resul ng path

target a subtree. If, instead, we tried to create a pointer off the end of the array, the resul ng
Compiling LLVMlite to X86
Compiling LLVMlite Types to X86

• ⟦i1⟧, ⟦i64⟧, ⟦t*⟧ = quad word (8 bytes, 8-byte aligned)

• raw i8 values are not allowed (they must be manipulated via i8*)

• array and struct types are laid out sequentially in memory

• getelementptr computations must be relative to the LLVMlite size definitions


– i.e. ⟦i1⟧ = quad (quite wasteful!)
Compiling LLVM locals
• How do we manage storage for each %uid defined by an LLVM instruction?

• Option 1:
– Map each %uid to a x86 register
– Efficient!
– Difficult to do effectively: many %uid values, only 16 registers
– We will see how to do this later in the semester

• Option 2:
– Map each %uid to a stack-allocated space
– Less efficient!
– Simple to implement

• For HW3 we will follow Option 2


Other LLVMlite Features

• Globals
– must use %rip relative addressing

• Calls
– Follow x64 AMD ABI calling conventions
– Should interoperate with C programs

• getelementptr
– trickiest part
Tour of HW3

• See HW3 description and README.md

• Main definitions: ll.ml

• Compiler in the pipeline: driver.ml and process_ll_file.

• Using main.native

• Compiling with clang


Lexing
Lexical analysis, tokens, regular expressions, automata
Compilation in a Nutshell
Source Code
(Character stream)
if (b == 0) { a = 1; }
Lexical Analysis
Token stream:

if ( b == 0 ) { a = 0 ; }
Parsing
Abstract Syntax Tree:
If Intermediate code:
l1:
Analysis &
Transformation
%cnd = icmp eq i64 %b, 0
Eq Assn None br i1 %cnd, label %l2, label %l3
l2:
store i64* %a, 1
br label %l3
l3:
b 0 a 1
Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:

Today: Lexing
Source Code
(Character stream)
if (b == 0) { a = 1; }
Lexical Analysis
Token stream:

if ( b == 0 ) { a = 0 ; }
Parsing
Abstract Syntax Tree:
If Intermediate code:
l1:
Analysis &
Transformation
%cnd = icmp eq i64 %b, 0
Eq Assn None br i1 %cnd, label %l2, label %l3
l2:
store i64* %a, 1
br label %l3
l3:
b 0 a 1
Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:

First Step: Lexical Analysis

• Change the character stream “if (b == 0) a = 0;” into tokens:

if ( b == 0 ) { a = 0 ; }

IF; LPAREN; Ident(“b”); EQEQ; Int(0); RPAREN; LBRACE; Ident(“a”);


EQ; Int(0); SEMI; RBRACE

• Token: data type that represents indivisible “chunks” of text:


– Identifiers: a y11 elsex _100
– Keywords: if else while
– Integers: 2 200 -500 5L
– Floating point: 2.0 .02 1e5
– Symbols: + * ` { } ( ) ++ << >> >>>
– Strings: “x” “He said, \”Are you?\””
– Comments: (* CS4212: Project 1 … *) /* foo */

• Often delimited by whitespace (‘ ‘, \t, etc.)


– In some languages (e.g. Python or Haskell) whitespace is significant
Demo: Handlex

How hard can it be?

See handlex.ml

https://github.com/cs4212/week-05-lexing
Lexing By Hand

• How hard can it be?


– Tedious and painful!
• Problems:
– Precisely define tokens
– Matching tokens simultaneously
– Reading too much input (need look ahead)
– Error handling
– Hard to compose/interleave tokeniser code
– Hard to maintain
A Principled Solution to Lexing
Regular Expressions
• Regular expressions precisely describe sets of strings.

• A regular expression R has one of the following forms:


– ε Epsilon stands for the empty string
– ‘a’ An ordinary character stands for itself
– R1 | R2 Alternatives, stands for choice of R1 or R2
– R 1R 2 Concatenation, stands for R1 followed by R2
– R* Kleene star, stands for zero or more repetitions of R

• Useful extensions:
– “foo” Strings, equivalent to 'f''o''o'
– R+ One or more repetitions of R, equivalent to RR*
– R? Zero or one occurrences of R, equivalent to (ε|R)
– ['a'-'z'] One of a or b or c or … z, equivalent to (a|b|…|z)
– [^'0'-'9'] Any character except 0 through 9
– R as x Name the string matched by R as x
Example Regular Expressions

• Recognise the keyword “if ”: ”if”


• Recognise a digit: ['0'-'9']
• Recognise an integer literal: '-'?['0'-'9']+
• Recognise an identifier:
(['a'-'z']|['A'-'Z'])(['0'-'9']|'_'|['a'-'z']|['A'-'Z'])*

• In practice, it’s useful to be able to name regular expressions:

let lowercase = ['a'-'z']


let uppercase = ['A'-'Z']
let character = uppercase | lowercase
How to Match?
• Consider the input string: ifx = 0
– Could lex as: if x = 0 or as: ifx = 0

• Regular expressions alone are ambiguous, need a rule to choose between the options above
• Most languages choose “longest match”
– So the 2nd option above will be picked
– Note that only the first option is “correct” for parsing purposes

• Conflicts: arise due to two tokens whose regular expressions have a shared prefix
– Ties broken by giving some matches higher priority
– Example: keywords have priority over identifiers
– Usually specified by order the rules appear in the lex input file
Lexer Generators
• Reads a list of regular expressions: R1,…,Rn , one per token.
• Each token has an attached “action” Ai
(just a piece of code to run when the regular expression is matched)

rule token = parse


| '-'?digit+ { Int (Int32.of_string (lexeme lexbuf)) }
| '+' { PLUS }
| 'if' { IF }
| character (digit|character|'_')* { Ident (lexeme lexbuf) }
| whitespace+ { token lexbuf }
token
regular expressions actions

• Generates scanning code that:


1. Decides whether the input is of the form (R1|…|Rn)*
2. Whenever the scanner matches a (longest) token, it runs the associated action
Demo: Ocamllex

lexlex.mll

You might also like