PLDI Week 07 More Parsing
PLDI Week 07 More Parsing
Week 7: Parsing
Ilya Sergey
[email protected]
ilyasergey.net/CS4212/
Where we are
• Before in the Course:
• basics of x86
• LLVM
• Last week:
• Lexical Analysis
• This week:
• Algorithms for Parsing
• Parser Generation
• Next week:
• Types and Type Systems
Compilation in a Nutshell
Source Code
(Character stream)
if (b == 0) { a = 1; }
Lexical Analysis
Token stream:
if ( b == 0 ) { a = 0 ; }
Parsing
Abstract Syntax Tree:
If Intermediate code:
l1:
Analysis &
Transformation
%cnd = icmp eq i64 %b, 0
Eq Assn None br i1 %cnd, label %l2, label %l3
l2:
store i64* %a, 1
br label %l3
l3:
b 0 a 1
Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:
…
This week: Parsing
Source Code
if (b == 0) { a = 1; }
Lexical Analysis
Token stream:
if ( b == 0 ) { a = 0 ; }
Parsing
Abstract Syntax Tree:
If Intermediate code:
l1:
Analysis &
Transformation
%cnd = icmp eq i64 %b, 0
Eq Assn None br i1 %cnd, label %l2, label %l3
l2:
store i64* %a, 1
br label %l3
l3:
b 0 a 1
Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:
…
Parsing: Finding Syntactic Structure
Block
If While
{
if (b == 0) a = b;
while (a != 1) {
Bop … … Bop Block
print_int(a);
a = a – 1;
} Expr …
} b == 0 a != 1
Source input
Call
Abstract Syntax tree
…
Context-Free Grammars
Context-Free Grammars
• Here is a specification of the language of balanced parens:
• Idea: “derive” a string in the language by starting with S and rewriting according to the rules:
– Example: S ⟼ (S)S ⟼ ((S)S)S ⟼ ((ε)S)S ⟼ ((ε)S)ε ⟼ ((ε)ε)ε = (())
S ⟼ (S)S
S⟼ε
• Given the only one look-ahead symbol: ‘(‘ it isn’t clear whether to pick
S⟼E or S ⟼ E + S first.
LL(1) Grammars
Grammar is the problem
• Not all grammars can be parsed “top-down” with only a single lookahead symbol.
• Top-down: starting from the start symbol (root of the parse tree) and going down
• LL(1) means
– Left-to-right scanning
– Left-most derivation,
– 1 lookahead symbol
S⟼E+S | E
• This language isn’t “LL(1)” E ⟼ number | ( S )
• Construct the set of all input tokens that may appear first in strings
that can be derived from γ
– Add the production γ to the entry (A, token) for each such token.
• Note: if there are two different productions for a given entry, the
grammar is not LL(1)
Example T ⟼ S$
S ⟼ ES’
• First(T) = First(S) S’ ⟼ ε
• First(S) = First(E) S’ ⟼ + S
• First(S’) = { + } E ⟼ number | ( S )
• First(E) = { number, ‘(‘ }
Note: we want the least
solution to this system of
• Follow(S’) = Follow(S) set equations… a fixpoint
computation. More on
• Follow(S) = { $, ‘)’ } ∪ Follow(S’) these later in the course.
number + ( ) $ (EOF)
T ⟼ S$ ⟼S$
S ⟼ E S’ ⟼E S’
S’ ⟼+S ⟼ε ⟼ε
E ⟼ num. ⟼(S)
Converting the table to code
• Define n mutually recursive functions
– one for each nonterminal A: parse_A
– Assuming the stream of tokens is globally available, the type of parse_A is unit -> ast,
if A is not an auxiliary nonterminal
– Parse functions for auxiliary nonterminals (e.g. S’) take extra ast’s as inputs, one for each
nonterminal in the “factored” prefix.
• Each function “peeks” at the lookahead token and then follows the production rule in the
corresponding entry.
– Consume terminal tokens from the input stream
– Call parse_X to create sub-tree for nonterminal X
– If the rule ends in an auxiliary nonterminal, call it with appropriate ast’s.
(The auxiliary rule is responsible for creating the ast after looking at more input.)
– Otherwise, this function builds the ast tree itself and returns it.
Demo: LL(1) Parsing
• https://github.com/cs4212/week-06-parsing
• ll1_parser.ml
• Hand-generated LL(1) code for the table below.
number + ( ) $ (EOF)
T ⟼ S$ ⟼S$
S ⟼ E S’ ⟼E S’
S’ ⟼+S ⟼ε ⟼ε
E ⟼ num. ⟼(S)
LL(1) Summary
• Problems:
– Grammar must be LL(1)
– Can extend to LL(k) (it just makes the table bigger)
– Grammar cannot be left recursive (parser functions will loop!)
– There are CF grammars that cannot be transformed to LL(k)
1 E 4 1 E 4
3 3
Top-down Bottom-up
Progress of Bottom-up Parsing
Reductions Scanned Input Remaining
(1 + 2 + (3 + 4)) + 5 ⟻ (1 + 2 + (3 + 4)) + 5
(E + 2 + (3 + 4)) + 5 ⟻ ( 1 + 2 + (3 + 4)) + 5
(S + 2 + (3 + 4)) + 5 ⟻ (1 + 2 + (3 + 4)) + 5
(S + E + (3 + 4)) + 5 ⟻ (1 + 2 + (3 + 4)) + 5
Rightmost derivation
(S + (3 + 4)) + 5 ⟻ (1 + 2 + (3 + 4)) + 5
(S + (E + 4)) + 5 ⟻ (1 + 2 + (3 + 4)) + 5
(S + (S + 4)) + 5 ⟻ (1 + 2 + (3 + 4)) + 5
(S + (S + E)) + 5 ⟻ (1 + 2 + (3 + 4 )) + 5
(S + (S)) + 5 ⟻ (1 + 2 + (3 + 4 )) + 5
(S + E) + 5 ⟻ (1 + 2 + (3 + 4) )+5
(S) + 5 ⟻ (1 + 2 + (3 + 4) )+5
E+5⟻ (1 + 2 + (3 + 4)) +5
S+5⟻ (1 + 2 + (3 + 4)) +5
S+E⟻ (1 + 2 + (3 + 4)) + 5 S⟼S+E | E
S E ⟼ number | ( S )
Shift/Reduce Parsing
• Parser state:
– Stack of terminals and nonterminals.
– Unconsumed input is a string of terminals
– Current derivation step is stack + input
• Parsing is a sequence of shift and reduce operations:
• Shift: move look-ahead token to the stack
• Reduce: Replace symbols γ at top of stack with nonterminal X
such that X ⟼ γ is a production. (pop γ, push X)
• Goal: know what set of reductions are legal at any given point.
S ⟼ ( L ) | id
• Example grammar for non-empty tuples and identifiers: S
L⟼S | L,S
S ⟼ ( L ) | id ( L )
• Example strings:
x
L ⟼ S | L , S
L , S
• Example strings:
(x,y)
((((x)))) Parse tree for: L , S w
– x Parse treez),
for:
(x, (y, z), w) (x, (y, w)
– (x,y) (x, (y, (z, w)))
(x, (y, z), w)
S ( L )
– ((((x))))
x L , S
– (x, (y, z), w)
– (x, (y, (z, w))) S z
• Main idea: decide what to do based on a prefix α of the stack plus the look-ahead symbol.
– The prefix α is different for different possible reductions
since in productions X ⟼ γ and Y ⟼ β, γ and β might have different lengths.
• Main goal: know what set of reductions are legal at any point.
– How do we keep track?
LR(0) States
LR(0) States
• An LR(0) state is a set of items keeping track of progress on possible
upcoming reductions.
• An LR(0) item is a production from the language with an extra
separator “.” somewhere in the right-hand-side
S ⟼ ( L ) | id
LS⟼⟼S (| LL ), S|
id
L⟼S | L,S
• Example items: S ⟼ .( L ) or S ⟼ (. L) or L ⟼ S.
• Intuition:
– Stuff before the ‘.’ is already on the stack
(beginnings of possible g’s to be reduced)
– Stuff after the ‘.’ is what might be seen next
– The prefixes a are represented by the state itself
Constructing the DFA: Start state & Closure
Constructing the DFA: Start state & Closure
• Idea of the Closure: productions that can be applicable with the already observed stack
Action Goto
State
table table
Example Parse Table
( ) id , $ S L
1 s3 s2 g4
2 S⟼id S⟼id S⟼id S⟼id S⟼id
3 s3 s2 g7 g5
4 DONE
5 s6 s8
6 S ⟼ (L) S ⟼ (L) S ⟼ (L) S ⟼ (L) S ⟼ (L)
7 L⟼S L⟼S L⟼S L⟼S L⟼S
8 s3 s2 g9
9 L ⟼ L,S L ⟼ L,S L ⟼ L,S L ⟼ L,S L ⟼ L,S
• An LR(0) machine only works if states with reduce actions have a single reduce action.
– In such states, the machine always reduces (ignoring lookahead)
• With more complex grammars, the DFA construction will yield states with shift/reduce
and reduce/reduce conflicts:
OK shift/reduce reduce/reduce
S ⟼ ( L ). S ⟼ L ,S.
S ⟼ ( L ).
L ⟼ .L , S S ⟼ ,S.
left right
S⟼S+E | E S⟼E+S | E
E ⟼ number | ( S ) E ⟼ number | ( S )
LR(1)
LALR(1)
LL(1) SLR
LR(0)
Parsing in OCaml via Menhir
Practical Issues
• https://github.com/cs4212/week-07-more-parsing
• Conflict 1:
• Operator precedence (State 13)
• Conflict 2:
• Parsing if-then-else statements
Shift/Reduce conflicts
• Conflict 1:
• Operator precedence (State 13)
• Resolving by changing the grammar (see good_parser.ml)
• Conflict 2:
• Parsing if-then-else statements
From Menhir Manual http://gallium.inria.fr/~fpottier/menhir/manual.pdf
5.3 Inlining
It is well-known that the following grammar of arithmetic expressions does not work as expected: that is, in
spite of the priority declarations, it has shift/reduce conflicts.
%token < int > INT
%token PLUS TIMES
%left PLUS
%left TIMES
%%
expression:
| i = INT { i }
| e = expression; o = op; f = expression { o e f }
op:
| PLUS { ( + ) }
| TIMES { ( * ) }
The trouble is, the precedence level of the production expression ! expression op expression is undefined, and
there is no sensible way of defining it via a %prec declaration, since the desired level really depends upon the
symbol that was recognized by op: was it PLUS or TIMES?
The standard workaround is to abandon the definition of op as a separate nonterminal symbol, and to inline
its definition into the definition of expression, like this:
From Menhir Manual
The trouble is, the precedence level of the production expression ! expression op expression is undefined, and
http://gallium.inria.fr/~fpottier/menhir/manual.pdf
there is no sensible way of defining it via a %prec declaration, since the desired level really depends upon the
symbol that was recognized by op: was it PLUS or TIMES?
The standard workaround is to abandon the definition of op as a separate nonterminal symbol, and to inline
its definition into the definition of expression, like this:
expression:
| i = INT { i }
| e = expression; PLUS; f = expression { e + f }
| e = expression; TIMES; f = expression { e * f }
This avoids the shift/reduce conflict, but gives up some of the original specification’s structure, which,
in realistic situations, can be damageable. Fortunately, Menhir offers a way of avoiding the conflict without
manually transforming the grammar, by declaring that the nonterminal symbol op should be inlined:
expression:
| i = INT { i }
| e = expression; o = op; f = expression { o e f }
%inline op:
| PLUS { ( + ) }
| TIMES { ( * ) }
The %inline keyword causes all references to op to be replaced with its definition. In this example, the definition
of op involves two productions, one that develops to PLUS and one that expands to TIMES, so every production
that refers to op is effectively turned into two productions, one that refers to PLUS and one that refers to TIMES.
After inlining, op disappears and expression has three productions: that is, the result of inlining is exactly the
manual workaround shown above.
In some situations, inlining can also help recover a slight efficiency margin. For instance, the definition:
Precedence and Associativity Declarations
• Parser generators, like menhir often support precedence and associativity declarations.
– Hints to the parser about how to resolve conflicts.
– See: good-parser.mly
• Pros:
– Avoids having to manually resolve those ambiguities by manually introducing extra nonterminals
(see parser.mly)
– Easier to maintain the grammar
• Cons:
– Can’t as easily re-use the same terminal (if associativity differs)
– Introduces another level of debugging
• Limits:
– Not always easy to disambiguate the grammar based on just precedence and associativity.
Conflict 2: Ambiguity in Real Languages
• Observation: An un-matched ‘if’ should not appear as the ‘then’ clause of a containing ‘if’.
S ⟼ M | U // M = “matched”, U = “unmatched”
U ⟼ if (E) S // Unmatched ‘if ’
U ⟼ if (E) M else U // Nested if is matched
M ⟼ if (E) M else M // Matched ‘if ’
M⟼X=E // Other statements
• See: else-resolved-parser.mly
Alternative: Use { }
• Ambiguity arises because the ‘then’ branch is not well bracketed: