0% found this document useful (0 votes)

36 views59 pages

PLDI Week 07 More Parsing

The document discusses parsing in compiler design. It covers context-free grammars and how they are used to define languages. It also discusses LL(1) grammars and how predictive parsing works by using a parsing table to uniquely determine the production to apply based on the lookahead token.

Uploaded by

Victor Zhao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views59 pages

PLDI Week 07 More Parsing

Uploaded by

Victor Zhao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

CS4212: Compiler Design

Week 7: Parsing

Ilya Sergey
[email protected]

ilyasergey.net/CS4212/
Where we are
• Before in the Course:
• basics of x86
• LLVM

• Last week:
• Lexical Analysis

• This week:
• Algorithms for Parsing
• Parser Generation

• Next week:
• Types and Type Systems
Compilation in a Nutshell
Source Code
(Character stream)
if (b == 0) { a = 1; }
Lexical Analysis
Token stream:

if ( b == 0 ) { a = 0 ; }
Parsing
Abstract Syntax Tree:
If Intermediate code:
l1:
Analysis &
Transformation
%cnd = icmp eq i64 %b, 0
Eq Assn None br i1 %cnd, label %l2, label %l3
l2:
store i64* %a, 1
br label %l3
l3:
b 0 a 1
Backend
Assembly Code
l1:
cmpq %eax, $0
jeq l2
jmp l3
l2:
…
This week: Parsing
Source Code

if (b == 0) { a = 1; }
Lexical Analysis
Token stream:

Block

If While
{
if (b == 0) a = b;
while (a != 1) {
Bop … … Bop Block
print_int(a);
a = a – 1;
} Expr …
} b == 0 a != 1
Source input
Call
Abstract Syntax tree
…
Context-Free Grammars
Context-Free Grammars
• Here is a specification of the language of balanced parens:

Note: Once again we have to take

S ⟼ (S)S care to distinguish meta-language
elements (e.g. “S” and “⟼”) from
S⟼ε object-language elements (e.g. “(“ ).*

• The definition is recursive – S mentions itself.

• Idea: “derive” a string in the language by starting with S and rewriting according to the rules:
– Example: S ⟼ (S)S ⟼ ((S)S)S ⟼ ((ε)S)S ⟼ ((ε)S)ε ⟼ ((ε)ε)ε = (())

• You can replace the “nonterminal” S by one of its definitions anywhere

• A context-free grammar accepts a string iff there is a derivation from the start symbol

* And, since we’re writing this description in English, we are careful

distinguish the meta-meta-language (e.g. words) from the meta-language and
object-language (e.g. symbols) by using quotes.
CFGs Mathematically
• A Context-free Grammar (CFG) consists of
– A set of terminals (e.g., a lexical token or ε)
– A set of nonterminals (e.g., S and other syntactic variables)
– A designated nonterminal called the start symbol
– A set of productions: LHS ⟼ RHS
• LHS is a nonterminal
• RHS is a string of terminals and nonterminals

• Example: The balanced parentheses language:

S ⟼ (S)S
S⟼ε

• How many terminals? How many nonterminals? Productions?

LL & LR Parsing

Searching for derivations

Consider finding left-most derivations
S⟼E+S | E
• Look at only one input symbol at a time. E ⟼ number | ( S )

Partly-derived String Look-ahead Parsed/Unparsed Input

S ( (1 + 2 + (3 + 4)) + 5
⟼E+S ( (1 + 2 + (3 + 4)) + 5
⟼ (S) + S 1 (1 + 2 + (3 + 4)) + 5
⟼ (E + S) + S 1 (1 + 2 + (3 + 4)) + 5
⟼ (1 + S) + S 2 (1 + 2 + (3 + 4)) + 5
⟼ (1 + E + S) + S 2 (1 + 2 + (3 + 4)) + 5
⟼ (1 + 2 + S) + S ( (1 + 2 + (3 + 4)) + 5
⟼ (1 + 2 + E) + S ( (1 + 2 + (3 + 4)) + 5
⟼ (1 + 2 + (S)) + S 3 (1 + 2 + (3 + 4)) + 5
⟼ (1 + 2 + (E + S)) + S 3 (1 + 2 + (3 + 4)) + 5
⟼…
There is a problem
S⟼E+S | E
E ⟼ number | ( S )
• We want to decide which production to apply based on the
look-ahead symbol.
• But, there is a choice:

(1) S ⟼ E ⟼ (S) ⟼ (E) ⟼ (1)

vs.
(1) + 2. S ⟼ E + S ⟼ (S) + S ⟼ (E) + S ⟼ (1) + S ⟼ (1) + E
⟼ (1) + 2

• Given the only one look-ahead symbol: ‘(‘ it isn’t clear whether to pick
S⟼E or S ⟼ E + S first.
LL(1) Grammars
Grammar is the problem
• Not all grammars can be parsed “top-down” with only a single lookahead symbol.
• Top-down: starting from the start symbol (root of the parse tree) and going down
• LL(1) means
– Left-to-right scanning
– Left-most derivation,
– 1 lookahead symbol
S⟼E+S | E
• This language isn’t “LL(1)” E ⟼ number | ( S )

• Is it LL(k) for some k?

• What can we do?

Making a grammar LL(1)
• Problem: We can’t decide which S production to apply until we see the
symbol after the first expression.

• Solution: “Left-factor” the grammar. There is a common S prefix for each

choice, so add a new non-terminal S’ at the decision point:
S ⟼ ES’
S⟼E+S | E S’ ⟼ ε
E ⟼ number | ( S ) S’ ⟼ + S
E ⟼ number | ( S )

• Also need to eliminate left-recursion. Why?

S⟼S+E | E
• Consider: E ⟼ number | ( S )
LL(1) Parse of the input string
• Look at only one input symbol at a time.

Partly-derived String Look-ahead Parsed/Unparsed Input

S ( (1 + 2 + (3 + 4)) +
⟼ E S’ ( (1 + 2 + (3 + 4)) + 5 S ⟼ ES’
⟼ (S) S’ 1 (1 + 2 + (3 + 4)) + 5 S’ ⟼ ε
⟼ (E S’) S’ 1 (1 + 2 + (3 + 4)) + 5 S’ ⟼ + S
⟼ (1 S’) S’ + (1 + 2 + (3 + 4)) + 5 E ⟼ number | ( S )
⟼ (1 + S) S’ 2 (1 + 2 + (3 + 4)) + 5
⟼ (1 + E S’) S’ 2 (1 + 2 + (3 + 4)) + 5
⟼ (1 + 2 S’) S’ + (1 + 2 + (3 + 4)) + 5
⟼ (1 + 2 + S) S’ ( (1 + 2 + (3 + 4)) + 5
⟼ (1 + 2 + E S’) S’ ( (1 + 2 + (3 + 4)) + 5
⟼ (1 + 2 + (S)S’) S’ 3 (1 + 2 + (3 + 4)) + 5
Predictive Parsing

• Given an LL(1) grammar:

– For a given nonterminal, the look-ahead symbol uniquely determines the production to apply.
– Top-down parsing = predictive parsing
S ⟼ ES’
– Driven by a predictive parsing table:
S’ ⟼ ε
nonterminal * input token → production
S’ ⟼ + S
number + ( ) $ (EOF) E ⟼ number | ( S )
T ⟼ S$ ⟼ S$
S ⟼ E S’ ⟼ E S’
S’ ⟼+S ⟼ε ⟼ε
E ⟼ num. ⟼(S)

• Note: it is convenient to add a special end-of-file token $

and a start symbol T (top-level) that requires $.
How do we construct the parse table?

• Consider a given production: A  γ

• Construct the set of all input tokens that may appear first in strings
that can be derived from γ
– Add the production  γ to the entry (A, token) for each such token.

• If γ can derive ε (the empty string), then we construct the set

of all input tokens that may follow the nonterminal A in the grammar.
– Add the production  ε to the entry (A, token) for each such token.

• Note: if there are two different productions for a given entry, the
grammar is not LL(1)
Example T ⟼ S$
S ⟼ ES’
• First(T) = First(S) S’ ⟼ ε
• First(S) = First(E) S’ ⟼ + S
• First(S’) = { + } E ⟼ number | ( S )
• First(E) = { number, ‘(‘ }
Note: we want the least
solution to this system of
• Follow(S’) = Follow(S) set equations… a fixpoint
computation. More on
• Follow(S) = { $, ‘)’ } ∪ Follow(S’) these later in the course.
number + ( ) $ (EOF)
T ⟼ S$ ⟼S$
S ⟼ E S’ ⟼E S’
S’ ⟼+S ⟼ε ⟼ε
E ⟼ num. ⟼(S)
Converting the table to code
• Define n mutually recursive functions
– one for each nonterminal A: parse_A
– Assuming the stream of tokens is globally available, the type of parse_A is unit -> ast,
if A is not an auxiliary nonterminal
– Parse functions for auxiliary nonterminals (e.g. S’) take extra ast’s as inputs, one for each
nonterminal in the “factored” prefix.

• Each function “peeks” at the lookahead token and then follows the production rule in the
corresponding entry.
– Consume terminal tokens from the input stream
– Call parse_X to create sub-tree for nonterminal X
– If the rule ends in an auxiliary nonterminal, call it with appropriate ast’s.
(The auxiliary rule is responsible for creating the ast after looking at more input.)
– Otherwise, this function builds the ast tree itself and returns it.
Demo: LL(1) Parsing

• https://github.com/cs4212/week-06-parsing
• ll1_parser.ml
• Hand-generated LL(1) code for the table below.

number + ( ) $ (EOF)
T ⟼ S$ ⟼S$
S ⟼ E S’ ⟼E S’
S’ ⟼+S ⟼ε ⟼ε
E ⟼ num. ⟼(S)
LL(1) Summary

• Top-down parsing that finds the leftmost derivation.

• Language Grammar ⇒ LL(1) grammar ⇒ prediction table ⇒ recursive-descent parser
• Great for simple hand-written implementation with fine-tuned error control (e.g., for editors)

• Problems:
– Grammar must be LL(1)
– Can extend to LL(k) (it just makes the table bigger)
– Grammar cannot be left recursive (parser functions will loop!)
– There are CF grammars that cannot be transformed to LL(k)

• Is there a better way?

LR Grammars
Bottom-up Parsing (LR Parsers)
• LR(k) parser:
– Left-to-right scanning
– Rightmost derivation
– k lookahead symbols

• LR grammars are more expressive than LL

– Can handle left-recursive (and right recursive) grammars; virtually all programming languages
– Easier to express programming language syntax (no left factoring)

• Technique: “Shift-Reduce” parsers

– Work bottom up instead of top down
– Construct right-most derivation of a program in the grammar
– Used by many parser generators (e.g. yacc, ocamlyacc, menhir, etc.)
– Better error detection/recovery
Top-down vs. Bottom up
S S
Note: ‘(‘ has been
• Consider the left-recursive grammar:
scanned but not
S⟼S+E | E S + E consumed. Processing S + E
it is still pending.
E ⟼ number | ( S )
E 5 E 5
• (1 + 2 + (3 + 4)) + 5
( S ) ( S )

• What part of the tree must we S + E S + E

know after scanning just “(1 + 2” ?
S + E( S ) S + E( S )
• In top-down, must be able to guess
E 2 S + E E 2 S + E
which productions to use…

1 E 4 1 E 4

3 3
Top-down Bottom-up
Progress of Bottom-up Parsing
Reductions Scanned Input Remaining
(1 + 2 + (3 + 4)) + 5 ⟻ (1 + 2 + (3 + 4)) + 5
(E + 2 + (3 + 4)) + 5 ⟻ ( 1 + 2 + (3 + 4)) + 5
(S + 2 + (3 + 4)) + 5 ⟻ (1 + 2 + (3 + 4)) + 5
(S + E + (3 + 4)) + 5 ⟻ (1 + 2 + (3 + 4)) + 5
Rightmost derivation

(S + (3 + 4)) + 5 ⟻ (1 + 2 + (3 + 4)) + 5
(S + (E + 4)) + 5 ⟻ (1 + 2 + (3 + 4)) + 5
(S + (S + 4)) + 5 ⟻ (1 + 2 + (3 + 4)) + 5
(S + (S + E)) + 5 ⟻ (1 + 2 + (3 + 4 )) + 5
(S + (S)) + 5 ⟻ (1 + 2 + (3 + 4 )) + 5
(S + E) + 5 ⟻ (1 + 2 + (3 + 4) )+5
(S) + 5 ⟻ (1 + 2 + (3 + 4) )+5
E+5⟻ (1 + 2 + (3 + 4)) +5
S+5⟻ (1 + 2 + (3 + 4)) +5
S+E⟻ (1 + 2 + (3 + 4)) + 5 S⟼S+E | E
S E ⟼ number | ( S )
Shift/Reduce Parsing
• Parser state:
– Stack of terminals and nonterminals.
– Unconsumed input is a string of terminals
– Current derivation step is stack + input
• Parsing is a sequence of shift and reduce operations:
• Shift: move look-ahead token to the stack
• Reduce: Replace symbols γ at top of stack with nonterminal X
such that X ⟼ γ is a production. (pop γ, push X)

Stack Input Action

(1 + 2 + (3 + 4)) + 5 shift (
( 1 + 2 + (3 + 4)) + 5 shift 1
(1 + 2 + (3 + 4)) + 5 reduce: E ⟼ number
(E + 2 + (3 + 4)) + 5 reduce: S ⟼ E
(S + 2 + (3 + 4)) + 5 shift +
(S + 2 + (3 + 4)) + 5 shift 2
(S + 2 + (3 + 4)) + 5 reduce: E ⟼ number
LR(0) Grammars

Simple LR parsing with no look-ahead.

LR Parser States

• Goal: know what set of reductions are legal at any given point.

• Idea: Summarise all possible stack prefixes α as a finite parser state.

– Parser state is computed by a DFA that reads the stack σ.
– Accept states of the DFA correspond to unique reductions that apply.

• Example: LR(0) parsing

– Left-to-right scanning, Right-most derivation, zero look-ahead tokens
– Too weak to handle many language grammars (e.g. the “sum” grammar)
– But, helpful for understanding how the shift-reduce parser works.
Example LR(0) Grammar: Tuples
Example LR(0) Grammar: Tuples
• Example grammar for non-empty tuples and identifiers:

S ⟼ ( L ) | id
• Example grammar for non-empty tuples and identifiers: S
L⟼S | L,S
S ⟼ ( L ) | id ( L )
• Example strings:
x
L ⟼ S | L , S
L , S
• Example strings:
(x,y)
((((x)))) Parse tree for: L , S w
– x Parse treez),
for:
(x, (y, z), w) (x, (y, w)
– (x,y) (x, (y, (z, w)))
(x, (y, z), w)
S ( L )
– ((((x))))
x L , S
– (x, (y, z), w)
– (x, (y, (z, w))) S z

CIS 341: Compilers 24

Shift/Reduce Parsing
Shift/Reduce Parsing
Shift/Reduce Parsing
• Parser state: S ⟼ ( L ) | id
S ⟼ ( L ) | id
• – Stack
• Parser
Parser of terminals and nonterminals.
state:
state: SL ⟼
⟼ (SL )| | Lid, S
L⟼S | L,S
– –– Unconsumed
Stack input
of terminals
Stack and
of terminals isnonterminals.
and anonterminals.
string of terminals L⟼S | L,S
– –– Current derivation
Unconsumed
Unconsumed isstep
input
input aisstring
aisstring stack
of of + input
terminals
terminals
• –Parsing
– isderivation
Current
Current a derivation
sequence ofisshift
step
step and
is stack reduce
stack
+ +input operations:
input
Parsing
••• Shift:
Parsing isisa asequence
move sequence
look-ahead shift
oftoken
of shift andand
to reduce
the
reduce stack:operations:
e.g.
operations:
• Shift:
• Shift: movelook-ahead
Stackmove look-ahead token
Input
token to stack:
to the the stack:
e.g. e.g. Action
Stack Input
(x, (y, z), w) Action
shift (
( x,
(x, (y,
(y,z),
z),w)
w) shift (x
shift
( x, (y, z), w) shift x
•• Reduce: Replace
Reduce: Replace symbols γ at gtop
symbols at of
topstack
of stack with nonterminal
with nonterminal X such
X such that X ⟼ γ is a
• that
Reduce:
production.g is
X ⟼ Replace
(pop push X): ge.g.
a production.
γ,symbols at topg,ofpush
(pop stackX): e.g.nonterminal X such
with
StackX ⟼ g is a production.
that Input (pop g, push X): e.g. Action
(xStack , (y,Input
z), w) Action
reduce S ⟼ id
(S(x ,, (y, z),
(y, z), w)
w) reduce SL ⟼
reduce ⟼ id
S
(S , (y, z), w) reduce L ⟼ S
Example Run Stack Input
(x, (y, z), w)
Action
shift (
( x, (y, z), w) shift x
(x , (y, z), w) reduce S ⟼ id
S ⟼ ( L ) | id (S , (y, z), w) reduce L ⟼ S
(L , (y, z), w) shift ,
L⟼S | L,S
(L, (y, z), w) shift (
(L, ( y, z), w) shift y
(L, (y , z), w) reduce S ⟼ id
(L, (S , z), w) reduce L ⟼ S
(L, (L , z), w) shift ,
(L, (L, z), w) shift z
(L, (L, z ), w) reduce S ⟼ id
(L, (L, S ), w) reduce L ⟼ L, S
(L, (L ), w) shift )
(L, (L) , w) reduce S ⟼ ( L )
(L, S , w) reduce L ⟼ L, S
(L , w) shift ,
(L, w) shift w
(L, w ) reduce S ⟼ id
(L, S ) reduce L ⟼ L, S
(L ) shift )
(L) reduce S ⟼ ( L )
S
Action Selection Problem
• Given a stack σ and a look-ahead symbol b, should the parser:
– Shift b onto the stack (new stack is σb)
– Reduce a production X ⟼ γ, assuming that σ = αγ (new stack is αX)?

• Sometimes the parser can reduce but shouldn’t

– For example, X ⟼ ε can always be reduced
– Sometimes the stack can be reduced in different ways (reduce/reduce conflict)

• Main idea: decide what to do based on a prefix α of the stack plus the look-ahead symbol.
– The prefix α is different for different possible reductions
since in productions X ⟼ γ and Y ⟼ β, γ and β might have different lengths.

• Main goal: know what set of reductions are legal at any point.
– How do we keep track?
LR(0) States
LR(0) States
• An LR(0) state is a set of items keeping track of progress on possible
upcoming reductions.
• An LR(0) item is a production from the language with an extra
separator “.” somewhere in the right-hand-side
S ⟼ ( L ) | id
LS⟼⟼S (| LL ), S|
id
L⟼S | L,S
• Example items: S ⟼ .( L ) or S ⟼ (. L) or L ⟼ S.
• Intuition:
– Stuff before the ‘.’ is already on the stack
(beginnings of possible g’s to be reduced)
– Stuff after the ‘.’ is what might be seen next
– The prefixes a are represented by the state itself
Constructing the DFA: Start state & Closure
Constructing the DFA: Start state & Closure
• Idea of the Closure: productions that can be applicable with the already observed stack

• First step: Add a new production

S’ ⟼ S$ to the grammar S’ ⟼ S$
• Start state of the DFA = empty stack, S ⟼ ( L ) | id
so it contains the item:
L⟼S | L,S
S’ ⟼ .S$
• Closure of a state:
– Adds items for all productions whose LHS nonterminal occurs in an item
in the state just after the ‘.’
– The added items have the ‘.’ located at the beginning (no symbols for
those items have been added to the stack yet)
– Note that newly added items may cause yet more items to be added to the
state… keep iterating until a fixed point is reached.
• Example: CLOSURE({S’ ⟼ .S$}) = {S’ ⟼ .S$, S ⟼ .(L), S⟼.id}

• Resulting “closed state” contains the set of all possible productions

that might be reduced next.
Example: Constructing the DFA
Example: Constructing the DFA
S’ ⟼ S$
S’ ⟼ .S$ S ⟼ ( L ) | id
L⟼S | L,S

• First, we construct a state with the initial item S’ ⟼ .S$

Example: Constructing the DFA
Example: Constructing the DFA
S’ ⟼ S$
S’ ⟼ .S$ S ⟼ ( L ) | id
S ⟼ .( L ) L⟼S | L,S
S ⟼ .id

• Next, we take the closure of that state:

CLOSURE({S’ ⟼ .S$}) = {S’ ⟼ .S$, S ⟼ .( L ), S ⟼ .id}

• In the set of items, the nonterminal S appears after the ‘.’

• So we add items for each S production in the grammar
Example: Constructing the DFA
Example: Constructing the DFA
S’ ⟼ S$
id S ⟼ ( L ) | id
S’ ⟼ .S$ S ⟼ id.
S ⟼ .( L ) L⟼S | L,S
S ⟼ .id
(
S ⟼ (. L ) • Next we add the transitions:
• First, we see what terminals and
nonterminals can appear after the
S
‘.’ in the source state.
– Outgoing edges have those label.
• The target state (initially) includes
all items from the source state that
S’ ⟼ S.$ have the edge-label symbol after
the ‘.’, but we advance the ‘.’ (to
simulate shifting the item onto the
stack)
Example: Constructing the DFA
Example: Constructing the DFA
S’ ⟼ S$
id S ⟼ ( L ) | id
S’ ⟼ .S$ S ⟼ id.
S ⟼ .( L ) L⟼S | L,S
S ⟼ .id
(
S⟼ (. L )
L⟼ .S
L⟼ .L, S
S S⟼ .(L)
S⟼ .id

• Finally, for each new state, we take the closure.

S’ ⟼ S.$ • Note that we have to perform two iterations to compute
CLOSURE({S ⟼ ( . L )})
– First iteration adds L ⟼ .S and L ⟼ .L, S
– Second iteration adds S ⟼ .(L) and S ⟼ .id
Example: Constructing
Full DFA the DFA
for the Example
1 2 8 9
id id S
S’ ⟼ .S$ S ⟼ id. L ⟼ L, . S L ⟼ L, S.
S ⟼ .( L ) S ⟼ .( L )
S ⟼ .id id S ⟼ .id
3 ( • Current state: run the
(
S⟼ (. L ) , DFA on the stack.
L⟼ .S 5
L⟼ .L, S S ⟼ ( L .) • If a reduce state is
L reached, reduce
S ( S⟼ .(L) L⟼L.,S
S⟼ .id • Otherwise, if the next
) token matches an
S
4 7 6 outgoing edge, shift.

S’ ⟼ S.$ L ⟼ S. S ⟼ ( L ). • If no such transition,

it is a parse error.
$

Done! Reduce state: ‘.’ at the

end of the production
CIS 341: Compilers 17
Using the DFA

• Run the parser stack through the DFA.

• The resulting state tells us which productions might be reduced next.
– If not in a reduce state, then shift the next symbol and transition according to DFA.
– If in a reduce state, X ⟼ γ with stack αγ, pop γ and push X.

• Optimisation: No need to re-run the DFA from beginning every step

– Store the state with each symbol on the stack: e.g. 1(3(3L5)6
– On a reduction X ⟼ γ, pop stack to reveal the state too:
e.g. From stack 1(3(3L5)6 reduce S ⟼ ( L ) to reach stack 1(3
– Next, push the reduction symbol: e.g. to reach stack 1(3S
– Then take just one step in the DFA to find next state: 1(3S7
Implementing the Parsing Table
• Represent the parser automaton as a table of shape:
state * (terminals + nonterminals)
• Entries for the “action table” specify two kinds of actions:
– Shift and goto state n
– Reduce using reduction X ⟼ γ
• First pop γ off the stack to reveal the state
• Look up X in the “goto table” and goto that state

Terminal Symbols Nonterminal Symbols

Action Goto
State

table table
Example Parse Table
( ) id , $ S L
1 s3 s2 g4
2 S⟼id S⟼id S⟼id S⟼id S⟼id
3 s3 s2 g7 g5
4 DONE
5 s6 s8
6 S ⟼ (L) S ⟼ (L) S ⟼ (L) S ⟼ (L) S ⟼ (L)
7 L⟼S L⟼S L⟼S L⟼S L⟼S
8 s3 s2 g9
9 L ⟼ L,S L ⟼ L,S L ⟼ L,S L ⟼ L,S L ⟼ L,S

sx = shift and goto state x

gx = goto state x
Example
• Parse the token stream: (x, (y, z), w)$

Stack Stream Action (according to table)

ε1 (x, (y, z), w)$ s3
ε1(3 x, (y, z), w)$ s2
ε1(3x2 , (y, z), w)$ Reduce: S⟼id
ε1(3S , (y, z), w)$ g7 (from state 3 follow S)
ε1(3S7 , (y, z), w)$ Reduce: L⟼S
ε1(3L , (y, z), w)$ g5 (from state 3 follow L)
ε1(3L5 , (y, z), w)$ s8
ε1(3L5,8 (y, z), w)$ s3
ε1(3L5,8(3 y, z), w)$ s2
LR(0) Limitations

• An LR(0) machine only works if states with reduce actions have a single reduce action.
– In such states, the machine always reduces (ignoring lookahead)

• With more complex grammars, the DFA construction will yield states with shift/reduce
and reduce/reduce conflicts:

OK shift/reduce reduce/reduce
S ⟼ ( L ). S ⟼ L ,S.
S ⟼ ( L ).
L ⟼ .L , S S ⟼ ,S.

• Such conflicts can often be resolved by using a look-ahead symbol: LR(1)

Examples

• Consider the left associative and right associative “sum” grammars:

left right
S⟼S+E | E S⟼E+S | E
E ⟼ number | ( S ) E ⟼ number | ( S )

• One is LR(0) the other isn’t… which is which and why?

• What kind of conflict do you get? Shift/reduce or Reduce/reduce?

• Ambiguities in associativity/precedence usually lead to shift/reduce conflicts.

Classification of Grammars

LR(1)

LALR(1)

LL(1) SLR

LR(0)
Parsing in OCaml via Menhir
Practical Issues

• https://github.com/cs4212/week-07-more-parsing

• Dealing with source file location information

– In the lexer and parser
– In the abstract syntax

– See range.ml, ast.ml

– Check the parse tree (printing via driver.ml)

• Lexing comments / strings

Menhir output

• You can get verbose parser debugging information by doing:

– menhir --explain …
– or, if using ocamlbuild:
ocamlbuild –use-menhir -yaccflag -–explain …

• The result is a <parsername>.conflicts file that contains a description of the error

– The parser items of each state use the ‘.’ just as described above

• The flag --dump generates a full description of the automaton

• Example: see start_parser.mly

Shift/Reduce conflicts

• Conflict 1:
• Operator precedence (State 13)

• Conflict 2:
• Parsing if-then-else statements
Shift/Reduce conflicts

• Conflict 1:
• Operator precedence (State 13)
• Resolving by changing the grammar (see good_parser.ml)

• Conflict 2:
• Parsing if-then-else statements
From Menhir Manual http://gallium.inria.fr/~fpottier/menhir/manual.pdf

5.3 Inlining
It is well-known that the following grammar of arithmetic expressions does not work as expected: that is, in
spite of the priority declarations, it has shift/reduce conflicts.
%token < int > INT
%token PLUS TIMES
%left PLUS
%left TIMES

expression:
| i = INT { i }
| e = expression; o = op; f = expression { o e f }
op:
| PLUS { ( + ) }
| TIMES { ( * ) }
The trouble is, the precedence level of the production expression ! expression op expression is undefined, and
there is no sensible way of defining it via a %prec declaration, since the desired level really depends upon the
symbol that was recognized by op: was it PLUS or TIMES?
The standard workaround is to abandon the definition of op as a separate nonterminal symbol, and to inline
its definition into the definition of expression, like this:
From Menhir Manual
The trouble is, the precedence level of the production expression ! expression op expression is undefined, and
http://gallium.inria.fr/~fpottier/menhir/manual.pdf
there is no sensible way of defining it via a %prec declaration, since the desired level really depends upon the
symbol that was recognized by op: was it PLUS or TIMES?
The standard workaround is to abandon the definition of op as a separate nonterminal symbol, and to inline
its definition into the definition of expression, like this:
expression:
| i = INT { i }
| e = expression; PLUS; f = expression { e + f }
| e = expression; TIMES; f = expression { e * f }
This avoids the shift/reduce conflict, but gives up some of the original specification’s structure, which,
in realistic situations, can be damageable. Fortunately, Menhir offers a way of avoiding the conflict without
manually transforming the grammar, by declaring that the nonterminal symbol op should be inlined:
expression:
| i = INT { i }
| e = expression; o = op; f = expression { o e f }
%inline op:
| PLUS { ( + ) }
| TIMES { ( * ) }
The %inline keyword causes all references to op to be replaced with its definition. In this example, the definition
of op involves two productions, one that develops to PLUS and one that expands to TIMES, so every production
that refers to op is effectively turned into two productions, one that refers to PLUS and one that refers to TIMES.
After inlining, op disappears and expression has three productions: that is, the result of inlining is exactly the
manual workaround shown above.
In some situations, inlining can also help recover a slight efficiency margin. For instance, the definition:
Precedence and Associativity Declarations

• Parser generators, like menhir often support precedence and associativity declarations.
– Hints to the parser about how to resolve conflicts.
– See: good-parser.mly

• Pros:
– Avoids having to manually resolve those ambiguities by manually introducing extra nonterminals
(see parser.mly)
– Easier to maintain the grammar

• Cons:
– Can’t as easily re-use the same terminal (if associativity differs)
– Introduces another level of debugging

• Limits:
– Not always easy to disambiguate the grammar based on just precedence and associativity.
Conflict 2: Ambiguity in Real Languages

• Consider this grammar: • Consider how to parse:

S ⟼ if (E) S if (E1) if (E2) S1 else S2

S ⟼ if (E) S else S
S⟼X=E • This is known as the “dangling else” problem.
E⟼…
• What should the “right” answer be?
• Is this grammar OK?
• How do we change the grammar?
How to Disambiguate if-then-else
• Want to rule out:

if (E1) if (E2) S1 else S2

• Observation: An un-matched ‘if’ should not appear as the ‘then’ clause of a containing ‘if’.

S ⟼ M | U // M = “matched”, U = “unmatched”
U ⟼ if (E) S // Unmatched ‘if ’
U ⟼ if (E) M else U // Nested if is matched
M ⟼ if (E) M else M // Matched ‘if ’
M⟼X=E // Other statements

• See: else-resolved-parser.mly
Alternative: Use { }
• Ambiguity arises because the ‘then’ branch is not well bracketed:

if (E1) { if (E2) { S1 } } else S2 // unambiguous

if (E1) { if (E2) { S1 } else S2 } // unambiguous

• So: could just require brackets

– But requiring them for the else clause too leads to ugly code for chained if-statements:

How about a compromise? Allow unbracketed else

if (c1) {
…
block only if the body is ‘if ’:
} else {
if (c2) {
if (c1) {
} else {
} else if (c2) { Benefits:
if (c3) {
} else if (c3) {
• Less ambiguous
} else { • Easy to parse
}
} else {
• Enforces good style
}
}
}
HW4: Oat v.1
Oat
• Simple C-like Imperative Language
– supports 64-bit integers, arrays, strings
– top-level, mutually recursive procedures
– scoped local, imperative variables

• See examples in hw4programs folder

• How to design/specify such a language?

Data Domain Fundamentals Student Guide
100% (1)
Data Domain Fundamentals Student Guide
70 pages
Servicemanual 20070321 Am232 PDF
No ratings yet
Servicemanual 20070321 Am232 PDF
104 pages
Compiler Design Unit 2 by Dr. Choudhary Ravi Singh
No ratings yet
Compiler Design Unit 2 by Dr. Choudhary Ravi Singh
46 pages
Mx3ipg2a PDF
No ratings yet
Mx3ipg2a PDF
2 pages
CMZ700 Yokogawa Gyro
90% (10)
CMZ700 Yokogawa Gyro
84 pages
Csf401 Unit 02
No ratings yet
Csf401 Unit 02
82 pages
Unit Iii Context-Free Grammar and Languages: 3.1.1. Definition
No ratings yet
Unit Iii Context-Free Grammar and Languages: 3.1.1. Definition
29 pages
"Dorothy Meets The Scarecrow" : English
No ratings yet
"Dorothy Meets The Scarecrow" : English
25 pages
Unit 3 13 Assignment 3 Develop A Website Harrison Odonnell It Level 2
No ratings yet
Unit 3 13 Assignment 3 Develop A Website Harrison Odonnell It Level 2
18 pages
AS00001155
No ratings yet
AS00001155
28 pages
Atcd Unit 2
No ratings yet
Atcd Unit 2
49 pages
XenApp 6.5 Advanced Administratoin - Student Manual
No ratings yet
XenApp 6.5 Advanced Administratoin - Student Manual
310 pages
2020 Electrical Engineering Paper-1 (PCC-EE-301) : Circuit Theory Total Marks - 70 Duration:3 Hrs
No ratings yet
2020 Electrical Engineering Paper-1 (PCC-EE-301) : Circuit Theory Total Marks - 70 Duration:3 Hrs
5 pages
Module 4
No ratings yet
Module 4
125 pages
08 CFG
No ratings yet
08 CFG
41 pages
FA - 11-1 - Parse Trees
No ratings yet
FA - 11-1 - Parse Trees
2 pages
CD Unit2
No ratings yet
CD Unit2
58 pages
4 Parsing
No ratings yet
4 Parsing
55 pages
Unit 2 2
No ratings yet
Unit 2 2
26 pages
Binary Search Tree
No ratings yet
Binary Search Tree
80 pages
DPG 21XX
No ratings yet
DPG 21XX
54 pages
PLDI Week 03 Irs
No ratings yet
PLDI Week 03 Irs
51 pages
15 - Software Development
No ratings yet
15 - Software Development
89 pages
First Follow
No ratings yet
First Follow
8 pages
GSM + Accessories Price List: Table of Content
100% (1)
GSM + Accessories Price List: Table of Content
27 pages
Bottom Up Parsing
No ratings yet
Bottom Up Parsing
12 pages
Compiler Design Study Material Unit 2nd
No ratings yet
Compiler Design Study Material Unit 2nd
28 pages
(KB2885) Download and Install ESET Offline or Install Older Versions of ESET Windows Home Products
No ratings yet
(KB2885) Download and Install ESET Offline or Install Older Versions of ESET Windows Home Products
4 pages
PLDI Week 10 More Typing
No ratings yet
PLDI Week 10 More Typing
55 pages
Checkpoint Packet Flow
No ratings yet
Checkpoint Packet Flow
3 pages
Naju Compiler
No ratings yet
Naju Compiler
19 pages
Data Collection Manual
No ratings yet
Data Collection Manual
48 pages
Aditya CD Notes
No ratings yet
Aditya CD Notes
22 pages
Module 2 - Compiler Till or Paser
No ratings yet
Module 2 - Compiler Till or Paser
120 pages
7 - Parsing Techniques - Top Down Parsing
No ratings yet
7 - Parsing Techniques - Top Down Parsing
47 pages
CD Unit-3
No ratings yet
CD Unit-3
146 pages
Wintertotal 2014
No ratings yet
Wintertotal 2014
14 pages
Unit 5
No ratings yet
Unit 5
41 pages
5 - Lecture05 - Top-Down Parsing
No ratings yet
5 - Lecture05 - Top-Down Parsing
35 pages
Theory of Computation and Compiler Design: Module - 4
No ratings yet
Theory of Computation and Compiler Design: Module - 4
31 pages
Deep Reinforcement Learning For 5G Networks: Joint Beamforming, Power Control, and Interference Coordination
No ratings yet
Deep Reinforcement Learning For 5G Networks: Joint Beamforming, Power Control, and Interference Coordination
30 pages
Algorithms and Flowcharts
No ratings yet
Algorithms and Flowcharts
31 pages
PLDI Week 04 LLVM
No ratings yet
PLDI Week 04 LLVM
62 pages
Applications Customer Support Engineer in Burlington VT Resume Jeramy Hammer
No ratings yet
Applications Customer Support Engineer in Burlington VT Resume Jeramy Hammer
2 pages
Data Visualization2.pdf - Crdownload
No ratings yet
Data Visualization2.pdf - Crdownload
18 pages
CD Unit2 New1
No ratings yet
CD Unit2 New1
93 pages
AI and Robotics
No ratings yet
AI and Robotics
22 pages
PLDI Week 02 X86lite
No ratings yet
PLDI Week 02 X86lite
30 pages
03 Parsing
No ratings yet
03 Parsing
71 pages
PLDI Week 08 Lambda
No ratings yet
PLDI Week 08 Lambda
17 pages
First Order Systems
No ratings yet
First Order Systems
40 pages
Chapter Four
No ratings yet
Chapter Four
54 pages
CSC-437 Chapter 4
No ratings yet
CSC-437 Chapter 4
65 pages
Effects of Social Media On Grade 11 Students of Voctech
No ratings yet
Effects of Social Media On Grade 11 Students of Voctech
11 pages
Toc Unit 3
No ratings yet
Toc Unit 3
49 pages
Normal Forms and Parsing: CSC 3130: Automata Theory and Formal Languages
No ratings yet
Normal Forms and Parsing: CSC 3130: Automata Theory and Formal Languages
22 pages
2-Role of Parser and Parse Tree-02!08!2024
No ratings yet
2-Role of Parser and Parse Tree-02!08!2024
69 pages
CD Lec 5 Sum23-24
No ratings yet
CD Lec 5 Sum23-24
33 pages
CD Parser
No ratings yet
CD Parser
40 pages
Unit 7
No ratings yet
Unit 7
34 pages
Bottom-Up Parsing
No ratings yet
Bottom-Up Parsing
4 pages
Linked List Programs
No ratings yet
Linked List Programs
6 pages
Arts Vi Long Quiz Fourth Periodical 2017-2018 With Answer Key
No ratings yet
Arts Vi Long Quiz Fourth Periodical 2017-2018 With Answer Key
5 pages
Module 2a - With Soln
No ratings yet
Module 2a - With Soln
90 pages
Unit - V
No ratings yet
Unit - V
13 pages
Chapter No. 5.: Compilers: Analysis Phase
No ratings yet
Chapter No. 5.: Compilers: Analysis Phase
113 pages
Lecture 8
No ratings yet
Lecture 8
20 pages
Module 4 - Top Down Parsing
No ratings yet
Module 4 - Top Down Parsing
31 pages
Hyderabad
No ratings yet
Hyderabad
43 pages
Ambiguous Grammar
No ratings yet
Ambiguous Grammar
3 pages
TOC II Updated
No ratings yet
TOC II Updated
41 pages
CD Unit3
No ratings yet
CD Unit3
74 pages
Week 10 - Non Recursive Predictive Parsor
0% (1)
Week 10 - Non Recursive Predictive Parsor
41 pages
Lecture3 Java
No ratings yet
Lecture3 Java
82 pages
Lexical Class3
No ratings yet
Lexical Class3
27 pages
Learn JavaScript - Iterators Cheatsheet - Codecademy
No ratings yet
Learn JavaScript - Iterators Cheatsheet - Codecademy
2 pages
Predictive Parser Unit 2
No ratings yet
Predictive Parser Unit 2
22 pages
Lab Compilers
No ratings yet
Lab Compilers
5 pages
Top Down Parsing
No ratings yet
Top Down Parsing
38 pages
Amon Chowdhury CV
No ratings yet
Amon Chowdhury CV
3 pages
Lecture#16, Chapter 04 (Part II)
No ratings yet
Lecture#16, Chapter 04 (Part II)
20 pages
CH-3 Syntax Analyzer
No ratings yet
CH-3 Syntax Analyzer
41 pages
Lecture 08
No ratings yet
Lecture 08
24 pages
Compiler Design Unit-2
No ratings yet
Compiler Design Unit-2
29 pages
Unit 3 Class
No ratings yet
Unit 3 Class
23 pages
Top-Down Parsing: - The Parse Tree Is Created Top To Bottom. - Top-Down Parser
No ratings yet
Top-Down Parsing: - The Parse Tree Is Created Top To Bottom. - Top-Down Parser
31 pages
Syntax Analysis: EECS 483 - Lecture 4 University of Michigan Monday, September 17, 2006
No ratings yet
Syntax Analysis: EECS 483 - Lecture 4 University of Michigan Monday, September 17, 2006
28 pages
Parsing
No ratings yet
Parsing
38 pages
CD UNIT-II Syntax Analysis
No ratings yet
CD UNIT-II Syntax Analysis
13 pages
Q1. Define The Following With Simple Example (Choose Five Only) :: A Grammar That Produces More Than One Parse Tree For Some
No ratings yet
Q1. Define The Following With Simple Example (Choose Five Only) :: A Grammar That Produces More Than One Parse Tree For Some
8 pages
Elimination of Left Recursion
No ratings yet
Elimination of Left Recursion
17 pages
From Simple IO to Monad Transformers
From Everand
From Simple IO to Monad Transformers
J Adrian Zimmer
2/5 (1)
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)

PLDI Week 07 More Parsing

Uploaded by

PLDI Week 07 More Parsing

Uploaded by

CS4212: Compiler Design

Note: Once again we have to take

• The definition is recursive – S mentions itself.

• You can replace the “nonterminal” S by one of its definitions anywhere

* And, since we’re writing this description in English, we are careful

• Example: The balanced parentheses language:

• How many terminals? How many nonterminals? Productions?

Searching for derivations

Partly-derived String Look-ahead Parsed/Unparsed Input

(1) S ⟼ E ⟼ (S) ⟼ (E) ⟼ (1)

• Is it LL(k) for some k?

• What can we do?

• Solution: “Left-factor” the grammar. There is a common S prefix for each

• Also need to eliminate left-recursion. Why?

Partly-derived String Look-ahead Parsed/Unparsed Input

• Given an LL(1) grammar:

• Note: it is convenient to add a special end-of-file token $

• Consider a given production: A  γ

• If γ can derive ε (the empty string), then we construct the set

• Top-down parsing that finds the leftmost derivation.

• Is there a better way?

• LR grammars are more expressive than LL

• Technique: “Shift-Reduce” parsers

• What part of the tree must we S + E S + E

Stack Input Action

Simple LR parsing with no look-ahead.

• Idea: Summarise all possible stack prefixes α as a finite parser state.

• Example: LR(0) parsing

CIS 341: Compilers 24

• Sometimes the parser can reduce but shouldn’t

• First step: Add a new production

• Resulting “closed state” contains the set of all possible productions

• First, we construct a state with the initial item S’ ⟼ .S$

• Next, we take the closure of that state:

• In the set of items, the nonterminal S appears after the ‘.’

• Finally, for each new state, we take the closure.

S’ ⟼ S.$ L ⟼ S. S ⟼ ( L ). • If no such transition,

Done! Reduce state: ‘.’ at the

• Run the parser stack through the DFA.

• Optimisation: No need to re-run the DFA from beginning every step

Terminal Symbols Nonterminal Symbols

sx = shift and goto state x

Stack Stream Action (according to table)

• Such conflicts can often be resolved by using a look-ahead symbol: LR(1)

• Consider the left associative and right associative “sum” grammars:

• One is LR(0) the other isn’t… which is which and why?

• Ambiguities in associativity/precedence usually lead to shift/reduce conflicts.

• Dealing with source file location information

– See range.ml, ast.ml

• Lexing comments / strings

• You can get verbose parser debugging information by doing:

• The result is a <parsername>.conflicts file that contains a description of the error

• The flag --dump generates a full description of the automaton

• Example: see start_parser.mly

• Consider this grammar: • Consider how to parse:

S ⟼ if (E) S if (E1) if (E2) S1 else S2

if (E1) if (E2) S1 else S2

if (E1) { if (E2) { S1 } } else S2 // unambiguous

• So: could just require brackets

How about a compromise? Allow unbracketed else

• See examples in hw4programs folder

• How to design/specify such a language?

You might also like