Chapter 6-1 Note
Chapter 6-1 Note
Bottom-Up
Parsing
(Shift-Reduce)
陳奇業 成功大學資訊工程系
1
Overview
We study bottom-up (also called LR) parsers, whose operation can be
compared with top-down parsers as follows:
A bottom-up parser begins with the parse tree’s leaves and moves toward its root.
A top-down parser moves the parse tree’s root toward its leaves.
A bottom-up parser traces a rightmost derivation in reverse. A top-down parser
traces a leftmost derivation.
A bottom-up parser uses a grammar rule to replace the rule’s right-hand side (RHS)
with its left-hand side (LHS). A top-down parser does the opposite, replacing a rule’s
LHS with its RHS.
2
𝐸 → 𝑇 → 𝑇 ∗ 𝐹 → 𝑇 ∗ 𝑖𝑑→ 𝐹 ∗𝑖𝑑 →𝑖𝑑∗ 𝑖𝑑
3
An Example
Grammar :
Input:
( rightmost derivation )
LR parsing:
( rightmost derivation in reverse)
4
S
1
a A B e
2
3
A b c d
b
LR parsing:
5
Overview
The style of parsing considered in this chapter is known by the following
names:
Bottom-up, because the parser works its way from the terminal symbols to the
grammar’s goal symbol
Shift-reduce, because the two most prevalent actions taken by the parser are to
shift symbols onto the parse stack and to reduce a string of such symbols located
at the top-of-stack to one of the grammar’s non-terminals
LR(), because such parsers scan the input from the left (the “L” in LR) producing a
rightmost derivation (the “R” in LR) in reverse, using symbols of lookahead
6
Handle Pruning
Bottom-up parsing during a left-to-right scan of the input constructs a
rightmost derivation in reverse.
Informally, a "handle" is a substring that matches the body of a production,
and whose reduction represents one step along the reverse of a rightmost
derivation.
Given a sentential form, the handle is defined as the sequence of symbols
that will next be replaced by reduction.
7
Example
8
Handle Pruning
Formally, if , then production in the position following is a handle of . Notice
that the string to the right of the handle must contain only terminal symbols.
For convenience, we refer to the body rather than as a handle.
Note we say "a handle" rather than "the handle," because the grammar
could be ambiguous, with more than one rightmost derivation of .
If a grammar is unambiguous, then every right-sentential form of the
grammar has exactly one handle.
9
Handle Pruning
A rightmost derivation in reverse can be obtained by "handle pruning.“ That
is, we start with a string of terminals to be parsed. If is a sentence of the
grammar at hand, then let , where is the th right-sentential form of some as
yet unknown rightmost derivation
10
Shift-Reduce Parsing
There are four actions a parser can make:
Shift. Shift the next input symbol onto the top of the stack.
Reduce. The right end of the string to be reduced must be at the top of the stack.
Locate the left end of the string within the stack and decide with what nonterminal
to replace the string.
Accept. Announce successful completion of parsing.
Error. Discover a syntax error and call an error recovery routine.
11
Stack Implementation of Bottom-Up
Parsing
There is an important fact that justifies the use of a stack in shift-reduce
parsing: the handle will always eventually appear on top of the stack, never
inside.
12
Example (from 龍書 )
13
14
Shift-Reduce Parsing
The use of a stack in shift-reduce parsing is justified by an important fact:
the handle will always eventually appear on top of the stack, never inside.
This fact can be shown by considering the possible forms of two successive
steps in any rightmost derivation. In case (I), is replaced by , and then the
rightmost nonterminal in the body is replaced by . In case (2), is again
expanded first, but this time the body is a string of terminals only. The next
rightmost nonterminal will be somewhere to the left of .
15
Shift-Reduce Parsing
In other words:
(1)
(2)
Consider case (1) in reverse Consider case (2)
STACK INPUT ACTION STACK INPUT ACTION
$ $ reduce by $ $ reduce by
$ $ shift $ $ shift
$ $ reduce by $ $ shift
$ $ reduce by
16
LR Parsers
Advantages:
LR parsers can be constructed to recognize all programming language construct for which
context-free grammars can be written.
The LR-parsing method is the most general nonbacktracking shift-reduce parsing method
known, yet it can be implemented as efficiently as other, more primitive shift-reduce
methods
The class of grammars that can be parsed by LR parser is the proper superset of the class
of grammars that can be parsed by predictive parsers.
LR parsers can detect errors in syntax as soon as possible
Drawbacks:
Too much work to do
17
LR Parsing Engine
18
LR Parsing Engine
19
Structure of the LR Parsing Table
The parsing table consists of two parts: a parsing-action function ACTION and
a goto function GOTO
1. The ACTION function takes as arguments a state and a terminal (or $, the input
endmarker). The value of ACTION[] can have one of four forms:
(a) Shift , where is a state. The action taken by the parser effectively shifts input to the stack,
but uses state to represent .
(b) Reduce . The action of the parser effectively reduces on the top of the stack to head .
(c) Accept. The parser accepts the input and finishes parsing.
(d) Error. The parser discovers an error in its input and takes some corrective action.
2. We extend the GOTO function, defined on sets of items, to states: if GOTO[]= then
GOT0 also maps a state and a nonterminal to state .
20
21
22
23
LR() Parsing
As is the case with LL parsers, LR parsers are parameterized by the number of
lookahead symbols that are consulted to determine the appropriate parser action.
An LR() parser can peek at the next tokens.
This notion of “peeking” and the term LR(0) are confusing, because even an LR(0)
parser must refer to the next input token, for the purpose of indexing the parse
table to determine the appropriate action. The “0” in LR(0) refers not to the
lookahead at parse time, but rather to the lookahead used in constructing the
parse table.
At parse-time, LR(0) and LR(1) parsers index the parse table using one token of
lookahead; for , an LR() parser uses tokens of lookahead.
24
LR() Parsing
The number of columns in an LR() parse table grows dramatically with .
For example, an LR(3) parse table is indexed by the parse state to select a
row, and by the next 3 input tokens to select a column.
If the terminal alphabet has symbols, then the number of distinct three-
token sequences is . More generally, an LR() table has columns for a token
alphabet of size .
To keep the size of parse tables within reason, most parser generators are
limited to one token of lookahead.
25
LR() Parsing
LR() parsing decide the next action by examining the tokens already shifted
and at most lookahead tokens
A grammar is LR() if, and only if, it is possible to construct an LR parse table
such that tokens of lookahead allows the parser to recognize exactly those
strings in the grammar’s language.
26
LR(0) Table Construction
To keep track of the parser’s progress, we introduce the notion of an LR(0)
item—a grammar production with a bookmark that indicates the current
progress through the production’s RHS.
27
LR(0) Table Construction
Definition: An LR(0) item of a grammar G is a production of G with a dot () at some position of the right
side. e.g. has 4 items
28
Closure of Item Sets
If is a set of items for a grammar ,
then CLOSURE()s the set of items
constructed from by the two rules:
Initially, add every item in to
CLOSURE()
If is in CLOSURE() and is a
production, then add the item to
CLOSURE(), if it is not already there.
Apply this rule until no more new
items can be added to CLOSURE().
29
Example
Consider the augmented expression grammar
If is the set of one item , then CLOSURE() contains the set of items:
,,,,,,
30
LR(0) items
We divide all the sets of items of interest into two classes:
Kernel items: the initial item, , and all items whose dots are not at the left end.
Nonkernel items: all items with their dots at the left end, except for .
We now define a parser state as a set of LR(0) items. While each state is
formally a set of items.
31
The Function GOT0
GOTO() is defined to be the closure of the set of all items [ ] such that [ ] is in
where is a set of items and is a grammar symbol.
Intuitively, the GOT0 function is used to define the transitions in the LR(0)
automaton for a grammar.
The states of the automaton correspond to sets of items, and GOTO()
specifies the transition from the state for under input .
32
33
34
35
Characteristic Finite-State Machine
(CFSM)
The basis for LR parsing is a deterministic finite automaton (DFA), called the
characteristic finite-state machine (CFSM).
A viable prefix of a right sentential form is any prefix that does not extend
beyond its handle.
Formally, a CFSM recognizes its grammar’s viable prefixes.
When the automaton arrives in a double-boxed state, it has processed a
viable prefix that ends with a handle.
36
𝐸 ’⟹ 𝐸 ⟹ 𝐸+𝑛⟹𝑛+𝑛
are all viable prefixes of the right-sentential form .
Completing an LR(0) Parse Table
38
LR(0) Parse (from 龍書 )
39
LR(0) Parse (from 龍書 )
40
Conflict Diagnosis
If we consider the possibilities for multiple table-cell entries, only the
following two cases are troublesome for LR() parsing:
shift/reduce conflicts exist in a state when table construction cannot use the next
tokens to decide whether to shift the next input token or call for a reduction.
reduce/reduce conflicts exist when table construction cannot use the next tokens
to distinguish between multiple reductions.
41
Conflict Diagnosis
Conflicts arise for one of the following reasons:
The grammar is ambiguous. No (deterministic) table-construction method can
resolve conflicts that arise due to ambiguity.
The grammar is not ambiguous, but the current table-building approach could not
resolve the conflict. In this case, the conflict might disappear if one or more of the
following approaches is taken:
The current table-construction method is given more lookahead.
A more powerful table-construction method is used.
42
Ambiguous Grammars
43
Ambiguous
Grammars
The parse tree that favors the
reduction in State 5 corresponds
to a left-associative grouping for
addition, while the shift
corresponds to a right-
associative grouping.
44
Ambiguous Grammars
Ambiguous Unambiguous
45
Ambiguous Grammars
A statement beginning with would
appear as the token stream to the
parser. After shifting the first three
tokens onto the stack, a shift-reduce
parser would be in configuration
(stack) $... …$ (input buffer)
Make things as easy as possible for the
determine if 𝑖𝑑 is a procedure or an
parser. It should be left to scanner to
array. array
procedure
46
Grammars that are not LR(k)
47
SLR() Table
Construction
The SLR(k) (Simple LR with
k tokens of lookahead)
method attempts to resolve
inadequate states using
grammar analysis methods.
this grammar is LR(0) 48
SLR() Table
Construction
49
shift/reduce conflict
Error!!
shift/reduce conflict
50
SLR(𝑘) Table Construction
With the item in State 6, reduction by must be appropriate under some
conditions. If we examine the sentential forms and , we see that the must be
applied in State 6 when the next input symbol is or , but not .
If the reduction to can lead to a successful parse, then (or $) can appear
next to in some valid sentential form. An equivalent statement is
For our example, States 1 and 6 are resolved by computing .
51
SLR(𝑘) Table Construction
52
SLR(𝑘) Table Construction
53
Exercises 4, 5, 6, 10
54