Bottomup
Bottomup
1
Bottom Up Parsing has the following advantages over top-down
parsing.
Attribute computation is easy.
Since choices are made only at the end of a rule, shared prefixes are
unproblematic. Because of this, there is usually no need to modify
grammar rules.
The parser can be generated automatically.
One big disadvantage is the fact that bottom-up parsing does not
suppport left/right information flow. (For example, checking
symbol definitions.)
2
Shift/Reduce Parsing
Let G = (Σ, A, R, S) be an attribute grammar.
The shift/reduce parser operates on triples
(s, v, u) ∈ (Σ ⊗ S)∗ × (Σ ⊗ S)∗ × (Σ ⊗ S)∗ , where
• s ∈ (Σ ⊗ A)∗ is the stack.
• v ∈ (Σ ⊗ A)∗ is the lookahead,
• u ∈ (Σ ⊗ A)∗ is the input that is not yet read.
3
Shift/Reduce Parsing
We write ⊢ for the transition relation of the parser.
The parser starts in a state of form (ǫ, ǫ, u).
(Empty stack, empty lookahead, no input read.)
4
Read
A read means that the parser moves one unread token to the
lookahead:
5
Shift
A shift means that the parser shifts one token from lookahead to
the stack:
6
Reduction
A reduction means that the parser replaces the right hand side of a
grammar rule by the left hand side. It uses the attribute function
of the grammar rule to compute the new attribute.
If (A → w1 · . . . · wn ) : f ∈ R, then
7
Accept
The shift/reduce parser accepts its input if it is in a state
This means that it has read all the input, has empty lookahead,
and it managed to rewrite the input to S.
In practice an EOF symbol is used. Let # 6∈ Σ be a special EOF
symbol.
The shift/reduce parser accepts its input if it is in a state
8
Making the Decisions
At each state, the parser has the following choices:
• If the top of the stack contains the right hand side of a rule, it
can reduce.
• It it didn’t reach end of file, it can shift.
It is possible that more than one reduction is possible. If a
reduction is possible, it is still possible to shift. In order to decide,
the parser uses the lookahead.
A good parser makes its decisions as early as possible, that means
with the smallest possible lookahead.
We will only consider parsers that use a lookahead of at most 1.
9
Parser Generation Tools/Practical Aspects
There exist many parser generation tools that support attribute
grammars. (Yacc, Bison, Maphoon). The attribute functions are
usually represent by general C/C ++ -statements. In the code,
$1,$2,$3, ... refer to the attributes of the first,second, etc.
token on the right hand side.
The notation $$ refers to the attribute of the token on the left
hand side.
A rule of form A → A + B : f (x, y, z) = x + z is represented by:
A -> A + B // $$ = $1 + $3;
10
LALR parsing
LALR stands for look ahead left right. It is a technique for
deciding when reductions have to be made in shift/reduce parsing.
Often, it can make the decisions without using a look ahead.
Sometimes, a look ahead of 1 is required.
Most parser generators (and in particular Bison and Yacc)
construct LALR parsers.
In LALR parsing, a deterministic finite automaton is used for
determining when reductions have to be made. The deterministic
finite automaton is usually called prefix automaton. On the
following slides, I will explain how it is constructed.
11
Items
Let G = (Σ, R, S) be a context-free grammar.
Definition Let A ∈ Σ, w1 , w2 ∈ Σ∗ . If A → w1 · w2 ∈ R, then
A → w1 . w2 is called an item.
An item is a rule with a dot added somewhere in the right hand
side.
The intuitive meaning of an item A → w1 . w2 is that w1 has been
read, and if w2 will also be read, then the rule A → w1 w2 can be
reduced.
12
Items
Let a → bBc be a rule. The following items can be constructed
from this rule:
13
Operations on Itemsets (1)
Definition: An itemset is a set of items.
Because for a given grammar, there exists only a finite set of
possible items, the set of itemsets is also finite.
Let I be an itemset. The closure CLOS(I) of I is defined as the
smallest itemset J, s.t.
• I ⊆ J,
• If A → w1 . Bw2 ∈ J, and there exists a rule B → v ∈ R, then
B → . v ∈ J.
14
Operations on Itemsets (2)
Let I be an itemset, let α ∈ Σ be a symbol. The set TRANS(I, α)
is defined as
{A → w1 α . w2 | A → w1 . αw2 ∈ I }.
15
The Prefix Automaton
Let G = (Σ, R, S) be a grammar. The prefix automaton of G is the
deterministic finite automaton A = (Σ, Q, Qs , Qa , δ), that is the
result of the following algorithm:
• Start with A = (Σ, {CLOS(I)}, {CLOS(I)}, ∅, ∅), where
I = {Ŝ → .S #}, Ŝ 6∈ Σ is a new start symbol, S is the
original start symbol of G, and # 6∈ Σ is the EOF symbol.
• As long as there exist an I ∈ Q and an A ∈ Σ, s.t.
I ′ = CLOS(TRANS(I, A)) 6∈ Q, put
Q := Q ∪ {I ′ }, δ := δ ∪ {(I, A, I ′ )}.
δ := δ ∪ {(I, A, I ′ )}.
16
The Prefix Automaton (2)
The prefix automaton may be big, but it can be easily computed.
Every context-free language has a prefix automaton, but not every
language can be parsed by an LALR parser, because of the look
ahead sets.
Theorem: Let G = (Σ, R, S) be a context-free grammar. Let L be
its associated language, i.e. L = {w ∈ Σ∗ | S ⇒∗ w}. Let L′ be the
language defined by
17
Parse Algorithm (1)
std::vector< state > states;
// Stack of states of the prefix automaton.
while( true )
{
18
Parse Algorithm (2)
decision = unknown;
19
Parse Algorithm (3)
20
Parse Algorithm (4)
if( decision == unknown &&
topstate has only one reduction R with
lookahead. front( ) &&
no shift is possible with lookahead. front( ))
{
decision = reduce(R);
}
if( decision == unknown &&
topstate has only a shift Q with
lookahead. front( ) &&
no reduction is possible with lookahead. front())
{
decision = shift(Q);
}
21
Parse Algorithm (5)
if( decision == unknown )
{
// Either we have a conflict, or the parser is
// stuck.
22
Parse Algorithm (6)
decision = reduce(R);
}
23
Parse Algorithm (7)
if( decision == push(Q))
{
states. push_back( Q );
tokens. push_back( lookahead. front( ));
lookahead. pop_front( );
}
else
{
// decision has form reduce(R)
unsigned int n =
the length of the rhs of R.
24
Parse Algorithm (8)
token lhs =
compute_lhs( R,
tokens. begin( ) + tokens. size( ) - n,
tokens. begin( ) + tokens. size( ));
// this also computes the attribute.
25
Parse Algorithm (9)
// Unreachable.
26
Lookahead Sets
We already have seen lookahead sets in action.
If a state has more than one reduction, or a reduction and a shift,
the parser looks at the lookahead symbol, in order to decide what
to do next.
LA(I, A → w) ⊆ Σ is defined a set of tokens. If the parser is in
state I, and the lookahead ∈ LA(I, A → w), then the parser can
reduce A → w.
When should a token σ be in LA(I, A → w) ?
27
Lookahead Sets (2)
Definition:
s ∈ LA(I, A → w) if
1. A → w . ∈ I (obvious)
2. There exists a correct input word w1 s w2 #, such that
3. The parser reaches a state with state stack (. . . , I) and token
stack (. . . , w), the lookahead (of the parser) is s, and
4. the parser can reduce the rule A → w, after which
5. it can read the rest of the input w2 and reach an accepting
state.
28
Computing Look Ahead Sets
For every rule A → w of the grammar G, such that there exist
states I1 , I2 , I3 , s.t. A → . w ∈ I1 , A → w . ∈ I2 , there exists a
path from I1 to I2 in the prefix automaton that reads w, and there
is a transition from I1 to I3 that reads A, the following must hold:
• For every symbol σ ∈ Σ, for which a transition from I3 to some
other state is possible in the prefix automaton,
σ ∈ LA( I2 , A → w . ).
• For every item of form B → v . ∈ I3 ,
LA( I3 , B → v .) ⊆ LA( I2 , A → w .)
Compute the LA as the smallest such sets.
29
Computing Look Ahead Sets (2)
Example
S → Aa,
A → B,
A → Bb,
B → C,
B → Cc,
C → d.
30
The algorithm on the previous slides can sometimes compute too
big look ahead sets. You will see this in the exercises.
31
Computing the Lookahead Sets in the Correct Way
Definition: Let G = (Σ, R, S) be a grammar. An LR(1)-item (based
on G) is an object of form A → w1 . w2 /s, where (A → w1 w2 ) ∈ R,
and s ∈ Σ is a terminal symbol of G.
A LR(1)-item set is a set of LR(1)-items.
The intuitive meaning of A → w1 . w2 /s is something like: ‘We
have read w1 , and are prepared to read w2 . s after that’.
32
Closure of LR(1)-Itemsets
Let I be an LR(1)-itemset. The closure CLOS(I) of I is defined as
the smallest LR(1)-itemset J, s.t.
• I ⊆ J,
• If A → w1 . Bw2 /s ∈ J, and there exists a rule B → v ∈ R,
then for each terminal symbol s′ ∈ FIRST(w2 s), also
B → . v/s′ ∈ J.
(FIRST is defined in the slides on top-down parsing.)
33
Transitions of LR(1)-Itemsets
Let I be an LR(1)-itemset, let α ∈ Σ be a symbol. TRANS(I, α) is
defined as
{A → w1 α . w2 /s | A → w1 . αw2 /s ∈ I }.
34
Core of an LR(1)-Itemset
Let I be an LR(1)-itemset. The core of I, written as CORE(I) is
defined as
{A → w1 . w2 | ∃s ∈ Σ : A → w1 . w2 /s ∈ I}.
(The set of LR(0)-items that one obtains when one removes all the
lookaheads.)
35
Construction of the Prefix Automaton with LR(1)-Items
Let G = (Σ, R, S) be a grammar. The prefix automaton of G is the
deterministic finite automaton A = (Σ, Q, Qs , Qa , δ), that is the
result of the following algorithm:
• Start with A = (Σ, {CLOS(I)}, {CLOS(I)}, ∅, ∅), where
I = {Ŝ → . S/#}, Ŝ 6∈ Σ is a new start symbol, S is the
original start symbol of G, and # 6∈ Σ is the EOF symbol.
36
• As long as there exist an I ∈ Q and an A ∈ Σ, s.t.
I ′ = CLOS(TRANS(I, A)), and there is no state I ′′ ∈ Q with
CORE(I ′′ ) = CORE(I ′ ), set
Q := Q ∪ {I ′ }, δ := δ ∪ {(I, A, I ′ )}.
37
Once the prefix automaton A = (Σ, Q, Qs , Qa , δ) has been
constructed, the lookahead sets can be obtained from the
LR(1)-items as follows:
If a state I contains items of form A → w/s′ , the lookahead set for
reducing A → w equals
38
The construction on the previous slides is carried out automatically
by parser generators. Examples are YACC, Bison, and also
Maphoon.
Using a parser generator, it is easier to extend the language later.
Also, the parser generator automatically analyzes the language,
and shows where the conflicts are.
Top-Down parsing (recursive descend) has the advantage that one
doesn’t need to study a tool, but it will be a lot harder to change
the language later. Developers often avoid use of a parser generator,
and then regret later, when they have to change the language.
39