Lecture3 Java
Lecture3 Java
CS414-2017-03
Parsing
David Galles
S′ → S$
S → AB
S → Ch
A → ef
A→ǫ
B →hg
C → DD
C → fi
D→g
ParseS use rule S → AB on e, h
use the rule S → C h on f, g
03-7: First sets
First(S ) is the set of all terminals that can start strings
derived from S (plus ǫ, if S can produce ǫ)
S′ → S$ First(S ′ ) =
S → AB First(S ) =
S → Ch First(A) =
A → ef First(B ) =
A→ǫ First(C ) =
B →hg First(D ) =
C → DD
C → fi
D→g
03-8: First sets
First(S ) is the set of all terminals that can start strings
derived from S (plus ǫ, if S can produce ǫ)
S′ → S$ First(S ′ ) = {e, f, g, h}
S → AB First(S ) = {e, f, g, h}
S → Ch First(A) = {e, ǫ}
A → ef First(B ) = {h}
A→ǫ First(C ) = {f, g}
B →hg First(D ) = {g}
C → DD
C → fi
D→g
03-9: First sets
We can expand the definition of First sets to include
strings of terminals and non-terminals
S′ → S$ First(aB ) =
S → AB First(BC ) =
S → Ch First(AbC ) =
A → ef First(AC ) =
A→ǫ First(abS ) =
B →hg First(DDA) =
C → DD
C → fi
D→g
03-10: First sets
We can expand the definition of First sets to include
strings of terminals and non-terminals
S′ → S$ First(aB ) = {a}
S → AB First(BC ) = {h}
S → Ch First(AbC ) = {e, b}
A → ef First(AC ) = {e, f, g}
A→ǫ First(abS ) = {a}
B →hg First(DDA) = {g}
C → DD
C → fi
D→g
03-11: Calculating First sets
For each non-terminal S , set First(S ) = {}
For each rule of the form S → γ , add First(γ ) to
First(S )
Repeat until no changes are made
03-12: Calculating First sets
Example:
S′ → S$
S → AB
S → Ch
A → ef
A→ǫ
B →hg
C → DD
C → fi
D→g
03-13: Follow Sets
Follow(S ) is the set of all terminals that can follow S in
a (partial) derivation.
S′ → S$ Follow(S ′ ) =
S → AB Follow(S ) =
S → Ch Follow(A) =
A → ef Follow(B ) =
A→ǫ Follow(C ) =
B →hg Follow(D ) =
C → DD
C → fi
D→g
03-14: Follow Sets
Follow(S ) is the set of all terminals that can follow S in
a (partial) derivation.
S′ → S$ Follow(S ′ ) = { }
S → AB Follow(S ) = {$}
S → Ch Follow(A) = {h}
A → ef Follow(B ) = {$}
A→ǫ Follow(C ) = {h}
B →hg Follow(D ) = {h, g}
C → DD
C → fi
D→g
03-15: Calculating Follow sets
For each non-terminal S , set Follow(S ) = {}
For each rule of the form S → γ
For each non-terminal S1 in γ , where γ is of the
form αS1 β
If First(β ) does not contain ǫ, add all
elements of First(β ) to Follow(S1 ).
If First(β ) does contain ǫ, add all elements of
First(β ) except ǫ to Follow(S1 ), and add all
elements of Follow(S ) to Follow(S1 ).
If any changes were made, repeat.
03-16: Calculating Follow sets
Example:
S′ → S$
S → AB
S → Ch
A → ef
A→ǫ
B →hg
C → DD
C → fi
D→g
03-17: Parse Tables
Each row in a parse table is labeled with a
non-terminal
Each column in a parse table is labeled with a
terminal
Each element in a parse table is either empty, or
contains a grammar rule
Rule S → γ goes in row S , in all columns of
First(γ ).
If First(γ ) contains ǫ, then the rule S → γ goes
in row S , in all columns of Follow(S ).
03-18: Parse Table Example
S′ → S$ First(S ′ ) = {e, f, g, h} Follow(S ′ ) = { }
S → AB First(S ) = {e, f, g, h} Follow(S ) = {$}
S → Ch First(A) = {e, ǫ} Follow(A) = {h}
A → ef First(B ) = {h} Follow(B ) = {$}
A→ǫ First(C ) = {f, g} Follow(C ) = {h}
B →hg First(D ) = {g} Follow(D ) = {h, g}
C → DD
C → fi
D→g
03-19: Parse Table Example
e f g h i
S′ S′ → S$ S′ → S$ S′ → S$ S′ → S$
S S → AB S → C h S → C h S → AB
A A → ef A→ǫ
B B → hg
C C → fi C → DD
D D→g
03-20: Parse Table Example
void ParseA() { void ParseC() {
switch(currentToken) { switch(currentToken) {
case e: case f:
checkToken(e); checkToken(f);
checkToken(f); checkToken(i);
break; break;
case h: case g:
/* epsilon case */ ParseD();
break; ParseD();
otherwise: break;
error("Parse Error"); otherwise:
} error("Parse Error");
} }
}
03-21: LL(1) Parser Example
Z′ → Z$
Z → XY Z | d
X →a|Y
Y →ǫ|c
(Initial Symbol = Z ′ )
03-22: LL(1) Parser Example
Z′ → Z$ First(Z ′ ) = {a, c, d}
Z → XY Z | d First(Z ) = {a, c, d}
X →a|Y First(X ) = {a, c, ǫ}
Y →ǫ|c First(Y ) = {c, ǫ}
(Initial Symbol = Z ′ )
03-23: LL(1) Parser Example
Z′ → Z$ First(Z ′ ) = {a, c, d} Follow(Z ′ ) = { }
Z → XY Z | d First(Z ) = {a, c, d} Follow(Z ) = {$}
X →a|Y First(X ) = {a, c, ǫ} Follow(X ) = {a, c, d}
Y →ǫ|c First(Y ) = {c, ǫ} Follow(Y ) = {a, c, d}
(Initial Symbol = Z ′ )
03-24: LL(1) Parser Example
Z′ → Z$ First(Z ′ ) = {a, c, d} Follow(Z ′ ) = { }
Z → XY Z | d First(Z ) = {a, c, d} Follow(Z ) = {$}
X →a|Y First(X ) = {a, c, ǫ} Follow(X ) = {a, c, d}
Y →ǫ|c First(Y ) = {c, ǫ} Follow(Y ) = {a, c, d}
a c d
Z′ Z ′ → Z$ Z ′ → Z$ Z ′ → Z$
Z Z → XY Z Z → XY Z Z → XY Z
Z →d
X X →a X→Y X→Y
X→Y
Y Y →ǫ Y →c Y →ǫ
Y →ǫ
03-25: non-LL(1) Grammars
Not all grammars can be parsed by a LL(1) parser
A grammar is LL(1) if the LL(1) parse table
contains no duplicate entires
Previous CFG is not LL(1)
03-26: non-LL(1) Grammars
Not all grammars can be parsed by a LL(1) parser
A grammar is LL(1) if the LL(1) parse table
contains no duplicate entires
Previous CFG is not LL(1)
No ambiguous grammar is LL(1)
03-27: LL(1) Parser Example
S ′ → S$
S → if E then S else S
S → begin L end
S → print(E )
L→ǫ
L → SL′
L′ →; SL′
L′ → ǫ
E → num = num
03-28: LL(1) Parser Example
Non-Terminal First Follow
S′ {if, begin, print} {}
S {if, begin, print} {$, end, ;}
L {ǫ, if, begin, print} {end}
L′ {ǫ, ;} {end}
E {num} {)}
03-29: LL(1) Parser Example
if then else begin end print
S′ S ′ → S$ S ′ → S$ S ′ → S$
S S → if E then S else S S → begin L end S → print(E)
L L → SL′ S ′ → S$ L→ǫ S ′ → S$
L′
E
( ) ; num =
S′
S
L
L′ L′ →; SL′
E E → num = num
03-30: LL(1) Parser Example
S′ → S$
S → ABC
A→a
A→ǫ
B→b
B→ǫ
C→c
C→ǫ
03-31: LL(1) Parser Example
S → V = E;
S → V = identifier;
E →V
E → integer_literal
V → identifier
Structured variable definitions commonly have this prob-
lem
03-35: Removing Ambiguity
This grammar, for describing variable accesses, is
also ambiguous
Ambiguous in the same was as expression
CFG
Can be made unambiguous in a similar fashion
V →V .V
V → identifier
03-36: Removing Ambiguity
Some Languages are inherently ambiguous
A Language L is ambiguous if:
For each CFG G that generates L, G is
ambiguous
No programming languages are inherently
ambiguous
03-37: Left Recursion
S → Sα
S→β
Any CFG that contains these rules (where α and β
are any string of terminals and non-terminals) is
not LL(1)
Why?
03-38: Left Recursion
S → Sa
S→b
What should ParseS() do on a b?
03-39: Left Recursion
S → Sa
S→b
First Follow
S {b} {a}
a b
S S → Sa
S →b
03-40: Left Recursion
S → Sa
S→b
S ⇒b
S ⇒ S a ⇒ ba
S ⇒ S a ⇒ S aa ⇒ baa
S ⇒ S a ⇒ S aa ⇒ S aaa ⇒ baaa
S ⇒ S a ⇒ S aa ⇒ S aaa ⇒ S aaaa ⇒ baaaa
03-42: Removing Left Recursion
S → Sa
S→b
S → Sa
S→b
S → bS ′
S ′ → aS ′
S′ → ǫ
03-44: Removing Left Recursion
S → Sa
S→b
S → Sa
S→b
S → b(a)*
03-46: Removing Left Recursion
In General, if you have rules of the form:
S → Sα
S→β
You can replace them with the CFG rules
S → βS ′
S ′ → αS ′
S′ → ǫ
or the EBNF rules
S → β(α)∗
03-47: Removing Left Recursion
What about:
S → Sα1
S → Sα2
S → β1
S → β2
03-48: Removing Left Recursion
What about:
S → Sα1
S → Sα2
S → β1
S → β2
S → BA We can use the same
B → β1 method for arbitrarily
B → β2 complex grammars
A → α1 A
A → α2 A
A→ǫ
03-49: Left Factoring
Consider Fortran DO statements:
Fortran:
do
var = intial, final
loop body
end do
Java Equivalent:
for (var=initial; var <= final; var++) {
loop body
}
03-50: Left Factoring
Consider Fortran DO statements:
Fortran:
do
var = intial, final, inc
loop body
end do
Java Equivalent:
for (var=initial; var <= final; var+=inc) {
loop body
}
03-51: Left Factoring
CFG for Fortran DO statements:
S → do L S
L → id = exp, exp
L → id = exp, exp, exp
Is this Grammar LL(1)?
03-52: Left Factoring
CFG for Fortran DO statements:
S → do L S
L → id = exp, exp
L → id = exp, exp, exp
Is this Grammar LL(1)? No!
The problem is in the rules for L
Two rules for L that start out exactly the same
No way to know which rule to apply when
looking at just 1 symbol
03-53: Left Factoring
Factor out the similar sections from the rules:
S → do L S
L → id = exp, exp
L → id = exp, exp, exp
03-54: Left Factoring
Factor out the similar sections from the rules:
S → do L S
L → id = exp, exp L′
L′ → , exp
L′ → ǫ
We can also use EBNF:
S → do L S
L → id = exp, exp (, exp)?
03-55: Left Factoring
In general, if we have rules of the form:
S → α β1
S → α β2
...
S → α βn
We can left factor these rules to get:
S→αB
B → β1
B → β2
...
B → βn
03-56: Building an LL(1) Parser
Create a CFG for the language
Remove ambiguity from the CFG, remove left
recursion, and left-factor it
Find First/Follow sets for all non-terminals
Build the LL(1) parse table
Use the parse table to create a suite of mutually
recursive functions
03-57: Building an LL(1) Parser
Create an EBNF for the language
Remove ambiguity from the EBNF, remove left
recursion, and left-factor it
Find First/Follow sets for all non-terminals
Build the LL(1) parse table
Use the parse table to create a suite of mutually
recursive functions
Use a parser generator tool that converts the
EBNF into parsing functions
03-58: Structure JavaCC file foo.jj
options{
/* Code to set various options flags */
}
PARSER_BEGIN(foo)
PARSER_END(foo)
TOKEN_MGR_DECLS :
{
/* Declarations used by lexical analyzer */
}
void nonTerminalName() :
{ /* Java Declarations */ }
{ /* Rule definition */
}
For now, the Java Declarations section will be
empty (we will use it later on, when building parse
trees)
Non terminals in JavaCC rules are followed by ()
Terminals in JavaCC rules are between < and >
03-60: JavaCC Rules
For example, the CFG rules:
S → while (E ) S
S → V = E;
Would be represented by the JavaCC rule:
void statement() :
{}
{
<WHILE> <LPAREN> expression() <RPAREN> statement()
| variable() <GETS> expression() <SEMICOLON>
}
03-61: Example JavaCC Files
We now examine a JavaCC file that parses prefix
expressions
Prefix expression examples:
Prefix expression Equivalent infix expression
+34 3+4
+-215 (2 - 1) + 5
+-34*56 (3 - 4) + (5 * 6)
+-*3456 ((3 * 4) - 5) + 6
+3-4*56 3 + (4 - (5 * 6))
Come up with a CFG for Prefix expressions
03-62: Example JavaCC Files
CFG for prefix expressions:
E→+EE
E→-EE
E→*EE
E→/EE
E → num
Pull up JavaCC file on other screen
03-63: Example JavaCC Files
In javacc format:
void expression():
{}
{
<PLUS> expression() expression()
| <MINUS> expression() expression()
| <TIMES> expression() expression()
| <DIVIDE> expression() expression()
| <INTEGER_LITERAL>
}
void S():
{}
{ "a" "b" "c"
| "a" "d" "c"
}
Is this grammar LL(1)?
03-65: Lookahead
A LOOKAHEAD directive, placed at a choice point,
will allow JavaCC to use a lookahead > 1:
void S():
{}
{ LOOKAHEAD(2) "a" "b" "c"
| "a" "d" "c"
}
ParseS will look at the next two symbols, before
deciding which rule for S to use
03-66: Lookahead
LOOKAHEAD directives are placed at “choice
points” – places in the grammar where there is
more than one possible rule that can match.
void S():
{}
{
"A" (("B" "C") | ("B" "D"))
}
03-67: Lookahead
LOOKAHEAD directives are placed at “choice
points” – places in the grammar where there is
more than one possible rule that can match.
void S():
{}
{
"A" ( LOOKAHEAD(2) ("B" "C") | ("B" "D"))
}
03-68: Lookahead
LOOKAHEAD directives are placed at “choice
points” – places in the grammar where there is
more than one possible rule that can match.
void S():
{}
{
"A" (("B" "C") | ("B" "D"))
}
03-69: Lookahead
LOOKAHEAD directives are placed at “choice
points” – places in the grammar where there is
more than one possible rule that can match.
void S():
{}
{
LOOKAHEAD(2) "A" (("B" "C") | ("B" "D"))
}
void S() :
{}
{
"a" "b" "c"
| "a" "b" "d"
}
03-72: JavaCC & Non-LL(k)
Infamous dangling else
void statement():
{}
{
<IF> expression() <THEN> statement()
| <IF> expression() <THEN> statement() <ELSE> statement()
| /* Other statement definitions */
}