0% found this document useful (0 votes)
61 views

Lecture3 Java

The document discusses parsing and recursive descent parsers. It begins by explaining that parsing involves determining if a sequence of tokens forms a syntactically correct program according to a context-free grammar. A recursive descent parser is then introduced as a parser implemented as a suite of recursive functions, one for each non-terminal in the grammar. The document provides examples of grammars and walks through calculating first and follow sets and building parse tables for LL(1) parsing.

Uploaded by

Zerihun Bekele
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views

Lecture3 Java

The document discusses parsing and recursive descent parsers. It begins by explaining that parsing involves determining if a sequence of tokens forms a syntactically correct program according to a context-free grammar. A recursive descent parser is then introduced as a parser implemented as a suite of recursive functions, one for each non-terminal in the grammar. The document provides examples of grammars and walks through calculating first and follow sets and building parse tables for LL(1) parsing.

Uploaded by

Zerihun Bekele
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Compilers

CS414-2017-03
Parsing
David Galles

Department of Computer Science


University of San Francisco
03-0: Parsing
Once we have broken an input file into a sequence
of tokens, the next step is to determine if that
sequence of tokens forms a syntactically correct
program – parsing
Parsing a sequence of tokens == determining if the
string of tokens could be generated by a
Context-Free Grammar.
03-1: CFG Example
S → print(E );
S → while (E ) S
S →{L}
E → identifier
E → integer_literal
L → SL
L→ǫ
Examples / Parse Trees
03-2: Recursive Descent Parser
Write Java code that repeatedly calls
getNextToken(), and determines if the stream of
returned tokens can be generated by the CFG
If so, end normally
If not, call an error function
03-3: Recursive Descent Parser
A Recursive Descent Parser is implemented as a suite
of recursive functions, one for each non-terminal in the
grammar:

ParseS will terminate normally if the next tokens in


the input stream can be derived from the
non-terminal S
ParseL will terminate normally if the next tokens
in the input stream can be derived from the
non-terminal L
ParseE will terminate normally if the next tokens
in the input stream can be derived from the
non-terminal E
03-4: Recursive Descent Parser
S → print(E );
S → while (E ) S
S →{L}
E → identifier
E → integer_literal
L → SL
L→ǫ
Code for Parser.java.html on web browser
03-5: LL(1) Parsers
These recursive descent parsers are also known as
LL(1) parsers, for Left-to-right, Leftmost derivation, with
1 symbol lookahead

The input file is read from left to right (starting with


the first symbol in the input stream, and
proceeding to the last symbol).
The parser ensures that a string can be derived by
the grammar by building a leftmost derivation.
Which rule to apply at each step is decided upon
after looking at just 1 symbol.
03-6: Building LL(1) Parsers

S′ → S$
S → AB
S → Ch
A → ef
A→ǫ
B →hg
C → DD
C → fi
D→g
ParseS use rule S → AB on e, h
use the rule S → C h on f, g
03-7: First sets
First(S ) is the set of all terminals that can start strings
derived from S (plus ǫ, if S can produce ǫ)
S′ → S$ First(S ′ ) =
S → AB First(S ) =
S → Ch First(A) =
A → ef First(B ) =
A→ǫ First(C ) =
B →hg First(D ) =
C → DD
C → fi
D→g
03-8: First sets
First(S ) is the set of all terminals that can start strings
derived from S (plus ǫ, if S can produce ǫ)
S′ → S$ First(S ′ ) = {e, f, g, h}
S → AB First(S ) = {e, f, g, h}
S → Ch First(A) = {e, ǫ}
A → ef First(B ) = {h}
A→ǫ First(C ) = {f, g}
B →hg First(D ) = {g}
C → DD
C → fi
D→g
03-9: First sets
We can expand the definition of First sets to include
strings of terminals and non-terminals
S′ → S$ First(aB ) =
S → AB First(BC ) =
S → Ch First(AbC ) =
A → ef First(AC ) =
A→ǫ First(abS ) =
B →hg First(DDA) =
C → DD
C → fi
D→g
03-10: First sets
We can expand the definition of First sets to include
strings of terminals and non-terminals
S′ → S$ First(aB ) = {a}
S → AB First(BC ) = {h}
S → Ch First(AbC ) = {e, b}
A → ef First(AC ) = {e, f, g}
A→ǫ First(abS ) = {a}
B →hg First(DDA) = {g}
C → DD
C → fi
D→g
03-11: Calculating First sets
For each non-terminal S , set First(S ) = {}
For each rule of the form S → γ , add First(γ ) to
First(S )
Repeat until no changes are made
03-12: Calculating First sets
Example:

S′ → S$
S → AB
S → Ch
A → ef
A→ǫ
B →hg
C → DD
C → fi
D→g
03-13: Follow Sets
Follow(S ) is the set of all terminals that can follow S in
a (partial) derivation.
S′ → S$ Follow(S ′ ) =
S → AB Follow(S ) =
S → Ch Follow(A) =
A → ef Follow(B ) =
A→ǫ Follow(C ) =
B →hg Follow(D ) =
C → DD
C → fi
D→g
03-14: Follow Sets
Follow(S ) is the set of all terminals that can follow S in
a (partial) derivation.
S′ → S$ Follow(S ′ ) = { }
S → AB Follow(S ) = {$}
S → Ch Follow(A) = {h}
A → ef Follow(B ) = {$}
A→ǫ Follow(C ) = {h}
B →hg Follow(D ) = {h, g}
C → DD
C → fi
D→g
03-15: Calculating Follow sets
For each non-terminal S , set Follow(S ) = {}
For each rule of the form S → γ
For each non-terminal S1 in γ , where γ is of the
form αS1 β
If First(β ) does not contain ǫ, add all
elements of First(β ) to Follow(S1 ).
If First(β ) does contain ǫ, add all elements of
First(β ) except ǫ to Follow(S1 ), and add all
elements of Follow(S ) to Follow(S1 ).
If any changes were made, repeat.
03-16: Calculating Follow sets
Example:

S′ → S$
S → AB
S → Ch
A → ef
A→ǫ
B →hg
C → DD
C → fi
D→g
03-17: Parse Tables
Each row in a parse table is labeled with a
non-terminal
Each column in a parse table is labeled with a
terminal
Each element in a parse table is either empty, or
contains a grammar rule
Rule S → γ goes in row S , in all columns of
First(γ ).
If First(γ ) contains ǫ, then the rule S → γ goes
in row S , in all columns of Follow(S ).
03-18: Parse Table Example
S′ → S$ First(S ′ ) = {e, f, g, h} Follow(S ′ ) = { }
S → AB First(S ) = {e, f, g, h} Follow(S ) = {$}
S → Ch First(A) = {e, ǫ} Follow(A) = {h}
A → ef First(B ) = {h} Follow(B ) = {$}
A→ǫ First(C ) = {f, g} Follow(C ) = {h}
B →hg First(D ) = {g} Follow(D ) = {h, g}
C → DD
C → fi
D→g
03-19: Parse Table Example

e f g h i
S′ S′ → S$ S′ → S$ S′ → S$ S′ → S$
S S → AB S → C h S → C h S → AB
A A → ef A→ǫ
B B → hg
C C → fi C → DD
D D→g
03-20: Parse Table Example
void ParseA() { void ParseC() {
switch(currentToken) { switch(currentToken) {
case e: case f:
checkToken(e); checkToken(f);
checkToken(f); checkToken(i);
break; break;
case h: case g:
/* epsilon case */ ParseD();
break; ParseD();
otherwise: break;
error("Parse Error"); otherwise:
} error("Parse Error");
} }
}
03-21: LL(1) Parser Example
Z′ → Z$
Z → XY Z | d
X →a|Y
Y →ǫ|c

(Initial Symbol = Z ′ )
03-22: LL(1) Parser Example
Z′ → Z$ First(Z ′ ) = {a, c, d}
Z → XY Z | d First(Z ) = {a, c, d}
X →a|Y First(X ) = {a, c, ǫ}
Y →ǫ|c First(Y ) = {c, ǫ}

(Initial Symbol = Z ′ )
03-23: LL(1) Parser Example
Z′ → Z$ First(Z ′ ) = {a, c, d} Follow(Z ′ ) = { }
Z → XY Z | d First(Z ) = {a, c, d} Follow(Z ) = {$}
X →a|Y First(X ) = {a, c, ǫ} Follow(X ) = {a, c, d}
Y →ǫ|c First(Y ) = {c, ǫ} Follow(Y ) = {a, c, d}

(Initial Symbol = Z ′ )
03-24: LL(1) Parser Example
Z′ → Z$ First(Z ′ ) = {a, c, d} Follow(Z ′ ) = { }
Z → XY Z | d First(Z ) = {a, c, d} Follow(Z ) = {$}
X →a|Y First(X ) = {a, c, ǫ} Follow(X ) = {a, c, d}
Y →ǫ|c First(Y ) = {c, ǫ} Follow(Y ) = {a, c, d}
a c d
Z′ Z ′ → Z$ Z ′ → Z$ Z ′ → Z$
Z Z → XY Z Z → XY Z Z → XY Z
Z →d
X X →a X→Y X→Y
X→Y
Y Y →ǫ Y →c Y →ǫ
Y →ǫ
03-25: non-LL(1) Grammars
Not all grammars can be parsed by a LL(1) parser
A grammar is LL(1) if the LL(1) parse table
contains no duplicate entires
Previous CFG is not LL(1)
03-26: non-LL(1) Grammars
Not all grammars can be parsed by a LL(1) parser
A grammar is LL(1) if the LL(1) parse table
contains no duplicate entires
Previous CFG is not LL(1)
No ambiguous grammar is LL(1)
03-27: LL(1) Parser Example
S ′ → S$
S → if E then S else S
S → begin L end
S → print(E )

L→ǫ
L → SL′
L′ →; SL′
L′ → ǫ

E → num = num
03-28: LL(1) Parser Example
Non-Terminal First Follow
S′ {if, begin, print} {}
S {if, begin, print} {$, end, ;}
L {ǫ, if, begin, print} {end}
L′ {ǫ, ;} {end}
E {num} {)}
03-29: LL(1) Parser Example
if then else begin end print
S′ S ′ → S$ S ′ → S$ S ′ → S$
S S → if E then S else S S → begin L end S → print(E)
L L → SL′ S ′ → S$ L→ǫ S ′ → S$
L′
E

( ) ; num =
S′
S
L
L′ L′ →; SL′
E E → num = num
03-30: LL(1) Parser Example
S′ → S$
S → ABC
A→a
A→ǫ
B→b
B→ǫ
C→c
C→ǫ
03-31: LL(1) Parser Example

Non-terminal First Follow


S′ {a, b, c} {}
S {a, b, c} {$}
A {a} {b, c, $}
B {b} {c, $}
C {c} {$}
a b c $
S ′ S ′ → S$ S ′ → S$ S ′ → S$
S S → ABC S → ABC S → ABC S → ABC
A A→a A→ǫ A→ǫ A→ǫ
B B→b B→ǫ B→ǫ
C C→c C→ǫ
03-32: Creating an LL(1) CFG
Not all grammars are LL(1)
We can often modify a CFG that is not LL(1)
New CFG generates the same language as the
old CFG
New CFG is LL(1)
03-33: Creating an LL(1) CFG
Remove Ambiguity
No ambiguous grammar is LL(1)
Grammar is ambiguous if there are two ways to
generate the same string
If there are two ways to generate a string α,
modify the CFG so that one of the ways to
generate the string is removed
03-34: Removing Ambiguity
Often a grammar is ambiguous when there is a
special case rule that can be generated by a
general rule
Solution: Remove the special case, let the general
case handle it

S → V = E;
S → V = identifier;
E →V
E → integer_literal
V → identifier
Structured variable definitions commonly have this prob-
lem
03-35: Removing Ambiguity
This grammar, for describing variable accesses, is
also ambiguous
Ambiguous in the same was as expression
CFG
Can be made unambiguous in a similar fashion

V →V .V
V → identifier
03-36: Removing Ambiguity
Some Languages are inherently ambiguous
A Language L is ambiguous if:
For each CFG G that generates L, G is
ambiguous
No programming languages are inherently
ambiguous
03-37: Left Recursion
S → Sα
S→β
Any CFG that contains these rules (where α and β
are any string of terminals and non-terminals) is
not LL(1)
Why?
03-38: Left Recursion
S → Sa
S→b
What should ParseS() do on a b?
03-39: Left Recursion
S → Sa
S→b
First Follow
S {b} {a}

a b
S S → Sa
S →b
03-40: Left Recursion
S → Sa
S→b

What strings can be derived from this grammar?


03-41: Left Recursion
S → Sa
S→b

What strings can be derived from this grammar?

S ⇒b
S ⇒ S a ⇒ ba
S ⇒ S a ⇒ S aa ⇒ baa
S ⇒ S a ⇒ S aa ⇒ S aaa ⇒ baaa
S ⇒ S a ⇒ S aa ⇒ S aaa ⇒ S aaaa ⇒ baaaa
03-42: Removing Left Recursion

S → Sa
S→b

What strings can be derived from this grammar?


A b, followed by zero or more a’s.
What is a CFG for this language that does not use
Left Recursion?
03-43: Removing Left Recursion

S → Sa
S→b

What strings can be derived from this grammar?


A b, followed by zero or more a’s.
What is a CFG for this language that does not use
Left Recursion?

S → bS ′
S ′ → aS ′
S′ → ǫ
03-44: Removing Left Recursion

S → Sa
S→b

What strings can be derived from this grammar?


A b, followed by zero or more a’s.
What is an EBNF for this language that does not
use Left Recursion?
03-45: Removing Left Recursion

S → Sa
S→b

What strings can be derived from this grammar?


A b, followed by zero or more a’s.
What is an EBNF for this language that does not
use Left Recursion?

S → b(a)*
03-46: Removing Left Recursion
In General, if you have rules of the form:
S → Sα
S→β
You can replace them with the CFG rules
S → βS ′
S ′ → αS ′
S′ → ǫ
or the EBNF rules
S → β(α)∗
03-47: Removing Left Recursion
What about:
S → Sα1
S → Sα2
S → β1
S → β2
03-48: Removing Left Recursion
What about:
S → Sα1
S → Sα2
S → β1
S → β2
S → BA We can use the same
B → β1 method for arbitrarily
B → β2 complex grammars
A → α1 A
A → α2 A
A→ǫ
03-49: Left Factoring
Consider Fortran DO statements:

Fortran:
do
var = intial, final
loop body
end do
Java Equivalent:
for (var=initial; var <= final; var++) {
loop body
}
03-50: Left Factoring
Consider Fortran DO statements:

Fortran:
do
var = intial, final, inc
loop body
end do
Java Equivalent:
for (var=initial; var <= final; var+=inc) {
loop body
}
03-51: Left Factoring
CFG for Fortran DO statements:
S → do L S
L → id = exp, exp
L → id = exp, exp, exp
Is this Grammar LL(1)?
03-52: Left Factoring
CFG for Fortran DO statements:
S → do L S
L → id = exp, exp
L → id = exp, exp, exp
Is this Grammar LL(1)? No!
The problem is in the rules for L
Two rules for L that start out exactly the same
No way to know which rule to apply when
looking at just 1 symbol
03-53: Left Factoring
Factor out the similar sections from the rules:
S → do L S
L → id = exp, exp
L → id = exp, exp, exp
03-54: Left Factoring
Factor out the similar sections from the rules:
S → do L S
L → id = exp, exp L′
L′ → , exp
L′ → ǫ
We can also use EBNF:
S → do L S
L → id = exp, exp (, exp)?
03-55: Left Factoring
In general, if we have rules of the form:
S → α β1
S → α β2
...
S → α βn
We can left factor these rules to get:
S→αB
B → β1
B → β2
...
B → βn
03-56: Building an LL(1) Parser
Create a CFG for the language
Remove ambiguity from the CFG, remove left
recursion, and left-factor it
Find First/Follow sets for all non-terminals
Build the LL(1) parse table
Use the parse table to create a suite of mutually
recursive functions
03-57: Building an LL(1) Parser
Create an EBNF for the language
Remove ambiguity from the EBNF, remove left
recursion, and left-factor it
Find First/Follow sets for all non-terminals
Build the LL(1) parse table
Use the parse table to create a suite of mutually
recursive functions
Use a parser generator tool that converts the
EBNF into parsing functions
03-58: Structure JavaCC file foo.jj
options{
/* Code to set various options flags */
}

PARSER_BEGIN(foo)

public class foo {


/* This segment is often empty */
}

PARSER_END(foo)

TOKEN_MGR_DECLS :
{
/* Declarations used by lexical analyzer */
}

/* Token Rules & Actions */

/* JavaCC Rules and Actions -- EBNF for language*/


03-59: JavaCC Rules
JavaCC rules correspond to EBNF rules
JavaCC rules have the form:

void nonTerminalName() :
{ /* Java Declarations */ }
{ /* Rule definition */
}
For now, the Java Declarations section will be
empty (we will use it later on, when building parse
trees)
Non terminals in JavaCC rules are followed by ()
Terminals in JavaCC rules are between < and >
03-60: JavaCC Rules
For example, the CFG rules:
S → while (E ) S
S → V = E;
Would be represented by the JavaCC rule:
void statement() :
{}
{
<WHILE> <LPAREN> expression() <RPAREN> statement()
| variable() <GETS> expression() <SEMICOLON>
}
03-61: Example JavaCC Files
We now examine a JavaCC file that parses prefix
expressions
Prefix expression examples:
Prefix expression Equivalent infix expression
+34 3+4
+-215 (2 - 1) + 5
+-34*56 (3 - 4) + (5 * 6)
+-*3456 ((3 * 4) - 5) + 6
+3-4*56 3 + (4 - (5 * 6))
Come up with a CFG for Prefix expressions
03-62: Example JavaCC Files
CFG for prefix expressions:
E→+EE
E→-EE
E→*EE
E→/EE
E → num
Pull up JavaCC file on other screen
03-63: Example JavaCC Files
In javacc format:

void expression():
{}
{
<PLUS> expression() expression()
| <MINUS> expression() expression()
| <TIMES> expression() expression()
| <DIVIDE> expression() expression()
| <INTEGER_LITERAL>
}

Put up driver file, do some examples


03-64: Lookahead
Consider the following JavaCC fragment:

void S():
{}
{ "a" "b" "c"
| "a" "d" "c"
}
Is this grammar LL(1)?
03-65: Lookahead
A LOOKAHEAD directive, placed at a choice point,
will allow JavaCC to use a lookahead > 1:

void S():
{}
{ LOOKAHEAD(2) "a" "b" "c"
| "a" "d" "c"
}
ParseS will look at the next two symbols, before
deciding which rule for S to use
03-66: Lookahead
LOOKAHEAD directives are placed at “choice
points” – places in the grammar where there is
more than one possible rule that can match.
void S():
{}
{
"A" (("B" "C") | ("B" "D"))
}
03-67: Lookahead
LOOKAHEAD directives are placed at “choice
points” – places in the grammar where there is
more than one possible rule that can match.

void S():
{}
{
"A" ( LOOKAHEAD(2) ("B" "C") | ("B" "D"))
}
03-68: Lookahead
LOOKAHEAD directives are placed at “choice
points” – places in the grammar where there is
more than one possible rule that can match.

void S():
{}
{
"A" (("B" "C") | ("B" "D"))
}
03-69: Lookahead
LOOKAHEAD directives are placed at “choice
points” – places in the grammar where there is
more than one possible rule that can match.

void S():
{}
{
LOOKAHEAD(2) "A" (("B" "C") | ("B" "D"))
}

This is not a valid use of lookahead – the grammar


will not be parsed correctly. Why not?
03-70: JavaCC & Non-LL(k)
JavaCC will produce a parser for grammars that
are not LL(1) (and even for grammars that are not
LL(k), for any k)
The parser that is produced is not guaranteed to
correctly parse the language described by the
grammar
A warning will be issued when JavaCC is run on a
non-LL(1) grammar
03-71: JavaCC & Non-LL(k)
What does JavaCC do for a non-LL(1) grammar?
The rule that appears first will be used

void S() :
{}
{
"a" "b" "c"
| "a" "b" "d"

}
03-72: JavaCC & Non-LL(k)
Infamous dangling else
void statement():
{}
{
<IF> expression() <THEN> statement()
| <IF> expression() <THEN> statement() <ELSE> statement()
| /* Other statement definitions */
}

Why doesn’t this grammar work?


03-73: JavaCC & Non-LL(k)
void statement() :
{}
{
<IF> expression() <THEN> statement() optionalelse()
| /* Other statement definitions */
}
void optionalelse() :
{}
{
<ELSE> statement()
| /* nothing */ { }
}

if <e> then <S>


if <e> then <S> else <S>
if <e> then if <e> then <S> else <S>
03-74: JavaCC & Non-LL(k)
void statement() :
{}
{
<IF> expression() <THEN> statement() optionalelse()
| /* Other statement definitions */
}
void optionalelse() :
{}
{
/* nothing */ { }
| <ELSE> statement()
}

What about this grammar?


03-75: JavaCC & Non-LL(k)
void statement() :
{}
{
<IF> expression() <THEN> statement() optionalelse()
| /* Other statement definitions */
}
void optionalelse() :
{}
{
/* nothing */ { }
| <ELSE> statement()
}

What about this grammar?


Doesn’t work! (why?)
03-76: JavaCC & Non-LL(k)
void statement() :
{}
{
<IF> expression() <THEN> statement() (<ELSE> statement)?
| /* Other statement definitions */
}

This grammar will also work correctly


Also produces a warning
03-77: JavaCC & Non-LL(k)
void statement() :
{}
{
<IF> expression() <THEN> statement()
(LOOKAHEAD(1) <ELSE> statement)?
| /* Other statement definitions */
}

This grammar also works correctly


Produces no warnings
(not because it is any more safe – if you include
a LOOKAHEAD directive, the system assumes
you know what you are doing)
03-78: Parsing Project
For your next project, you will write a parser for
simpleJava using JavaCC.
Provided Files:
ParseTest.java A main program to test your
parser You must use program as your starting
non-terminal for ParseTest to work correctly!
test*.sjava Various simpleJava programs to
test your parser. Some have parsing errors,
some do not. Be sure to test your parser on
other test cases, too! These files are not meant
to be exhaustive!!
03-79: Parsing Project “Gotcha’s”
Expressions can be tricky. Read the text for more
examples and suggestions
Structured variable accesses are similar to
expressions, and have some of the same issues
Avoid specific cases that can be handled by a
more general case
03-80: Parsing Project “Gotcha’s”
Procedure calls and assignment statements can
be tricky for LL(1) parsers. You may need to
left-factor, and/or use LOOKAHEAD directives
LOOKAHEAD directives are useful, but can be
dangerous (for instance, you will not get warnings
for the sections that use LOOKAHEAD.) Try
left-factoring, or other techniques, before resorting
to LOOKAHEAD.
This project is much more difficult than the lexical
analyzer. Start Early!

You might also like