2.2 - Syntax Analysis (Upto Top-Down Parsing)
2.2 - Syntax Analysis (Upto Top-Down Parsing)
ANALYSIS
Introduction
◦All programming languages have certain syntactic structures
◦We need to verify that the source code written for a language is
syntactically valid
◦The validity of the syntax is checked by the syntax analysis
◦Syntaxes are represented using context free grammar (CFG), or
Backus Naur Form (BNF)
◦Parsing is the act of performing syntax analysis to verify an input
program's compliance with the source language
Introduction
◦The purpose of syntax analysis or parsing is to check that we have a
valid sequence of tokens
◦A by-product of this process is typically a tree that represents the
structure of the program
◦Parsing is the act of checking whether a grammar “accepts” an input
text as valid (according to the grammar rules)
◦It determines the exact correspondence between the text and the rules
of given grammar
The Role of the Parser
The Role of the Parser
◦Analyzes the context free syntax
◦Generates the parse tree
◦Provides the mechanism for context sensitive analysis
◦Determines errors and tries to handle them
Types of Parser
◦There are three general types of parsers for grammars: universal,
top-down, and bottom-up
◦Universal Parser
◦Can parse any kind of grammars
◦Too inefficient to use in production compilers
◦E.g. of universal parsing methods: CYK algorithm, Earley’s algorithm
◦The methods commonly used in compilers can be classified as being
either top-down or bottom-up
Types of Parser
◦Top-Down Parser
◦the parse tree is created top to bottom, starting from the root
◦LL for top-down parsing
◦Bottom-Up Parser
◦the parse is created bottom to top; starting from the leaves
◦LR for bottom-up parsing
Types of Parser
◦Both top-down and bottom-up parsers scan the input from left to
right (one symbol at a time)
◦Efficient top-down and bottom-up parsers can only be implemented
for sub-classes of context-free grammars
Error Handling
◦Every phase of the compiler is prone to errors
◦Syntax analysis and semantic analysis are the phases that are the most
common sources of errors
◦Syntactic errors include misplaced semicolons or extra or missing
braces; that is, “{”or “}”
◦As another example, in C or Java, the appearance of a case statement
without an enclosing switch is a syntactic error
◦If error occurs, the process should not terminate but instead report
the error and try to advance
Error Recovery Techniques
◦Panic Mode Recovery
◦With this method, on discovering an error, the parser discards input
symbols one at a time until one of a designated set of synchronizing
tokens is found, say a semicolon
◦May skip errors if there are more than one error in the sentence
◦It has the advantage of simplicity, and, unlike some methods, is
guaranteed not to go into an infinite loop
Error Recovery Techniques
◦Phrase-Level Recovery
◦On discovering an error, a parser may perform local correction on
the remaining input; that is, it may replace a prefix of the remaining
input by some string that allows the parser to continue
◦A typical local correction is to replace a comma by a semicolon,
delete an extraneous semicolon, or insert a missing semicolon
◦We must be careful to choose replacements that do not lead to
infinite loops, as would be the case, for example, if we always
inserted something on the input ahead of the current input symbol
Error Recovery Techniques
◦Error Productions
◦By anticipating common errors that might be encountered, we can
augment the grammar for the language at hand with productions
that generate the erroneous constructs
◦A parser constructed from a grammar augmented by these error
productions detects the anticipated errors when an error production
is used during parsing
◦The parser can then generate appropriate error diagnostics about
the erroneous construct that has been recognized in the input
Error Recovery Techniques
◦Global Correction
◦We would like a compiler to make as few changes as possible in
processing an incorrect input string
◦There are algorithms for choosing a minimal sequence of changes to
obtain a globally least-cost correction
◦Given an incorrect input string X and grammar G, these algorithms
will find a parse tree for a related string Y, such that the number of
insertions, deletions, and changes of tokens required to transform X
into Y is as small as possible
Error Recovery Techniques
◦Unfortunately, these methods are in general too costly to implement
in terms of time and space, so these techniques are currently only of
theoretical interest
Context Free Grammar
◦Most of the programming languages have recursive structures that can be
defined by Context Free Grammar (CFG)
◦CFG can be defined as 4 – tuple (V, T, P, S), where
◦V → finite set of Variables or Non-terminals that are used to define the
grammar denoting the combination of terminal or non-terminals or both
◦T → Terminals, the basic symbols of the sentences; they are indivisible
units
◦P → Production rule that defines the combination of terminals or
non-terminals or both for particular non-terminal
◦S → It is the special non-terminal symbol called start symbol
◦Example: Grammar to define a palindrome string over binary string
S → 0S0 | 1S1
S→0|1
For id * id + id,
E ⇒ E * E ⇒ id * E ⇒ id * E + E ⇒ id * id + E ⇒ id * id + id
This is left-most derivation
Parse Tree
◦The pictorial representation of the derivation can be depicted using
the parse tree
◦In parse tree internal nodes represent non-terminals and the leaves
represent terminals
◦Consider the grammar above: E → E + E | E * E | (E) | -E | id
◦The string -(id + id) is a sentence of this grammar because there is a
derivation:
◦E ⇒ -E ⇒ -(E) ⇒ -(E + E) ⇒ -(id + E) ⇒ -(id + id)
Parse Tree
Parse tree for:
-(id + id)
Sequence of
parse trees for
the derivation
Ambiguity
◦A grammar is said to be ambiguous if it can produce a sentence in
more than one way
◦If there is more than one parse tree for a sentence or derivation (left
or right) with respect to the given grammar, then the grammar is said
to be ambiguous
Ambiguity
◦Consider the grammar above: E → E + E | E * E | (E) | -E | id
◦Consider the string: id + id * id
◦It has two distinct derivations:
(a) E ⇒ E+E ⇒ id+E ⇒ id+E*E ⇒ id+id*E ⇒ id+id*id
(b) E ⇒ E*E ⇒ E+E*E ⇒ id+E*E ⇒ id+id*E ⇒ id+id*id
Input → Grammar G
Output → Equivalent grammar with no left recursion
Arrange non terminals in some order A1, A2, ……., An
for i=1 to n do
for j=1 to i-1 do
replace each production of the form Ai → Aj γ by the productions
Ai → α1 γ |…..| αk γ,
where Aj → α1|…..| αk are all current Aj-productions
end do
eliminate the immediate left recursions among Ai-productions
end do
Example
S → Aa | b
A → Ac | Sd | ε
◦The leaf ‘a’ matches the second symbol of w and the leaf d matches
the third symbol
◦Since we have produced a parse tree for w, we halt and announce
successful completion of parsing
Recursive-Descent Parsing
◦A left-recursive grammar can cause a recursive-descent parser, even
one with backtracking, to go into an infinite loop
◦That is, when we try to expand a nonterminal A, we may eventually
find ourselves again trying to expand A without having consumed any
input
Algorithm:
1. Use two pointers: iptr for pointing the input symbol to be read and optr for
pointing the symbol of output string, initially start symbol S.
2. If the symbol pointed by optr is a non-terminal, use the first unexpanded
production rule for expansion.
3. While the symbol pointed by iptr and optr is same increment the both
pointers.
4. The loop in the above step terminates when
a. there is a non-terminal at the output (case A)
b. it is the end of input (case B)
c. unmatching terminal symbol pointed by iptr and optr is seen (case C)
5. If (A) is true, goto step 2
6. If (B) is true, terminate with success
7. If (C) is true, decrement both pointers to the place of last non-terminal
expansion(backtrack) and goto step 2
8. If there is no more unexpanded production left and (B) is not true,
report error
Input Output Rules Fired
Example:
(iptr)cad (optr)S [Rule 2, Try S → cAd ]
1. If X is ε then FIRST(X)={ε}
2. If X is a terminal symbol then FIRST(X) = {X}
3. If X is a non-terminal symbol and X → ε is a production rule then
FIRST(X)=FIRST(X) ∪ ε
4. If X is a non-terminal symbol and X → Y1Y2 …Yn is a production rule then
a) if a terminal a is in FIRST(Y1) then FIRST(X) = FIRST(X) ∪ a
b) if a terminal a is in FIRST(Yi) and ε is in all FIRST(Yj) for j=1,...,i-1 then
FIRST(X) = FIRST(X) ∪ a
c) if ε is in all FIRST(Yj) for j=1,...,n then FIRST(X) = FIRST(X) ∪ ε
Now, we can compute FIRST for any string X1X2…Xn as follows:
It helps to first compute the nullable set (i.e., those non-terminals X that
X⇒* ε), since you need to refer to the nullable status of various
nonterminals when computing the first and follow sets:
nullable(G) = {A, B', C}
The first sets for each non-terminal are:
First(C) = {b. ε}
First(B') = {a. ε}
First(B) = {c}
First(A) = {b, a, ε}
To compute the follow sets, take each nonterminal and go through all the
right-side productions that the nonterminal is in, matching to the steps given
earlier:
Follow(S) = {$}
S doesn’t appear in the right hand side of any productions. We put $ in the
follow set because S is the start symbol.
Follow(B) = {$}
B appears on the right hand side of the S → AB production. Its follow set
is the same as S.
Follow(B') = {$}
B' appears on the right hand side of two productions. The B' → aACB'
production tells us its follow set includes the follow set of B', which is
tautological. From B → cB', we learn its follow set is the same as B.
Follow(C) = {a, $}
C appears in the right hand side of two productions. The production A →
Ca tells us a is in the follow set. From B' → aACB' , we add the First(B')
which is just a again. Because B' is nullable, we must also add Follow(B')
which is $.
Follow(A) = {c, b, a, $}
A appears in the right hand side of two productions. From S → AB we
add First(B) which is just c. B is not nullable. From B' → aACB' , we add
First(C) which is b. Since C is nullable, so we also include First(B') which is
a. B' is also nullable, so we include Follow(B') which adds $
Predictive Parsing
◦A Recursive Descent parser always chooses the first available
production whenever encountered by a non-terminal
◦This is inefficient and causes a lot of backtracking
◦It also suffers from the left-recursion problem
◦The recursive descent parser can work efficiently if there is no need of
backtracking
◦A predictive parser tries to predict which production produces the
least chances of a backtracking and infinite looping
Predictive Parsing
◦A predictive parser is characterized by its ability to choose the
production to apply solely on the basis of the next input symbol and
the current non-terminal being processed
◦To enable this, the grammar must take a particular form called a LL(1)
grammar
◦The first "L" means we scan the input from left to right; the second
"L" means we create a leftmost derivation; and the 1 means one input
symbol of look ahead
◦LL(1) has no left recursive productions and is left-factored
LL(1) Grammars
◦Predictive parsers, that is, recursive-descent parsers needing no
backtracking, can be constructed for a class of grammars called LL(1)
◦The predictive top-down techniques(either recursive-descent or
table-driven) require a grammar that is LL(1)
◦One fully general way to determine if a grammar is LL(1) is to build
the table and see if you have conflicts
◦A grammar whose parsing table has no multiple defined entries is said
to be LL(1) grammar
LL(1) Grammars
◦No ambiguous or left-recursive grammar is LL(1)
◦There are no general rules by which multiple-defined entries can be
made single-valued without affecting the language recognized by a
grammar(i.e. there are no general rules to convert a non LL(1)
grammar into a LL(1) grammar)
Properties of LL(1) Grammar
◦No Ambiguity and No Recursion
◦In any LL(1) grammar, if there exists a rule of the form A → α | β ,
where α and β are distinct, then
1. For any terminal a, if a ∈ FIRST(α) then a ∉ FIRST(β) or vice-versa
2. Either α ⇒* ε or β ⇒* ε (or neither), but not both
3. If β ⇒* ε, then α does not derive any string beginning with a
terminal in FOLLOW(A); likewise, if α ⇒* ε, then β does not derive
any string beginning with a terminal in FOLLOW(A)
LL(1) Grammars
Constructing LL(1) Parsing
Table
for each production rule A → α of a grammar G
for each terminal a in FIRST(α)
Input: LL(1) Grammar G
add A → α to M[A,a]
Output: Parsing Table M
if ε in FIRST(α) then
for each terminal a in FOLLOW(A)
add A → α to M[A,a]
if ε in FIRST(α) and $ in FOLLOW(A) then
add A → α to M[A,$]
make all undefined entries of the parsing table
M as error
Example: Consider the grammar:
E → TE'
E' → +TE' | ε
T → FT'
T' → *FT' | ε FIRST(TE') = {(, id} FOLLOW(E) ={$, )}
F → (E) | id FIRST(+TE') = {+} FOLLOW(E') ={$, )}
FIRST(ε) = { ε } FOLLOW(T) ={+, $, )}
FIRST(F) ={( , id} FIRST(FT') = {(, id} FOLLOW(T') ={+, $, )}
FIRST(T') ={*, ε } FIRST(*FT') = {*} FOLLOW(F) ={+, *, $, )}
FIRST(T) ={( , id} FIRST(ε) = { ε }
FIRST(E') ={+, ε } FIRST((E)) = {(}
FIRST(E) ={( , id} FIRST(id) = {id}
◦Final Parsing Table:
Nonrecursive Predictive Parsing
Nonrecursive Predictive Parsing
◦A nonrecursive predictive parser can be built by maintaining a stack
explicitly, rather than implicitly via recursive calls
◦Non recursive predictive parsing is a table driven parser
◦The table driven predictive parser has stack, input buffer, parsing table
and output stream
◦The input buffer contains the sentence to be parsed followed by $ as
an end marker
◦The stack contains symbols of the grammar
Nonrecursive Predictive Parsing
◦Initially stack contains start symbol on top of $
◦When the stack is emptied (i.e. only $ is left in the stack), parsing is
completed
◦Parsing table is two dimensional array M[A,a] containing information
about the production to be used upon seeing the terminal at input and
nonterminal at the stack
◦Each entry in the parsing table holds the production rule
◦Each column holds the terminal or $, and each row holds non
terminal symbols
Nonrecursive Predictive Parsing
◦Algorithm:
◦INPUT: A string w and a parsing table M for grammar G.
◦OUTPUT: If w is in L(G), a leftmost derivation of w; otherwise, an
error indication.
◦METHOD: Initially, the parser is in a configuration with w$ in the
input buffer and the start symbol S of G on top of the stack, above $.
The following algorithm uses the predictive parsing table M to
produce a predictive parse for the input.
Nonrecursive Predictive Parsing
Example 1: Consider the grammar
E → TE'
E' → +TE' | ε
T → FT'
T' → *FT' | ε
F → (E) | id
Nonrecursive Predictive Parsing
◦Its Parsing Table is:
Moves made by a
predictive parser
on input
id + id * id
◦Example 2:
Consider the grammar
S → aBa
B → bB | ε
a b $
S S → aBa
B B→ε B → bB
◦Given string w = abba
Stack Input Action
$S abba$ output S → aBa
$aBa abba$ match a
$aB bba$ output B → bB
$aBb bba$ match b
$aB ba$ output B → bB
$aBb ba$ match b
$aB a$ output B → ε
$a a$ match a
$ $ Accept; Successful
completion
Output :-
S → aBa, B → bB, B → bB, B → ε
So, the leftmost derivation is
S ⇒ aBa ⇒ abBa ⇒ abbBa ⇒ abba