UNIT III
SYNTAX ANALYSIS
Need and Role of the Parser-Context Free Grammars -Top Down Parsing-
General Strategies-Recursive Descent Parser Predictive Parser-LL(1) Parser-Shift Reduce
Parser-LR Parser- LR (0)Item-Construction of SLR Parsing Table -Introduction to LALR
Parser - Error Handling and Recovery in Syntax Analyzer-YACC-Design of a syntax
Analyzer for a Sample Language .
The Role of the Parser
The parser obtains a string of tokens from the lexical Analyzer and verifies that
the string of token names can be generated by the grammar for the source language.
The parsers reports any syntax errors and to recover from commonly occurring
errors to continue processing the remainder of the program. Conceptually, for well-
formed programs, the parser constructs a parse tree and passes it to the rest of the
compiler for further processing.
The parser and the rest of the front end could well be implemented by a single module.
Position of parser in compiler model
Parsers can be classified into two broad categories. They are:
1. Universal Parsing
2. Top down parsing
3. Bottom up parsing
i. Universal Parsing:
Universal Parsing Such as Cocke – younger-kasami algorithm and
Earley‟s algorithm that can parse any grammar.
VI SEM CS6660-COMPILER DESIGN
ii. Top down parsing:
These are the parsers, which construct the parse tree from the root (starting
non-terminal) to the leaves in pre-order for the given input string.
The starting non-terminal is expanded to derive the given input string.
It has LL Grammars and is implemented by hand.
iii. Bottom up parsing:
These are the parsers which constructs the parse tree from the leaves to the root
(starting terminal) for the given input string.
Parsing is done from bottom to top of the tree.
The input string is reduced to the starting non-terminal.
It has LR Grammars and are implemented by automated tools.
SYNTAX ERROR HANDLING:
Common programming errors can occur at many different levels.
Lexical errors include misspellings of identifiers, keywords, or operators -
Syntactic errors include misplaced semicolons or extra or missing braces;
Semantic errors include type mismatches between operators and operands.
Logical errors, such as an infinitely recursive call.
Error detection and recovery in a compiler is centered around the syntax analysis phase.
The reasons for this are:
i. Many errors are syntactic in nature or are exposed when the stream of tokens
coming from the lexical analyzer disobeys the grammatical rules defining the
programming language.
ii. Modern parsing methods are very precise that they can detect the presence of
syntactic errors in programs very efficiently.
The goals of the error handler in a parser are:
i. It should report the presence of errors clearly and accurately.
ii. It should recover from each error quickly enough to be able to detect
subsequent errors.
iii. It should not significantly slow down the processing of correct programs.
VI SEM CS6660-COMPILER DESIGN
Error-Recovery Strategies
The error-recovering strategies are:
1. Panic-mode
2. Phrase-level
3. Error-productions
4. Global-correction.
Panic-Mode Recovery
With this method, on discovering an error, the parser discards input symbols one
at a time until one of a designated set of synchronizing tokens is found.
The synchronizing tokens are usually delimiters, such as semicolon or 3, whose
role in the source program is clear and unambiguous.
The compiler designer must select the synchronizing tokens appropriate for the
source language. While panic-mode correction often skips a considerable amount
of input without checking it for additional errors, it has the advantage of
simplicity, and, unlike some methods to be considered later, is guaranteed not to
go into an infinite loop.
Phrase-Level Recovery
On discovering an error, a parser may perform local correction on the
remaining input; that is, it may replace a prefix of the remaining input by
some string that allows the parser to continue.
A typical local correction is to replace a comma by a semicolon, delete an
extraneous semicolon, or insert a missing semicolon.
The choice of the local correction is left to the compiler designer.
Phrase-level replacement has been used in several error-repairing
compilers, as it can correct any input string. Its major drawback is the
difficulty it has in coping with situations in which the actual error has
occurred before the point of detection.
Error Product ions
By anticipating common errors that might be encountered, we can
augment the grammar for the language at hand with productions that
generate the erroneous constructs.
VI SEM CS6660-COMPILER DESIGN
A parser constructed from a grammar augmented by these error
productions detects the anticipated errors when an error production is used
during parsing. The parser can then generate appropriate error diagnostics
about the erroneous construct that has been recognized in the input.
Global Correction
Ideally, we would like a compiler to make as few changes as possible in
processing an incorrect input string. There are algorithms for choosing a
minimal sequence of changes to obtain a globally least-cost correction.
Given an incorrect input string x and grammar G, these algorithms will
find a parse tree for a related string y, such that the number of insertions,
deletions, and changes of tokens required to transform x into y is as small
as possible. Unfortunately, these methods are in general too costly to
implement in terms of time and space, so these techniques are currently
only of theoretical interest.
CONTEXT-FREE GRAMMARS
The Formal Definition of a Context-Free Grammar
A context-free grammar (grammar for short) consists of terminals, nonterminals, a start
symbol, and productions.
G=(V,T,P,S)
1. Variables Or Nonterminal: (A-Z)
2. Terminals (T) :0-9,a-z,+,-,/,* and all special characters)
3. Productions(A->a)
4. S- Starting Symbol
Example:
E->E+E/E*E/a
In a grammar, one nonterminal is distinguished as the start symbol, and the set of
strings it denotes is the language generated by the grammar.
The productions of a grammar specify the manner in which the terminals and
nonterminals can be combined to form strings.
Each production consists of:
VI SEM CS6660-COMPILER DESIGN
(a) A nonterminal called the head or left side of the production; this production
defines some of the strings denoted by the head.
(b) The arrow symbol (->)
(c) A body or right side consisting of zero or more terminals and nonterminals.
The components of the body describe one way in which strings of the nonterminal at the
head can be constructed.
Example: The grammar with the following productions defines simple arithmetic
expressions:
expr-> expr op expr
expr->( expr )
expr-> - expr
expr-> id
op -> +
op -> -
op -> /
op -> *
op -> ↑
In this grammar, the terminal symbols are:
id, +, -, *, /, ↑, (, )
The non terminals are
expr and op
The start symbol is expr.
Notational Conventions
The following notational conventions for grammars are used:
1. These symbols are terminals:
(a) Lowercase letters early in the alphabet, such as a, b, c.
(b) Operator symbols such as +, -, and so on.
(c) Punctuation symbols such as parentheses, comma, and so on.
(d) The digits 0,1,. . . ,9.
(e) Boldface strings such as id or if, each of which represents a single terminal
symbol.
VI SEM CS6660-COMPILER DESIGN
2. These symbols are nonterminals:
(a) Uppercase letters early in the alphabet, such as A, B, C.
(b) The letter S, which, when it appears, is usually the start symbol.
(c) Lowercase, italic names such as expr or stmt.
(d) When discussing programming constructs, uppercase letters may be used to
represent nonterminals for the constructs. For example, nonterminals for
expressions, terms, and factors are often represented by E, T, and F, respectively.
3. Uppercase letters late in the alphabet, such as X, Y, 2, represent grammar symbols; that
is, either nonterminals or terminals.
4. Lowercase letters late in the alphabet, chiefly u, v, . . . , x, represent strings of
terminals.
5. Lowercase Greek letters, α, β,γ for example, represent strings of grammar symbols.
6. A set of productions A -> α1, A -> α2, . . . , A-> αk with a common head A (call them
A-productions), may be written A -> α1| α2|… | αk.
7. Unless stated otherwise, the head of the first production is the start symbol.
Example: Using these conventions, the grammar with the following productions
expr-> expr op expr
expr->( expr )
expr-> - expr
expr-> id
op -> +
op -> -
op -> /
op -> *
op -> ↑
can be rewritten as follows:
E->E A E | ( E ) | -E | id
A-> + | - | * | / | ↑
The notational conventions tell us that E, T, and F are non terminals, with E the start
symbol. The remaining symbols are terminals.
Derivations of CFG:
The construction of a parse tree can be made precise by taking a derivational
view, in which productions are treated as rewriting rules.
VI SEM CS6660-COMPILER DESIGN
Beginning with the start symbol, each rewriting step replaces a nonterminal by the
body of one of its productions.
The derivation is denoted by
The symbol means, "derives in one step."
means ,”derives in zero or more steps."
means, "derives in one or more steps."
Sentence:
If S a, where S is the start symbol of a grammar G, we say that a is a sentential form
of G. Note that a sentential form may contain both terminals and nonterminals, and may
be empty. A sentence of G is a sentential form with no nonterminals. The language
generated by a grammar is its set of sentences.
Context-free language :
Thus, a string of terminals w is in L(G), the language generated by G, if and only if w is a
sentence of G (or S w). A language that can be generated by a grammar is said to be a
context-free language.
If two grammars generate the same language, the grammars are said to be equivalent.
Two types of derivation:
1. In leftmost derivations, the leftmost nonterminal is replaced at each step. If
is a step in which the leftmost nonterminal in a is replaced, we write
2. In rightmost derivations, the rightmost nonterminal is always chosen and replaced
by terminal at each step.
Rightmost derivations are sometimes called canonical derivations.
Parse Trees and Derivations
A parse tree is a graphical representation of a derivation is called parse Tree.
Each interior node of a parse tree represents the application of a production. The
interior node is labeled with the nonterminal A in the head of the production; the
children of the node are labeled, from left to right, by the symbols in the body of
the production by which this A was replaced during the derivation.
The leaves of a parse tree are labeled by nonterminals or terminals and, read from
left to right, constitute a sentential form, called the yield or frontier of the tree.
VI SEM CS6660-COMPILER DESIGN
Example for a parse tree:
Sequence of Parse tree:
Ambiguity
A grammar that produces more than one parse tree for some sentence is said to be
ambiguous. Put another way, an ambiguous grammar is one that produces more than one
leftmost derivation or more than one rightmost derivation for the same sentence.
Example:
The arithmetic expression grammar permits two distinct leftmost derivations for the
sentence id + id * id:
VI SEM CS6660-COMPILER DESIGN
Note
that the parse tree of Fig. 4.5(a) reflects the commonly assumed precedence of + and *,
while the tree of Fig. 4.5(b) does not. That is, it is customary to treat operator * as having
higher precedence than +, corresponding to the fact that we would normally evaluate an
expression like a + b * c as a + (b * c),rather than as (a + b) * c.
For most parsers, it is desirable that the grammar be made unambiguous, for if it is not,
we cannot uniquely determine which parse tree to select for a sentence. In other cases, it
is convenient to use carefully chosen ambiguous grammars, together with disambiguating
rules that "throw away" undesirable parse trees, leaving only one tree for each sentence.
Verifying the Language Generated by a Grammar
Although compiler designers rarely do so for a complete programming-language
grammar, it is useful to be able to reason that a given set of productions generates a
particular language. Troublesome constructs can be studied by writing a concise, abstract
grammar and studying the language that it generates. We shall construct such a grammar
for conditional statements below.
A proof that a grammar G generates a language L has two parts: show that every string
generated by G is in L, and conversely that every string in L can indeed be generated by
G.
Example 4.12 : Consider the following grammar:
It may not be initially apparent, but this simple grammar generates all strings of balanced
parentheses, and only such strings. To see why, we shall show first that every sentence
derivable from S is balanced, and then that every balanced string is derivable from S. To
VI SEM CS6660-COMPILER DESIGN
show that every sentence derivable from S is balanced, we use an inductive proof on the
number of steps n in a derivation.
BASIS: The basis is n = 1. The only string of terminals derivable from S in one step is the
empty string, which surely is balanced.
INDUCTION: Now assume that all derivations of fewer than n steps produce balanced
sentences, and consider a leftmost derivation of exactly n steps. Such a derivation must
be of the form
The derivations of x and y from S take fewer than n steps, so by the inductive hypothesis
x and y are balanced. Therefore, the string (x)y must be balanced.
That is, it has an equal number of left and right parentheses, and every prefix has at least
as many left parentheses as right.
Having thus shown that any string derivable from S is balanced, we must next show that
every balanced string is derivable from S. To do so, use induction on the length of a
string.
BASIS: If the string is of length 0, it must be E, which is balanced.
INDUCTION: First, observe that every balanced string has even length. Assume that
every balanced string of length less than 2n is derivable from S,
and consider a balanced string w of length 2n, n 2 1. Surely w begins with a
left parenthesis. Let (x) be the shortest nonempty prefix of w having an equal
number of left and right parentheses. Then w can be written as w = (x) y where both x and
y are balanced. Since x and y are of length less than 2n, they are derivable from S by the
inductive hypothesis. Thus, we can find a derivation of the form proving that w = (x)y is
also derivable from S.
Context-Free Grammars Versus Regular Expressions
Every construct that can be described by a regular expression can be described by
a grammar, but not vice-versa. Alternatively, every regular language is a context-
free language, but not vice-versa.
For example, the regular expression (alb)*abb and the grammar describe the same
language, the set of strings of a's and b's ending in abb. We can construct
mechanically a grammar to recognize the same language as a nondeterministic
finite automaton (NFA).
VI SEM CS6660-COMPILER DESIGN
Writing a Grammar
Grammars are capable of describing most, but not all, of the syntax of
programming languages. For instance, the requirement that identifiers be
declared before they are used cannot be described by a context-free grammar.
Therefore, the sequences of tokens accepted by a parser form a superset of the
programming language; subsequent phases of the compiler must analyze the
output of the parser to ensure compliance with rules that are not checked by
the parser.
We then consider several transformations that could be applied to get a grammar more
suitable for parsing.
One technique can eliminate ambiguity in the grammar, and other techniques -
left-recursion elimination and left factoring - are useful for rewriting grammars
so they become suitable for top-down parsing.
Lexical Versus Syntactic Analysis:
Everything that can be described by a regular expression can also be described by
a grammar.
There are several reasons use regular expressions to define the lexical syntax of a
language.
1. Separating the syntactic structure of a language into lexical and nonlexical parts
provides a convenient way of modularizing the front end of a compiler into two
manageable-sized components.
2. The lexical rules of a language are frequently quite simple, and to describe them
we do not need a notation as powerful as grammars.
3. Regular expressions generally provide a more concise and easier-to-understand
notation for tokens than grammars.
4. More efficient lexical analyzers can be constructed automatically from regular
expressions than from arbitrary grammars.
VI SEM CS6660-COMPILER DESIGN
Regular Expressions Vs. Context-Free Grammars
Regular expressions are most useful for describing the structure of constructs such
as identifiers, constants, keywords, and white space.
Grammars, on the other hand, are most useful for describing nested structures
such as balanced parentheses, matching begin-end's, corresponding if-then-else's,
and so on. These nested structures cannot be described by regular expressions.
For example, the regular expression (a|b) *abb and the grammar
A0 -> aA0 | bA0 | aA1
A1 -> bA2
A2 -> bA3
A3 -> є
Eliminating Ambiguity
An ambiguous grammar can be rewritten to eliminate the ambiguity. As an example, we
shall eliminate the ambiguity from the following “dangling else” grammar.
stmt if expr then stmt
| if expr then stmt else stmt
| other --------------------------> (2.5)
Here other stands for any other statement According to this grammar, the
compound conditional statement
if E1 then S1 else if E2 then S2 else S3
has only one parse tree
if E1 then if E2 then S1 else S2 has two parse trees
VI SEM CS6660-COMPILER DESIGN
In all programming languages with conditional statements of this form, the first
parse tree (else matches with closest if )is preferred. The general rule is, “match each
else with the closest previous unmatched then”. This disambiguating rule can be
incorporated directly into the grammar. So, the grammar is rewritten as,
stmt matched_stmt | unmatched_stmt
matched_stmt if expr then matched_stmt else matched_stmt | other
unmatched_stmt if expr then stmt |
if expr then matched_stmt else unmatched_stmt --->(2.6)
This grammar generates the same set of strings, but it allows only one parsing string
, namely the one that associates each else with the closest previous unmatched then.
Elimination of Left Recursion
A grammar is left recursive if it has a nonterminal A such that there is A=>A for
some string .Top-down parsing methods cannot handle left-recursive grammars, so a
transformation is needed to eliminate left recursion.
The left recursive pair of production
A → A α| β where does not start with A.
VI SEM CS6660-COMPILER DESIGN
Then we can rewrite the rules to eliminate the left recursion as follows
A → β A'
A' → α A'| ε
Example,
E → E + T/T
After eliminating the left recursion
E → T E'
E' → + T E'
E' → ε
Example 2:
Consider the grammar
S→E
E→T|E+T|E-T
T →F |T *F |T /F
Eliminate Left recursion from this grammar.
Solution:
After eliminating the left recursion
S→E
E → T E'
E' → + T E'/- T E'/ ε
T → F T'
T' → * F T'/ F T'/ ε
The left recursive pair of production
A → A α1/ A α2/……/ A αn/β1/ β2/…./ βn
Then we can rewrite the rules to eliminate the left recursion as follows
A → β1 A'/ β2 A'/…./ βn A'
A' → α1 A'/ α2 A'/……/ αn A' | ε
Consider another grammar used for arithmetic expressions.
E E+T | T
T T*F | F
VI SEM CS6660-COMPILER DESIGN
F id | (E) ---------------------------------> (2.8)
After eliminating the left recursion
E TE’
E’ +TE’ |
T FT’
T’ *FT’ |
F id | (E)
Eliminating Left-Factoring
Another property required of a grammar to be suitable for top-down parsing is that the
grammar is left-factored. A left-factored grammar is one where for each nonterminal,
there are no two productions on that nonterminal which have a common nonempty prefix
of symbols on the right-hand side of the production.
If the production is of the form
A → α β1/ α β2 /…./ α βn / γ
Then the production after eliminating Left-Factoring the grammar is,
A →α A'/ γ
A' → β1/ β2 /…./ βn
For example, here is a grammar
A → a b c/ a b d
Both productions share the common prefix a b on the right hand side of the production.
After eliminating Left-Factoring the grammar is
A → a b A'
A' → c/ d
Example:
Eliminate Left Factoring for the grammar,
VI SEM CS6660-COMPILER DESIGN
S→ iEtS / iEtSeS /a
E→b
Solution:
α = iEtS β1= ε β2=eS γ= a
then the grammar after eliminating Left-Factoring is
S→ iEtSS‟/a
S‟→eS/ ε
E→b
TOP-DOWN PARSING
Parsing is the process of determining if a string of tokens can be generated by a
grammar.
Two top-down parsing are .
Recursive Descent Parsing (backtracking)
An efficient non-backtracking parsing called Predictive Parsing for
LL(1) grammars.
1. RECURSIVE DESCENT PARSING
Top-down parsing can be viewed as an attempt to find a leftmost
derivation for an input string. It is viewed as an attempt to construct
a parse tree for the input starting from the root and creating the
nodes of the parse tree is preorder. Special case of recursive descent
parsing is called as predictive parsing.
A recursive-descent parsing program consists of a set of procedures,
one for each nonterminal. Execution begins with the procedure for
the start symbol, which halts and announces success if its procedure
body scans the entire input string. General recursive-descent may
require backtracking; that is, it may require repeated scans over the
input. However, backtracking is rarely needed to parse programming
language constructs, so backtracking parsers are not seen frequently.
Consider the grammar
S -> cAd
A -> ab | a -------------------------------> (2.12)
and the input string w=cad. To construct a parse tree for this string top-
down, we initially create a tree consisting of a single node labeled S. An
VI SEM CS6660-COMPILER DESIGN
input pointer points to c, the first symbol of w. we can use the first
production for S to expand the tree and obtain the tree shown below.
The leftmost leaf, labeled c, matches the first symbol of w, so we can
now advance the input pointer to a, the second symbol of w, and
consider the next leaf, labeled A. We can then expand A using the first
alternative for A to obtain the tree, which is shown in the figure 1
above. We now have the match for the second input symbol so we
advance the input pointer to d, the third symbol, and compare d against
the next leaf, labeled b. since b does not match d, we report failure and
go back to A to see whether there is another alternative for A that we
have not tried but that might produce a match.
i.
2. Predictive Parsing or LL(1):
L->Scanning from Left
L->Leftmost derivation
1->One input symbol at each step
• No left recursive or ambiguous grammar can be LL(1)
Steps:
1. Eliminate Ambiguity , Left recursion and Left factoring
2. Find First and Follow
3. Construct Predictive parsing Table
4. Parse the input string
Construction of Parsing Table:
Before constructing the parsing table, two functions are to be performed to fill the
entries in the table.
FIRST( ) and FOLLOW( ) functions.
These functions will indicate proper entries in the table for a grammar G.
VI SEM CS6660-COMPILER DESIGN
Compute FIRST :
To compute FIRST(X) for all grammar symbols X, apply the following rules
until no more terminals or ε can be added to any FIRST set.
1. If X is terminal, then FIRST(X) is {X}.
2. If X is nonterminal and X → aα is a production, then add a to FIRST(X). If
X→ε is a production, then add ε to FIRST(X).
3. If X → Y1, Y2, … , Yk is a production, then for all I such that all of Y1, … , Yi-1
are nonterminals and FIRST(Yj) contains ε for j = 1,2, … , i-1 (i.e., Y1, Y2, … . Yi-1
ε), add every non-ε symbol in FIRST(Yi) to FIRST(X). If ε is in FIRST(Yj) for all j
= 1, 2, … , k, than add ε to FIRST(X).
Compute FOLLOW:
To compute FOLLOW(A) for all nonterminals A, apply the following rules until
nothing can be added to any FOLLOW set.
1. $ is in FOLLOW(S), where S is the start symbol.
2. If there is a production A → αBβ, β ≠ ε, the everything in FIRST(β) but
ε is in FOLLOW(B).
3. If there is a production A → αB, or a production A → αBβ where
FIRST(β) contains ε (i.e., β ε), then everything in FOLLOW(A) is in
FOLLOW(B).
Construction of Predictive Parsing Table:
The following algorithm can be used to construct a predictive parsing table for a grammar
G
Constructing a predictive parsing table
Input: Grammar G
Output: Parsing table M
Method:
1. For each production A → α of the grammar, do step 2 and 3.
2. For each terminal a in FIRST(α), add A → α to M[A,a].
3. If ε is in FIRST(α), add A → α to M[A,b] for each terminal b in FOLLOW(A). If ε is
in FIRST(α) and $ is in FOLLOW(A), add A → α to M[A,$].
4. Make each undefined entry of M error
Predictive parsing program
The predictive parser has an input, a stack, a parsing table, and an output.
The input contains the string to be parsed, followed by $, the right endmarker.
The stack contains a sequence of grammar symbols, preceded by $, the bottom-of-
stack marker.
Initially the stack contains the start symbol of the grammar preceded by $.
The parsing table is a two dimensional array M[A,a], where A is a nonterminal,
and a is a terminal or the symbol $.
The parser is controlled by a program that behaves as follows:
VI SEM CS6660-COMPILER DESIGN
The program determines X, the symbol on top of the stack, and a, the current input
symbol.
These two symbols determine the action of the parser.
There are three possibilities:
1. If X = a = $, the parser halts and announces successful completion of parsing.
2. If X = a ≠ $, the parser pops X off the stack and advances the input pointer to the next
input symbol.
3. If X is a nonterminal, the program consults entry M[X,a] of the parsing table M. This
entry will be either an X-production of the grammar or an error entry.
If M[X,a] = {X → UVW}, the parser replaces X on top of the stack by WVU (with
U on top).
If M[X,a] = error, the parser calls an error recovery routine.
Predictive parsing program
repeat
begin
let X be the top stack symbol and a the next input symbol;
if X is a terminal or $ then
if X = a then
pop X from the stack and remove a from the input
else
ERROR( )
else /* X is a nonterminal */
if M[X,a] = X → Y1, Y2, … , Yk then
begin
pop X from the stack;
push Yk, Yk-1, … ,Y1 onto the stack, Y1 on top
end
else
ERROR( )
end
until
X = $ /* stack has emptied */
Example:
Consider the following grammar
E→E+T|T
T→T*F|F
F → ( E ) | id
Compute the FIRST and FOLLOW function for the above grammar.
Solution:
Here the grammar is in left-recursion, so eliminate the left recursion for the Grammar
VI SEM CS6660-COMPILER DESIGN
So we get;
E → TE`
E` → +TE` | ε
T → FT`
T` → *FT` | ε
F → ( E ) | id
Then:
FIRST(E) = FIRST(T) = FIRST(F) = {(, id}.
FIRST(E`) = {+,ε}
FIRST(T`) = {*,ε}
FOLLOW(E) = FOLLOW(E`) = {),$}
FOLLOW(T) = FOLLOW(T`) = {+,),$}
FOLLOW(F) = {+,*,),$}
( id + ) * $
E E → T E' E → T E'
E' E' → + T E' E' → ε E' → ε
T T → F T` T → F T`
T' → ε
T' T' → ε T` →FT` T' → ε
F F → ( E ) F → id
No table entry contains more than one production, we know that the grammar is in LL(1)
and that predictive parsing will successfully handle any input string.
STACK INPUT ACTION
$E id+id * id$
$E‟T id+id * id$ E->TE‟
$ E‟ T‟F id+id * id$ T->FT‟
$ E‟ T‟id id+id * id$ F->id
$ E‟T‟ +id * id$ Match id
$ E‟ +id * id$ T‟->Є
VI SEM CS6660-COMPILER DESIGN
$ E‟ T + +id * id$ E‟-> +TE‟
$ E‟T id * id$ Match +
$ E‟ T‟F id * id$ T-> FT‟
$ E‟ T‟id id * id$ F-> id
$ E‟T‟ * id$ Match id
$ E‟ T‟ F * * id$ T‟-> *FT‟
$ E‟ T‟F id$ Match *
$ E‟ T‟id id$ F-> id
$ E‟T‟ $ Match id
$E‟ $ T‟-> Є
$ $ Accepted
Example : Consider the grammar
-----------------------------------> (2.13)
Then,
FIRST(E) = FIRST(T) = FIRST(F) = { ( , id }
FIRST (E‟) = { +,є }
FIRST (T‟) = { *,є }
FOLLOW(E) = FOLLOW (E‟) = { ), $ }
FOLLOW (T) = FOLLOW (T‟) = { +,),$ }
FOLLOW (F) = { +,*,), $ }
VI SEM CS6660-COMPILER DESIGN
1. FIRST(F) = FIRST(T) = FIRST(E) = {(, id }. To see why, note that the two
productions for F have bodies that start with these two terminal symbols, id and
the left parenthesis. T has only one production, and its body starts with F. Since F
does not derive є, FIRST(T) must be the same as FIRST(F). The same argument
covers FIRST(E).
2. FIRST(E') = {+, є }. The reason is that one of the two productions for E' has a
body that begins with terminal +, and the other's body is E. Whenever a
nonterminal derives є, we place є in FIRST for that nonterminal.
3. FIRST(T') = {*, є }. The reasoning is analogous to that for FIRST(E').
4. FOLLOW(E) = FOLLOW(E') = { ), $ }. Since E is the start symbol,
FOLLOW(E) must contain $. The production body (E ) explains why the right
parenthesis is in FOLLOW(E). For E‟, note that this nonterminal appears only at
the ends of bodies of E-productions. Thus, FOLLOW(E') must be the same as
FOLLOW(E).
5. FOLLOW(T) = FOLLOW(T') = {+, ), $ }. Notice that T appears in bodies only
followed by E'. Thus, everything except є that is in FIRST(E') must be in
FOLLOW (T) ; that explains the symbol +. However, since FIRST(E') contains є
(i.e., E' & E), and E' is the entire string following T in the bodies of the E-
productions, everything in FOLLOW(E) must also be in FOLLOW(T). That
explains the symbols $ and the right parenthesis. As for T', since it appears only at
the ends of the T-productions, it must be that FOLLOW(T') = FOLLOW(T).
6. FOLLOW(F) = {+, *, ), $ }. The reasoning is analogous to that for T in point
(5).
Consider production E -> TE'. Since
FIRST(TE’) = FIRST(T) = { (,id}
This production is added to M[E, ( ] and M[E, id].
Production E’ -> +TE’ is added to M[E', +] .
Production E‟->є is added to M[ E',)] and M[ E‟ , $ ], since FOLLOW(E') = { ), $ }.
VI SEM CS6660-COMPILER DESIGN
Parsing the input string id + id * id
Example : The following grammar, which abstracts the dangling-else problem, is
shown below:
S -> iEtSS‟ | a
S‟ -> eS | є
E -> b
The parsing table for this grammar appears below. The entry for M [S‟,e]
contains both S'-> es and S' -> є
BOTTOM-UP PARSING
A bottom-up parse corresponds to the construction of a parse tree for an
input string beginning at the leaves (the bottom) and working up towards
the root (the top).
This is nothing but reducing a string w to the start symbol of a grammar.
At each reduction step a particular substring matching the right side of a
VI SEM CS6660-COMPILER DESIGN
production is replaced by the symbol on the left of that production and if
the substring is chosen correctly at each step, a rightmost derivation is
traced out in reverse.
Types of Bottom up parsing
1. Shift reduce parsing.
2. Operator Precedance Parsing
3. LR Parsing
a. SLR PARSING (or) LR(0) Parsing
b. CLR PARSING (or) LR(1) PARSING
c. LALR PARSING (or) LALR(1) PARSING
A general style of bottom-up parsing is known as shift reduce parsing.
Example : Consider the grammar
S -> aABe
A -> Abc | b
B -> d
The sentence abbcde can be reduced to S by the following steps
abbcde
aAbcde
aABe
S
We scan abbcde looking for a substring that matches the right side of some
production. The substrings b and d qualify. Let us choose the leftmost b and replace it by
A, the left side of the production A->b, thus we obtain the string aAbcde. Now the
substrings Abc, b and d match the right side of some production. Although b is the
leftmost substring that matches the right side of some production, we choose to replace
the substring Abc by A, the left side of the production A->Abc. We now obtain aAde.
Then replacing d by B, the left side of the production B->d, we obtain aABe.
S -> aABe-> aAde -> aAbcde -> abbcde
Handles
A "handle" is a substring that matches the body of a production, and
whose reduction represents one step along the reverse of a rightmost
derivation.
VI SEM CS6660-COMPILER DESIGN
In many cases the leftmost substring β that matches the right side of the
production A->β is not a handle, because a reduction by the production A-
>β yields a string that cannot be reduced to the start symbol.
Formally, a handle of a right-sentential form γ is a production A->β and a position of γ
where the string β may be found replaced by A to produce the previous right-sentential
form in a rightmost derivation of γ. That is, if S αAw αβw,then production A->β
in the position following α is a handle of αβw.Notice that the string w to the right of the
handle must contain only terminal symbols.
Handle Pruning:
A rightmost derivation in reverse can be obtained by "handle pruning."
That is, we start with a string of terminals w to be parsed. If w is a
sentence of the grammar, then let w = γn, where γn is the nth right-
sentential form of some as yet unknown rightmost derivation
To reconstruct this derivation in reverse order, we locate the handle βn in
γn and replace βn by the left side of some production An -> βn to obtain the
(n-1)st right sentential form γn-1.
We then repeat this process. That is, we locate the handle βn-1 in γn-1 and
reduce this handle to obtain the right-sentential form γn-2.
By continuing this process we produce a right-sentential form consisting
only of the start symbol S, then we halt and announce successful
completion of parsing.
The reverse of the sequence of productions used in the reductions is a
rightmost derivation for the input string.
SHIFT REDUCE PARSING
Shift-reduce parsing is a form of bottom-up parsing in which a stack holds
grammar symbols and an input buffer holds the rest of the string to be parsed.
The primary operations are shift and reduce, there are actually four possible
actions a shift-reduce parser can make:
VI SEM CS6660-COMPILER DESIGN
(1) shift
(2) reduce
(3) accept
(4) error.
1. In a Shift action, the next input symbol is shifted onto the top of the stack.
2. In a Reduce action, the parser knows the right end of the handle is at the
top of the stack. It must then locate the left end of the handle within the
stack and decide with what non-terminal to replace the string.
3. In an Accept action, the parser announces successful completion of
parsing.
4. In an Error action, the parser discovers that a syntax error has occurred
and calls an error recovery routine.
The handle always appears at the top of the stack just before it is identified as the handle.
We use $ to mark the bottom of the stack and also the right end of the input. Initially, the
stack is empty, and the string w is on the input, as follows:
STACK INPUT
$ w$
During a left-to-right scan of the input string, the parser shifts zero or more input
symbols onto the stack, until a handle β is on top of the stack.
It then reduces β to the left side of the appropriate production.
The parser repeats this cycle until it has detected an error or until the stack
contains the start symbol and the input is empty:
STACK INPUT
$S $
Upon entering this configuration, the parser halts and announces successful completion of
parsing.
Example:
The actions a shift-reduce parser might take in parsing the input string id 1 *id2 is shown
below according to the expression grammar E-> E+E | E*E | (E) | -E | id
VI SEM CS6660-COMPILER DESIGN
STACK INPUT ACTION
$ id1 * id2 $ Shift
$ id1 * id2 $ Reduce by E->id
$E * id2 $ Shift
$E * id2 $ Shift
$E * id2 $ Reduce by E->id
$E * E $ Reduce by E-> E*E
$E $ Accept
Viable Prefixes
The set of prefixes of right sentential forms that can appear on the stack of a shift-
reduce parser are called viable prefixes. A viable prefix is that it is a prefix of a right
sentential form that does not continue past the right end of the rightmost handle of that
sentential form.
Conflicts during Shift-Reduce Parsing
There are context-free grammars for which shift-reduce parsing cannot be
used.
Every shift-reduce parser for such a grammar can reach a configuration in
which the parser, knowing the entire stack contents and the next input
symbol, cannot decide whether to shift or to reduce (a shift/reduce
conflict), or cannot decide which of several reductions to make (a
reduce/reduce conflict).
We now give some examples of syntactic constructs that give rise to such
grammars. Technically, these grammars are not in the LR(k) class of
grammars. The k in LR(k) refers to the number of symbols of lookahead
on the input.
Example: An ambiguous grammar can never be LR. For example, consider the dangling-
else grammar:
stmt -> if expr then stmt | if expr then stmt else stmt | other
If we have a shift-reduce parser in configuration
STACK Input
VI SEM CS6660-COMPILER DESIGN
. . if expr then stmt else . . . $
we cannot tell whether if expr then stmt is the handle, no matter what appears
below it on the stack. Here there is a shift/reduce conflict. Depending on what follows
the else on the input, it might be correct to reduce if expr then stmt to stmt, or it might be
correct to shift else and then to look for another stmt to complete the alternative if expr
then stmt else stmt.
Another common setting for conflicts occurs when we know we have a handle,
but the stack contents and the next input symbol are insufficient to determine which
production should be used in a reduction.
Example:
Suppose we have a lexical analyzer that returns the token name id for all names,
regardless of their type. Suppose also that our language invokes procedures by giving
their names, with parameters surrounded by parentheses, and that arrays are referenced
by the same syntax. Since the
translation of indices in array references and parameters in procedure calls are different,
we want to use different productions to generate lists of actual parameters and indices.
Our grammar might therefore have (among others) productions as shown below.
OPERATOR PRECEDENT PARSING
There are two main categories of shift-reduce parsers
1. Operator-Precedence Parser
– simple, but only a small class of grammars.
2. LR-Parsers
– covers wide range of grammars.
1. SLR – simple LR parser
2. LR – most general LR parser
VI SEM CS6660-COMPILER DESIGN
3. LALR – intermediate LR parser (lookhead LR parser)
– SLR, LR and LALR work same, only their parsing tables are
different.
Operator-Precedence Parser
For small but important class of grammars efficient shift reduce parsers can
be built. Operator grammars have the property that no production right
side is empty or has two adjacent nonterminals. This property enables the
implementation of efficient operator-precedence parsers.
The following grammar for expressions
E -> EAE | (E) | -E | id
A -> + | - | * | / | ↑
Is not an operator grammar, because the right side EAE has two
consecutive non-terminals. However, if we substitute for A each of its alternative,
we obtain the following operator grammar.
E -> E+E | E-E | E*E | E/E | E↑E | (E) | -E | id
Operator precedence parsing has a number of disadvantages.
• It is hard to handle tokens like the minus sign, which has two different
precedence.
• The relationship between a grammar for the language and the operator
precedence parser is worse so that one cannot be sure the parser accepts
exactly the desired language.
• Only a small class of grammars can be parsed using operator precedence
techniques.
Operator precedence relations:
Three disjoint precedence relations, <·, =. , ·> , between certain pairs of
terminals. These precedence relations have the following meaning
Relation Meaning
a <· b a yields precedence to b
a =· b a has the same precedence as b
VI SEM CS6660-COMPILER DESIGN
a ·> b a takes precedence over b
There are two common ways of determining what precedence relations should
hold between a pair of terminals. The first method is based on traditional notions
of associativity and precedence. For example, if * is to have higher precedence
than +, we make + <· * and * ·> +.
The second method is first to construct an unambiguous grammar which
reflects correct associativity and precedence in its parse trees.
For example, the following operator precedence relations can be introduced for
simple expressions:
id + * $
id ·> ·> ·>
+ <· ·> <· ·>
* <· ·> ·> ·>
$ <· <· <·
Then the string id + id* id with the precedence relations inserted is
$ <· id1 ·> + <· id2 ·> * <· id3 ·> $
Having precedence relations allows to identify handles as follows:
- scan the string from left until seeing ·>
- scan backwards the string from right to left until seeing <·
- everything between the two relations <· and ·> forms the handle
Note that not the entire sentential form is scanned to find the handle.
$ <. id .> + <. id .> * <. id .> $ E id $ id + id * id $
$ <. + <. id .> * <. id .> $ E id $ E + id * id $
$ <. + <. * <. id .> $ E id $ E + E * id $
$ <. + <. * .> $ E E*E $ E + E * .E $
VI SEM CS6660-COMPILER DESIGN
$ <. + .> $ E E+E $E+E$
$ $ $E$
If no precedence relations holds between a pair of terminals, then a
syntactic error has been detected and a error recovery routine must be invoked.
Creation of operator precedence table:
To create an operator precedence table , we have to find out leading and
trailing of every non-terminal.
LEADING
1. If the production of the form A->γaβ, where γ is є or a single non-
terminal, then Leading(A)=a.
2. If the production of the form A->Bα, and if a is in Leading(B), then a is
also in Leading(A)
TRAILING
1. Productions of the form A->γaβ, where β is either є or a single non-
terminal, then a will be in Trailing(A)
2. A->αB, and if a is in Trailing(B), then a is also in Trailing(A)
Rules for constructing Parsing table:
1. Set $ <. a for all a in Leading(S) and set b.>$ for all b in Trailing(S),
where S is the start symbol of the grammar G
2. For each production A->X1X2X3 … Xn do
a. If Xi, Xi+1 both are terminals then Xi =. Xi+1.
b. If Xi, Xi+2 are terminals and Xi+1 is a non-terminal then Xi =. Xi+2.
c. If Xi is a terminal and Xi+1 is a non-terminal, then Xi <.
Leading(Xi+1)
d. If Xi is a non-terminal and Xi+1 is a terminal, then Trailing(Xi) .>
Xi+1.
Algorithm 2.5 : Operator Precedence Parsing Algorithm
set ip to point to the first symbol of w$ ;
VI SEM CS6660-COMPILER DESIGN
repeat forever
if $ is on top of the stack and ip points to $ then
return
else
begin
let a be the topmost terminal symbol on the stack and let b be the
symbol
pointed to by ip;
if ( a <. b or a =· b ) then
begin /* SHIFT */
push b onto the stack;
advance ip to the next input symbol;
end
else if ( a .> b ) then /* REDUCE */
repeat
pop the stack
until the top of stack terminal is related by <. to the terminal most
recently popped
else error();
end
Operator-Precedence Relations from Associativity and Precedence
We use associativity and precedence relations among operators.
1. If operator 1 has higher precedence than operator 2 , make 1 .> 2 and 2
<. 1.
2. If operator 1 and operator 2 have equal precedence,
they are left-associative 1 .> 2 and 2 .> 1
they are right-associative 1 <. 2 and 2 <. 1
3. For all operators , <.id, id .> , <. (, (<. , .> ), ) .> , .> $, and $ <.
Also, let
( =· ) $ <. ( id .> ) ) .> $ ( <. (
$ <. id id .> $ ) .> ) ( <. id
VI SEM CS6660-COMPILER DESIGN
Example:
The operator precedence relations for grammar E -> E+E | E-E | E*E | E/E | E^E |
(E) | -E | id is
+ - * / ^ id ( ) $
+ .> .> <. <. <. <. <. .> .>
- .> .> <. <. <. <. <. .> .>
* .> .> .> .> <. <. <. .> .>
/ .> .> .> .> <. <. <. .> .>
^ .> .> .> .> <. <. <. .> .>
id .> .> .> .> .> .> .>
( <. <. <. <. <. <. <. =·
) .> .> .> .> .> .> .>
$ <. <. <. <. <. <. <.
Precedence Functions
Compilers using operator precedence parsers do not need to store the table of
precedence relations. The table can be encoded by two precedence functions f
and g that map terminal symbols to integers.
For symbols a and b.
f(a) < g(b) whenever a <. b
f(a) = g(b) whenever a =· b
f(a) > g(b) whenever a .> b
The Precedence table for the above table has the following pair of
precedence functions
VI SEM CS6660-COMPILER DESIGN
+ - * / ^ ( ) id $
f 2 2 4 4 4 0 6 6 0
g 1 1 3 3 5 5 0 5 0
Algorithm 2.6 : Constructing Precedence Functions
Input : An Operator Precedence Functions
Output: Precedence functions representing the input matrix, or an indication that
none exist
Method:
1. Create symbols fa and gb for each a that is a terminal or $.
2. Partition the created symbols into as many groups as possible, in
such a way that if a =. b, then fa and gb are in the same group.
3. Create a directed graph whose nodes are the groups found in (2). For
any a and b, if a <.b , place an edge from the group of gb to the
group of fa. If a .> b, place an edge from the group of fa to that of gb.
4. If the graph constructed in (3) has a cycle, then no precedence
functions exist. If there are no cycle, let f(a) be the length of the
longest path beginning at the group of fa; let g(a) be the length of the
longest path beginning at the group of ga
id + * $ + * id $
id ·> ·> ·> f 2 4 4 0
+ <· ·> <· ·> g 1 3 5 0
* <· ·> ·> ·>
$ <· <· <· ·>
VI SEM CS6660-COMPILER DESIGN
gi fid
d
f* g*
g+ f+
f$ g$
Disadvantages of Operator Precedence Parsing
It cannot handle the unary minus (the lexical analyzer should handle
the unary minus).
Small class of grammars.
Difficult to decide which language is recognized by the grammar.
Advantages of Operator Precedence Parsing:
simple
powerful enough for expressions in programming languages
Error Recovery in Operator-Precedence Parsing
Error Cases:
1. No relation holds between the terminal on the top of stack and the
next input symbol.
2. A handle is found (reduction step), but there is no production with
this handle as a right side
Error Recovery:
1. Each empty entry is filled with a pointer to an error routine.
2. Decides the popped handle “looks like” which right hand side. And
tries to recover from that situation.
Handling Errors During Reductions
VI SEM CS6660-COMPILER DESIGN
This handles errors of type 2. As there is no production to reduce, it
displays a error diagnostic message. To handle, the routine should decide what
production the right side looks like. For example, suppose abc is popped, and there
is no production right side consisting of a,b and c together with zero or more non-
terminals. Then we might consider if deletion of one of a,b, and c yields a legal
right side.
For example, if there is a right side aEcE, we might issue the error diagnostic
Illegal b on line
if there is a right side abEdc, we might issue the error diagnostic
missing d on line
We may also find there is a right side with proper sequence of terminals but the
wrong pattern of non-terminals. For example is abc is popped off the stack with no
non-terminals surrounding and abc is not a right side but aEbc is, we might issue a
diagnostic missing E on line
Handling Shift/Reduce Errors
When consulting the precedence matrix to decide whether to shift or
reduce, we may find that no relation holds between the top stack and the first input
symbol.
To recover, we must modify (insert/change)
1. Stack or
2. Input or
3. Both
Example
id ( ) $
id e3 e3 .> .>
( <. <. =. e4
) e3 e3 .> .>
$ <. <. e2 e1
e1: Called when : whole expression is missing
insert id onto the input
VI SEM CS6660-COMPILER DESIGN
issue diagnostic: „missing operand‟
e2: Called when : expression begins with a right parenthesis
delete ) from the input
issue diagnostic: „unbalanced right parenthesis‟
e3: Called when : id or ) is followed by id or (
insert + onto the input
issue diagnostic: „missing operator‟
e4: Called when : expression ends with a left parenthesis
pop ( from the stack
issue diagnostic: „missing right parenthesis‟
LR PARSERS
LR parser is a bottom-up syntax analysis technique that can be used to parse a
large class of context-free grammars.
This technique is called as LR(k) parsing,
o the “L” is for left-to-right scanning of the input,
o the “R” for constructing a right most derivation in reverse, and
o “k” for the number of input symbols of lookahead that are use din making
parsing decisions.
o When (k) is omitted, k is assumed to be 1.
LR parsing is attractive for a variety of reasons:
LR parsers can be constructed to recognize virtually all programming language
constructs for which context-free grammars can be written.
The LR-parsing method is the most general non backtracking shift-reduce parsing
method known, yet it can be implemented as efficiently as other, more primitive
shift-reduce methods (see the bibliographic notes).
The class of grammars that can be parsed using LR methods is a proper superset
of the class of grammars that can be parsed with predictive parsers.
An LR parser can detect a syntactic error as soon as it is possible to do so on a
left-to-right scan of the input.
VI SEM CS6660-COMPILER DESIGN
The principal drawback of the LR method is that it is too much work to construct an LR
parser by hand for a typical programming-language grammar. A specialized tool, an LR
parser generator like YACC is available.
The LR Parsing Algorithm:
It consists of an input, an output, a stack, a driver program, and a parsing table
that has two parts , action and goto. The driver program is the same for all LR parsers,
only the parsing table changes from one parser to another. The parsing program reads
characters from an input buffer one at a time. The program uses a stack to store a string
of the form s0X1s1X2 s2 . . . smX m, where sm is on top. Each Xi is a grammar symbol and
each si is a symbol called a state. Each state symbol summarizes the information
contained in the stack below it, and the combination of the state symbol on top of the
stack and the current input symbol are used to index the parsing table and to determine
the shift-reduce parsing decision.
The parsing table consists of two parts, a parsing action function action and a
goto function goto. The program driving the LR parser behaves as follows. It determines
sm, the state currently on top of the stack, and a i, the current input symbol. It then consults
action[sm, ai], the parsing action table entry for state s m and input ai, which have one of
four values:
1. Shift
If action[sm,ai] = shift s, shifts the next input symbol and the state s onto the
stack
(S0X1S1 ... XmSm, ai ai+1 ... an$ ) -> ( S0X1S1 ... XmS m ais, ai+1 ... an $ )
Here the parser has shifted both the current input symbol ai and the next state s,
which is given in action [sm, ai], onto the stack; ai+1 becomes the current input symbol.
2. Reduce
If action[sm,ai] = reduce A-> then the parser executes a reduce move.
(S0X1S1 ... XmSm, ai ai+1 ... an$ ) -> (S0X1S1... Xm-r Sm-r A s, ai ... an $ )
Where s=goto[sm-r, A] and r is the length of β, the right side of the production.
Here the parser first popped 2r symbols off the stack ( r state symbols and r grammar
symbols), exposing state sm-r. The parser then pushed both A, the left side of the
production, and s, the entry for goto[s m-r,A], onto the stack. The current input symbol is
not changed in a reduce move.
3. Accept
If action[sm, ai] = accept, Parsing successfully completed
VI SEM CS6660-COMPILER DESIGN
4. Error
If action[sm, ai] = error, Parser detected an error (an empty entry in the action
table)
input a1 ... ai ... an $
stack
Sm
Xm
LR Parsing Algorithm output
Sm-1
Xm-1
.
.
Action Table Goto Table
S1
terminals and $ non-terminal
X1 s s
t four different t each item is
S0 a actions a a state number
t t
e e
s s
Algorithm 2.7: LR Parsing Algorithm
Input: An input string w and an LR parsing table with functions action and goto for a
grammar.
Output: If w is in L(G), a bottom up parser for w; otherwise an error indication
Method: Initially, the parser has s0 on its stack, where s0 is the initial state and w$ in the
input buffer. The parser then executes the program below until an accept or error is
encountered.
Set ip to point to the first symbol of w$
Repeat forever begin
Let s be the state on top of the stack and a the symbol pointed to by ip;
if action[s,a]=shift s‟ then begin
push a then s‟ on top of the stack
advance ip to the next input symbol
end
else if action[s,a] = reduce A=>β then begin
pop 2*|β| symbols off the stack;
let s‟ be the state now on top of the stack;
push A then goto[s‟,A] on top of the stack;
output the production a->β
end
else if action[s,a] = accept then
return
VI SEM CS6660-COMPILER DESIGN
end error()
end
Example :
The Parsing action and goto function of an LR parsing table for the following grammar is
shown below
(1) E->E+T
(2) E->T
(3) T->T*F
(4) T->F
(5) F->(E)
(6) F->id
The codes for the actions are:
1. si means shift and stack state i,
2. rj means reduce by the production numbered j,
3. acc means accept,
4. blank means error.
On input id*id+id, the sequence of stack and input contents is shown below. At
line(1) the LR parser is in state 0 with id the first input symbol. The action in row 0 and
column id of the action field is s5, meaning shift and cover the stack with state5.
VI SEM CS6660-COMPILER DESIGN
LR Grammars
A grammar for which we can construct a parsing table is said to be an LR
grammar. An LR parser does not have to scan the entire stack to know when the handle
appears on top. Rather, the state symbol on top of the stack contains all the information it
needs.
Difference between LL and LR grammar
A grammar to be LR(k), we must be able to recognize the occurrence of the right
side of a production, having seen all of what is derived from the right side with k symbols
look ahead. In LL (k) grammar we must be able to recognize the use of a production
seeing only the first k symbols of what its right side derives. Thus LR grammars can
describe more languages than LL grammar.
Methods of LR Parsing
There are three methods of LR Parsing
1. Simple LR (SLR) Parsing
2. Canonical LR (CLR) Parsing
3. Look Ahead LR (LALR) Parsing
SLR PARSER
The Simple LR(SLR) parsing is the weakest of three methods of parsing in terms
of the number of grammars for which it succeeds, but it is the easiest method to
implement.
VI SEM CS6660-COMPILER DESIGN
An LR(0) item of a grammar G is a production of G with dot at some position of
the right side. Thus, production A->XYZ yields the four items
A->.XYZ
A->X.YZ
A->XY.Z
A->XYZ.
The production A->є generates only one item A->. . An item can be represented
by a pair of integers, the first giving the number of production and second the position of
the dot. If G is a grammar with start symbol S‟, then G‟, the augmented grammar G, is G
with a new start symbol S‟ and the production S‟->S. The purpose of this new starting
production is to indicate to the parser when it should stop parsing and announce
acceptance of the input. That is acceptance occurs when and only when the parser is
about to reduce by S‟->S.
The closure operation:
If I is a set of LR(0) items for a grammar G, then closure(I) is the set of LR(0)
items constructed from I by the two rules:
1. Initially, every LR(0) item in I is added to closure(I).
2. If A .B is in closure(I) and B-> is a production rule of G; then
add the item B. to I, if not already there. We apply this rule until no
more new LR(0) items can be added to closure(I).
The function closure can be computed as below
function closure ( I )
begin
J := I;
repeat
for each item A .B in J and each production
Bγ of G such that B. is not in J do
add B. to J
until no more items can be added to J
end
If one B-production is added to the closure of I with the dot at the left end, then
all B-Productions will similarly added to the closure. Divide all sets of items into
two classes of items.
1. Kernel Items, which includes the initial item, S‟->.S, and all items whose
dots are not at the right end.
2. Non-Kernal Items, which have their dots at the right end.
The GOTO Operation
VI SEM CS6660-COMPILER DESIGN
goto(I,X) is defined to be the closure of the set of all items [A-<αX.β] such
that [A-<α.Xβ] is in I.
The Sets-Of-Items Construction:
Procedure items(G‟)
Begin
C := {closure({[S‟.S]})};
repeat
for each sets of items I in C and each grammar symbol X
such that goto(I,X) is not empty and not in C do
add goto (I,X) to C
until no more sets of items can be added to C.
end
SLR Parsing Tables
Given a grammar G, we augment G to produce G‟, and from G‟ we
construct C, the canonical collection of sets of items for G‟. We construct action,
the parsing action function, and goto, the goto function, from C using the
following algorithm. It requires us to know the FOLLOW(A) for each non-
terminal A of a Grammar.
Algorithm 2.8 – Constructing an SLR parsing Table
INPUT: An augmented grammar G‟
OUTPUT: The SLR-parsing table functions ACTION and GOT0 for G‟
METHOD:
1. Construct C = {I0, I1, . . . ,In), the collection of sets of LR(0) items for G‟.
2. State i is constructed from Ii . The parsing actions for state i are determined as
follows:
(a)If [A -> α.aβ ] is in Ii, and GOTO(Ii,a ) = Ij ,then set ACTION[i, a] to "shift
j." Here a must be a terminal.
(b) If [A -> α.] is in Ii, then set action[i,a] to "reduce A-> α" for all a in
VI SEM CS6660-COMPILER DESIGN
FOLLOW(A), here A may not be S.
(c) If [S‟ -> S.] is in Ii, then set action[i,$] to "accept ."
If any conflicting actions are generated by the above rules, we say the grammar
is not SLR(1). The algorithm fails to produce a parser in this case.
3. The goto transitions for state i are constructed for all nonterminals A
using the rule: If GOTO(Ii,A) = Ij, then goto[i, A] = j
4. All entries not defined by rules (2) and (3) are made "error."
5. The initial state of the parser is the one constructed from the set of items
containing [S‟->.S].
Example : Let us consider the grammar
(1) E -> E+T (4) T -> F
(2) E -> T (5) F -> (E)
(3) T -> T*F (6) F -> id
In the above grammar, find FIRST AND FOLLOW
Non- FIRST FOLLOW
Terminal
E ( , id +, ), $
T ( , id +, *, ), $
F ( , id +, *, ), $
The augmented grammar is
E‟ ->E
E ->E+T
E ->T
VI SEM CS6660-COMPILER DESIGN
T ->T*F
T ->F
F ->(E)
F ->id
I0 = closure (E‟->E) Then I7 = goto(I2, *) is Then I4 = goto (I6, ( ) is
I0 : E‟ -> .E I7 : T -> T * .F I4 : F -> (.E)
E -> .E+T F -> .(E) E -> .E+T
E -> .T F -> .id E -> .T
T -> .T*F Then I8 = goto (I4, E) is T -> .T*F
T -> .F I8 : F->(E.) T -> .F
F -> .(E) E-> E.+T F -> .(E)
F -> .id Then I2 = goto (I4 ,T) is F -> .id
Then I1 =goto(I0, E) is I2 : E-> T. Then I5 = goto(I6, id) is
I1 : E‟ -> E. T-> T.*F I5 : F->id.
E -> E.+T Then I3 =goto(I4, F) is Then I10 = goto(I7,F)
Then I2 =goto (I0 ,T) is I3 : T -> F. I10 : T->T*F.
I2 : E -> T. Then I4 = goto (I4, ( ) is Then I4 = goto (I7, ( ) is
T -> T. * F I4 : F -> (.E) I4 : F -> (.E)
Then I3 =goto(I0, F) is E -> .E+T E -> .E+T
I3 : T -> F. E -> .T E -> .T
Then I4 = goto (I0, ( ) is T -> .T*F T -> .T*F
I4 : F -> (.E) T -> .F T -> .F
E -> .E+T F -> .(E) F -> .(E)
E -> .T F -> .id F -> .id
VI SEM CS6660-COMPILER DESIGN
T -> .T*F Then I5 = goto(I4, id) is Then I5 = goto(I7, id) is
T -> .F I5 : F->id. I5 : F->id.
F -> .(E) Then I6 = goto(I6, T) Then I11 = goto(I8, ) )
F -> .id I9 : E->E+T. I11 : F -> (E).
Then I5 = goto(I0, id) is T=>T.*F Then I6 = goto(I8,+) is
I5 : F->id. Then I3 =goto(I6, F) is I6 : E-> E +. T
Then I6 = goto(I1,+) is I3 : T -> F. T -> .T*F
I6 : E-> E +. T T -> .F
T -> .T*F F -> .(E)
T -> .F F -> .id
F -> .(E) Then I7 = goto(I9, *) is
F -> .id I7 : T -> T * .F
F -> .(E)
F -> .id
Creation of SLR Parsing table:
1.To Fill the shift action
Take all goto in the LR(0) items constructed. If Goto(Ii,x)=Ij, where Ii, Ij are states
and x is a terminal, then fill Action[Ii,x]= shift Ij.
Example: Now consider I1
E‟ -> E.
E -> E.+T
The second item yields action[1,+] = shift 6 (i.e., S6) because goto(I1, +)=I6.
Now consider I2
E -> T.
T -> T. * F
VI SEM CS6660-COMPILER DESIGN
The second item yields action[2,*] = shift 7 (i.e., S7) because goto(I2, *)=I7.
E + T
I I I6 I9 * to
F
0 1 ( I3 I7
T I4
id
I5
*
F I7 F
I ( I10
2 id I4
(
I5
E
id I id
)
T
3 F I8 +
( I2 I11
I I3 I6
4 I4
2. To fill the reduce action
Find out theI non-kernel items, i.e., Items in which the dot „.‟ is at the right end. If
Ij contains a5 production with dot at a right end A->α., then fill action[i, follow(A)]
= reduce A->α.
Example: Now consider I2
E -> T.
T -> T. * F
The first item has dot at the right end. Since follow(E)= { $, +, )} ,the action[2,$]
= action[2,+]= action[2,)] = reduce E->T.
3. To fill the accept action
If [S‟->s.] is in Ii, then set action[i , $] to accept.
Example: Now consider I1
E‟ -> E.
E -> E.+T
The first item contains [E‟->E.] , so action[1,$] = accept.
4. To fill the error action
All entries not defined are made “error”.
VI SEM CS6660-COMPILER DESIGN
state id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 acc
2 r2 s7 r2 r2
3 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
Shift/reduce and reduce/reduce conflicts
• If a state does not know whether it will make a shift operation or reduction
for a terminal, we say that there is a shift/reduce conflict.
• If a state does not know whether it will make a reduction operation using
the production rule i or j for a terminal, we say that there is a
reduce/reduce conflict.
• If the SLR parsing table of a grammar G has a conflict, we say that that
grammar is not SLR grammar.
Conflict Example – shift.reduce conflict
Consider the grammar
S L=R
SR
VI SEM CS6660-COMPILER DESIGN
L *R
L id
RL
The canonical collection of sets are
I0 : S‟ .S I3 : S R. I6 : S L=.R
S .L=R I4 : L *.R R .L
S .R R .L L .*R
L .*R L .*R L .id
L .id L .id I7 : L *R.
R .L I5 : L id. I8 : R L.
I1 : S‟ S. I9 : S L=R.
I2 : S L.=R
R L.
Consider the set of items I2. The first item in this set makes ACTION[ 2, =] be
"shift 6." Since FOLLOW(R) contains =, the second item sets action[2,=] to
reduce “R->L.” . Since there is both a shift and a reduce entry in action[2,=], state
2 has a shift/reduce conflict on input symbol =.
Conflict example- Reduce/reduce conflict
Consider the grammar,
S AaAb I0 : S‟ .S
S BbBa S .AaAb
A S .BbBa
B A.
B.
VI SEM CS6660-COMPILER DESIGN
Here FOLLOW(A)={a,b} and FOLLOW(B)={a,b}
Action[0,a] reduce by A action[0,b] reduce by A
reduce by B reduce by B
reduce/reduce conflict reduce/reduce conflict
CANONICAL LR PARSER
In the SLR method, state i calls for reduction by A->α if the set of items Ii
contains item [A->α.] and a is in Follow(A). In some situations, however, when
state i appears on top of the stack, the viable prefix βa on the stack is such that βa
cannot be followed by a in any right-sentential form. Thus, the reduction by A->α
should be invalid on input a.
Let us reconsider the non-SLR grammar discussed above. In state 2 we had
item R->L., which could correspond to A=>α above, and a could be the = sign,
which is in Follow(R). Thus, the SLR parser calls for reduction by R->L in state 2
with = as the next input (the shift action is also called for, because of item S->L. =
R in state 2). However, there is no right-sentential form of the grammar that begins
R = … . Thus state 2, which is the state corresponding to viable prefix L only,
should not really call for reduction of that L to R.
It is possible to carry more information in the state that will allow us to rule
out some of these invalid reductions by A->α. By splitting states when necessary,
we can arrange to have each state of an LR parser indicate exactly
which input symbols can follow a handle a for which there is a possible reduction
to A.
The extra information is incorporated into the state by redefining items to include
a terminal symbol as a second component. The general form of an item
becomes [A ->α.β, a], where A ->αβ is a production and a is a terminal or the right
endmarker $. We call such an object an LR(1) item. The 1 refers to the length of
the second component, called the lookahead of the item. The lookahead has no
effect in an item of the form [A->α.β,a], where β is not є,but an item of the form
[A->α.,a] calls for a reduction by A->α only if the next input symbol is a. Thus, we
are compelled to reduce by A ->α only on those input symbols a for which [A->α.,
a] is an LR(1) item in the state on top of the stack. The set of such a's will always
be a subset of FOLLOW(A).
Formally, we say LR(1) item [A->α.β,a] is valid for a viable prefix γ if
there is a derivation S δAw δαβw, where
VI SEM CS6660-COMPILER DESIGN
1. γ =δα and
2. either a is the first symbol of w, or w is є and a is .
Let us consider the grammar
S->BB
B->aB | b
There is a rightmost derivation S aaBab aaaBab. We see that item [B-
>a.B,a] is valid for a viable prefix γ=aaa by letting δ= aa, A = B, w = ab, α = a,
and β=B in the above definition. There is also a rightmost derivation
S BaB BaaB. From this derivation we see that item [B->a.B,$] is valid for
viable prefix Baa.
Constructing LR(1) Sets of Items
The method for building the collection of sets of valid LR(1) items is
essentially the same as the one for building the canonical collection of sets of
LR(0) items. We need only to modify the two procedures CLOSURE and GOTO.
Algorithm 2.9: Construction of the sets LR(1) items
Input: An Augmented grammar G‟
Ouput: The sets of LR(1) items that are the set of items valid for one or more
viable prefixes of G‟.
Method: The procedures closure and goto and the main routine items for
constructing the sets of items.
function closure(I)
begin
repeat
for each item [A->α.Bβ,a] in I )
each production B->γ in G'
and each terminal b in FIRST(βa)
such that [B->.γ,b] to I;
add [B->.γ,b] to I
until no more items are added to I;
return I;
end
function goto(I,X)
begin
VI SEM CS6660-COMPILER DESIGN
let J be the set of items [A->αX.β,a] such that
[A->αX.β,a] is in J;
return closure(J);
end
Procedure items(G‟)
begin
C := {closure ( { [ S’ -> .S,$] })};
repeat
for each set of items I in C and each grammar symbol X
such that goto(I,X) is not empty and not in C do
add goto(I,X) to C;
until no more sets of items can be added to C;
end
Algorithm 2.10 : Construction of canonical-LR parsing tables.
INPUT: An augmented grammar G‟.
OUTPUT: The canonical-LR parsing table functions action and goto for G'.
METHOD:
1. Construct C = {I0, I1, . . , In), the collection of sets of LR(1) items for G'.
2. State i of the parser is constructed from Ii. The parsing action for state i is
determined as follows.
(a) If [A -> α.aβ,b ] is in Ii and goto(I i,a)= Ij, then set action[i,a] to "shift j". Here a
must be a terminal.
(b) If [A -> α. ,a ] is in Ii, A # S', then set action[i, a] to "reduce A ->α."
(c) If [S‟ -> S.,$] is in Ii, then set action[i, $] to "accept."
If any conflicting actions result from the above rules, we say the grammar is not
LR(1). The algorithm fails to produce a parser in this case.
3. The goto transitions for state i are determined as follows: If goto(Ii,A) = Ij , then
goto[i, A] = j.
4. All entries not defined by rules (2) and (3) are made "error."
5. The initial state of the parser is the one constructed from the set containing item
[S‟->.S, $].
Example:
VI SEM CS6660-COMPILER DESIGN
Consider the grammar
S->CC
C->cC
C->d
The augmented grammar
(1) S‟->S
(2) S->CC
(3) C->cC
(4) C->d
We begin by computing the closure of {[S‟->.S,$]}. To close, we match the
item [S‟->.S,$] with the item [A ->α.Bβ, a] in the procedure Closure. That is,
A=S‟, α=є, B=S, B=є and a=$. Function Closure tells us to add [B->.γ, b] for each
production B ->γ and terminal b in FIRST(βa). In terms of the present grammar,
B->γ must be S-> CC, and since, β is є and a is $, b may only be $. Thus we add
[S -> .CC, $].
We continue to compute the closure by adding all items [C -> .γ, b] for b in
FIRST(C$). That is, matching [S -> .CC, $] against [A ->α.Bβ, a] , we have A = S,
α=є , B = C, β = C, and a = $. Since C does not derive the empty string,
FIRST(C$) = FIRST(C). Since FIRST(C) contains terminals c and d, we add items
[C -> .cC, c], [C -> .cC, d], [C -> .d, c] and [C -> .d, d]. None of the new items has
a nonterminal immediately to the right of the dot, so we have completed I0 our
first set of LR(1) items.
I0 : S‟->.S , $ Goto(I0, c) Goto(I2,d)
S->.CC, $ I3 : C->c.C, c/d I7: C ->d., $
C->.cC , c/d C->.cC, c/d Goto(I3, C)
C->.d , c/d C-> .d , c/d I8 : C->cC. , c/d
Goto(I0,S) Goto(I0,d) Goto(I3,c) = I3
I1 : S‟->S., $ I4 : C->d., c/d Goto(I3, d) = I4
Goto(I0, C) Goto(I2, C) Goto (I6, C)
I2 : S->C.C,$ I5 : S->CC., $ I9: C->cC., $
C-> .cC,$ Goto(I2,c) Goto(I6, c) = I6
C->.d ,$ I6: C->c.C, $ Goto(I6, d) = I7
VI SEM CS6660-COMPILER DESIGN
C-> .cC, $
C->.d, $
The difference between I3 and I6 is only in the second component. The first
component is the same.
The GOTO Graph
Every SLR(1) grammar is an LR(1) grammar, but for an SLR(1) grammar the
canonical LR parser may have more states than the SLR parser for the same
grammar. The grammar of the previous examples is SLR and has an SLR parser
with seven states, compared with the ten in CLR.
Canonical Parsing table construction:
1. To Fill the shift action
Take all goto in the LR(1) items constructed. If Goto(Ii,x)=Ij, where Ii, Ij
are states and x is a terminal, then fill action[Ii,x]= shift Ij.
Example: Now consider I0
S‟->.S, $
S->.CC, $
C->.cC , c/d
VI SEM CS6660-COMPILER DESIGN
C->.d , c/d
The Third item yields action[0,c] = shift 3 (i.e., S3) because goto(I0, c)=I3.
The fourth item yields action[0,d] = shift 4 (i.e., S4) because goto(I 0,
d)=I4.
2. To fill the reduce action
Find out the non-kernel items, i.e., Items in which the dot „.‟ is at the
right end. If Ij contains a production with dot at a right end (i.e., [A->α.
a] and A≠S‟ ) , then fill action[i, a] = reduce A->α.
Example: Now consider I4
C->d., c/d
The item has dot at the right end. So action[4,c] = r3 and action[4,d] = r3.
3. To fill the accept action
If [S‟->s., $] is in Ii, then set action[i , $] to accept.
Example: Now consider I1
S‟->S., $
The item contains [S‟->S.] , so action[1,$] = accept.
4. To fill the error action
All entries not defined are made “error”.
LALR PARSER
VI SEM CS6660-COMPILER DESIGN
The last parser method LALR (LookAhead-LR) technique is often used in
practice because the tables obtained by it are considerably smaller than the
canonical LR tables. The SLR and LALR tables for a grammar always have the
same number of states whereas CLR would typically have several number of
states. For example, for a language like Pascal SLR and LALR would have
hundreds of states but CLR would have several thousands of states.
Let us again consider grammar
S->CC
C->cC
C->d
whose sets of LR(1) items are
I0 : S‟->.S , $ Goto(I0, c) Goto(I2,d)
S->.CC, $ I3 : C->c.C, c/d I7: C ->d., $
C->.cC , c/d C->.cC, c/d
C->.d , c/d C-> .d , c/d Goto(I3, C)
I8 : C->cC. , c/d
Goto(I0,S) Goto(I0,d) Goto(I3,c) = I3
I1 : S‟->S., $ I4 : C->d., c/d Goto(I3, d) = I4
Goto(I0, C) Goto(I2, C) Goto (I6, C)
I2 : S->C.C,$ I5 : S->CC., $ I9: C->cC., $
C-> .cC,$
C->.d ,$ Goto(I2,c) Goto(I6, c) = I6
I6: C->c.C, $ Goto(I6, d) = I7
C-> .cC, $
C->.d, $
VI SEM CS6660-COMPILER DESIGN
Take a pair of similar looking states, such as I4 and I7. Each of these states has
only items with first component C->d.. In I4 the lookaheads are c or d; in I7, $ is
the only lookahead.
To see the difference between the roles of I 4 and I7 in the parser, note that
the grammar generates the regular language c*dc*d. When reading an input
cc . . cdcc .. cd, the parser shifts the first group of c's and their following d onto the
stack, entering state 4 after reading the d. The parser then calls for a reduction by
C->d, provided the next input symbol is c or d. The requirement that c or d follow
makes sense, since these are the symbols that could begin strings in c*d. If $
follows the first d, we have an input like ccd, which is not in the language, and
state 4 correctly declares an error if $ is the next input. The parser enters state 7
after reading the second d. Then, the parser must see $ on the input, or it started
with a string not of the form c*dc*d. It thus makes sense that state 7 should reduce
by C-> d on input $ and declare error on inputs c or d.
Let us now replace I4 and I7 by I47 , the union of I4 and I7 , consisting of the set of
three items represented by [C -> d., c/d/$]. The goto's on d to I4 or I7 from I0, I2, I3
and I6 now enter I47. The action of state 47 is to reduce on any input. The revised
parser behaves essentially like the original, although it might reduce d to C in
circumstances where the original would declare error, for example, on input like
ccd or cdcdc. The error will eventually be caught; in fact, it will be caught before
any more input symbols are shifted.
Generally, we can look for sets of LR (1) item having the same core, that is,
set of first components, and we may merge these sets with common cores into one
set of items.
For example, in the above table, I4 and I7 form such a pair, with core {C ->
d.}. Similarly, I3 and I6 form another pair, with core {C-> c.C, C -> .cC, C -> .d}.
There is one more pair, I8 and I9 with common core {C-> cC.}.
The merging of states with common cores can never produce a shift/reduce
conflict that was not present in one of the original states, because shift actions
depend only on the core, not the look ahead. Only in not LR (1) grammar conflict
will occur. But it is possible that a merger will produce a reduce/reduce conflict.
Algorithm 2.11 : LALR table construction.
INPUT: An augmented grammar G'.
OUTPUT: The LALR parsing-table functions ACTION and GOT0 for G'.
METHOD:
VI SEM CS6660-COMPILER DESIGN
1. Construct C = (I0 I1, . . , In ), the collection of sets of LR(1) items.
2. For each core present among the set of LR(1) items, find all sets having that
core, and replace these sets by their union.
3. Let C' = {J0J1, . . . ,Jm) be the resulting sets of LR(1) items. The parsing actions
for state i are constructed from Ji in the same manner as in Algorithm 2.10. If
there is a parsing action conflict, the algorithm fails to produce a parser, and the
grammar is said not to be LALR(1).
4. The GOTO table is constructed as follows. If J is the union of one or more sets
of LR(1) items, that is, J = I1 U I2 U . . .U Ik , then the cores of GOTO(I1, X) ,
GOTO(I2, X) , . . . , GOTO(Ik, X)are the same, since I1,I2, . . ,Ik all have the same
core. Let K be the union of all sets of items having the same core as GOTO(I 1, X).
Then GOTO(J,X) = K.
Example: Again consider grammar discussed above whose GOTO graph was
shown in the table above. As we mentioned, there are three pairs of sets of items
that can be merged.
I3 and I6 are replaced by their union.
I36: C->c.C, c/d/$
C->.cC, c/d/$
C-> .d, c/d/$
I4 and I7 are replaced by their union:
I47: C->d., c/d/$
and I8 and I9 are replaced by their union:
I89: C->cC., c/d/$
The LALR action and goto functions for the condensed sets of items are shown
VI SEM CS6660-COMPILER DESIGN
Syntax Error Handling
Common programming errors can occur at many different levels.
Lexical errors include misspellings of identifiers, keywords, or operators -
e.g., the use of an identifier elipsesize instead of ellipsesize – and missing quotes
around text intended as a string.
Syntactic errors include misplaced semicolons or extra or missing braces; that is,
'((" or ")." As another example, in C or Java, the appearance of a case statement
without an enclosing switch is a syntactic error
Semantic errors include type mismatches between operators and operands.
An example is a return statement in a Java method with result type void.
Logical errors can be anything from incorrect reasoning on the part of the
programmer to the use in a C program of the assignment operator = instead of the
comparison operator = =.
The precision of parsing methods allows syntactic errors to be detected very
efficiently. Several parsing methods, such as the LL and LR methods, detect an
error as soon as possible; that is, when the stream of tokens from the lexical
analyzer cannot be parsed further according to the grammar for the language.
More precisely, they have the viable-prefix property, meaning that they detect that
an error has occurred as soon as they see a prefix of the input that cannot be
completed to form a string in the language.
VI SEM CS6660-COMPILER DESIGN
Another reason for emphasizing error recovery during parsing is that many errors
appear syntactic, whatever their cause, and are exposed when parsing cannot
continue. A few semantic errors, such as type mismatches, can also be detected
efficiently; however, accurate detection of semantic and logical errors at compile
time is in general a difficult task.
The error handler in a parser has goals that are simple to state but challenging to
realize:
Report the presence of errors clearly and accurately.
Recover from each error quickly enough to detect subsequent errors.
Add minimal overhead to the processing of correct programs.
Error-Recovery Strategies
Once an error is detected, how should the parser recover?
The simplest approach is for the parser to quit with an informative error message
when it detects the first error.
Additional errors are often uncovered if the parser can restore itself to a state
where processing of the input can continue with reasonable hopes that the further
processing will provide meaningful diagnostic information.
If errors pile up, it is better for the compiler to give up after exceeding some error
limit than to produce an annoying avalanche of "spurious" errors.
The common error-recovery strategies are:
a. panic-mode
b. phrase-level
c. error-productions and
d. global-correction.
Panic-Mode Recovery
With this method, on discovering an error, the parser discards input symbols one
at a time until one of a designated set of synchronizing tokens is found. The
synchronizing tokens are usually delimiters, such as semicolon, whose role in the
VI SEM CS6660-COMPILER DESIGN
source program is clear and unambiguous. The compiler designer must select the
synchronizing tokens appropriate for the source language.
Phrase-Level Recovery
On discovering an error, a parser may perform local correction on the remaining
input; that is, it may replace a prefix of the remaining input by some string that
allows the parser to continue. A typical local correction is to replace a comma by a
semicolon, delete an extraneous semicolon, or insert a missing semicolon.,
Error Product ions
By anticipating common errors that might be encountered, we can augment the
grammar for the language at hand with productions that generate the erroneous
constructs. A parser constructed from a grammar augmented by these error
productions detects the anticipated errors when an error production is used during
parsing. The parser can then generate appropriate error diagnostics about the
erroneous construct that has been recognized in the input.
Global Correction
Ideally, we would like a compiler to make as few changes as possible in
processing an incorrect input string. There are algorithms for choosing a minimal
sequence of changes to obtain a globally least-cost correction. Given an incorrect
input string x and grammar G, these algorithms will find a parse tree for a related
string y, such that the number of insertions, deletions, and changes of tokens
required to transform x into y is as small as possible.
YACC
YACC ("yet another compiler compiler") is the standard parser generator for the Unix
operating system. An open source program, yacc generates code for the parser in the C
programming language.The original version of yacc was written by Stephen Johnson at
American Telephone and Telegraph (AT&T).
VI SEM CS6660-COMPILER DESIGN
A translator can be constructed using Yacc in the manner illustrated in above Figure.
First, translate.y, containing a Yacc specification of the translator is prepared. The UNIX
system command
yacc translate.y
transforms the file translate.y into a C program called y.tab.c using the LALR method.
The program y.tab.c is a representation of an LALR parser written in C, along with other
C routines that theuser may have prepared. By compiling y.tab.c along with the ly library
that contains
the LR parsing program using the command
cc y.tab.c –ly
we obtain the desired object program a.out that performs the translation specified by the
original Yacc program. If other procedures are needed, they can be compiled or loaded
with y.tab.c.
A Yacc source program has three parts:
The Yacc Specification
yacc has it's own specification language. A yacc specification is structured along the
same lines as a Lex specification.
%{
VI SEM CS6660-COMPILER DESIGN
/* C declarations and includes */
%}
/* Yacc token and type declarations */
%%
/* Yacc Specification
in the form of grammer rules like this:
*/
symbol : symbols tokens
{ $$ = my_c_code($1); }
;
%%
/* C language program (the rest) */
Example: To illustrate how to prepare a Yacc source program, let us construct a simple
desk calculator that reads an arithmetic expression, evaluates it, and then prints its
numeric value. We shall build the desk calculator starting with the following grammar for
arithmetic expressions:
E -> E + T / T
T -> T * F / F
F -> ( E ) / digit
The token digit is a single digit between 0 and 9.
The Declarations Part
There are two sections in the declarations part of a Yacc program; both are optional. In
the First section, we put ordinary C declarations, delimited by %{ and %}.
Here we place declarations of any temporaries used by the translation rules or procedures
of the second and third sections.
This section contains only the include-statement
#include <ctype.h>
that causes the C preprocessor to include the standard header file <ctype.h> that contains
the predicate isdigit.
Also in the declarations part are declarations of grammar tokens. The statement
%token DIGIT
Yacc specification of a simple desk calculator
VI SEM CS6660-COMPILER DESIGN
The Translation Rules Part
In the part of the Yacc specification after the first %% pair, we put the translation rules.
Each rule consists of a grammar production and the associated semantic action. A set of
productions that we have been writing:
would be written in Yacc as
VI SEM CS6660-COMPILER DESIGN
In a Yacc production, unquoted strings of letters and digits not declared to be tokens are
taken to be nonterminals. Alternative bodies can be separated by a vertical bar, and a
semicolon follows each head with its alternatives and their semantic actions. The first
head is taken to be the start symbol.
A Yacc semantic action is a sequence of C statements. In a semantic action, the symbol
$$ refers to the attribute value associated with the nonterminal of the head, while $i refers
to the value associated with the ith grammar symbol (terminal or nonterminal) of the
body. The semantic action is performed whenever we reduce by the associated
production, so normally the semantic action computes a value for $$ in terms of the $i's.
In the Yacc specification, we have written the two E –productions
E -> E + T / T
and their associated semantic actions as:
expr : expr '+' term { $$ = $1 + $3; }
| term
;
we have added a new starting production to the Yacc specification.
line : expr '\n' { printf("%d\n", $1); }
The Supporting C-Routines Part
The third part of a Yacc specification consists of supporting C-routines. A lexical
analyzer by the name yylex() must be provided. Using Lex to produce yylex() is a
common choice; Other procedures such as error recovery routines may be added as
necessary.
VI SEM CS6660-COMPILER DESIGN
Yacc specification for a more advanced desk calculator
VI SEM CS6660-COMPILER DESIGN