Module 3
Module 3
AUTOMATA THEORY
AND COMPILER
DESIGN- 21CS51
MODULE 3
Context Free Grammars: Definition and designing CFGs, Derivations Using a Grammar, Parse Trees,
Ambiguity and Elimination of Ambiguity, Elimination of Left Recursion, Left Factoring.
FORMAL DEFINITIONS:
• There is a finite set of symbols that form the strings, i.e. there is a finite alphabet. The alphabet
symbols are called terminals.
• There is a finite set of variables, sometimes called non-terminals or syntactic categories. Each
variable represents a language (i.e. a set of strings).
– In the palindrome example, the only variable is P.
Leftmost Derivation
• In the previous example we used a derivation called a leftmost derivation. We can specifically
denote a leftmost derivation using the subscript “lm”, as in:lm or *lm
• A leftmost derivation is simply one in which we replace the leftmost variable in a production body
by one of its production bodies first, and then work our way from left to right.
Obtain leftmost derivation for the string aaabbabbbba using the following grammar:
S ab|bA
A aS|bAA|a
B bS|aBB|b
Rightmost Derivation
• Not surprisingly, we also have a rightmost derivation which we can specifically denote via:
rm or *rm
• A rightmost derivation is one in which we replace the rightmost variable by one of its production
bodies first, and then work our way from right to left.
Rightmost Derivation Example
• a*(a+b1) was already shown previously using a leftmost derivation.
• the rightmost derivation, makes replacements in different order:
– E rm E*E rm E * (E) rm E*(E+E) rm E*(E+I) rm E*(E+ID) rm E*(E+I1) rm
E*(E+L1) rm E*(E+b1) rm E*(I+b1) rm E*(L+b1) rm E*(a+b1) rm I*(a+b1) rm
L*(a+b1) rm a*(a+b1)
• Any derivation has an equivalent leftmost and rightmost derivation. That is, A * . iff A *lm
and A *rm .
Parse trees
– A parse tree is a top-down representation of a derivation. It is a good way to visualize the
derivation process.
• Let G= (V,T,P,S) be a CFG. The tree is derivationtree with the following properties:
– The root has label S
– Every vertes has label which is in (VUTU)
– Every leaf node has label from T and an interior node has label from V.
• If a vertex is labelled A and if X1,X2,X3,X4……Xn are all children of A from left, then A
X1,X2,X3,X4……Xn must be a production in P
• Sample parse tree for the palindrome CFG for 1110111:
P | 0 | 1 | 0P0 | 1P1
1 P 1
1 P 1
1 P 1
E * E
I ( E )
L E + E
a I I
L I D
a L 1
• The yield of the parse tree is the string that results when we concatenate the leaves from left to
right (e.g., doing a leftmost depth first search).
– The yield is always a string that is derived from the root and is guaranteed to be a string in
the language L.
• E E*E E+E*E
Examples
Is the following grammar ambiguous?
– SAS | ε
– AA1 | 0A1 | 01
Inherent Ambiguity
A CFL L is said to be inherently ambiguous if all its grammars are ambiguous
Example:
Condider the Grammar for string aabbccdd
SAB | C
A aAb | ab
BcBd | cd
C aCd | aDd
D->bDc | bc
Removing Ambiguity
• No algorithm can tell us if an arbitrary CFG is ambiguous in the first place
2.Bottom up parser: which build parse trees from leaves and work up the root.
Therefore there are two types of parsing methods– top-down parsing and bottom-up parsing
There are three general types of parsers for grammars: Universal, top-down and bottom-up.
Logic errors occur when programs operate incorrectly but do not terminate abnormally (or crash).
Unexpected or undesired outputs or other behavior may result from a logic error, even if it is not
immediately recognized as such.
A run-time error is an error that takes place during the execution of a program and usually happens
because of adverse system parameters or invalid input data. The lack of sufficient memory to run an
Finding error or reporting an error – Viable-prefix is the property of a parser that allows early
detection of syntax errors.
Goal detection of an error as soon as possible without further consuming unnecessary input
How: detect an error as soon as the prefix of the input does not match a prefix of any string in the
language.
Example: for(;), this will report an error as for having two semicolons inside braces.
Error Recovery –
The basic requirement for the compiler is to simply stop and issue a message, and cease compilation.
There are some common recovery methods that are as follows.
1. Panic mode recovery :
This is the easiest way of error-recovery and also, it prevents the parser from developing infinite
loops while recovering error. The parser discards the input symbol one at a time until one of the
designated (like end, semicolon) set of synchronizing tokens (are typically the statement or
expression terminators) is found. This is adequate when the presence of multiple errors in the same
statement is rare. Example: Consider the erroneous expression- (1 + + 2) + 3. Panic-mode recovery:
Skip ahead to the next integer and then continue. Bison: use the special terminal error to describe
how much input to skip.
E->int|E+E|(E)|error int|(error)
3. Error productions :
The use of the error production method can be incorporated if the user is aware of common mistakes
that are encountered in grammar in conjunction with errors that produce erroneous constructs. When
this is used, error messages can be generated during the parsing process, and the parsing can
continue. Example: write 5x instead of 5*x
4. Global correction :
In order to recover from erroneous input, the parser analyzes the whole program and tries to find the
closest match for it, which is error-free. The closest match is one that does not do many insertions,
deletions, and changes of tokens. This method is not practical due to its high time and space
complexity.
CONTEXT-FREE GRAMMARS:
A context-free grammar has four components:
A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings. The
non-terminals define sets of strings that help define the language generated by the grammar.
A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which
strings are formed.
A set of productions (P). The productions of a grammar specify the manner in which the terminals
and non-terminals can be combined to form strings. Each production consists of a non-terminal
called the left side of the production, an arrow, and a sequence of tokens and/or on- terminals, called
the right side of the production.
One of the non-terminals is designated as the start symbol (S); from where the production begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the
start symbol) by the right side of a production, for that non-terminal.
Example
We take the problem of palindrome language, which cannot be described by means of Regular
Expression. That is, L = { w | w = wR } is not a regular language. But it can be described by means
of CFG, as illustrated below:
G = ( V, Σ, P, S )
Where:
V = { Q, Z, N }
Σ = { 0, 1 }
Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During parsing,
we take two decisions for some sentential form of input:
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-most
derivation. The sentential form derived by the left-most derivation is called the left-sentential form.
Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-most
derivation. The sentential form derived from the right-most derivation is called the right-sentential form.
Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived from
the start symbol. The start symbol of the derivation becomes the root of the parse tree. Let us see this by
an example from the last topic.
E→E*E
Step 2:
E→E+E*E
Step 3:
E → id + E * E
Step 4:
E → id + id * E
Step 5:
E → id + id * id
In a parse tree:
All leaf nodes are terminals.
All interior nodes are non-terminals.
In-order traversal gives original input string.
A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed first,
therefore the operator in that sub-tree gets precedence over the operator which is in the parent nodes.
Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at
least one string.
Example
E→E+E
Example
Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the
expression contains:
id op id op id
it will be evaluated as:
(id op id) op id
For example, (id + id) + id
Operations like Exponentiation are right associative, i.e., the order of evaluation in the same expression
will be:
id op (id op id)
For example, id ^ (id ^ id)
Precedence
If two different operators share a common operand, the precedence of operators decides which will take
the operand. That is, 2+3*4 can have two different parse trees, one corresponding to (2+3)*4 and another
matched-stmt non-alternative-stmt
TOP-DOWN PARSING
A program that performs syntax analysis is called a parser. A syntax analyzer takes tokens as input
and output error message if the program syntax is wrong. The parser uses symbol-look- ahead and
an approach called top-down parsing without backtracking. Top-downparsers check to see if a
string can be generated by a grammar by creating a parse tree starting from the initial symbol and
working down. Bottom-up parsers, however, check to see a string can be generated from a
grammar by creating a parse tree from the leaves, and working up. Early parser generators such as
YACC creates bottom-up parsers whereas many of Java parser generators such as JavaCC create
top-down parsers.
Example of top-down parser:
Consider the grammar
The above algorithm is non—deterministic. General Recursive descent may require backtracking.
Back-tracking
Top- down parsers start from the root node (start symbol) and match the input string against the
production rules to replace them (if matched). To understand this, take the following example of CFG:
S → rXd | rZd
X → oa | ea
Z → ai
For an input string: read, a top-down parser, will behave like this:
It will start with S from the production rules and will match its yield to the left-most letter of the input, i.e.
‘r’. The very production of S (S → rXd) matches with it. So the top-down parser advances to the next
input letter (i.e. ‘e’). The parser tries to expand non-terminal ‘X’ and checks its production from the left
(X → oa). It does not match with the next input symbol. So the top-down parser backtracks to obtain the
next production rule of X, (X → ea).
Now the parser matches all the input letters in an ordered manner. The string is accepted.
Example − Write down the algorithm using Recursive procedures to implement the following
Grammar.
E → TE′
E′ → +TE′
T → FT′
T′ →∗ FT′|ε
F → (E)|id
Example:
(1) A => Aα | β
(2) S => Aα | β
A => Sd
(1) is an example of immediate left recursion, where A is any non-terminal symbol and α represents a
string of non-terminals.
(2) is an example of indirect-left recursion.
A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself and the
parser may go into a loop forever.
Second method is to use the following algorithm, which should eliminate all direct and indirect left
recursions.
START
Arrange non-terminals in some order like A1, A2, A3,…, An
Example
Example
Left factoring transforms the grammar to make it useful for top-down parsers. In this technique, we make
one production for each common prefixes and the rest of the derivation is added by new productions.
Example
For example, everything in FIRST(Yj) is surely in FIRST(X). If Y1 does not derive e, then we
add nothing more to FIRST(X), but if Y1=*> ɛ, then we add FIRST(Y2) and so on.
To compute the FOLLOW(A) for all nonterminals A, apply the following rules until nothing
can be added to any FOLLOW set.
1. Place $ in FOLLOW(S), where S is the start symbol and $ in the input right endmarker.
2. If there is a production A=>aBs where FIRST(s) except ɛ is placed in FOLLOW(B).
3. If there is aproduction A->aB or a production A->aBs where FIRST(s) contains ɛ, then everything
in FOLLOW(A) is in FOLLOW(B).
Consider the following example to understand the concept of First and Follow.
Find the first and follow of all nonterminals in the Grammar-
E -> TE'
E'-> +TE'|e
T -> FT'
T'-> *FT'|e
F -> (E)|id
E E' T T' F
FIRST {(,id} {+,e} {(,id} {*,e} {(,id}
FOLLOW {),$} {),$} {+,),$} {+,),$} {+,*,),$}
For example, id and left parenthesis are added to FIRST(F) by rule 3 in definition of FIRST with i=1
in each case, since FIRST(id)=(id) and FIRST('(')= {(} by rule 1. Then by rule 3 with i=1, the
production T -> FT' implies that id and left parenthesis belong to FIRST(T) also.
To compute FOLLOW,we put $ in FOLLOW(E) by rule 1 for FOLLOW. By rule 2 applied
toproduction F-> (E), right parenthesis is also in FOLLOW(E). By rule 3 applied to
production E-> TE', $ and right parenthesis are in FOLLOW(E').
Calculate the first and follow functions for the given grammar-
S → (L) / a
L → SL’
L’ → ,SL’ / ∈
The first and follow functions are as follows-
INPUT SYMBOLS
Non-Terminal id + * ( ) $
E E→TE’ E→TE’
E’ E’→+TE’ E’→ε E’→ε
T T→FT’ T→FT’
T’ T’→ε T’→*FT’ T’→ε T’→ε
F F→id F→(E)
The main difficulty in using predictive parsing is in writing a grammar for the source
language such that a predictive parser can be constructed from the grammar. Although left
We can resolve this ambiguity by choosing S' -> eS. This choice corresponds to associating an else with
the closest previous then .
Problems with top down parser
The various problems associated with top down parser are:
Ambiguity in the grammar, Left recursion, Non-left factored grammar, Backtracking
“if” from the lexical analyzer, we cannot tell whether to use the first
production or to use the second production to expand the non- terminal S.
prefix. That is, left
factoring is must for parsing using top-down parser.
A grammar in which two or more productions from every non-terminal A do not have a common prefix of
symbols on the right hand side of the A-productions is called left factored grammar.
Backtracking: The backtracking is necessary for top down parser for following reasons:
The table-driven parser in Fig. 4.19 has an input buffer, a stack containing a sequence of grammar
symbols, a parsing table constructed by Algorithm 4.31, and an output stream. The input buffer contains
the string to be parsed, followed by the endmarker $. We reuse the symbol $ to mark the bottom of the
stack, which initially contains the start symbol of the grammar on top of $.
The parser is controlled by a program that considers X, the symbol on top of the stack, and a, the current
input symbol. If X is a nonterminal, the parser chooses an X-production by consulting entry M[X, a] of the
parsing table M. (Additional code could be executed here, for example, code to construct a node in a parse
tree.) Otherwise, it checks for a match between the terminal X and current input symbol a.
The behavior of the parser can be described in terms of its configurations, which give the stack contents
and the remaining input. The next algorithm describes how configurations are manipulated.
INPUT SYMBOLS
Non-Terminal id + * ( ) $
E E→TE’ E→TE’
E’ E’→+TE’ E’→ε E’→ε
T T→FT’ T→FT’
T’ T’→ε T’→*FT’ T’→ε T’→ε
F F→id F→(E)
Note that the sentential forms in this derivation correspond to the input that has already been matched (in
column M A T C H E D ) followed by the stack contents. The matched input is shown only to highlight
the correspondence. For the same reason, the top of the stack is to the left; when we consider bottom-up
parsing, it will be more natural to show the top of the stack to the right. The input pointer points to the
leftmost symbol of the string in the INPUT column.
The stack of a nonrecursive predictive parser makes explicit the terminals and nonterminals that
the parser hopes to match with the remainder of the input. We shall therefore refer to symbols
on the parser stack in the following discussion. An error is detected during predictive
parsing when the terminal on top of the stack does not match the next input symbol or when
nonterminal A is on top of the stack, a is the next input symbol, and the parsing table entry
M[A,a] is empty.
Panic-mode error recovery is based on the idea of skipping symbols on the input until a token
in a selected set of synchronizing tokens appears. Its effectiveness depends on the choice of
synchronizing set. The sets should be chosen so that the parser recovers quickly from errors that
are likely to occur in practice. Some heuristics are as follows
As a starting point, we can place all symbols in FOLLOW(A) into the synchronizing set for
nonterminal A.
If we skip tokens until an element of FOLLOW(A) is seen and pop A from the stack, it is
likely that parsing can continue.
It is not enough to use FOLLOW(A) as the synchronizing set for A. For example , if
semicolons terminate statements, as in C, then keywords that begin statements may not
appear in the FOLLOW set of the nonterminal generating expressions.
A missing semicolon after an assignment may therefore result in the keyword beginning the
next statement being skipped.
Often, there is a hierarchical structure on constructs in a language; e.g., expressions
appear within statement, which appear within bblocks,and so on.
We can add to the synchronizing set of a lower construct the symbols that begin
higher constructs.
For example, we might add keywords that begin statements to the synchronizing sets for the
non-terminals generation expressions.
If we add symbols in FIRST(A) to the synchronizing set for nonterminal A, then it may be
possible to resume parsing according to A if a symbol in FIRST(A) appears in the input.
If a nonterminal can generate the empty string, then the production deriving e can be
used as a default. Doing so may postpone some error detection, but cannot cause an error to
be missed. This approach reduces the number of nonterminals that have to be considered
during error recovery.
If a terminal on top of the stack cannot be matched, a simple idea is to pop the
terminal, issue a message saying that the terminal was inserted, and continue parsing. In
effect, this approach takes the synchronizing set of a token to consist of all other tokens.
INPUT SYMBOLS
Non-Terminal id + * ( ) $
E E→TE’ E→TE’ SYNCH SYNCH
E’ E’→+TE’ E’→ε E’→ε
T T→FT’ SYNCH T→FT’ SYNCH SYNCH
T’ T’→ε T’→*FT’ T’→ε T’→ε
F F→id SYNCH SYNCH F→(E) SYNCH SYNCH