Unit-2 F&CD
Unit-2 F&CD
SYNTAX ANALYSIS
LEXICAL ANALYSIS: OVERVIEW OF LEXICAL ANALYSIS
To identify the tokens we need some method of describing the possible tokensthat can appear in the
input stream. For this purpose we introduce regular expression, a notation that can be used to describe
essentially all the tokens ofprogramming language.
Secondly , having decided what the tokens are, we need some mechanism torecognize these in the
input stream. This is done by the token recognizers, whichare designed using transition diagrams and finite
automata.
❖ ROLE AND RESPONSIBILITY OF LEXICAL ANALYZER
LA is the first phase of a compiler. The main task is to read the input characterand produce as output a
sequence of tokens that the parser uses for syntax analysis.
Figure 1.12: Interactions between the lexical analyzer and the parser
Upon receiving a „get next token‟ command form the parser, the lexical analyzer reads the input
character until it can identify the next token. The LAreturn to the parser representation for the token it has
found. The representationwill be an integer code, if the token is a simple construct such as parenthesis,
comma or colon.
LA may also perform certain secondary tasks as the user interface. One suchtask is striping out from the
source program the commands and white spaces in the form of blank, tab and new line characters. Another
is correlating error message from the compiler with the source program.
❖ LEXICAL ANALYSIS VS PARSING:
Lexical analysis Parsing
A Scanner simply turns an input String (say a A parser converts this list of tokens into a Tree-
file) into a list of tokens. These tokens represent like object to represent how the
things like identifiers, parentheses, tokens fit together to form a cohesive whole
operators etc. (sometimes referred to as a sentence).
The lexical analyzer (the "lexer") parses A parser does not give the nodes any meaning
individual symbols from the source code beyond structural cohesion. The next thing
file into tokens. From there, the "parser" proper to do is extract meaning from this
turns those whole tokens into sentences of your structure (sometimes called
grammar contextual analysis).
Example:
Description of token
It implies that every Regular Grammar is also context-free, but there exists some problems, which are
beyond the scope of Regular Grammar. CFG is a helpful tool in describing the syntax of programming
languages.
Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce terminologies
used in parsing technology.
A context-free grammar has four components:
A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings. The non-
terminals define sets of strings that help define the language generated by the grammar.
A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which strings
are formed.
A set of productions (P). The productions of a grammar specify the manner in which the terminals
and non-terminals can be combined to form strings. Each production consists of a non-terminal called
the left side of the production, an arrow, and a sequence of tokens and/or on- terminals, called the
right side of the production.
One of the non-terminals is designated as the start symbol (S); from where the production begins.
The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the start
symbol) by the right side of a production, for that non-terminal.
Example
We take the problem of palindrome language, which cannot be described by means of Regular
Expression. That is, L = { w | w = wR } is not a regular language. But it can be described by means of
CFG, as illustrated below:
G = ( V, Σ, P, S )
Where:
V = { Q, Z, N }
Σ = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S={Q}
This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111, etc.
Unit-2
Syntax Analyzers
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The
parser analyzes the source code (token stream) against the production rules to detect any errors in
the code. The output of this phase is a parse tree.
This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and generating
a parse tree as the output of the phase.
Parsers are expected to parse the whole code even if some errors exist in the program. Parsers use
error recovering strategies, which we will learn later in this chapter.
❖ Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During
parsing, we take two decisions for some sentential form of input:
• Deciding the non-terminal which is to be replaced.
• Deciding the production rule, by which, the non-terminal will be replaced.
To decide which non-terminal to be replaced with production rule, we can have two options.
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-most
derivation. The sentential form derived by the left-most derivation is called the left-sentential form.
Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-most
derivation. The sentential form derived from the right-most derivation is called the right-sentential
form.
Example
Production rules:
E→E+E
E→E*E
E → id
Input string: id + id * id
Unit-2
Step 1:
E→E*E
Step 2:
E→E+E*E
Unit-2
Step 3:
E → id + E * E
Step 4:
E → id + id * E
Step 5:
E → id + id * id
Unit-2
In a parse tree:
E→E+E
E→E–E
E → id
For the string id + id – id, the above grammar generates two parse trees:
(2) S => Aα | β
A => Sd
(1) is an example of immediate left recursion, where A is any non-terminal symbol and α
represents a string of non-terminals.
(2) is an example of indirect-left recursion.
A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself
and the parser may go into a loop forever.
Unit-2
END
Example
The production set
S => Aα | β
A => Sd
after applying the above algorithm, should become
S => Aα | β
A => Aαd | βd
and then, remove immediate left recursion using the first technique.
A => βdA'
A' => αdA' | ε
Now none of the production has either direct or indirect left recursion.
Unit-2
❖ Left Factoring
If more than one grammar production rules has a common prefix string, then the top-down parser
cannot make a choice as to which of the production it should take to parse the string in hand.
Eliminating Left Factoring
If a top-down parser encounters a production like
A ⟹ αβ | α𝜸 | …
Then it cannot determine which production to follow to parse the string as both productions are
starting from the same terminal (or non-terminal). To remove this confusion, we use a technique
called left factoring.
Left factoring transforms the grammar to make it useful for top-down parsers. In this technique,
we make one production for each common prefixes and the rest of the derivation is added by
new productions.
The above productions can be written as
A => αA'
A'=> β | 𝜸 | …
Now the parser has only one production per prefix which makes it easier to take decisions.
Example:
Consider the grammar and eliminate the left recursion
E ⟶ E + T / T ——- (1)
T ⟶ T * F / F ——- (2)
First, let’s take equation (1) and rewrite it using the given formula
E⟶E+T/T
E ⟶ T E’ —— (3)
E’ ⟶ +TE’ / Є —— (4)
Productions 3 and 4 are the Left factoring equations of the given production
T⟶T*F/F
T ⟶ FT’ —–(5)
T’ ⟶ FT’ / Є —— (6)
Productions 5 and 6 are the Left factoring equations of the given production
The final productions after left factoring are:
E⟶TE
E’ ⟶ +TE’ / Є
T ⟶ FT’
T’ ⟶ FT’ / Є
❖
Unit-2
Follow Function
Follow () is a set of terminal symbols that can be displayed just to the right of the non-terminal
symbol in any sentence format. It is the first non-terminal appearing after the given terminal
symbol on the right-hand side of production.
For example,
If the input string is
E->TE’, F->(E)/id
Here we found that on the right-hand side of the production statement where is the E occurs, we
only found E in the production F->(E)/id through which we found the follow of E.
Then the output Follow of E = { ) }, as ‘)’ is the non-terminal in the input string on the right-
hand side of the production.
Top-Down Parsing
The top-down parsing technique parses the input, and starts constructing a parse tree from the
root node gradually moving down to the leaf nodes. The types of top-down parsing are depicted
below:
Recursive descent is a top-down parsing technique that constructs the parse tree from the
top and the input is read from left to right. It uses procedures for every terminal and non-
terminal entity. This parsing technique recursively parses the input to make a parse tree,
which may or may not require back-tracking. But the grammar associated with it (if not
left factored) cannot avoid back-tracking. A form of recursive- descent parsing that does not
require any back-tracking is known as predictive parsing.
This parsing technique is regarded recursive as it uses context-free grammar which is
recursive in nature.
• BACK-TRACKING
Top- down parsers start from the root node (start symbol) and match the input string
against the production rules to replace them (if matched). To understand this, take the
following example of CFG:
S → rXd | rZd
X → oa | ea
Z → ai
For an input string: read, a top-down parser, will behave like this:
It will start with S from the production rules and will match its yield to the left-most letter of
the input, i.e. ‗r‘. The very production of S (S → rXd) matches with it. So the top-downparser
Unit-2
advances to the next input letter (i.e. ‗e‘). The parser tries to expand non- terminal ‗X‘ and
checks its production from the left (X → oa). It does not match withthe next input symbol.
So the top-down parser backtracks to obtain the next production rule of X, (X → ea).
Now the parser matches all the input letters in an ordered manner. The string is accepted.
• PREDICTIVE PARSER
Predictive parser is a recursive descent parser, which has the capability to predict which
production is to be used to replace the input string. The predictive parser does not suffer
from backtracking.
To accomplish its tasks, the predictive parser uses a look-ahead pointer, which points to
the next input symbols. To make the parser back-tracking free, the predictive parserputs
some constraints on the grammar and accepts only a class of grammar known as LL(k)
grammar.
Predictive parsing uses a stack and a parsing table to parse the input and generate aparse
tree. Both the stack and the input contains an end symbol $ to denote that the stack is
empty and the input is consumed. The parser refers to the parsing table to takeany decision
on the input and stack element combination.
In recursive descent parsing, the parser may have more than one production to choose
from for a single instance of input, whereas in predictive parser, each step has at most one
production to choose. There might be instances where there is no production matching
the input string, making the parsing procedure to fail.
• LL PARSER
An LL Parser accepts LL grammar. LL grammar is a subset of context-free grammar but
with `some restrictions to get the simplified version, in order to achieve easy
implementation. LL grammar can be implemented by means of both algorithmsnamely,
recursive-descent or table-driven.
LL parser is denoted as LL(k).
The first L in LL(k) is parsing/scanning the input from left to right.
The second L in LL(k) stands for left-most derivation
k itself represents the number of looks ahead. Generally k = 1, so LL(k) may also be written
as LL(1).
LL PARSING ALGORITHM
Given below is an algorithm for LL(1) Parsing:
Input:
string ω
parsing table M for grammar G
Output:
If ω is in L(G) then left-most derivation of ω,
error otherwise.
repeat
let X be the top stack symbol and a the symbol pointed by ip.
if X∈ Vt or $
if X = a
POP X and advance ip.
else
error()
endif
else /* X is non-terminal */
if M[X,a] = X → Y1, Y2,... Yk
POP X
PUSH Yk, Yk-1,... Y1 /* Y1 on top */
Output the production X → Y1, Y2,... Yk
else
error()
endif
endif
until X = $ /* empty stack */
LL(1) GRAMMAR
The above algorithm can be applied to any grammar G to produce a parsing table
M. For some Grammars, for example if G is left recursive or ambiguous, then M will
have at least one multiply-defined entry. A grammar whose parsing table has no multiply
defined entries is said to be LL(1). It can be shown that the above algorithm can be used
to produce for every LL(1) grammar G a parsing table M that parses all and only the
sentences of G. LL(1) grammars have several distinctive properties. No ambiguous or
left recursive grammar can be LL(1). There remains a question of what should be done
in case of multiply defined entries.
One easy solution is to eliminate all left recursion and left factoring, hoping to produce
a grammar which will produce no multiply defined entries in the parse tables.
Unfortunately there are some grammars which will give an LL(1) grammar after any kind
of alteration. In general, there are no universal rules to convert multiply defined entries
into single valued entries without affecting the language recognized by the parser.
The main difficulty in using predictive parsing is in writing a grammar for the source
language such that a predictive parser can be constructed from the grammar. Although
left recursion elimination and left factoring are easy to do, they make the resulting grammar
hard to read and difficult to use the translation purposes. To alleviatesome of this difficulty,
a common organization for a parser in a compiler is to use a predictive parser for control
constructs and to use operator precedence for expressions. however, if an lr parser
generator is available, one can get all the benefits of predictive parsing and operator
precedence automatically.
Unit-2
BOTTOM-UP PARSER
Bottom-up parsing starts from the leaf nodes of a tree and works in upward directiontill it
reaches the root node. Here, we start from a sentence and then applyproduction rules in
reverse manner in order to reach the start symbol. The image given below depicts the
bottom-up parsers available.
❖ SHIFT-REDUCE PARSING
Shift-reduce parsing uses two unique steps for bottom-up parsing. These steps are known
as shift-step and reduce-step.
• Shift step: The shift step refers to the advancement of the input pointer to the nextinput
symbol, which is called the shifted symbol. This symbol is pushed onto the stack. The
shifted symbol is treated as a single node of the parse tree.
• Reduce step : When the parser finds a complete grammar rule (RHS) and replaces itto
(LHS), it is known as reduce-step. This occurs when the top of the stack contains ahandle.
To reduce, a POP function is performed on the stack which pops off the handle and
replaces it with LHS non-terminal symbol.
❖ INTRODUCTION TO LR PARSER
The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide classof
context-free grammar which makes it the most efficient syntax analysis technique. LR
parsers are also known as LR(k) parsers, where L stands for left-to-right scanning of the input
stream; R stands for the construction of right-most derivation in reverse, and k denotes the
number of lookahead symbols to make decisions.
LR parser :
There are three widely used algorithms available for constructing an LR parser:
1. LR algorithm:
The LR algorithm requires stack, input, output and parsing table. In all type of LR parsing, input,
output and stack are same but parsing table is different.
LR (1) Parsing
Various steps involved in the LR (1) Parsing:
• For the given input string write a context free grammar.
• Check the ambiguity of the grammar.
• Add Augment production in the given grammar.
• Create Canonical collection of LR (0) items.
• Draw a data flow diagram (DFA).
• Construct a LR (1) parsing table.
Unit-2
Augment Grammar
Augmented grammar G` will be generated if we add one more production in the given grammar G. It
helps the parser to identify when to stop the parsing and announce the acceptance of the input.
Example
Given grammar
S → AA
A → aA | b
The Augment grammar G` is represented by
S`→ S
S → AA
A → aA | b
Canonical Collection of LR(0) items
An LR (0) item is a production G with dot at some position on the right side of the production.
LR(0) items is useful to indicate that how much of the input has been scanned up to a given point in
the process of parsing.
In the LR (0), we place the reduce node in the entire row.
Example
Given grammar:
S → AA
A → aA | b
➢ Add Augment Production and insert '•' symbol at the first position for every production in G
S` → •S
S → •AA
A → •aA
A → •b
I0 State:
Add Augment production to the I0 State and Compute the Closure
I0 = Closure (S` → •S)
Add all productions starting with S in to I0 State because "•" is followed by the non-terminal. So, the
I0 State becomes
I0 = S` → •S
S → •AA
Add all productions starting with "A" in modified I0 State because "•" is followed by the non-
terminal. So, the I0 State becomes.
I0= S` → •S
S → •AA
A → •aA
A → •b
Drawing DFA:
The DFA contains the 7 states I0 to I6.
Unit-2
LR(0) Table
➢ If a state is going to some other state on a terminal then it correspond to a shift move.
➢ If a state is going to some other state on a variable then it correspond to go to move.
➢ If a state contain the final item in the particular row then write the reduce node completely.
Explanation:
I0 on S is going to I1 so write it as 1.
I0 on A is going to I2 so write it as 2.
I2 on A is going to I5 so write it as 5.
I3 on A is going to I6 so write it as 6.
I0, I2and I3on a are going to I3 so write it as S3 which means that shift 3.
I0, I2 and I3 on b are going to I4 so write it as S4 which means that shift 4.
I4, I5 and I6 all states contains the final item because they contain • in the right most end. So rate the
production as production number.
I4 contains the final item which drives A → b• and that production corresponds to the production
number 3 so write it as r3 in the entire row.
I5 contains the final item which drives S → AA• and that production corresponds to the production
number 1 so write it as r1 in the entire row.
I6 contains the final item which drives A → aA• and that production corresponds to the production
number 2 so write it as r2 in the entire row.
Unit-2
LR PARSING ALGORITHM
Here we describe a skeleton algorithm of an LR parser:
token = next_token()
repeat forever
s = top of stack
else
error()
SIMPLE LR(SLR)
The parsing table has two fields associated with each state in the DFA knownas action
and goto. These are computed using the following algorithms: Construct C
={I0,I1,…..In}, the collection of sets of LR(0) items for G‘
State i is constructed from Ii. The parsing actions for state i aredetermined
as follows:
If [A→α.aβ] is in Ii and goto(Ii, a) = Ij, then set action[i, a] to ―shiftj.‖ Here,
a is required to be a terminal
If [A→α.] is in Ii then set action [i, a][ to ―reduce A→ α] for all a inFOLLOW(A),
here A may not be S‘
If [S‘→ S.] is in Ii, then set action[i, $] to ―accept‖
If any conflicting actions are generated by the above rules, we say thegrammar is
not SLR(1). The algorithm fails to produce a parser in this case.
The goto transitions for state I are constructed for all non-terminals A using therule: if
goto(Ii, A) = Ij, then goto[i, A] =j
All entries not defined by rules (2) and (3) are made ―error‖
The initial state of the parser is the one constructed from the set containing item[S‘→ S]
Number of productions in the grammar from onwards and use theproduction
number while making a reduction entry
Unit-2
I1:
E‘ → E.
E → E.+ T
I2:
E → T.
T → T.* F
I3:
T → F.
I4:
F → (. E)
E →. E + T
E →. T
T →. T * FT →. F
F →. (E )
F → .id
I5:
F → id.
I6:
E → E + .T
T → .T * FT → .F
F → .( E )
F → .id
I7:
T → T * .FF → .( E)
F → .id
Unit-2
I8:
F → ( E .)E → E. + T
I9:
E → E + T.
T → T. * F
I10:
T → T * F.
I11:
F → ( E ).
ACTION GOTO
STATE Id + * ( ) $ E T F
0 s5 1 2 3
1 s6 s4 ACC
2 r2 s7 r2 r2
3 r4 r4 r4 r4
4 s5 S4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
Here , si means shift state i. ri means ‗reduce by using production number I‘and ACC
means Accept, blank means error.
Unit-2
Add Augment Production, insert '•' symbol at the first position for everyproduction in G and
also add the lookahead.
S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
I0 State:
Add Augment production to the I0 State and Compute the Closure
I0 = Closure (S` → •S)
Add all productions starting with S in to I0 State because "." is followed bythe non-terminal. So,
the I0 State becomes
I0 = S` → •S, $
S → •AA, $
Add all productions starting with A in modified I0 State because "." isfollowed by the non-
terminal. So, the I0 State becomes.
Unit-2
I0= S` → •S, $
S → •AA, $
A → •aA, a/bA → •b, a/b
I4 contains the final item which drives ( A → b•, a/b), so action {I4, a} = R3,action {I4, b} = R3.
I5 contains the final item which drives ( S → AA•, $), so action {I5, $} = R1.
I7 contains the final item which drives ( A → b•,$), so action {I7, $} = R3.
I8 contains the final item which drives ( A → aA•, a/b), so action {I8, a} = R2, action {I8, b} =
R2.
I9 contains the final item which drives ( A → aA•, $), so action {I9, $} = R2.
Unit-2
❖ LALR PARSER
➢ LALR refers to the lookahead LR. To construct the LALR (1) parsing table, we use the
canonical collection of LR (1) items.
➢ In the LALR (1) parsing, the LR (1) items which have same productions but different
look ahead are combined to form a single set of items
➢ LALR (1) parsing is same as the CLR (1) parsing, only difference in the parsing table.
Example
LALR ( 1 ) Grammar
S → AA
A → aA
A→b
Add Augment Production, insert '•' symbol at the first position for everyproduction in G and
also add the look ahead.
S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
I0 State:
Add Augment production to the I0 State and Compute the ClosureL
I0 = Closure (S` → •S)
Add all productions starting with S in to I0 State because "•" is followed by thenon-terminal.
So, the I0 State becomes
I0 = S` → •S, $
S → •AA, $
Add all productions starting with A in modified I0 State because "•" is followedby the non-
terminal. So, the I0 State becomes.
I0= S` → •S, $
S → •AA, $
A → •aA, a/bA → •b, a/b
If we analyze then LR (0) items of I3 and I6 are same but they differ only in theirlookahead.
I3 = { A → a•A, a/bA → •aA, a/bA → •b, a/b}
I6= { A → a•A, $A → •aA, $ A → •b, $}
Clearly I3 and I6 are same in their LR (0) items but differ in their lookahead, so we can
combine them and called as I36.
DRAWING DFA:
Unit-2
The dangling else is a problem in computer programming in which an optional else clause
in an If–then(–else) statement results in nested conditionals being ambiguous.Formally,
the context-free grammar of the language is ambiguous, meaning there is more than one
correct parse tree.
In many programming languages one may write conditionally executed code in two
forms: the if-then form, and the if-then-else form – the else clause is optional:
Here we have a shift-reduce error. Consider the first two items in I3. If we have a*b+cand
we parsed a*b, do we reduce using E ::= E * E or do we shift more symbols? In theformer
case we get a parse tree (a*b)+c; in the latter case we get a*(b+c). To resolvethis conflict,
we can specify that * has higher precedence than +. The precedence of a grammar
production is equal to the precedence of the rightmost token at the rhs of the production.
For example, the precedence of the production E ::= E * E is equal to the precedence of
the operator *, the precedence of the production E ::= ( E ) is equal to the precedence of
the token ), and the precedence of the production E ::=if E then E else E is equal to the
precedence of the token else. The idea is that if the look ahead has higher precedence than
the production currently used, we shift. For example, if we are parsing E + E using the
production rule E ::= E + E and the look ahead is *, we shift *. If the look ahead has the
same precedence as that of the current production and is left associative, we reduce,
otherwise we shift. The above grammar is valid if we define the precedence and
associativity of all the operators. Thus, it is very important when you write a parser using
CUP or any other LALR(1) parser generator to specify associativities and precedence‘s for
most tokens (especially for those used as operators). Note: you can explicitly define the
precedence of a rule in CUP using the %prec directive:
E ::= MINUS E %prec UMINUS
where UMINUS is a pseudo-token that has higher precedence than TIMES, MINUS etc,
so that -1*2 is equal to (-1)*2, not to -(1*2).
Another thing we can do when specifying an LALR(1) grammar for a parser generator is
error recovery. All the entries in the ACTION and GOTO tables that have no content
correspond to syntax errors. The simplest thing to do in case of error is to report it and stop
the parsing. But we would like to continue parsing finding more errors. This is callederror
recovery. Consider the grammar:
S ::= L = E ;
| { SL } ; |
error ;S:= S ;|L|
SL S ;
The special token error indicates to the parser what to do in case of invalid syntax for S(an
invalid statement). In this case, it reads all the tokens from the input stream until it finds the
first semicolon. The way the parser handles this is to first push an error state in the stack. In
case of an error, the parser pops out elements from the stack until it findsan error state
where it can proceed. Then it discards tokens from the input until a restartis possible. Inserting
error handling productions in the proper places in a grammar to do good error recovery is
considered very hard.
Unit-2
By compiling y . tab. c along with the ly library that contains the LR parsing program
using the command
we obtain the desired object program a. out that performs the translation specified
by the original Yacc program.7 If other procedures are needed, they can be
compiled or loaded with y
. tab . c, just as with any C program. A Yacc source program has three parts: