0% found this document useful (0 votes)
26 views

rkCD-Chapter 3 - Syntax Analysis

The document discusses syntax analysis in compilers. It describes the role of the parser in breaking down a program into tokens. Context-free grammars and derivation trees are used to represent the syntax and parse trees. Left recursion and left factoring techniques are discussed to remove ambiguities from grammars.

Uploaded by

sale msg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

rkCD-Chapter 3 - Syntax Analysis

The document discusses syntax analysis in compilers. It describes the role of the parser in breaking down a program into tokens. Context-free grammars and derivation trees are used to represent the syntax and parse trees. Left recursion and left factoring techniques are discussed to remove ambiguities from grammars.

Uploaded by

sale msg
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Chapter - 3

Syntax Analysis
The Role of the Syntax Analyzer or Parser:
Parsing:
Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn the
basic concepts used in the construction of a parser.
We have seen that a lexical analyzer can identify tokens with the help of regular
expressions and pattern rules. But a lexical analyzer cannot check the syntax of a given sentence
due to the limitations of the regular expressions. Regular expressions cannot check balancing
tokens, such as parenthesis. Therefore, this phase uses context-free grammar (CFG), which is
recognized by push-down automata.
In compiler model, the parser obtains a string of tokens from the lexical analyzer and verifies
that the string can generated by the grammar for the source language. We expect the parser to
report any syntax errors in an intelligible fashion. It should also recover from commonly
occurring errors that it can continue processing the remainder of its input. Syntax of
programming is described by context free grammar (CFG).

The output of the parser is some representation of the parse tree for the stream of tokens
produced by the lexical analyzer. There are a number of tasks that might be conducted during
parsing, such as collecting information about various tokens into the symbol table,
performing type checking and other kinds OF semantic analysis, and generating
intermediate code.
Parser is a compiler that is used to break the data into smaller elements coming from
lexical analysis phase. A parser takes input in the form of sequence of tokens and produces
output in the form of parse tree. Types of Parsing:
The methods commonly used in compilers are classified as being either top down parsing or
bottom-up parsing. As indicated by their names,
I.top down parser build parse trees from the top (root) to the bottom (leaves),
II.bottom-up parsers start from the leaves and work up to the root.
In both cases, the input to the parser is scanned from left to right, one symbol at a
time.

Dr Krishna Page 1
Most efficient top-down and bottom-up methods work only an subclasses of grammars, but
several of these subclasses, such as the LL and LR grammars, are expressive enough to
describe most syntactic constructs in programming languages. Parsers implemented by hand
often work with LL grammars.
Context Free Grammar:
A context-free grammar is a 4-tuple G= (N, T, P, S) where
a) N is a finite set called the variables or Non terminals
b) T is a finite set, disjoint from N, called the terminals
c) P is a finite set of rules or productions, with each rule being a variable and a string of
variables and terminals, and
d) S∈N is the start variable or symbol.
G = (N, T, P, S)
Where:
N = { Q, Z, N }
T = { 0, 1 }
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S = { Q }
This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111,
etc.
Derivation: Derivation is the process of applying a sequence of production rules in order to
derive a string of the corresponding language. It means the sequence of replacing non-terminal
symbols. Eg. id+id+id
EE+EE+E+Eid+E+Eid+id+Eid+id+id
Generally ,derivation steps αABαγβ there is a proportion Aγ in one grammar.
 means one derivation step
*
 zero or more derivation step
+
 one or more derivation step
Derivation type:
1. Left Most Derivation: If we choose the left most non-terminal in each derivation steps
the derivation is said to be left most derivation /LMD.
Example: Production rules:
SS+S
S S - S
S a | b |c Input:
a - b + c

Dr Krishna Page 2
The left-most derivation is:
SS+S
S S - S + S
S a - S + S
S a - b + S
S a - b + c
2. Right Most Derivation: A way to choose the right most nonterminal in each derivation
step is known as right most derivation /RMD.
Example:
Example: Production rules:
SS+S
S S - S
S a | b |c Input:
a - b + c
Right Most Derivation:
SS-S
S S - S + S
S S - S + c
S S - b + c
S a - b + c
Example:
Production rules:
E→E+E
E→E*E
E → id
Input string: id + id * id
The left-most derivation is:
E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Notice that the left-most side non-terminal is always processed first. The
right-most derivation is:
E→E+E
E→E+E*E
E → E + E * id
E → E + id * id
E → id + id * id
Parse Tree or Derivation Tree: It is a graphical representation of derivation. It is convenient to
see how strings are derived from the start symbol. The start symbol of the derivation becomes the
root of the parse tree.
It consists three nodes

Dr Krishna Page 3
Root node: -the start node which has not parent node and
Inner node: - non-terminals
Leaves node: -Terminals and a node which has not child
We take the left-most derivation of a + b * c
The left-most derivation is: E → E * E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
E→E*E

E→E+E*E

E → id + E * E

E → id + id * E

E → id + id * id

Ambiguity:
A grammar having more than one left most or right most derivation for a sentence so, the
grammar is called ambiguous grammar.

Example

Dr Krishna Page 4
E→E+E
E→E–E
E → id
For the string id + id – id, the above grammar generates two parse trees:

N.Bfor parser the grammar must be unambiguous so that it can have unique selection. We should
eliminate Ambiguity of grammars during the design phase of compilers
Left Recursion:
A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation
contains ‘A’ itself as the left-most symbol. Left-recursive grammar is considered to be a
problematic situation for top-down parsers. Top-down parsers start parsing from the Start
symbol, which in itself is non-terminal. So, when the parser encounters the same non-terminal in
its derivation, it becomes hard for it to judge when to stop parsing the left non-terminal and it
goes into an infinite loop. Example:
(1) A => Aα | β
(2) S => Aα | β A => Sd
(1) is an example of immediate left recursion, where A is any non-terminal symbol and α
represents a string of non-terminals. (2) is an example of indirect-left recursion.

A top-down parser will first parse the A, which in-turn will yield a string consisting of A
itself and the parser may go into a loop forever.
Removal of Left Recursion
One way to remove left recursion is to use the following technique:
The production
A => Aα | β
is converted into following productions
A => βA'
A'=> αA' | ε
This does not impact the strings derived from the grammar, but it removes
immediate left recursion.

Example

Dr Krishna Page 5
The production set
S => Aα | β
A => Sd
after applying the above algorithm, should become
S => Aα | β
A => Aαd | βd and then, remove immediate left recursion
using the first technique.
A => βdA'
A' => αdA' | ε
Now none of the production has either direct or indirect left recursion.
Example: Consider the following grammar and eliminate left recursion-
E→E+T/T
T→TxF/F
F → id
The grammar after eliminating left recursion is-
E → TE’
E’ → +TE’ /∈
T → FT’
T’ → xFT’ / ∈
F → id
Consider the following grammar and eliminate left recursion-
S → (L) / a
L→L,S/S
The grammar after eliminating left recursion is-
S → (L) / a
L → SL’
L’ → ,SL’ / ∈Left
Factoring:
If more than one grammar production rules has a common prefix string, then the topdown
parser cannot make a choice as to which of the production it should take to parse the string in
hand.
Example
If a top-down parser encounters a production like
A ⟹ αβ1|αβ2|…..γ1|γ2 …
Then it cannot determine which production to follow to parse the string as both
productions are starting from the same terminal (or non-terminal). To remove this confusion, we
use a technique called left factoring.
Left factoring transforms the grammar to make it useful for top-down parsers. In this
technique, we make one production for each common prefixes and the rest of the derivation is
added by new productions.

Example

Dr Krishna Page 6
The above productions can be written as
A => αA'|γ1|γ2
A'=> β1|β2 …
Now the parser has only one production per prefix which makes it easier to take decisions.
Problem-01:
Do left factoring in the following grammar-
S → iEtS / iEtSeS / a
E→b
Solution-
The left factored grammar is-
S → iEtSS’ / a
S’ → eS /
E→b
Problem-02:
Do left factoring in the following grammar-
A → aAB / aBc / aAc
Solution-
Step-01:
A → aA’
A’ → AB / Bc / Ac
Again, this is a grammar with common prefixes.
Step-02:
A → aA’
A’ → AD / Bc
D→B/c
This is a left factored grammar.
Problem-03:
Do left factoring in the following grammar-
S → bSSaaS / bSSaSb / bSb / a Solution-Step-01:
S → bSS’ / a
S’ → SaaS / SaSb / b
Again, this is a grammar with common prefixes.
Step-02:
S → bSS’ / a
S’ → SaA / b
A → aS / Sb
This is a left factored grammar.
First and Follow Sets:
An important part of parser table construction is to create first and follow sets. These sets
can provide the actual position of any terminal in the derivation. This is done to create the
parsing table where the decision of replacing T[A, t] = α with some production rule. First Set:

Dr Krishna Page 7
This set is created to know what terminal symbol is derived in the first position by a nonterminal.
For example,
α→tβ
That is α derives t (terminal) in the very first position. So, t FIRST(α).
Algorithm for calculating First set Look at the definition of FIRST(α) set:
• if α is a terminal, then FIRST(α) = { α }.
• if α is a non-terminal and α → ℇ is a production, then FIRST(α) = { ℇ }.
• If ABC then FIRST(A)= FIRST(B) if FIRST(B) doesn’t contain ℇ ; if
FIRST(B) contains ℇ then FIRST(A)= FIRST(B) U FIRST(C).
First set can be seen as:

Follow Set:
Likewise, we calculate what terminal symbol immediately follows a non-terminal α in
production rules. We do not consider what the non-terminal can generate but instead, we see
what would be the next terminal symbol that follows the productions of a non-terminal.
Algorithm for calculating Follow set:
• if S is a start symbol, then FOLLOW(S) = $
• if A → αBβ, then FOLLOW(B)=FIRST(B) if FIRST(B) doesn’t contains ℇ .
• if A → αBβ, where β ℇ then FOLLOW(B)=FOLLOW(A).
• if A → αB, then FOLLOW(B)= FOLLOW(A).

Follow set can be seen as: FOLLOW(α) = { t | S *αt*}


Problem-01:
Calculate the first and follow functions for the given grammar-

S → aBDh
B → cC
C → bC / ∈
D → EF
E →g/∈
F →f/∈
Solution-
The first and follow functions are as follows-
First Functions-
First(S) = { a }
• First(B) = { c }
• First(C) = { b , ∈ }
• First(D) = { First(E) – ∈ } ∪ First(F) = { g , f , ∈ }
• First(E) = { g , ∈ }
• First(F) = { f , ∈ } Follow Functions-
• Follow(S) = { $ }
• Follow(B) = { First(D) – ∈ } ∪ First(h) = { g , f , h }

Dr Krishna Page 8
• Follow(C) = Follow(B) = { g , f , h }
• Follow(D) = First(h) = { h }
• Follow(E) = { First(F) – ∈ } ∪ Follow(D) = { f , h }
• Follow(F) = Follow(D) = { h }
Problem-02:
Calculate the first and follow functions for the given grammar-
S→A
A → aB / Ad B → b
C→g
Solution-
We have-
 The given grammar is left recursive.
 So, we first remove left recursion from the given grammar.
After eliminating left recursion, we get the following grammar-
S→A
A → aBA’
A’ → dA’ / ∈ B → b
C→g
Now, the first and follow functions are as follows-
First Functions-
• First(S) = First(A) = { a }
• First(A) = { a }
• First(A’) = { d , ∈ }
• First(B) = { b }
• First(C) = { g }
Follow Functions-
• Follow(S) = { $ }
• Follow(A) = Follow(S) = { $ }  Follow(A’) = Follow(A) = { $ }
• Follow(B) = { First(A’) – ∈ } ∪ Follow(A) = { d , $ }
• Follow(C) = NA
Problem-03:
Calculate the first and follow functions for the given grammar-
S → (L) / a
L → SL’
L’ → ,SL’ / ∈
Solution-
The first and follow functions are as follows-
First Functions-
• First(S) = { ( , a }
• First(L) = First(S) = { ( , a }
• First(L’) = { , , ∈ }

Dr Krishna Page 9
Follow Functions-
 Follow(S) = { $ } ∪ { First(L’) – ∈ } ∪ Follow(L) ∪Follow(L’) = { $ , , , ) }
 Follow(L) = { ) }
 Follow(L’) = Follow(L) = { ) }
Problem4:
Calculate the first and follow functions for the given grammar-
E→E+T/T
T→TxF/F
F → (E) / id
Solution-
We have-
• The given grammar is left recursive.
• So, we first remove left recursion from the given grammar.
After eliminating left recursion, we get the following grammar-
E → TE’
E’ → + TE’ / ∈
T → FT’
T’ → x FT’ / ∈
F → (E) / id
Now, the first and follow functions are as follows-
First Functions-
• First(E) = First(T) = First(F) = { ( , id }
• First(E’) = { + , ∈ }
• First(T) = First(F) = { ( , id }
• First(T’) = { x , ∈ }
• First(F) = { ( , id }
Follow Functions-
• Follow(E) = { $ , ) }
• Follow(E’) = Follow(E) = { $ , ) }
• Follow(T) = { First(E’) – ∈ } ∪ Follow(E) ∪ Follow(E’) = { + , $ , ) }
• Follow(T’) = Follow(T) = { + , $ , ) }
• Follow(F) = { First(T’) – ∈ } ∪ Follow(T) ∪ Follow(T’) = { x , + , $ , ) }
Top Down Parsing:
When the parser starts constructing the parse tree from the start symbol and then tries to
transform the start symbol to the input, it is called top-down parsing.
We have learnt in the last chapter that the top-down parsing technique parses the input,
and starts constructing a parse tree from the root node gradually moving down to the leaf nodes.
The types of top-down parsing are depicted below:

Dr Krishna Page 10
Dr Krishna Page 11
Back-tracking:
Top- down parsers start from the root node (start symbol) and match the input string
against the production rules to replace them (if matched). To understand this, take the following
example of CFG:
S → rXd | rZd
X → oa | ea Z
→ ai
For an input string: read, a top-down parser, will behave like this:
It will start with S from the production rules and will match its yield to the left-most letter of the
input, i.e. ‘r’. The very production of S (S → rXd) matches with it. So the top-down parser
advances to the next input letter (i.e. ‘e’). The parser tries to expand non-terminal ‘X’ and checks
its production from the left (X → oa). It does not match with the next input symbol. So the
topdown parser backtracks to obtain the next production rule of X, (X → ea).
Now the parser matches all the input letters in an ordered manner. The string is accepted.

Recursive Descent Parsing:


Recursive descent is a top-down parsing technique that constructs the parse tree from the
top and the input is read from left to right. It uses procedures for every terminal and non-terminal
entity. This parsing technique recursively parses the input to make a parse tree, which may or
may not require back-tracking. But the grammar associated with it (if not left factored) cannot
avoid back-tracking. A form of recursive-descent parsing that does not require any back-tracking
is known as predictive parsing.
This parsing technique is regarded recursive as it uses context-free grammar which is
recursive in nature.
Algorithm for Recursive Descent Parsing:
1. If input is a nonterminal then call corresponding function.
2. If input is a terminal compare terminal with input, if equal then increment input.
3. If nonterminal produces 3 productions, then write all productions.
4. No need of any main function or variables.

Dr Krishna Page 12
Example:

Predictive Parser (or) Non Recursive Descent Parsing (or) LL(1) Parser:
Predictive parser is a recursive descent parser, which has the capability to predict which
production is to be used to replace the input string. The predictive parser does not suffer from
backtracking.
To accomplish its tasks, the predictive parser uses a look-ahead pointer, which points to
the next input symbols. To make the parser back-tracking free, the predictive parser puts some
constraints on the grammar and accepts only a class of grammar known as LL(k) grammar.

Predictive parsing uses a stack and a parsing table to parse the input and generate a parse
tree. Both the stack and the input contains an end symbol $ to denote that the stack is empty and
the input is consumed. The parser refers to the parsing table to take any decision on the input and
stack element combination.
An LL Parser accepts LL grammar. LL grammar is a subset of context-free grammar but
with some restrictions to get the simplified version, in order to achieve easy implementation. LL
grammar can be implemented by means of both algorithms namely, recursive-descent or
tabledriven.
LL parser is denoted as LL(k). The first L in LL(k) is parsing the input from left to right,
the second L in LL(k) stands for left-most derivation and k itself represents the number of look
aheads. Generally k = 1, so LL(k) may also be written as LL(1).

Dr Krishna Page 13
Problem:
Consider the following grammar with the following set of terminal symbols {+,-,*,/,(,),id}:
E -> E + T | E - T | T
T -> T * F | T / F | F
F -> ( E ) | id
1. Rewrite the grammar so that it can be analyzed by an LL(1) predictive parser.
2. Compute the FIRST and FOLLOW sets for the non-terminal symbols.
3. Build the parse table for the predictive parser.
4. Process the input phrase a*(b+c) detailing the contents of the stack, the input and each
action performed by the parser.

Solution:
The grammar can be parsed by an LL(1) parser if it does not have left recursion and no
ambiguity is present (i.e., the LOOKAHEADs for all productions of each non-terminal are
disjoint).
simple inspection of the grammar shows that left recursion is present in both E and T
rules. Also, there are left corners that may hide ambiguity.
The first step, then, is to rewrite the grammar so that left recursion is eliminated:
E -> T E'
E' -> + T E' | - T E' | ε
T -> F T'
T' -> * F T' | / F T' | ε F
-> ( E ) | id
The new grammar still has left corners, but a cursory inspection shows that they are not
problematic. Nevertheless, they should be eliminated:
E -> ( E ) T' E' | id T' E'
E' -> + T E' | - T E' | ε
T -> ( E ) T' | id T'
T' -> * F T' | / F T' | ε
F -> ( E ) | id
FIRST(E) = FIRST(( E ) T' E') FIRST(id T' E') = {(, id}
FIRST(E') = FIRST(+ T E') FIRST(- T E') {ε} = {+, -, ε} FIRST(T)
= FIRST(( E ) T') FIRST(id T') = {(, id}
FIRST(T') = FIRST(* F T') FIRST(/ F T') {ε} = {*, /, ε}
FIRST(F) = FIRST(( E )) FIRST(id) = {(, id}

FOLLOW(E) = {$} {)} = {), $}


FOLLOW(E') = FOLLOW(E) = {), $}
FOLLOW(T) = FIRST(E')\{ε} FOLLOW(E') = {+, -, ), $}
FOLLOW(T') = FOLLOW(T) = {+, -, ), $}
FOLLOW(F) = FIRST(T')\{ε} FOLLOW(T') = {*, /, +, -, ), $}

For building the parse table, we will consider the FIRST and FOLLOW sets above.

Dr Krishna Page 14
+ - * / id ( ) $
E E -> id T' E -> ( E )
E' T' E'

E' E' -> + T E' -> - T E' -> ε E' ->


E' E' ε

T -> id T' T -> ( E ) T'


T

T' -> * F T' -> ε T' ->


T' T' -> ε
T' -> ε T' T' -> / F T' ε
F F -> id F -> ( E )

The following table show the parsing of the a*(b+c) input sequence.
STACK INPUT ACTION
E$ a*(b+c)$ E -> id T' E'
idT'E'$ a*(b+c)$ -> id ≡ a
T'E'$ *(b+c)$ T' -> * F T'
*FT'E'$ *(b+c)$ -> *
FT'E'$ (b+c)$ F -> ( E )
(E)T'E'$ (b+c)$ -> (
E)T'E'$ b+c)$ E -> id T' E'
idT'E')T'E'$ b+c)$ -> id ≡ b
T'E')T'E'$ +c)$ T' -> ε
E')T'E'$ +c)$ E' -> + T E'
+TE')T'E'$ +c)$ -> +
TE')T'E'$ c)$ T -> id T'
idT'E')T'E'$ c)$ -> id ≡ c
T'E')T'E'$ )$ T' -> ε
E')T'E'$ )$ E' -> ε
)T'E'$ )$ -> )
T'E'$ $ T' -> ε
E'$ $ E' -> ε
$ $ ACCEPT

Dr Krishna Page 15

You might also like