Syntax Analysis
Syntax Analysis
• The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and verifies that the string
can be generated by the grammar for the source language. It reports any syntax errors in the program. It also
recovers from commonly occurring errors so that it can continue processing its input.
1. It verifies the structure generated by the tokens based on the grammar.
2. It constructs the parse tree.
3. It reports the errors.
4. It performs error recovery.
Issues :
Parser cannot detect errors such as:
1. Variable re-declaration
2. Variable initialization before use
3. Data type mismatch for an operation.
The above issues are handled by Semantic Analysis phase.
Context free grammar (CFG)
• Context free grammar is a formal grammar which is used to generate all possible strings in a given formal
language.
• Context free grammar G can be defined by four tuples as:
G= (V, T, P, S)
Where,
• G describes the grammar,
• T describes a finite set of terminal symbols. (terminal are the basic symbols and it can’t be further derive)
• V describes a finite set of non-terminal symbols (whatever comes in the lefthand sides of a production is
terminal symbol)
• P describes a set of production rules (The set of Productions where each production consists of a non-
terminal, called the left side followed by an arrow, followed by a string of non-terminals and/or terminals
called the right side
• S is the start symbol.
• Example of terminal an non-terminal symbol:
•
• Terminal:{ (, ), *, +} Non-terminal :{S,A}
Contd….
• Example: find the terminal and non-terminal symbol
Derivation
• Derivation is a sequence of production rules. It is used to get the input string through these
production rules. During parsing we have to take two decisions. These are as follows:
• We have to decide the non-terminal which is to be replaced.
• We have to decide the production rule by which the non-terminal will be replaced.
• We have two options to decide which non-terminal to be replaced with production rule.
1. Leftmost derivation.
2. Rightmost derivation.
Contd…
• Left most derivation: In leftmost derivation, at each and every step the leftmost
non-terminal is expanded by substituting its corresponding production to derive a
string.
• Rightmost Derivation: In rightmost derivation, at each and every step the rightmost non-terminal
is expanded by substituting its corresponding production to derive a string.
Parse tree
• Parse tree is the pictorial representation of the derivation process. In parsing, the string is derived using the
start symbol. The root of the parse tree is that start symbol.
• While constructing a parse tree, some rule must be follow:
• Root node must be the starting symbol
• All interior nodes have to be non-terminals.
• All leaf nodes have to be terminals.
• Example:
• Input:
Step 1 :
Step 2:
• Step 3:
• Step 4:
• Step 5:
Ambiguity in Grammar
• A grammar is said to be ambiguous if there exists more than one leftmost derivation or more than
one rightmost derivation or more than one parse tree for the given input string. If the grammar is
not ambiguous, then it is called unambiguous.
• If the grammar has ambiguity, then it is not good for compiler construction. No method can
automatically detect and remove the ambiguity, but we can remove ambiguity by re-writing the
whole grammar without ambiguity.
• Example: Check whether the given grammar is ambiguous or not?
E→E+E
E→E-E
E → id
• String: id + id - id
Contd….
• First Leftmost derivation
E→E+E
→ id + E
→ id + E - E
→ id + id - E
→ id + id- id
• Second Leftmost derivation
E→E-E
→E+E-E
→ id + E - E
→ id + id - E
→ id + id - id
• Since there are two leftmost derivation for a single string "id + id - id", the grammar G is ambiguous.
How to remove ambiguity?
We can remove ambiguity solely on the basis of the following two properties –
I. Precedence – If different operators are used, we will consider the precedence of the operators. The
three important characteristics are :
The level at which the production is present denotes the priority of the operator used.
The production at higher levels will have operators with less priority. In the parse tree, the nodes
which are at top levels or close to the root node will contain the lower priority operators.
The production at lower levels will have operators with higher priority. In the parse tree, the
nodes which are at lower levels or close to the leaf nodes will contain the higher priority operators.
II. Associativity –If the same precedence operators are in production, then we will have to consider the
associativity.
If the associativity is left to right, then we have to prompt a left recursion in the production. The
parse tree will also be left recursive and grow on the left side.
+, -, *, / are left associative operators.
If the associativity is right to left, then we have to prompt the right recursion in the productions.
The parse tree will also be right recursive and grow on the right side.
^ is a right associative operator.
Example:
• Consider the ambiguous grammar E -> E-E | id
• Say, we want to derive the string id-id-id. Let’s consider a single value of id=3 to get more
insights. The result should be :
3-3-3 =-3
• Since the same priority operators, we need to consider associativity which is left to right.
Contd….
• So, to make the above grammar unambiguous, simply make the grammar Left Recursive and we
need to place a random non-terminal say P in place of the right non-terminal. The grammar
becomes :
E -> E – P | P
P -> id
• The above grammar is now unambiguous and will contain only one Parse Tree for the above
expression as shown below –
Example:
• Consider the grammar shown below, which has two different operators :
E -> E + E | E * E | id
• Clearly, the above grammar is ambiguous as we can draw two parse trees for the string “id+id*id”
as shown below. Consider the expression :
• Left recursion: A grammar in the form G = (V, T, S, P) is said to be in left recursive form if it has
the production rules of the form
A → A α |β
• Problem with Left Recursion: If a left recursion is present in any grammar then, during parsing
in the syntax analysis part of compilation, there is a chance that the grammar will create an infinite
loop. This is because, at every time of production of grammar, A will produce another A without
checking any condition.
• Left recursive grammar is not suitable for Top down parsers.
• This is because it makes the parser enter into an infinite loop.
• To avoid this situation, it is converted into its equivalent right recursive grammar.
• This is done by eliminating left recursion from the left recursive grammar.
Contd…
A → A α |β.
• The above Grammar is left recursive because the left of production is occurring at
a first position on the right side of production. It can eliminate left recursion by
replacing a pair of production with
A → βA′
A′ → αA′|ϵ
Example:
• Consider the Left Recursion from the Grammar.
E → E + T|T
T → T * F|F
F → (E)|id
• Eliminate immediate left recursion from the Grammar.
• Here, E → E + T|T is left recursive
A = E, α = +T, β = T
A → A α |β is changed to A → βA′ and A′ → α A′|ε in order to eliminate the left recursion.
A → βA′ means E → TE′
A′ → α A′|ε means E′ → +TE′|ε
• And the next production is T → T ∗ F|F with A → Aα|β
• A = T, α =∗ F, β = F
• A → β A′ means T → FT′
• A → α A′|ε means T′ →* FT′|ε
Production F → (E)|id does not have any left recursion
Contd…
• After eliminating left recursion, we get the following grammar
E → TE′
E′ → +TE′| ε
T → FT′
T →* FT′|ε
F → (E)| id
Example 2:
Left factoring
• Left factoring is used to convert a left-factored grammar into an equivalent grammar to remove the
uncertainty for the top-down parser. In left factoring, we separate the common prefixes from the production
rule.
• The following algorithm is used to perform left factoring in the grammar-
• Suppose the grammar is in the form:
S → iEtSS’ / a
S’ → eS / ∈
E→b
Do left factoring in the following grammar-
• A → aAB / aBc / aAc
• Step-01:
A → aA’
A’ → AB / Bc / Ac
• Again, this is a grammar with common prefixes.
A’ → AD / Bc
D→B/c
• This is a left factored grammar.
FIRST AND FOLLOW
• Rules for calculating First:
1. If then First(A)= {a}
2. If , then First (A) = {}
3. If then First (A) = First (B) , if First (B) doesn’t contains
if First (B) contain then First (A) = First (B) First (C)
• Rules for calculating Follow:
1. If ‘S’ is a start symbol then, Follow (S) = { $ }.
2. If , then Follow (B) = First ( ) , if First ( ) doesn’t contain
3. If , then Follow (B) = Follow (A)
4. If , where , then Follow(B) = { First(β) – ∈ } ∪ Follow(A)
• If X is Grammar Symbol, then First (X) will be −
• If X is a terminal symbol, then FIRST(X) = {X}
• If X → ε, then FIRST(X) = {ε}
• If X is non-terminal & X → a α, then FIRST (X) = {a}
• If X → Y1, Y2, Y3, then FIRST (X) will be
• (a) If Y is terminal, then
• FIRST (X) = FIRST (Y1, Y2, Y3) = {Y1}
• (b) If Y1 is Non-terminal and
• If Y1 does not derive to an empty string i.e., If FIRST (Y 1) does not contain ε then, FIRST (X) = FIRST (Y1, Y2, Y3) = FIRST(Y1)
• (c) If FIRST (Y1) contains ε, then.
• FIRST (X) = FIRST (Y1, Y2, Y3) = FIRST(Y1) − {ε} ∪ FIRST(Y2, Y3)
• Similarly, FIRST (Y2, Y3) = {Y2}, If Y2 is terminal otherwise if Y2 is Non-terminal then
• FIRST (Y2, Y3) = FIRST (Y2), if FIRST (Y2) does not contain ε.
• If FIRST (Y2) contain ε, then
• FIRST (Y2, Y3) = FIRST (Y2) − {ε} ∪ FIRST (Y3)
• Similarly, this method will be repeated for further Grammar symbols, i.e., for Y 4, Y5, Y6 … . YK.
• Computation of FOLLOW
• Follow (A) is defined as the collection of terminal symbols that occur directly to the right of A.
• FOLLOW(A) = {a|S ⇒* αAaβ where α, β can be any strings}
• Rules to find FOLLOW
• If S is the start symbol, FOLLOW (S) ={$}
• If production is of form A → α B β, β ≠ ε.
• (a) If FIRST (β) does not contain ε then, FOLLOW (B) = {FIRST (β)}
• Or
• (b) If FIRST (β) contains ε (i. e. , β ⇒* ε), then
• FOLLOW (B) = FIRST (β) − {ε} ∪ FOLLOW (A)
• ∵ when β derives ε, then terminal after A will follow B.
• If production is of form A → αB, then Follow (B) ={FOLLOW (A)}.
LL(1) parser (or) predictive parser (or) non-recursive descent parser
• LL(1) Parsing: Here the 1st L represents that the scanning of the Input will be done from the Left to Right
manner and the second L shows that in this parsing technique, we are going to use the Left most Derivation
Tree. And finally, the 1 represents the number of look-ahead, which means how many symbols are you going
to see when you want to make a decision.
$ id – id x id $ Shift
$ id – id x id $ Reduce E → id
$E – id x id $ Shift
$E– id x id $ Shift
$ E – id x id $ Reduce E → id
$E–E x id $ Shift
$E–Ex id $ Shift
$ E – E x id $ Reduce E → id
$E–ExE $ Reduce E → E x E
$E–E $ Reduce E → E – E
$E $ Accept
Operator precedence parsing
• Operator precedence grammar is kinds of shift reduce parsing method. It is applied to a small class of
operator grammars.
• A grammar is said to be operator precedence grammar if it has two properties:
• No R.H.S. of any production has a∈.
• No two non-terminals are adjacent.
• There are the three operator precedence relations:
• a ⋗ b means that terminal "a" has the higher precedence than terminal "b".
• a ⋖ b means that terminal "a" has the lower precedence than terminal "b".
• a ≐ b means that the terminal "a" and "b" both have same precedence.
Precedence table:
Contd….
• Step 1 : Check operator grammar or not.
• Step2 : Construct operator precedence relation table.
• Step 3 : Parse the given input string.
• Step 4 : Generate parse tree
LR parsing table
• LR parser is a bottom-up parser for context-free grammar that is very generally used by computer
programming language compiler and other associated tools. LR parser reads their input from left to right and
produces a right-most derivation. It is called a Bottom-up parser because it attempts to reduce the top-level
grammar productions by building up from the leaves. LR parsers are the most powerful parser of all
deterministic parsers in practice.
• Description of LR parser :
The term parser LR(k) parser, here the L refers to the left-to-right scanning, R refers to the rightmost derivation in reverse
and k refers to the number of unconsumed “look ahead” input symbols that are used in making parser decisions. Typically, k
is 1 and is often omitted. A context-free grammar is called LR (k) if the LR (k) parser exists for it. This first reduces the
sequence of tokens to the left. But when we read from above, the derivation order first extends to non-terminal.
1. The stack is empty, and we are looking to reduce the rule by S’→S$.
2. Using a “.” in the rule represents how many of the rules are already on the stack.
3. A dotted item, or simply, the item is a production rule with a dot indicating how much RHS has so far been recognized.
Closing an item is used to see what production rules can be used to expand the current structure. It is calculated as follows:
• Rules for LR parser :
The rules of LR parser as follows.
1. The first item from the given grammar rules adds itself as the first closed set.
2. If an object is present in the closure of the form A→ α. β. γ, where the next symbol after the symbol is non-terminal, add the
symbol’s production rules where the dot precedes the first item.
3. Repeat steps (B) and (C) for new items added under (B).
• LR parser algorithm :
LR Parsing algorithm is the same for all the parser, but the parsing table is different for each parser. It
consists following components as follows.
1. Input Buffer –
It contains the given string, and it ends with a $ symbol.
2. Stack –
The combination of state symbol and current input symbol is used to refer to the parsing table in order to take
the parsing decisions.
• Parsing Table :
Parsing table is divided into two parts- Action table and Go-To table. The action table gives a grammar rule
to implement the given current state and current terminal in the input stream. There are four cases used in
action table as follows.
1. Shift Action- In shift action the present terminal is removed from the input stream and the state n is pushed
onto the stack, and it becomes the new present state.
2. Reduce Action- The number m is written to the output stream.
3. The symbol m mentioned in the left-hand side of rule m says that state is removed from the stack.
4. The symbol m mentioned in the left-hand side of rule m says that a new state is looked up in the goto table
and made the new current state by pushing it onto the stack.
• LR parser diagram :
Syntax directed Definition
• Parser uses a CFG(Context-free-Grammar) to validate the input string and produce output for the next phase
of the compiler. Output could be either a parse tree or syntax tree. Now to interleave semantic analysis with
the syntax analysis phase of the compiler, we use Syntax Directed definition.
• In syntax directed definition, along with the grammar it can identify some informal notations and these
notations are known as semantic rules.
• After implementing the Semantic Analysis, the source program is modified to an intermediate form.
• There is some information that is required by the Intermediate code generation phase to convert the
semantically checked parse tree into intermediate code. But this information or attributes of variables cannot
be represented alone by Context- Free Grammar.
• So, some semantic actions or rules have to be attached with Context-Free grammar which helps the
intermediate code generation phase to generate intermediate code.
• So, Attaching attributes to the variables of the context Free Grammar and defining semantic rules (meaning)
of each production of grammar is called Syntax Directed Definition.
• It is a kind of notation in which each production of Context-Free Grammar is related with a set of semantic
rules or actions, and each grammar symbol is related to a set of Attributes.
• Attributes may be number, strings, references, datatypes etc.
1. Synthesized Attributes : If a node takes value from its children then it is synthesized attribute. A
Synthesized attribute is an attribute of the non-terminal on the left-hand side of a production. Synthesized
attributes represent information that is being passed up the parse tree. The attribute can take value only
from its children (Variables in the RHS of the production). The non-terminal concerned must be in the
head (LHS) of production. For e.g. let’s say A -> BC is a production of a grammar, and A’s attribute is
dependent on B’s attributes or C’s attributes then it will be synthesized attribute.
2. Inherited Attributes: If a nodes takes its value from its parent or sibling. An attribute of a nonterminal on
the right-hand side of a production is called an inherited attribute. The attribute can take value either from
its parent or from its siblings (variables in the LHS or RHS of the production). The non-terminal
concerned must be in the body (RHS) of production. For example, let’s say A -> BC is a production of a
grammar and B’s attribute is dependent on A’s attributes or C’s attributes then it will be inherited attribute
because A is a parent here, and C is a sibling.
Types of syntax directed definition
I. S-attributed SDD :
• If an SDD uses only synthesized attributes, it is called as S-attributed SDD.
• S-attributed SDDs are evaluated in bottom-up parsing, as the values of the parent nodes depend upon the
values of the child nodes.
• Semantic actions are placed in rightmost place of RHS.
II. L-attributed SDD:
• If an SDD uses both synthesized attributes and inherited attributes with a restriction that inherited
attribute can inherit values from left siblings or parents node only, it is called as L-attributed SDD.
• Attributes in L-attributed SDDs are evaluated by depth-first and left-to-right parsing manner.
• Semantic actions are placed anywhere in RHS.
• Example : S->ABC, Here attribute B can only obtain its value either from the parent – S or its left
sibling A but It can’t inherit from its right sibling C. Same goes for A & C – A can only get its value
from its parent & C can get its value from S, A, & B as well because C is the rightmost attribute in the
given production.
Dependency graph
• It represent the flow of information among the attributes in a parse tree.
• Dependency graph are useful for determining the evaluation order of attributes in parse tree.
• While an annotated parse tree shows the values of attributes, a dependency graph determines how those values can be
computed.
• These are used to show the flow of information among attribute instances within a parse tree.
• In the graph, an edge from one attribute instance to another means that the value of the first attribute is required to compute
the value of the second.
• Edges are used to express constraints that are implied by the language semantic rules.
• For each node in the parse tree, for example a node X, the dependency graph will have a node associated with X.
• If a semantic rule is associated with a production P defines the value of a synthesized attribute A.b in terms of the value of
X.c, then the dependency graph is said to have an edge from X.c to A.b. In more detail, at every node N labeled A where
this production P is applied, we create an edge to attribute b at N from attribute c at the child of N that corresponds to this
instance of symbol X in the body of the production.
• If a semantic rule associated with a production P defines a value of an inherited attribute B.c in terms of the value X.a, the
dependency graph contains an edge from X.a to B.c. For each node N that is labeled B and corresponds to an occurrence of
B in the body of the production P, we create an edge to and attribute c at N from the attribute a at node M that corresponds
to this occurrence of X. Keep in mind that M should be either the parent or a sibling.
Intermediate code generator
• Intermediate codes are machine-independent codes, but they are close to machine instructions. The
given program in a source language is converted to an equivalent program in an intermediate
language by the intermediate code generator.
• Forms of intermediate code :
• Syntax tree or abstract syntax tree
• Postfix notation
• Three address code
• Syntax Tree Syntax tree is nothing more than condensed form of a parse tree. The operator and
keyword nodes of the parse tree are moved to their parents and a chain of single productions is
replaced by single link in syntax tree the internal nodes are operators and child nodes are
operands. To form syntax tree put parentheses in the expression, this way it's easy to recognize
which operand should come first.
Postfix notation
• In postfix notation, the operator comes after an operand, i.e., the operator follows an operand.
• Example
• Postfix Notation for the expression (a+b) * (c+d) is ab + cd +*
• Postfix Notation for the expression (a*b) - (c+d) is ab* + cd + -
Three address code