4 - Syntax Analyzer (CFG)
4 - Syntax Analyzer (CFG)
CS440 Compilers 1
Syntax Analyzer (Contd.)
• The syntax of a programming language is described by a Context-Free
Grammar (CFG). We will use BNF (Backus-Naur Form) notation in the
description of CFGs.
• The syntax analyzer (parser) checks whether a given source program
satisfies the rules implied by a CFG or not.
– If it satisfies, the parser creates the parse tree of that program.
– Otherwise the parser gives the error messages.
• A Context-Free Grammar (CFG)
– Gives a precise syntactic specification of a programming language.
– The design of the grammar is an initial phase of the design of a compiler.
– A grammar can be directly converted into a parser by some tools.
CS440 Compilers 2
Parser
CS440 Compilers 3
Parsers (cont.)
• We categorize the parsers into two groups:
1. Top-Down Parser
– the parse tree is created top to bottom, starting from the root.
2. Bottom-Up Parser
– the parse tree is created bottom to top; starting from the leaves
• Both top-down and bottom-up parsers scan the input from left to right
(one symbol at a time).
• Efficient top-down and bottom-up parsers can be implemented only for
sub-classes of CFG.
– LL for top-down parsing
– LR for bottom-up parsing
CS440 Compilers 4
Grammar
• A set of rules by which valid sentences in a language are constructed.
• Example of English grammar:
– sentence –> <subject> <verb-phrase> <object>
– subject –> This | Computers | I
– verb-phrase –> <adverb> <verb> | <verb>
– adverb –> never
– verb –> is | run | am | tell
– object –> the <noun> | a <noun> | <noun>
– noun –> university | world | lecturer | lies
• Using the above rules or productions, we can derive simple sentences
such as these:
– This is a university.
– Computers run the world.
– I am the lecturer.
– I never tell lies.
CS440 Compilers 5
Some Important Terms
CS440 Compilers 6
Some Important Terms (Contd.)
CS440 Compilers 7
Formal Grammar
• We formally define a grammar as a 4-tuple {S, P, N, T}.
• S is the start symbol and S ∈ N
• P is the set of productions
• N and T are the nonterminal and terminal alphabets.
• A sentence is a string of symbols in T derived from S using one or
more applications of productions in P.
CS440 Compilers 8
4 Types of formal grammars (Chomsky Hierarchy)
CS440 Compilers 9
Types of formal grammars (Contd.)
• Type 2: context-free grammars:
– Productions are of the form X v where v is an arbitrary string of symbols in V, and X is a
single nonterminal.
– Wherever you find X, you can replace with v (regardless of context).
• Type 3: regular grammars:
– Productions are of the form X a, X aY, or X ε where X and Y are nonterminals
and a is a terminal.
– That is, the left-hand side must be a single nonterminal and the right-hand side can be either
empty, a single terminal by itself or with a single nonterminal.
– These grammars are the most limited in terms of expressive power.
• Note: Every type 3 grammar is a type 2 grammar, and every type 2 is a
type 1 and so on.
CS440 Compilers 10
Context-Free Grammars (CFG)
• A CFG recursively defines several sets of strings.
• Each set is denoted by a name, which is called a nonterminal.
• The set of nonterminals is disjoint from the set of terminals.
• One of the nonterminals are chosen to denote the language
described by the grammar. This is called the start symbol of
the grammar.
• Like regular expressions, CFG describe sets of strings, i.e.,
languages.
• CFG also defines structure of the strings in the language it
defines.
• Symbols in the alphabet are called terminals or leaves.
CS440 Compilers 11
CFG (Contd.)
CS440 Compilers 12
CFG (Contd.)
Examples
• Aa
• says that the set denoted by the nonterminal A contains the one-
character string a.
• A aA
• says that the set denoted by A contains all strings formed by
putting an a in front of a string taken from the set denoted by A.
• Together, these two productions indicate that A contains all non-
empty sequences of a’s and is hence (in the absence of other
productions) equivalent to the regular expression a+.
CS440 Compilers 13
CFG (Contd.)
• We can define a grammar equivalent to the regular expression
a* by the two productions
–B
– B aB
– where the first production indicates that the empty string is
part of the set B.
– Productions with empty right-hand sides are called empty
productions.
ᵋ
– Empty productions are sometimes written with an on the
right hand side instead of leaving it empty.
CS440 Compilers 14
CFG (Contd.)
• Simple expression grammar
Exp Exp+Exp
Exp Exp-Exp
Exp Exp*Exp
Exp Exp/Exp
Exp num
Exp (Exp)
CS440 Compilers 15
Syntactic Categories
• A syntactic category is a sub-language that embodies a particular
concept.
• Normally writing a grammar for a programming language starts by
dividing the constructs of the language into different syntactic
categories.
• Examples of common syntactic categories in programming languages:
1. Expressions: used to express calculation of values.
2. Statements: express actions that occur in a particular sequence.
3. Declarations: express properties of names used in other parts of
the program.
CS440 Compilers 16
Derivations
• The basic idea of derivation is to consider productions as rewrite rules
• Whenever we have a nonterminal, we can replace this by the right-hand
side of any production in which the nonterminal appears on the left-
hand side.
Example: E E+E : E+E derives from E
– we can replace E by E+E
– To be able to do this, we have to have a production rule EE+E in our grammar.
E E+E id+E id+id (derivation
of id+id from E)
• We can do this anywhere in a sequence of symbols (terminals and
nonterminals) and repeat doing so until we have only terminals left.
• The resulting sequence of terminals is a string in the language defined
by the grammar.
CS440 Compilers 17
Derivations (Contd.)
CS440 Compilers 18
Derivation (Contd.)
CS440 Compilers 19
Left-most and Right-most derivations
• At each derivation step, we can choose any of the non-terminal in the sentential form
of G for the replacement.
• If we always choose the left-most non-terminal in each derivation step, this derivation
is called as left-most derivation.
• Example: E -E -(E) -(E+E) -(id+E) -(id+id)
CS440 Compilers 20
Left-Most and Right-Most Derivations (Contd.)
Left-Most Derivation
E -E
lm
-(E)
lm
-(id+E)
-(E+E) lm lm
-(id+id)
lm
Right-Most Derivation
E
rm
-E
rm
-(E)
rm
-(E+id)
-(E+E) rm rm
-(id+id)
• We will see that the top-down parsers try to find the left-most derivation
of the given source program.
• We will see that the bottom-up parsers try to find the right-most
derivation of the given source program in the reverse order.
CS440 Compilers 21
Left-most derivation Example
• Given a grammar T->R, T->aTc, R-> ,R->RbR
1. T
2. aTc
3. aaTcc
4. aaRcc
5. aaRbRcc
6. aaRbRbRcc
7. aabRbRcc
8. aabRbRbRcc
9. aabbRbRcc
10. aabbbRcc
11. aabbbcc
CS440 Compilers 23
Parse Tree
• Inner nodes of a parse tree are non-terminal symbols.
• The leaves of a parse tree are terminal symbols.
• A parse tree can be seen as a graphical representation of a derivation.
E -E E
-(E) E
-(E+E)
E
- E - E - E
( E ) ( E )
E E E + E
- - E
E
-(id+E) -(id+id) ( E )
( E )
E + E E + E
id id id
CS440 Compilers 24
Ambiguity
id id
E
E E*E E+E*E id+E*E
id+id*E id+id*id E * E
E + E id
id id
CS440 Compilers 25
Ambiguity (cont.)
• For the most parsers, the grammar must be unambiguous.
• unambiguous grammar
unique selection of the parse tree for a sentence
• We should eliminate the ambiguity in the grammar during the design
phase of the compiler.
• An unambiguous grammar should be written to eliminate the ambiguity.
• We have to prefer one of the parse trees of a sentence (generated by an
ambiguous grammar) to disambiguate that grammar.
• In most (but not all) cases, an ambiguous grammar can be rewritten to
an unambiguous grammar that generates the same set of strings, or
external rules can be applied to decide which of the many possible
syntax trees is the “right one”
CS440 Compilers 26
Ambiguity (cont.)
• How do we know if a grammar is ambiguous?
• If we can find a string and show two alternative syntax trees for it, this is
a proof of ambiguity.
• It may, however, be hard to find such a string and, when the grammar is
unambiguous, even harder to show that this is the case.
• In fact, the problem is formally undecidable, i.e., there is no method that
for all grammars can answer the question “Is this grammar ambiguous?”.
• But in many cases it is not difficult to detect and prove ambiguity. For
example, any grammar that has a production of the form N NἁN
– N is nonterminal and ἁ is any sequence of grammar symbols.
– for example, T->R, T->aTc, R-> ,R->RbR, is ambiguous grammar and T->R, T->aTc, R->
,R->bR is the unambiguous version of the same.
CS440 Compilers 27
Ambiguity (cont.)
stmt stmt
E2 S1 E2 S1 S2
1 2
CS440 Compilers 28
Ambiguity (cont.)
• Ambiguous grammar
Stat id :=Exp
Stat Stat ;Stat
Stat if Exp then Stat else Stat
Stat if Exp then Stat
To make it unambiguous we make two nonterminals: One for matched (i.e. with
else-part) conditionals and one for unmatched (i.e. without else-part)
conditionals.
• Unambiguous grammar for statements
Stat Stat2;Stat
Stat Stat2
Stat2 Matched
Stat2 Unmatched
Matched if Exp then Matched else Matched
Matched id :=Exp
Unmatched if Exp then Matched else Unmatched
Unmatched if Exp then Stat2 CS440 Compilers 29
Ambiguity – Operator Precedence
• Ambiguous grammars (because of ambiguous operators) can be disambiguated
according to the precedence and associativity rules.
• An operator is left-associative if an expression must be evaluated from left to right,
i.e., as (a+b)+c. (Example 2+3*4)
Ambiguous Left-Associative E(Left-Associative)
E -> E opt. E E ->E opt. E`
E -> num E->E` E * E`
E`->num Num(4)
E + E`
E` Num(3)
Num(2)
• Similarly an operator is right-associative if an expression must be evaluated from
right to left, i.e., as a+(b+c).
• An operator is non-associative if expressions of the form “a opt. b opt. c” are illegal
CS440 Compilers 30
Left Recursion
• A grammar is left recursive if it has a non-terminal A such
that there is a derivation.
A A for some string
• Top-down parsing techniques cannot handle left-recursive
grammars.
• So, we have to convert our left-recursive grammar into an
equivalent grammar which is not left-recursive.
• The left-recursion may appear in a single step of the derivation
(immediate left-recursion), or may appear in more than one
step of the derivation.
CS440 Compilers 31
How to eliminate immediate left recursion
• There is a simple technique for rewriting the grammar to move
the recursion to the other side.
• For example, consider this left-recursive rule:
X –> Xa | Xb | AB | C | DEF
• To convert the rule, we introduce a new nonterminal X' that
we append to the end of all non-left-recursive productions for
X.
• The re-written productions are:
X –> ABX' | CX` | DEFX`
X' –> aX' | bX' | ε
(The expansion for the new nonterminal is basically the reverse of the original left-
recursive rule.)
CS440 Compilers 32
How to eliminate immediate left recursion(contd.)
AA | where does not start with A
eliminate immediate left recursion
A A’
A’ A’ | an equivalent grammar
In general,
A A 1 | ... | A m | 1 | ... | n where 1 ... n do not start with A
eliminate immediate left recursion
A 1 A’ | ... | n A’
A’ 1 A’ | ... | m A’ | an equivalent grammar
CS440 Compilers 33
Eliminate Left-Recursion -- Algorithm
- Arrange non-terminals in some order: A1 ... An
- for i from 1 to n do {
- for j from 1 to i-1 do {
replace each production
Ai Aj
by
Ai 1 | ... | k
where Aj 1 | ... | k
}
- eliminate immediate left-recursions among Ai productions
}
CS440 Compilers 34
Eliminate Left-Recursion -- Example
S Aa | b
A Ac | Sd | f
- Order of non-terminals: S, A
for S:
- we do not enter the inner loop.
- there is no immediate left recursion in S.
for A:
- Replace A Sd with A Aad | bd
So, we will have A Ac | Aad | bd | f
- Eliminate the immediate left-recursion in A
A bdA’ | fA’
A’ cA’ | adA’ |
CS440 Compilers 35
Eliminate Left-Recursion – Example2
S Aa | b
A Ac | Sd | f
- Order of non-terminals: A, S
for A:
- we do not enter the inner loop.
- Eliminate the immediate left-recursion in A
A SdA’ | fA’
A’ cA’ |
for S:
- Replace S Aa with S SdA’a | fA’a
So, we will have S SdA’a | fA’a | b
- Eliminate the immediate left-recursion in S
S fA’aS’ | bS’
S’ dA’aS’ |
CS440 Compilers 37
Left-Factoring (cont.)
• In general,
A 1 | 2 where is non-empty and the first symbols
of 1 and 2 (if they have one)are different.
• when processing we cannot know whether expand
A to 1 or
A to 2
CS440 Compilers 38
Left-Factoring -- Algorithm
• For each non-terminal A with two or more alternatives (production
rules) with a common non-empty prefix, let say
A 1 | ... | n | 1 | ... | m
convert it into
A A’ | 1 | ... | m
A’ 1 | ... | n
CS440 Compilers 39
Left-Factoring – Example1
A abB | aB | cdg | cdeB | cdfB
A aA’ | cdg | cdeB | cdfB
A’ bB | B
A aA’ | cdA’’
A’ bB | B
A’’ g | eB | fB
CS440 Compilers 40
Left-Factoring – Example2
A ad | a | ab | abc | b
A aA’ | b
A’ d | | b | bc
A aA’ | b
A’ d | | bA’’
A’’ | c
CS440 Compilers 41