0% found this document useful (0 votes)
121 views41 pages

4 - Syntax Analyzer (CFG)

The document discusses syntax analysis and parsing. It defines that the purpose of syntax analysis is to recombine tokens into a syntactic structure, typically a parse tree. The syntax of a programming language is described by a context-free grammar. A parser checks if a program satisfies the rules of the grammar, and if so it creates a parse tree; otherwise it returns errors.

Uploaded by

Ganesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views41 pages

4 - Syntax Analyzer (CFG)

The document discusses syntax analysis and parsing. It defines that the purpose of syntax analysis is to recombine tokens into a syntactic structure, typically a parse tree. The syntax of a programming language is described by a context-free grammar. A parser checks if a program satisfies the rules of the grammar, and if so it creates a parse tree; otherwise it returns errors.

Uploaded by

Ganesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Syntax Analyzer or Parser

• The purpose of syntax analysis (also known as parsing) is to recombine


the tokens created by the lexical analyzer.
• Syntax Analyzer creates the syntactic structure of the given source
program.
• This syntactic structure is mostly a syntax or parse tree (typically a data
structure).
• The leaves of this tree are the tokens found by the lexical analysis
• If the leaves are read from left to right, the sequence is the same as in the
input text.
• In addition to finding the structure of the input text, the syntax analysis
must also reject invalid texts by reporting syntax errors.

CS440 Compilers 1
Syntax Analyzer (Contd.)
• The syntax of a programming language is described by a Context-Free
Grammar (CFG). We will use BNF (Backus-Naur Form) notation in the
description of CFGs.
• The syntax analyzer (parser) checks whether a given source program
satisfies the rules implied by a CFG or not.
– If it satisfies, the parser creates the parse tree of that program.
– Otherwise the parser gives the error messages.
• A Context-Free Grammar (CFG)
– Gives a precise syntactic specification of a programming language.
– The design of the grammar is an initial phase of the design of a compiler.
– A grammar can be directly converted into a parser by some tools.

CS440 Compilers 2
Parser

• Parser works on a stream of tokens.


• The smallest item is a token.

source Lexical token parse tree


program Parser
Analyzer get next token

CS440 Compilers 3
Parsers (cont.)
• We categorize the parsers into two groups:

1. Top-Down Parser
– the parse tree is created top to bottom, starting from the root.
2. Bottom-Up Parser
– the parse tree is created bottom to top; starting from the leaves
• Both top-down and bottom-up parsers scan the input from left to right
(one symbol at a time).
• Efficient top-down and bottom-up parsers can be implemented only for
sub-classes of CFG.
– LL for top-down parsing
– LR for bottom-up parsing

CS440 Compilers 4
Grammar
• A set of rules by which valid sentences in a language are constructed.
• Example of English grammar:
– sentence –> <subject> <verb-phrase> <object>
– subject –> This | Computers | I
– verb-phrase –> <adverb> <verb> | <verb>
– adverb –> never
– verb –> is | run | am | tell
– object –> the <noun> | a <noun> | <noun>
– noun –> university | world | lecturer | lies
• Using the above rules or productions, we can derive simple sentences
such as these:
– This is a university.
– Computers run the world.
– I am the lecturer.
– I never tell lies.

CS440 Compilers 5
Some Important Terms

• Nonterminal: a grammar symbol that can be replaced/expanded to a


sequence of symbols.
• Terminal: an actual word in a language; these are the symbols in a
grammar that cannot be replaced by anything else. "terminal" is supposed
to conjure up the idea that it is a dead-end—no further expansion is
possible.
• Production: a grammar rule that describes how to replace/exchange
symbols. The general form of a production for a nonterminal is:
– X –>Y1Y2Y3...Yn
– The nonterminal X is declared equivalent to the concatenation of the symbols
Y1Y2Y3...Yn.
– The production means that anywhere where we encounter X, we may replace it by
the string Y1Y2Y3...Yn.

CS440 Compilers 6
Some Important Terms (Contd.)

• Derivation or parse: a sequence of applications of the rules of a


grammar that produces a finished string of terminals.
• Start symbol: A grammar has a single nonterminal (the start symbol)
from which all sentences derive:
– S –> X1X2X3...Xn
– All sentences are derived from S by successive replacement using the productions
of the grammar.
• Null symbol (ε): it is sometimes useful to specify that a symbol
can be replaced by nothing at all. To indicate this, we use the null
symbol ε, e.g., A –> B | ε.
• BNF a way of specifying programming languages using formal
grammars and production rules with a particular form of notation
(Backus-Naur form).

CS440 Compilers 7
Formal Grammar
• We formally define a grammar as a 4-tuple {S, P, N, T}.
• S is the start symbol and S ∈ N
• P is the set of productions
• N and T are the nonterminal and terminal alphabets.
• A sentence is a string of symbols in T derived from S using one or
more applications of productions in P.

CS440 Compilers 8
4 Types of formal grammars (Chomsky Hierarchy)

• American linguist Noam Chomsky proposed four types of formal


grammars called the Chomsky Hierarchy
• Type 0: free or unrestricted grammars:
– These are the most general.
– Productions are of the form u –> v
– where both u and v are arbitrary strings of symbols in V, with u non-null.
– There are no restrictions on what appears on the left or right-hand side other than the
lefthand side must be non-empty.
• Type 1: context-sensitive grammars:
– Productions are of the form uXw –> uvw
– where u, v and w are arbitrary strings of symbols in V, with v non-null, and X a single
nonterminal.
– In other words, X may be replaced by v but only when it is surrounded by u and w. (i.e.,
in a particular context).

CS440 Compilers 9
Types of formal grammars (Contd.)
• Type 2: context-free grammars:
– Productions are of the form X  v where v is an arbitrary string of symbols in V, and X is a
single nonterminal.
– Wherever you find X, you can replace with v (regardless of context).
• Type 3: regular grammars:
– Productions are of the form X  a, X  aY, or X  ε where X and Y are nonterminals
and a is a terminal.
– That is, the left-hand side must be a single nonterminal and the right-hand side can be either
empty, a single terminal by itself or with a single nonterminal.
– These grammars are the most limited in terms of expressive power.
• Note: Every type 3 grammar is a type 2 grammar, and every type 2 is a
type 1 and so on.

CS440 Compilers 10
Context-Free Grammars (CFG)
• A CFG recursively defines several sets of strings.
• Each set is denoted by a name, which is called a nonterminal.
• The set of nonterminals is disjoint from the set of terminals.
• One of the nonterminals are chosen to denote the language
described by the grammar. This is called the start symbol of
the grammar.
• Like regular expressions, CFG describe sets of strings, i.e.,
languages.
• CFG also defines structure of the strings in the language it
defines.
• Symbols in the alphabet are called terminals or leaves.
CS440 Compilers 11
CFG (Contd.)

• The sets are described by a number of productions or rules.


• Each production describes some of the possible strings that
are contained in the set denoted by a nonterminal.
• A production has the form
– N X1 . . .Xn
– where N is a nonterminal and X1 . . .Xn are zero or more symbols,
each of which is either a terminal or a nonterminal.
– The above notation can be described as the set denoted by N
contains strings that are obtained by concatenating strings from the
sets denoted by X1 . . .Xn.

CS440 Compilers 12
CFG (Contd.)
Examples
• Aa
• says that the set denoted by the nonterminal A contains the one-
character string a.
• A  aA
• says that the set denoted by A contains all strings formed by
putting an a in front of a string taken from the set denoted by A.
• Together, these two productions indicate that A contains all non-
empty sequences of a’s and is hence (in the absence of other
productions) equivalent to the regular expression a+.

CS440 Compilers 13
CFG (Contd.)
• We can define a grammar equivalent to the regular expression
a* by the two productions
–B
– B  aB
– where the first production indicates that the empty string is
part of the set B.
– Productions with empty right-hand sides are called empty
productions.


– Empty productions are sometimes written with an on the
right hand side instead of leaving it empty.

CS440 Compilers 14
CFG (Contd.)
• Simple expression grammar
Exp  Exp+Exp
Exp  Exp-Exp
Exp  Exp*Exp
Exp  Exp/Exp
Exp  num
Exp (Exp)

• Simple Statement Grammer


Stat  id :=Exp
Stat  Stat ;Stat
Stat  if Exp then Stat else Stat
Stat  if Exp then Stat

CS440 Compilers 15
Syntactic Categories
• A syntactic category is a sub-language that embodies a particular
concept.
• Normally writing a grammar for a programming language starts by
dividing the constructs of the language into different syntactic
categories.
• Examples of common syntactic categories in programming languages:
1. Expressions: used to express calculation of values.
2. Statements: express actions that occur in a particular sequence.
3. Declarations: express properties of names used in other parts of
the program.

CS440 Compilers 16
Derivations
• The basic idea of derivation is to consider productions as rewrite rules
• Whenever we have a nonterminal, we can replace this by the right-hand
side of any production in which the nonterminal appears on the left-
hand side.
Example: E  E+E : E+E derives from E
– we can replace E by E+E
– To be able to do this, we have to have a production rule EE+E in our grammar.
E  E+E  id+E  id+id (derivation
of id+id from E)
• We can do this anywhere in a sequence of symbols (terminals and
nonterminals) and repeat doing so until we have only terminals left.
• The resulting sequence of terminals is a string in the language defined
by the grammar.

CS440 Compilers 17
Derivations (Contd.)

• Three rules of derivation


1. A   (if there is a production rule A in our
grammar where  and  are arbitrary strings of terminal
and non-terminal symbols)
2. 12... n (n derives from 1 or 1 derives n)
3.    (If there is a  such that  and   )
• Derivation notations
 : derives in one step OR production unspecified
* : derives in zero or more steps
+ : derives in one(at least) or more steps

CS440 Compilers 18
Derivation (Contd.)

• Derivation can be used to formally define the


language that a CFG generates as follows.
• Given a context-free grammar G with start symbol S, terminal symbols
T and productions P, the language L(G) that G generates is defined to be
the set of strings of terminal symbols that can be obtained by derivation
from S using the productions P, i.e., the set {w ∈ T *| S +w}.

CS440 Compilers 19
Left-most and Right-most derivations

• At each derivation step, we can choose any of the non-terminal in the sentential form
of G for the replacement.
• If we always choose the left-most non-terminal in each derivation step, this derivation
is called as left-most derivation.
• Example: E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)

• If we always choose the right-most non-terminal in each derivation step, this


derivation is called as right-most derivation.
• Example: E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)

CS440 Compilers 20
Left-Most and Right-Most Derivations (Contd.)
Left-Most Derivation
E  -E 
lm
-(E) 
lm
 -(id+E) 
-(E+E) lm lm
-(id+id)
lm

Right-Most Derivation
E
rm
-E 
rm
-(E) 
rm
 -(E+id) 
-(E+E) rm rm
-(id+id)

• We will see that the top-down parsers try to find the left-most derivation
of the given source program.
• We will see that the bottom-up parsers try to find the right-most
derivation of the given source program in the reverse order.

CS440 Compilers 21
Left-most derivation Example
• Given a grammar T->R, T->aTc, R-> ,R->RbR
1. T
2. aTc
3. aaTcc
4. aaRcc
5. aaRbRcc
6. aaRbRbRcc
7. aabRbRcc
8. aabRbRbRcc
9. aabbRbRcc
10. aabbbRcc
11. aabbbcc

Leftmost derivation of the string aabbbcc using the above grammar


CS440 Compilers 22
Parse tree construction rules
• The root of the tree is the start symbol of the grammar
• Whenever we rewrite a nonterminal we add as its children the symbols
on the right-hand side of the production that was used.
• The leaves of the tree are terminals which, when read from left to right,
form the derived string.

• If a nonterminal is rewritten using an empty production, an ᵋ is shown


as its child.
• When evaluating an expression, the subexpressions represented by
subtrees of the syntax tree are evaluated before the topmost operator is
applied.
• When we write such a syntax tree, the order of derivation is irrelevant
but the choice of production for rewriting each nonterminal matters.

CS440 Compilers 23
Parse Tree
• Inner nodes of a parse tree are non-terminal symbols.
• The leaves of a parse tree are terminal symbols.
• A parse tree can be seen as a graphical representation of a derivation.

E  -E E
 -(E) E
 -(E+E)
E
- E - E - E

( E ) ( E )

E E E + E
- - E
E
 -(id+E)  -(id+id) ( E )
( E )

E + E E + E

id id id

CS440 Compilers 24
Ambiguity

• A grammar produces more than one parse tree for a sentence is


called as an ambiguous grammar.
E
E  E+E  id+E  id+E*E E + E
 id+id*E  id+id*id
id E * E

id id

E
E  E*E  E+E*E  id+E*E
 id+id*E  id+id*id E * E

E + E id

id id

CS440 Compilers 25
Ambiguity (cont.)
• For the most parsers, the grammar must be unambiguous.
• unambiguous grammar
 unique selection of the parse tree for a sentence
• We should eliminate the ambiguity in the grammar during the design
phase of the compiler.
• An unambiguous grammar should be written to eliminate the ambiguity.
• We have to prefer one of the parse trees of a sentence (generated by an
ambiguous grammar) to disambiguate that grammar.
• In most (but not all) cases, an ambiguous grammar can be rewritten to
an unambiguous grammar that generates the same set of strings, or
external rules can be applied to decide which of the many possible
syntax trees is the “right one”
CS440 Compilers 26
Ambiguity (cont.)
• How do we know if a grammar is ambiguous?
• If we can find a string and show two alternative syntax trees for it, this is
a proof of ambiguity.
• It may, however, be hard to find such a string and, when the grammar is
unambiguous, even harder to show that this is the case.
• In fact, the problem is formally undecidable, i.e., there is no method that
for all grammars can answer the question “Is this grammar ambiguous?”.
• But in many cases it is not difficult to detect and prove ambiguity. For
example, any grammar that has a production of the form N  NἁN
– N is nonterminal and ἁ is any sequence of grammar symbols.
– for example, T->R, T->aTc, R-> ,R->RbR, is ambiguous grammar and T->R, T->aTc, R->
,R->bR is the unambiguous version of the same.

CS440 Compilers 27
Ambiguity (cont.)

stmt  if expr then stmt |


if expr then stmt else stmt | otherstmts

if E1 then if E2 then S1 else S2

stmt stmt

if expr then stmt else stmt if expr then stmt

E1 if expr then stmt S2 E1 if expr then stmt else stmt

E2 S1 E2 S1 S2
1 2
CS440 Compilers 28
Ambiguity (cont.)
• Ambiguous grammar
Stat  id :=Exp
Stat  Stat ;Stat
Stat  if Exp then Stat else Stat
Stat  if Exp then Stat
To make it unambiguous we make two nonterminals: One for matched (i.e. with
else-part) conditionals and one for unmatched (i.e. without else-part)
conditionals.
• Unambiguous grammar for statements
Stat  Stat2;Stat
Stat  Stat2
Stat2  Matched
Stat2  Unmatched
Matched  if Exp then Matched else Matched
Matched  id :=Exp
Unmatched  if Exp then Matched else Unmatched
Unmatched  if Exp then Stat2 CS440 Compilers 29
Ambiguity – Operator Precedence
• Ambiguous grammars (because of ambiguous operators) can be disambiguated
according to the precedence and associativity rules.
• An operator is left-associative if an expression must be evaluated from left to right,
i.e., as (a+b)+c. (Example 2+3*4)
Ambiguous Left-Associative E(Left-Associative)
E -> E opt. E E ->E opt. E`
E -> num E->E` E * E`
E`->num Num(4)
E + E`
E` Num(3)
Num(2)
• Similarly an operator is right-associative if an expression must be evaluated from
right to left, i.e., as a+(b+c).
• An operator is non-associative if expressions of the form “a opt. b opt. c” are illegal

CS440 Compilers 30
Left Recursion
• A grammar is left recursive if it has a non-terminal A such
that there is a derivation.
A  A for some string 
• Top-down parsing techniques cannot handle left-recursive
grammars.
• So, we have to convert our left-recursive grammar into an
equivalent grammar which is not left-recursive.
• The left-recursion may appear in a single step of the derivation
(immediate left-recursion), or may appear in more than one
step of the derivation.

CS440 Compilers 31
How to eliminate immediate left recursion
• There is a simple technique for rewriting the grammar to move
the recursion to the other side.
• For example, consider this left-recursive rule:
X –> Xa | Xb | AB | C | DEF
• To convert the rule, we introduce a new nonterminal X' that
we append to the end of all non-left-recursive productions for
X.
• The re-written productions are:
X –> ABX' | CX` | DEFX`
X' –> aX' | bX' | ε
(The expansion for the new nonterminal is basically the reverse of the original left-
recursive rule.)
CS440 Compilers 32
How to eliminate immediate left recursion(contd.)
AA |  where  does not start with A
 eliminate immediate left recursion
A   A’
A’   A’ |  an equivalent grammar

In general,
A  A 1 | ... | A m | 1 | ... | n where 1 ... n do not start with A
 eliminate immediate left recursion
A  1 A’ | ... | n A’
A’  1 A’ | ... | m A’ |  an equivalent grammar

CS440 Compilers 33
Eliminate Left-Recursion -- Algorithm
- Arrange non-terminals in some order: A1 ... An
- for i from 1 to n do {
- for j from 1 to i-1 do {
replace each production
Ai  Aj 
by
Ai  1  | ... | k 
where Aj  1 | ... | k
}
- eliminate immediate left-recursions among Ai productions
}

CS440 Compilers 34
Eliminate Left-Recursion -- Example
S  Aa | b
A  Ac | Sd | f
- Order of non-terminals: S, A
for S:
- we do not enter the inner loop.
- there is no immediate left recursion in S.
for A:
- Replace A  Sd with A  Aad | bd
So, we will have A  Ac | Aad | bd | f
- Eliminate the immediate left-recursion in A
A  bdA’ | fA’
A’  cA’ | adA’ | 

So, the resulting equivalent grammar which is not left-recursive is:


S  Aa | b
A  bdA’ | fA’
A’  cA’ | adA’ | 

CS440 Compilers 35
Eliminate Left-Recursion – Example2
S  Aa | b
A  Ac | Sd | f

- Order of non-terminals: A, S

for A:
- we do not enter the inner loop.
- Eliminate the immediate left-recursion in A
A  SdA’ | fA’
A’  cA’ | 

for S:
- Replace S  Aa with S  SdA’a | fA’a
So, we will have S  SdA’a | fA’a | b
- Eliminate the immediate left-recursion in S
S  fA’aS’ | bS’
S’  dA’aS’ | 

So, the resulting equivalent grammar which is not left-recursive is:


S  fA’aS’ | bS’
S’  dA’aS’ | 
A  SdA’ | fA’
A’  cA’ | 
CS440 Compilers 36
Left-Factoring
• A predictive parser (a top-down parser without backtracking) insists
that the grammar must be left-factored.

grammar  a new equivalent grammar suitable for predictive parsing

stmt  if expr then stmt else stmt |


if expr then stmt

• when we see if, we cannot know which production rule to choose to


re-write stmt in the derivation.

CS440 Compilers 37
Left-Factoring (cont.)
• In general,
A  1 | 2 where  is non-empty and the first symbols
of 1 and 2 (if they have one)are different.
• when processing  we cannot know whether expand
A to 1 or
A to 2

• But, if we re-write the grammar as follows


A  A’
A’  1 | 2 so, we can immediately expand A to A’

CS440 Compilers 38
Left-Factoring -- Algorithm
• For each non-terminal A with two or more alternatives (production
rules) with a common non-empty prefix, let say
A  1 | ... | n | 1 | ... | m

convert it into

A  A’ | 1 | ... | m
A’  1 | ... | n

CS440 Compilers 39
Left-Factoring – Example1
A  abB | aB | cdg | cdeB | cdfB


A  aA’ | cdg | cdeB | cdfB
A’  bB | B


A  aA’ | cdA’’
A’  bB | B
A’’  g | eB | fB

CS440 Compilers 40
Left-Factoring – Example2
A  ad | a | ab | abc | b


A  aA’ | b
A’  d |  | b | bc


A  aA’ | b
A’  d |  | bA’’
A’’   | c

CS440 Compilers 41

You might also like