Chapter 3- Syntax Analysis
Chapter 3- Syntax Analysis
❖ Discuss the different derivation formats: Leftmost derivation, Rightmost derivation and
Non-Leftmost, Non-Rightmost derivations
❖ Discuss ambiguous grammars and how to deal with ambiguity from CFGs.
❖ The parser is expected to report any syntax errors and to recover from commonly occurring
errors to continue processing the remainder of the program.
❖ Conceptually, for well-formed programs, the parser constructs a parse tree and passes it to the
rest of the compiler for further processing. In fact, the parse tree need not be constructed
explicitly, since checking and translation actions can be interspersed with parsing. Thus, the
parser and the rest of the front end could well be implemented by a single module.
error correction.
❖ The parser obtains a string of tokens from the lexical analyzer, as shown in the above Figure
and verifies that the string of token names can be generated by the grammar for the source
language.
❖ A grammar gives a precise, yet easy-to-understand, syntactic specification of a programming
language.
• From certain classes of grammars, we can construct automatically an efficient parser
that determines the syntactic structure of a source program.
• As a side benefit, the parser-construction process can reveal syntactic ambiguities and
trouble spots that might have slipped through the initial design phase of a language.
• The structure imparted to a language by a properly designed grammar is useful for
translating source programs into correct object code and for detecting errors.
❖ A grammar allows a language to be evolved or developed iteratively, by adding new constructs
to perform new tasks.
❖ These new constructs can be integrated more easily into an implementation that follows the
grammatical structure of the language.
❖ There are three general types of parsers for grammars: universal, top-down, and bottom-up.
• Universal parsing methods such as the Cocke-Younger-Kasami algorithm and Earley's
algorithm can parse any grammar (Read more on these).
❖ These general methods are, however, too inefficient to use in compilers production.
❖ The methods commonly used in compilers can be classified as being either top-down or
bottom-up.
• Top-Down Methods: - As implied by their names, top-down methods build parse trees
from the top (root) to the bottom (leaves).
• Bottom-up methods: - start from the leaves and work their way up to the root to build
the parse tree.
• In either case, the input to the parser is scanned from left to right, one symbol at a time.
❖ The most efficient top-down and bottom-up methods work only for sub-classes of grammars,
but several of these classes, particularly, LL and LR grammars, are expressive enough to
describe most of the syntactic constructs in modern programming languages.
Error Handling
Common Programming Errors include:
• Lexical errors, Syntactic errors, Semantic errors and logical Errors. The type of error
handled in this phase of compilation is syntactical error.
2
Compiler Design
2. A set of non-terminals N
o Non-terminals are syntactic variables that denote sets of strings.
o The sets of strings denoted by non-terminals help define the language generated by
the grammar.
o Non-terminals impose a hierarchical structure on the language that is key to syntax
analysis and translation
4. A special non-terminal S Є N, which is the start symbol. The production for the start
symbol are listed first.
❖ Just as regular expression generates strings of characters, CFG generate strings of tokens.
3
Compiler Design
4. Lowercase letters late in the alphabet, chiefly u, v, ..., z, represent (possibly empty) strings of
terminals.
5. Lowercase Greek letters ,,, for example, represent (possibly empty) strings of grammar
symbols.
❖ Thus, a generic production can be written as A→ , where A is the head and the
body.
6. A set of productions A→ 1, A→ 2, A→ 3,…, A→ k with a common head A (call them
A-productions), may be written A→ 1|2|3|...|k. call 1, 2, 3,...,k the alternatives for A.
7. Unless stated otherwise, the head of the first production is the start symbol.
4
Compiler Design
Example 2: - Using these conventions, the grammar of Example 1 can be rewritten concisely as:
E→ E+T|E-T|T
T→ T*F|T/F|F
F → ( E ) | id
Derivations
❖ A derivation is a description of how a string is generated from the start symbol of a
grammar.
❖ The construction of a parse tree can be made precise by taking a derivational view, in which
productions are treated as rewriting rules. Beginning with the start symbol, each rewriting
step replaces a nonterminal by the body of one of its productions.
❖ For a general definition of derivation, consider a nonterminal A in the middle of a sequence
of grammar symbols, as in 𝛼𝐴𝛽, where 𝛼 𝑎𝑛𝑑 𝛽 are arbitrary strings of grammar symbols.
o Suppose 𝐴 → 𝛾 is a production. Then, we write 𝛼𝐴𝛽 ⇒ 𝛼𝛾𝛽.
o The symbol ⇒ means, "derives in one step."
❖ Example 3: Use the CFG below to perform the derivations in example 4 & 5.
Terminals = {id, num, if, then, else, print, =, {, }, ;, (, ) }
Non-Terminals = { S, E, B, L }
Rules = (1) S → print(E);
(2) S → while (B) do S
(3) S → { L }
(4) E → id
(5) E → num
(6) B → E > E
(7) L → S
(8) L → SL
Start Symbol = S
Leftmost Derivations
A string of terminals and non-terminals α that can be derived from the initial symbol of the
grammar is called a sentential form
Thus the strings “{ SL }”, “while(id>E) do S”, and print(E>id)” of the above example
are all sentential forms.
A derivation is “leftmost” if, at each step in the derivation, the leftmost non-terminal is
selected to replace (always picks the leftmost non-terminal to replace).
A sentential form that occurs in a leftmost derivation is called a left-sentential form.
Example 4: We can use leftmost derivations to generate while (id > num) do print(id); from
the above CFG (example 3) as follows:
S → while(B) do S
→ while(E>E) do S
→ while(id>E) do S
→ while(id>num) do S
5
Compiler Design
→ while(id>num) do print(E);
→ while(id>num) do print(id);
Rightmost Derivations
Is a derivation technique that chooses the rightmost non-terminal to replace
Example 5: Generate while (num > num) do print(id); from CFG in example 3
S → while(B) do S
→ while(B) do print(E);
→ while(B) do print(id);
→ while(E>E) do print(id);
→ while(E>num) do print(id);
→ while(num>num) do print(id);
Non-Leftmost, Non-Rightmost Derivations
Some derivations are neither leftmost or rightmost, such as:
S → while(B) do S
→ while(E>E) do S
→ while(E>E) do print(E);
→ while(E>id) do print(E);
→ while(num>id) do print(E);
→ while(num>id) do print(num);
CFG Shorthand
We can combine two rules of the form S → α and S → β to get the single rule S → α│β
Example 6: CFG in example 3 can be shortened as follows
Terminals = {id, num, if, then, else, print, =, {,}, ;, (, ) }
Non-Terminals = { S, E, B, L }
Rules = S → print(E); | while (B) do S | { L }
E → id | num
B→E>E
L → S | SL
Start Symbol = S
Parse Trees
➢ A parse tree is a graphical representation of a derivation that filters out the order in which
productions are applied to replace non-terminals.
Each interior node of a parse tree represents the application of a production.
The interior node is labeled with the nonterminal A in the head of the production; the
children of the node are labeled, from left to right, by the symbols in the body of the
production by which this A was replaced during the derivation.
➢ We start with the initial symbol S of the grammar as the root of the tree
The children of the root are the symbols that were used to rewrite the initial symbol in
6
Compiler Design
the derivation.
The internal nodes of the parse tree are non-terminals
The children of each internal node N are the symbols on the right-hand side of a rule
that has N as the left-hand side (e.g. B → E > E where E > E is the right-hand side and
B is the left-hand side of the rule)
➢ Terminals are leaves of the tree.
Examples 7: - ( id + id )
E ⇒ -E ⇒ - ( E ) ⇒ -( E + E ) ⇒ -( id + E) ⇒-( id + id)
Example 8: - ( id + id * id)
E ⇒ E + E ⇒ E + E * E ⇒ ( E + id * E) ⇒ (E + id * id) ⇒ ( id + id * id)
a) b)
Ambiguous Grammars
A grammar is ambiguous if there is at least one string derivable from the grammar that has
more than one different parse tree, or more than one leftmost derivation, or more than one
rightmost derivation
The example 8 above has two parse trees (parse tree a and b) that are ambiguous
grammars.
Ambiguous grammars are bad, because the parse trees don’t tell us the exact meaning of the
string.
For example, if we see the example 8 again, in figure a, the string means id + (id * id),
but in fig. b, the string means (id + id) * id. This is why we call it “ambiguous”.
7
Compiler Design
We need to change the grammar to fix this problem. How? We may rewrite the grammar as follows:
Terminals = {id, +, -, *, /, (, )}
Non-Terminals = {E, T, F }
Start Symbol = E
Rules = E → E + T
E→ E-T
E→ T
T→ T*F
T→ T/F
F → id
F → (E)
A parse tree for id * (id + id)
Review Exercises
Note: attempt all questions individually.
Submit your answer on [email protected]
Due date: January 29, 2025 G.C.
8
Compiler Design
Which of the following strings are derivable from the grammar? Give the parse tree for
derivable strings?
i. ab iv. aaabb
ii. aabbb v. aaaabb
iii. aba vi. Aabb
3. Show that the following CFGs are ambiguous by giving two parse trees for the same string?
3.1) Terminals = { a, b } 3.2) Terminals = { if, then, else, print, id }
Non-Terminals = {S, T} Non-Terminals = {S, T}
Start Symbol = S Start Symbol = S
Rules = S→ STS Rules = S→ if id then S T
S→ b S→ print id
T→ aT T→ else S
T→ ε
T→ ε