0% found this document useful (0 votes)
0 views

Chapter 3- Syntax Analysis

Uploaded by

lidelidetuwatiro
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Chapter 3- Syntax Analysis

Uploaded by

lidelidetuwatiro
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Compiler Design

Chapter Three: Syntax Analysis


The Objectives of this chapter are listed as follows,
❖ Explain the basic roles of Parser (syntactic Analyzer).

❖ Describe context-Free Grammars (CFGs) and their representation format.

❖ Discuss the different derivation formats: Leftmost derivation, Rightmost derivation and
Non-Leftmost, Non-Rightmost derivations

❖ Be familiar with CFG shorthand techniques.

❖ Describe Parse Tree and its structure.

❖ Discuss ambiguous grammars and how to deal with ambiguity from CFGs.

❖ Explain the Extended Backus Naur Form

The Role of the Parser


❖ The parser obtains a string of tokens from the lexical analyzer, as shown in the below figure,
and verifies that the string of token names can be generated by the grammar for the source
language.

❖ The parser is expected to report any syntax errors and to recover from commonly occurring
errors to continue processing the remainder of the program.

❖ Conceptually, for well-formed programs, the parser constructs a parse tree and passes it to the
rest of the compiler for further processing. In fact, the parse tree need not be constructed
explicitly, since checking and translation actions can be interspersed with parsing. Thus, the
parser and the rest of the front end could well be implemented by a single module.

Fig. Position of parser in compiler model

❖ Therefore, Parser performs context-free syntax analysis, guides context-sensitive analysis,


constructs an intermediate representation, produces meaningful error messages and attempts
1
Compiler Design

error correction.
❖ The parser obtains a string of tokens from the lexical analyzer, as shown in the above Figure
and verifies that the string of token names can be generated by the grammar for the source
language.
❖ A grammar gives a precise, yet easy-to-understand, syntactic specification of a programming
language.
• From certain classes of grammars, we can construct automatically an efficient parser
that determines the syntactic structure of a source program.
• As a side benefit, the parser-construction process can reveal syntactic ambiguities and
trouble spots that might have slipped through the initial design phase of a language.
• The structure imparted to a language by a properly designed grammar is useful for
translating source programs into correct object code and for detecting errors.
❖ A grammar allows a language to be evolved or developed iteratively, by adding new constructs
to perform new tasks.
❖ These new constructs can be integrated more easily into an implementation that follows the
grammatical structure of the language.
❖ There are three general types of parsers for grammars: universal, top-down, and bottom-up.
• Universal parsing methods such as the Cocke-Younger-Kasami algorithm and Earley's
algorithm can parse any grammar (Read more on these).
❖ These general methods are, however, too inefficient to use in compilers production.
❖ The methods commonly used in compilers can be classified as being either top-down or
bottom-up.
• Top-Down Methods: - As implied by their names, top-down methods build parse trees
from the top (root) to the bottom (leaves).
• Bottom-up methods: - start from the leaves and work their way up to the root to build
the parse tree.
• In either case, the input to the parser is scanned from left to right, one symbol at a time.
❖ The most efficient top-down and bottom-up methods work only for sub-classes of grammars,
but several of these classes, particularly, LL and LR grammars, are expressive enough to
describe most of the syntactic constructs in modern programming languages.

Error Handling
Common Programming Errors include:
• Lexical errors, Syntactic errors, Semantic errors and logical Errors. The type of error
handled in this phase of compilation is syntactical error.

Error handler goals


• Report the presence of errors clearly and accurately
• Recover from each error quickly enough to detect subsequent errors
• Add minimal overhead to the processing of correct programs

2
Compiler Design

Common Error-Recovery Strategies includes:


1. Panic mode recovery: - Discard input symbol one at a time until one of designated set of
synchronization tokens is found.
2. Phrase level recovery: - Replacing a prefix of remaining input by some string that allows
the parser to continue.
3. Error productions: - Augment the grammar with productions that generate the erroneous
constructs
4. Global correction: - Choosing minimal sequence of changes to obtain a globally least-cost
correction

Context-Free Grammars (CFGs)


❖ CFG is used as a tool to describe the syntax of a programming language.

❖ A CFG includes 4 components:

1. A set of terminals T, which are the tokens of the language


o Terminals are the basic symbols from which strings are formed.
o The term "token name" is a synonym for “terminal"

2. A set of non-terminals N
o Non-terminals are syntactic variables that denote sets of strings.
o The sets of strings denoted by non-terminals help define the language generated by
the grammar.
o Non-terminals impose a hierarchical structure on the language that is key to syntax
analysis and translation

3. A set of rewriting rules R.


o The left-hand side (head) of each rewriting rule is a single non-terminal.
o The right-hand side (body) of each rewriting rule is a string of terminals and/or non-
terminals

4. A special non-terminal S Є N, which is the start symbol. The production for the start
symbol are listed first.

❖ Just as regular expression generates strings of characters, CFG generate strings of tokens.

❖ A string of tokens is generated by a CFG in the following way:


1. The initial input string is the start symbol S
2. While there are non-terminals left in the string:
a. Pick any non-terminal in the input string A
b. Replace a single occurrence of A in the string with the right-hand side of any
rule that has A as the left-hand side

3
Compiler Design

c. Repeat 1 and 2 until all elements in the string are terminals

Example 1: A grammar that defines simple arithmetic expressions:


Terminals = {id, +, -, *, /, (,)}
Non-Terminals = {expression, term, factor}
Start Symbol = expression
Rules = expression →expression + term
→ expression – term
→ term
term → term* factor
→ term/factor
→ factor
factor → (expression)
→ id
Notational Conventions
1. These symbols are terminals:
A. Lowercase letters early in the alphabet, such as a, b, c.
B. Operator symbols such as +, *, and so on.
C. Punctuation symbols such as parentheses, comma, and so on.
D. The digits 0, 1, ... ,9.
E. Boldface strings such as id or if, each of which represents a single terminal symbol.

2. These symbols are non-terminals:


A. Uppercase letters early in the alphabet, such as A, B, C.
B. The letter S, which, when it appears, is usually the start symbol.
C. Lowercase, italic names such as expr or stmt.
D. Uppercase letters may be used to represent non-terminals for the constructs.
For example: - non terminals for expressions, terms, and factors are often represented
by E, T, and F, respectively.
3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that is,
either non-terminals or terminals.

4. Lowercase letters late in the alphabet, chiefly u, v, ..., z, represent (possibly empty) strings of
terminals.

5. Lowercase Greek letters ,,, for example, represent (possibly empty) strings of grammar
symbols.
❖ Thus, a generic production can be written as A→ , where A is the head and  the
body.
6. A set of productions A→ 1, A→ 2, A→ 3,…, A→ k with a common head A (call them
A-productions), may be written A→ 1|2|3|...|k. call 1, 2, 3,...,k the alternatives for A.

7. Unless stated otherwise, the head of the first production is the start symbol.

4
Compiler Design

Example 2: - Using these conventions, the grammar of Example 1 can be rewritten concisely as:
E→ E+T|E-T|T
T→ T*F|T/F|F
F → ( E ) | id

Derivations
❖ A derivation is a description of how a string is generated from the start symbol of a
grammar.
❖ The construction of a parse tree can be made precise by taking a derivational view, in which
productions are treated as rewriting rules. Beginning with the start symbol, each rewriting
step replaces a nonterminal by the body of one of its productions.
❖ For a general definition of derivation, consider a nonterminal A in the middle of a sequence
of grammar symbols, as in 𝛼𝐴𝛽, where 𝛼 𝑎𝑛𝑑 𝛽 are arbitrary strings of grammar symbols.
o Suppose 𝐴 → 𝛾 is a production. Then, we write 𝛼𝐴𝛽 ⇒ 𝛼𝛾𝛽.
o The symbol ⇒ means, "derives in one step."
❖ Example 3: Use the CFG below to perform the derivations in example 4 & 5.
Terminals = {id, num, if, then, else, print, =, {, }, ;, (, ) }
Non-Terminals = { S, E, B, L }
Rules = (1) S → print(E);
(2) S → while (B) do S
(3) S → { L }
(4) E → id
(5) E → num
(6) B → E > E
(7) L → S
(8) L → SL
Start Symbol = S
Leftmost Derivations
 A string of terminals and non-terminals α that can be derived from the initial symbol of the
grammar is called a sentential form
 Thus the strings “{ SL }”, “while(id>E) do S”, and print(E>id)” of the above example
are all sentential forms.
 A derivation is “leftmost” if, at each step in the derivation, the leftmost non-terminal is
selected to replace (always picks the leftmost non-terminal to replace).
 A sentential form that occurs in a leftmost derivation is called a left-sentential form.
Example 4: We can use leftmost derivations to generate while (id > num) do print(id); from
the above CFG (example 3) as follows:
S → while(B) do S
→ while(E>E) do S
→ while(id>E) do S
→ while(id>num) do S
5
Compiler Design

→ while(id>num) do print(E);
→ while(id>num) do print(id);
Rightmost Derivations
 Is a derivation technique that chooses the rightmost non-terminal to replace
Example 5: Generate while (num > num) do print(id); from CFG in example 3
S → while(B) do S
→ while(B) do print(E);
→ while(B) do print(id);
→ while(E>E) do print(id);
→ while(E>num) do print(id);
→ while(num>num) do print(id);
Non-Leftmost, Non-Rightmost Derivations
 Some derivations are neither leftmost or rightmost, such as:
S → while(B) do S
→ while(E>E) do S
→ while(E>E) do print(E);
→ while(E>id) do print(E);
→ while(num>id) do print(E);
→ while(num>id) do print(num);
CFG Shorthand
 We can combine two rules of the form S → α and S → β to get the single rule S → α│β
Example 6: CFG in example 3 can be shortened as follows
Terminals = {id, num, if, then, else, print, =, {,}, ;, (, ) }
Non-Terminals = { S, E, B, L }
Rules = S → print(E); | while (B) do S | { L }
E → id | num
B→E>E
L → S | SL
Start Symbol = S

Parse Trees
➢ A parse tree is a graphical representation of a derivation that filters out the order in which
productions are applied to replace non-terminals.
 Each interior node of a parse tree represents the application of a production.
 The interior node is labeled with the nonterminal A in the head of the production; the
children of the node are labeled, from left to right, by the symbols in the body of the
production by which this A was replaced during the derivation.
➢ We start with the initial symbol S of the grammar as the root of the tree
 The children of the root are the symbols that were used to rewrite the initial symbol in
6
Compiler Design

the derivation.
 The internal nodes of the parse tree are non-terminals
 The children of each internal node N are the symbols on the right-hand side of a rule
that has N as the left-hand side (e.g. B → E > E where E > E is the right-hand side and
B is the left-hand side of the rule)
➢ Terminals are leaves of the tree.
Examples 7: - ( id + id )
E ⇒ -E ⇒ - ( E ) ⇒ -( E + E ) ⇒ -( id + E) ⇒-( id + id)

Example 8: - ( id + id * id)
E ⇒ E + E ⇒ E + E * E ⇒ ( E + id * E) ⇒ (E + id * id) ⇒ ( id + id * id)

a) b)

Ambiguous Grammars
 A grammar is ambiguous if there is at least one string derivable from the grammar that has
more than one different parse tree, or more than one leftmost derivation, or more than one
rightmost derivation
 The example 8 above has two parse trees (parse tree a and b) that are ambiguous
grammars.
 Ambiguous grammars are bad, because the parse trees don’t tell us the exact meaning of the
string.
 For example, if we see the example 8 again, in figure a, the string means id + (id * id),
but in fig. b, the string means (id + id) * id. This is why we call it “ambiguous”.

7
Compiler Design

We need to change the grammar to fix this problem. How? We may rewrite the grammar as follows:
Terminals = {id, +, -, *, /, (, )}
Non-Terminals = {E, T, F }
Start Symbol = E
Rules = E → E + T
E→ E-T
E→ T
T→ T*F
T→ T/F
F → id
F → (E)
A parse tree for id * (id + id)

Review Exercises
Note: attempt all questions individually.
Submit your answer on [email protected]
Due date: January 29, 2025 G.C.

1. Consider the context-free grammar: S → S S + | S S * | a and the string aa + a*.


a) Give a leftmost derivation for the string.
b) Give a rightmost derivation for the string.
c) Give a parse tree for the string.
d) Is the grammar ambiguous or unambiguous? Justify your answer.
e) Describe the language generated by this grammar.

2. Consider the following grammar


Terminals = { a, b }
Non-Terminals = {S, T, F }
Start Symbol = S
Rules = S→ TF
T→ T T T
T→ a
F→ aFb
F→ b

8
Compiler Design

Which of the following strings are derivable from the grammar? Give the parse tree for
derivable strings?
i. ab iv. aaabb
ii. aabbb v. aaaabb
iii. aba vi. Aabb

3. Show that the following CFGs are ambiguous by giving two parse trees for the same string?
3.1) Terminals = { a, b } 3.2) Terminals = { if, then, else, print, id }
Non-Terminals = {S, T} Non-Terminals = {S, T}
Start Symbol = S Start Symbol = S
Rules = S→ STS Rules = S→ if id then S T
S→ b S→ print id
T→ aT T→ else S
T→ ε
T→ ε

4. Construct a CFG for each of the following:


a. All integers with sign (Example: +3, -3)
b. The set of all strings over { (, ), [, ]} which form balanced parenthesis. That is, (), ()(),
((()())()), [()()] and ([()[]()]) are in the language but )( , ][ , (() and ([ are not.
c. The set of all string over {num, +, -, *, /} which are legal binary post-fix expressions.
Thus numnum+, num num num + *, num num – num * are all in the language, while
num*, num*num and num num num – are not in the language.
d. Are your CFGs in a, b and c ambiguous?

You might also like