[Week 4] Syntax Analysis (CFG)
[Week 4] Syntax Analysis (CFG)
Week Four:
Parsing (Syntax Analysis)
The CSC318 Team [2020|2021]
[email protected]
Recommended Texts for this Course
1. Watson, Des (2017). A Practical Approach to Compiler Construction, Springer. [MainText]
2. Campbell, Bill; Iyer, Swami; and Akbal-Delibas, Bahar (2013). Introduction to Compiler
Construction in a Java World, Taylor & Francis Group.
3. Seidl, Helmut; Wilhelm, Reinhard; and Hack, Sebastian (2012). Compiler Design: Analysis and
Transformation, Springer
4. Grune, Dick; Reeuwijk, Kees van; Bal, Henri E; Jacobs, Ceriel J.H; and Langendoen, Koen (2012).
Modern Compiler Design, Second Edition, Springer.
5. Reis, Anthony Dos (2011). Compiler Construction Using Java, JavaCC, and Yacc, Wiley
2
Parsing
v Once we have identified the tokens in our program,
we then want to determine its syntactic structure.
v That is, we want to put the tokens together to make the larger
syntactic entities: expressions, statements, methods, and
classdefinitions.
v The Abstract Syntax Tree, (AST) captures the essential syntax of our program.
v The AST makes that syntax, which is implicit in the program text,explicit.
ü This is the purpose of Parsing.
5
Context-Free Grammars and Languages
Backus–Naur Form (BNF) and Its Extensions
v The grammars that we use to describe programming languages are inherently recursive
and so are best described by what we call context-free grammars.
Ø if (E) S (1)
§ is also a statement.
§ Or, we can read the rule as, a statement S can be written as an if, followed by a
parenthesized expression E and finally a statement S.
§ That is, we can read the “::=“ as ‘can be written as a' .
6
§ There are abbreviations possible in the notation.
§ For example, we can write
§ Curly braces denote the Kleene closure, indicating that the phrase may appear zero or
more times
§ For example,
• E ::= T {+ T} (5)
which says that an expression E may be written as a term T , followed by zero or more
occurrences of + followed by a term T :
T
T+T
T+T+T
• . .. 8
§ Finally, one may use the alternation sign | inside
right-hand sides, using parentheses for grouping.
• For example,
E ::= T {(+ | -) T}
T+T-T+T
9
Grammar and the Language It Describes
• A context-free grammar is a tuple, G = (N,T,S,P ), where
• N is a set of non-terminal symbols, sometimes called non-terminals;
• T is a set of terminal symbols, sometimes called terminals;
• S ∈ N is a designated non-terminal, called the start symbol,; and
• P is a set of production rules, sometimes called productions or rules.
• For example, a context-free grammar that describes (very) simple arithmetic
expressions is G = (N, T, S, P ).
• where N = {E, T, F} is the set of non-terminals, T = {+, *, (, ), id} is the set of terminals,
S = E is the start symbol, and the Production set, P is is as follows:
10
• A little less formally, we can denote the same grammar simply
as a sequence of production rules.
• For example, the grammar below denotes the same grammar
as does the immediate preceding one given.
E ::= E + T
E ::= T
T ::= T * F
T ::= F
F ::= (E)
F ::= id
11
• We may surmise that the symbols (here E, T , and F ) that are defined by at least one production
rule are non-terminals.
• Conventionally, the first defined non-terminal (that is, E here) is our start symbol.
• Those symbols that are not defined (here +,*, (, ),and id) are terminals.
• The start symbol is important to us because it is from this symbol, using the production rules,
that can generate strings in a language.
• For example, because we designate E to be the start symbol in the grammar above,
we can record a sequence of applications of the production rules, starting from E to
the sentence id + id * id as follows:
E ⇒E + T
⇒T +T
⇒F + T
⇒ id + T
⇒ id + T * F
⇒ id + F * F
⇒ id + id * F
⇒ id + id * id
12
• When a string of symbols derives another string of symbols in one step, that is, by a single
application of a production rule, we say that first string directly derives the second string.
• For example,
E directly derives
E+T
E + T directly derives
T+T
T + T directly derives
F + T. and so on . ..
• When one string can be rewritten as another string, using zero or more production
rules from the grammar, we say the first string derives the second string.
• For this derives relation, we usually use the symbol ⇒∗.
• For example,
E ⇒∗ E (in zero steps)
E ⇒∗ id + F * F
T + T ⇒∗ id + id * id 13
• We say the language L(G) that is described by a grammar G consists of all the strings
(sentences) comprised of only terminal symbols that can be derived from the start symbol.
• A little more formally, we can express the language L(G) for a grammar G with start symbol S
and terminals T as
E ⇒∗ id + id * id
E ⇒∗ id
E ⇒∗ (id + id) * id
so, L(G) includes each of id + id * id,
id
(id + id) * id
• and infinitely more finite sentences. 14
Top-Down and Bottom-Up parsing Parsing
Approaches
• There are two major parsing approaches: top-down and bottom-up.
• In top-down parsing, you start with the start symbol, S and apply the production rules
until you arrive at the desired/specified string.
• In bottom-up parsing, you start with the specified string and reduce it to the start
symbol, S by applying the production rules backwards.
• As an example, let’s trace through the two approaches on this simple grammar that
recognizes strings consisting of any number of a’s followed by at least one (and
possibly more) b’s:
• S –> AB
• A –> aA | ε
• B –> b | bB
Here is a top-down parse of aaab. >>>
15
top-down parse
• We begin with the start symbol and at each step, expand one of the remaining
nonterminals by replacing it with the right side of one of its productions.
• We repeat until only terminals remain.
• The top-down parse produces a leftmost derivation of the sentence.
Start Symbol, S Rules:
AB S –> AB
aAB A –> aA
aaAB A –> aA
aaaAB A –> aA
aaaεB A –> ε
aaab B –> b
16
bottom-up parse
• A bottom-up parse works in reverse.
• We begin with the sentence of terminals and each step applies a production in reverse, replacing
a substring that matches the right side with the nonterminal on the left.
• We continue until we have substituted our way back to the start symbol.
The string • aaab The Rules:
(insert ε)
• aaaεb
A –> ε
• aaaAb
A –> aA
• aaAb
A –> aA
• aAb A –> aA
• Ab B –> b
• AB S –> AB
• S
17
Derivations (cont’d)
• We are interested in languages, those strings of terminals that can be derived from a grammar’s start symbol.
• There are two kinds of derivation that will be important to us when we go about parsing these languages: left-
most derivations and right-most derivations.
• A left-most derivation is a derivation in which at each step, the next string is derived by applying a production rule for
rewriting the left-most non-terminal.
• For example, we have already seen a left-most derivation of id + id * id:
E⇒E+ T
⇒T+T
⇒F+T
⇒ id + T
⇒ id + T * F
⇒ id + F * F
⇒ id + id * F
⇒ id + id * id
• Here we have underlined the left-most non-terminal in each string of symbols in the derivation to show that it is indeed
a left-most derivation.
18
right-most derivation
• A right-most derivation is a derivation in which at each step, the next string is derived
by applying a production rule for rewriting the right-most non-terminal.
19
• We use the term sentential form to refer to any string of (terminal and non-terminal)
symbols that can be derived from
the start symbol.
• So, for example in the previous derivation,
E
E+T
E+T*F
E + T * id
E + F * id
E + id * id
T + id * id
F + id * id
id + id * id
are all sentential forms.
ØClearly, any sentential form consisting solely of terminal symbols is a sentence in the
language. 20
Parse tree
• An alternative representation of a derivation is
the parse tree, a tree that illustrates the
derivation and the structure of an input string
(at the leaves) from a start symbol (at the root).
22
Ambiguous Grammars and Unambiguous Grammars
• Given a grammar G, if there exists a sentence s in L(G) for which
there are more than one left-most derivations in G (or equivalently,
either more than one right-most derivations, or more than one
parse tree for s in G), we say that the sentence s is ambiguous.
23
Example
• Consider the grammar, G:
E ::= E + E | E * E | (E) | id
25
right-most derivations
• It is also the case that the sentence has two
right-most derivations in the grammar:
E ⇒E + E
⇒ E + E *E
⇒ E + E * id
⇒ E + id * id
⇒ id + id *id
27
In-Class Exercise 5
• Produce another right-most derivation for the
same sentence.
28
In-Class Exercise 6
• Produce the two parse trees for the sentence to
demonstrate ambiguity.
30
• Clearly, we would rather have unambiguous grammars
describe our programming language, because
ambiguity in the parsing can lead to ambiguity in
assigning semantics (meaning) to programs.
S ::= if (E) S
| if (E) S else S
|s
E ::= e
• Here, the token e represents an arbitrary expression and the s represents a (non- conditional)
statement.
• Consider the following sentence form (string of terminals): if (e) if (e) s else s
• If we look at this sentence carefully, we see that it nests one conditional statement within another.
• One might ask: To which 'if' does the 'else' belong?
• We cannot know; the sentence is ambiguous.
33
Two left-most derivations for a conditional
statement
• More formally, there exist two left- most derivations for this sentence:
Derivation 1: Derivation 2:
34
Two parse trees for if (e) if (e) s else s
35
• In general, there is no algorithm that can tell us whether
or not an arbitrary grammar is ambiguous.
Ø Top-down parsing:
§ where the parser begins at the start symbol, S and, step-by-step derives the
input sentence and producing either a parse tree or (more likely) an Abstract
Syntax Tree (AST) from the root down to its leaves.
Ø Bottom-up parsing:
§ where the parser begins at the input sentence and, scanning it from left to right,
applies the production rules, replacing righthand sides with lefthand sides, in a
step-by-step fashion for reducing the sentence to obtain the start symbol.
§ Here the Parse Tree or Abstract Syntax Tree is built from the bottom leaves up to the top
root.
38
Top-Down Deterministic Parsing
• There are two popular top-down deterministic parsing strategies:
1. Parsing by recursive descent and
2. LL(1) parsing.
• Both Parsers scan the input from left to right, looking at and scanning just one symbol at a time.
• In both cases, the parser starts with the grammar’s start symbol, S as an initial goal in the
parsing process;
§ that is, the goal is to scan the input sentence and parse the syntactic entity represented by the
start symbol.
• The start symbol is then rewritten, using a BNF rule replacing the symbol with the right-
hand side sequence of symbols.
39
Parsing
• Parsing = process of determining if a string of tokens can be
generated by a grammar
• For any CF grammar there is a parser that takes at most O(n3) time to
parse a string of n tokens
• Top-down parsing “constructs” a parse tree from the root to the leaves
• Bottom-up parsing “constructs” a parse tree from the leaves to the root
4400
Predictive Parsing
• Recursive descent parsing is a top-down parsing method
§ Every nonterminal has one (recursive) procedure responsible for parsing the
nonterminal’s syntactic category of input tokens
§ When a nonterminal has multiple productions, each production is
implemented in a branch of a selection statement based on input look-ahead
information
41
41
Example Predictive Parser (Grammar)
type ® simple
| ^ id
| array [ simple ] of type
simple ® integer
| char
| num dotdot num
42
Example Predictive Parser (Program Code)
procedure match(t : token);
begin
if lookahead = t then procedure simple();
lookahead := nexttoken() begin
else error() if lookahead = ‘integer’ then
end; match(‘integer’)
else if lookahead = ‘char’ then
procedure type(); match(‘char’)
begin else if lookahead = ‘num’ then
if lookahead in { ‘integer’, ‘char’, ‘num’ } then match(‘num’);
simple() match(‘dotdot’);
else if lookahead = ‘^’ then match(‘num’)
match(‘^’); match(id) else error()
else if lookahead = ‘array’ then end;
match(‘array’); match(‘[‘); simple();
match(‘]’); match(‘of’); type()
else error()
end;
43
Example Predictive Parser (Execution Step 1)
Check lookahead
type()
and call match
match(‘array’)
lookahead
44
Example Predictive Parser (Execution Step 2)
type()
match(‘array’) match(‘[’)
lookahead
45
Example Predictive Parser (Execution Step 3)
type()
match(‘num’)
lookahead
46
Example Predictive Parser (Execution Step 4)
type()
match(‘num’) match(‘dotdot’)
lookahead
47
Example Predictive Parser (Execution Step 5)
type()
lookahead
48
Example Predictive Parser (Execution Step 6)
type()
lookahead
49
Example Predictive Parser (Execution Step 7)
type()
lookahead
50
Example Predictive Parser (Execution Step 8)
type()
match(‘integer’)
lookahead
LECTURE 4: FURTHER READING & ASSIGNMENT
49
Homework
1. Consider the grammar:
A →aAbA|c
Give a leftmost derivation of the string aacbcbc.
2. Prove which one of the following strings is recognized by this grammar.
S →aSbb|ε
i. aabb
ii. bb
iii. aabbbb
iv. aaaabb
3. Given the following expression grammar, produce a Syntax tree for 2+3*4.
Exp → Exp+Exp2
Exp → Exp-Exp2
Exp → Exp2
Exp2 → Exp2*Exp3
Exp2 → Exp2/Exp3
Exp2 → Exp3
Exp3 → num
Exp3 → (Exp)
52
52