[Week 3] Syntax Analysis (Derivation)
[Week 3] Syntax Analysis (Derivation)
Week Three:
Syntax Analysis and Context-Free Grammar (CFG)
2. Campbell, Bill; Iyer, Swami; and Akbal-Delibas, Bahar (2013). Introduction to Compiler
Construction in a Java World, Taylor & Francis Group.
3. Seidl, Helmut; Wilhelm, Reinhard; and Hack, Sebastian (2012). Compiler Design: Analysis and
Transformation, Springer
4. Grune, Dick; Reeuwijk, Kees van; Bal, Henri E; Jacobs, Ceriel J.H; and Langendoen, Koen (2012).
Modern Compiler Design, Second Edition, Springer.
5. Reis, Anthony Dos (2011). Compiler Construction Using Java, JavaCC, and Yacc, Wiley
2
Recap: From Description to Implementation
Tokens VersusTerminals
v In a compiler, the lexical analyzer reads the characters of the source program, groups them into
lexically meaningful units called lexemes, and produces as output tokens representing these
lexemes.
v The token names are abstract symbols that are used by the parser for syntax analysis.
v Often, we shall call these token names terminals, since they appear as terminal symbols in
the grammar for a programming language.
v The attribute value, if present, is a pointer to the symbol table that contains additional
information about the token.
4
In-Class Exercise 1
int main ()
{
int a = 0;
cout << a << endl;
return 1;
}
• Given the sample code above, list and describe the tokens.
5
Tree Terminology
v Tree data structures figure prominently in compiling.
• A tree consists of one or more nodes. Nodes may have labels, which typically will be grammar
symbols.When we draw a tree, we often represent the nodes by these labels only.
• Exactly one node is the root. All nodes except the root have a unique parent; the root has no
parent. When we draw trees, we place the parent of a node above that node and draw an edge
between them. The root is then the highest (top) node.
• If node N is the parent of node M, then M is a child of N. The children of one node are called
siblings. They have an order, from the left, and when we draw trees, we order the children of a given
node in this manner.
• A node with no children is called a leaf. Other nodes — those with one or more children — are
interior nodes.
• A descendant of a node N is either N itself, a child of N, a child of a child of N, and so on, for any
number of levels.We say node N is an ancestor of node M if M is a descendant of N.
9
Definition of Grammars
A context-free grammar (CFG) has four components
1. A set of terminal symbols, sometimes referred to as "tokens." The terminals are the elementary symbols
of the language defined by the grammar.
2. A set of nonterminals, sometimes called "syntactic variables." Each nonterminal represents a set of
strings of terminals, in a manner we shall describe.
3. A set of productions, where each production consists of a nonterminal, called the head or left side of the
production, an arrow, and a sequence of terminals and/or nonterminals, called the body or right side of
the production. The intuitive intent of a production is to specify one of the written forms of a construct; if
the head nonterminal represents a construct, then the body represents a written form of the construct.
Syntax Definition
v Context-free Grammar is a 4-tuple with
Ø A set of tokens (terminal symbols), T
Ø A set of nonterminals, N
Ø A set of productions, P
X → Y Y …Y
1 2 n
In English:
v An integer is an arithmetic expression.
v If exp1 and exp2 are arithmetic expressions, then so are the following:
• exp1 - exp2
• exp1 / exp2
• ( exp1 )
the corresponding CFG: we’ll write tokens as follows:
• exp → INTLITERAL E → intlit
• exp → exp MINUS exp E → E - E
• exp → exp DIVIDE exp E → E / E
• exp → LPAREN exp RPAREN E → ( E )
Example of a Grammar
v Context-free grammar for simple expressions:
list ® digit
digit ® 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
13
Derivation
v Given a CF grammar we can determine the set of all strings (sequences of tokens) generated
by the grammar using derivation
• In each step, we replace one nonterminal in the current sentential form with one of the
right-hand sides of a production for that nonterminal
Example of a Grammar
v Context-free grammar for simple expressions:
list ® digit
digit ® 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Derivation for the Example Grammar
list
Þ list + digit
Þ list - digit + digit
Þ digit - digit + digit
Þ 9 - digit + digit
Þ 9 - 5 + digit
Þ9-5+2
This is an example leftmost derivation, because we replaced the leftmost nonterminal
(underlined) in each step.
Likewise, a rightmost derivation replaces the rightmost nonterminal in each step
Derivation: An example
v CFG: v derivation:
E → id E→ E+E
E→E+E → E*E+E
E→E*E → id*E+E
E→(E) → id*id+E
→ id*id+id
• Is string id * id + id in language
defined by grammar?
Syntax Analyzer (Parser)
Parse Trees
list digit
list digit
digit
The sequence of
9 - 5 + 2 leafs is called the
yield of the parse tree
Ambiguity
This grammar is ambiguous, because more than one parse tree represents the
string 9-5+2
Ambiguity (cont’d)
string string
9 - 5 + 2 9 - 5 + 2
2
0
Associativity of Operators
• Left-associative operators have left-recursive productions
expr.t = “95-2+”
term.t = “9”
9 - 5 + 2
28
In-Class Exercise
Semantic Analysis
IR Generation
IR Optimization
Code Generation
Optimization Machine
Code
The Structure of a Modern Compiler
Semantic Analysis
IR Generation
IR Optimization
Code Generation
Optimization Machine
Code
The Structure of a Modern Compiler
Semantic Analysis
IR Generation
IR Optimization
Code Generation
Optimization Machine
Code
while (y < z)
{
int x = a + b;
y += x;
} Lexical Analysis
Syntax Analysis
Semantic Analysis
IR Generation
IR Optimization
Code Generation
Optimization
while (y < z)
{
int x = a + b;
y += x;
}
Lexical Analysis
Syntax Analysis
Semantic Analysis
IR Generation
IR Optimization
Code Generation
Optimization
while (y < z)
{
int x = a + b;
y += x;
}
T_While
T_LeftParen
Lexical Analysis
T_Identifier y
T_Less Syntax Analysis
T_Identifier z
T_RightParen
T_OpenBrace Semantic Analysis
T_Int
T_Identifier x IR Generation
T_Assign
T_Identifier a
T_Plus IR Optimization
T_Identifier b
T_Semicolon
T_Identifier y
Code Generation
T_PlusAssign
T_Identifier x Optimization
T_Semicolon
T_CloseBrace
while (y < z)
{ int x = a + b;
y += x;
}
T_While
T_LeftParen Lexical Analysis
T_Identifier y
T_Less Syntax Analysis
T_Identifier z
T_RightParen
T_OpenBrace Semantic Analysis
T_Int
T_Identifier x
T_Assign
IR Generation
T_Identifier a
T_Plus IR Optimization
T_Identifier b
T_Semicolon
T_Identifier y Code Generation
T_PlusAssign
T_Identifier x Optimization
T_Semicolon
T_CloseBrace
while (y < z) {
int x = a + b;
y += x;
}
Syntax Analysis
Sequence
Semantic Analysis
< = = IR Generation
IR Optimization
y z x + y +
Code Generation
Optimization
a b y x
while (y < z) {
int x = a + b;
y += x;
}
While
Lexical Analysis
Syntax Analysis
Sequence
Semantic Analysis
< = = IR Generation
IR Optimization
y z x + y +
Code Generation
Optimization
a b y x
while (y < z)
{ int x = a + b;
y += x;
} While void
Lexical Analysis
Syntax Analysis
Sequence void
Semantic Analysis
IR Optimization
y z x + int y + int
Code Generation
int int int int
Optimization
a b y x
int int int int
while (y < z)
{ int x = a + b;
y += x;
} While void
Lexical Analysis
Syntax Analysis
Sequence void
Semantic Analysis
IR Optimization
y z x + int y + int
Code Generation
int int int int
Optimization
a b y x
int int int int
while (y < z)
b;
{
int x = a +
y += x;
Lexical Analysis
}
y = x + y Semantic Analysis
_= y < z IR Generation
t IR Optimization
1 Code Generation
y = x + y Semantic Analysis
_= y < z IR Generation
t IR Optimization
1 Code Generation
int x = a +
y += x;
Lexical Analysis
}
x = a + b Syntax Analysis
_= y < z IR Generation
t IR Optimization
1 Code Generation
Optimization
if _t1 goto{ Loop
while (y < z)
b;
int x = a +
y += x;
Lexical Analysis
}
x = a + b Syntax Analysis
_= y < z IR Generation
t IR Optimization
1 Code Generation
Optimization
while (y < z)
b;
{
int x = a +
y += x;
Lexical Analysis
}
45
29
Assignment
v Consider the grammar, G = < {S},{a,b},P, S > with
productions
P: S → aSbS | bSaS | E