A Simple One - Pass Compiler
A Simple One - Pass Compiler
Chapter 2
A Simple One – Pass Compiler
Chapter 2
CSE309N
The Entire Compilation Process
Grammars for Syntax Definition
Syntax-Directed Translation
Parsing - Top Down & Predictive
Pulling Together the Pieces
The Lexical Analysis Process
Symbol Table Considerations
A Brief Look at Code Generation
Concluding Remarks/Looking Ahead
Chapter 2
CSE309N
Overview
Chapter 2
CSE309N
Grammars for Syntax Definition
A Context-free Grammar (CFG) Is Utilized to
Describe the Syntactic Structure of a Language
A CFG Is Characterized By:
1. A Set of Tokens or Terminal Symbols
2. A Set of Non-terminals
3. A Set of Production Rules
Each Rule Has the Form
NT {T, NT}*
4. A Non-terminal Designated As
the Start Symbol
Chapter 2
CSE309N
Grammars for Syntax Definition
Example CFG
Chapter 2
CSE309N
Information
A string of tokens is a sequence of zero or more tokens.
The string containing with zero tokens, written as , is called
empty string.
A grammar derives strings by beginning with the start symbol and
repeatedly replacing the non terminal by the right side of a
production for that non terminal.
The token strings that can be derived from the start symbol form
the language defined by the grammar.
Chapter 2
CSE309N
Grammars are Used to Derive Strings:
9 - 5 + digit P4 : digit 5
9-5+2 P4 : digit 2
Chapter 2
CSE309N
Grammars are Used to Derive Strings:
Chapter 2
CSE309N
A More Complex Grammar
Chapter 2
CSE309N
Defining a Parse Tree
A parse tree pictorially shows how the start symbol of a
grammar derives a string in the language.
More Formally, a Parse Tree for a CFG Has the Following
Properties:
Root Is Labeled With the Start Symbol
Leaf Node Is a Token or
Interior Node Is a Non-Terminal
If A x1x2…xn, Then A Is an Interior; x1x2…xn Are
Children of A and May Be Non-Terminals or Tokens
Chapter 2
CSE309N
Other Important Concepts
Ambiguity
Two derivations (Parse Trees) for the same token string.
string string
-
+ string string
string string
+ string
string - string 2 9 string
9 5 5 2
Grammar:
string string + string | string – string | 0 | 1 | …| 9
Chapter 2
CSE309N
Other Important Concepts
Associativity of Operators
Left vs. Right
list right
Chapter 2
CSE309N
Other Important Concepts
Operator Precedence
What does ( )
9+5*2 Typically * / is precedence
+ - order
mean?
stmt id := expr
| if expr then stmt
| if expr then stmt else stmt
| while expr do stmt
| begin opt_stmts end
Ambiguous Grammar?
Chapter 2
CSE309N
Syntax-Directed Translation
Associate Attributes With Grammar Rules and Translate as Parsing occurs
The translation will follow the parse tree structure (and as a result the
structure and form of the parse tree will affect the translation).
First example: Inductive Translation.
Infix to Postfix Notation Translation for Expressions
Translation defined inductively as: Postfix(E) where E is an
Expression.
Rules
1. If E is a variable or constant then Postfix(E) = E
2. If E is E1 op E2 then Postfix(E)
= Postfix(E1 op E2) = Postfix(E1) Postfix(E2) op
3. If E is (E1) then Postfix(E) = Postfix(E1)
Chapter 2
CSE309N
Examples
Postfix( ( 9 – 5 ) + 2 )
= Postfix( ( 9 – 5 ) ) Postfix( 2 ) +
= Postfix( 9 – 5 ) Postfix( 2 ) +
= Postfix( 9 ) Postfix( 5 ) - Postfix( 2 ) +
=95–2+
Postfix(9 – ( 5 + 2 ) )
= Postfix( 9 ) Postfix( ( 5 + 2 ) ) -
= Postfix( 9 ) Postfix( 5 + 2 ) –
= Postfix( 9 ) Postfix( 5 ) Postfix( 2 ) + –
=952+–
Chapter 2
CSE309N
Syntax-Directed Definition
Each Production Has a Set of Semantic Rules
Each Grammar Symbol Has a Set of Attributes
For the Following Example, String Attribute “t” is
Associated With Each Grammar Symbol
expr.t =95-
2+
expr.t =95- term.t =2
expr.t =9 term.t =5
term.t =9
9 - 5 + 2
It starts at the root and recursively visits the children of
each node in left-to-right order
The semantic rules at a given node are evaluated once all
descendants of that node have been visited.
A parse tree showing all the attribute values at each node
is called annotated parse tree. Chapter 2
CSE309N Translation Schemes
Embedded Semantic Actions into the right sides of
the productions.
A translation scheme is
expr expr + term {print(‘+’)} like a syntax-directed
definition except the
expr - term {print(‘-’)}
order of evaluation of
term the semantic rules is
term 0 {print(‘0’)} explicitly shown.
term 1 {print(‘1’)}
expr
{print(‘+’)}
…
+
expr term
term 9 {print(‘9’)}
- {print(‘-’)} 2 {print(‘2’)}
expr term
5 {print(‘5’)}
term
9 {print(‘9’)}
Chapter 2
CSE309N
Parsing
Parsing is the process of determining if a
string of tokens can be generated by a grammar.
Bottom-up:
• starts at leaves
• proceeds towards root
Chapter 2
CSE309N
Parsing – Top-Down & Predictive
Top-Down Parsing
Parse tree / derivation of a type simple Start symbol
Suppose input is :
array [ num dotdot num ] of integer
Parsing would begin with
type ???
Chapter 2
CSE309N
Top-Down Parse (type = start symbol)
Lookahead symbol
type type
?
array [ simple ] of type
Lookahead symbol
type
Chapter 2
CSE309N
Top-Down Parse (type = start symbol)
Lookahead symbol
type
The selection of
array [ simple ] of type production for non
terminal may involve
num dotdot num simple trail and error
type
integer
Chapter 2
CSE309N
Top-Down Process
Recursive Descent or Predictive Parsing
Parser Operates by Attempting to Match Tokens in the
Input Stream
Utilize both Grammar and Input Below to Motivate Code
for Algorithm
Chapter 2
CSE309N
Top-Down Algorithm (Continued)
procedure type ;
begin
if lookahead is in { integer, char, num } then simple
else if lookahead = ‘’ then begin match (‘’ ) ; match( id ) end
else if lookahead = array then begin
match( array ); match(‘[‘); simple; match(‘]’); match(of); type
end
else error
end ;
procedure simple ;
begin
if lookahead = integer then match ( integer );
else if lookahead = char then match ( char );
else if lookahead = num then begin
match (num); match (dotdot); match (num)
end
else error
end ; Chapter 2
CSE309N
Tracing
Input: array [ num dotdot num ] of integer
To initialize the parser:
set global variable : lookahead = array
call procedure: type
Chapter 2
CSE309N
Limitations
Can we apply the previous technique to every
grammar?
NO:
type simple
| array [ simple ] of type
simple integer
| array digit
digit 0|1|2|3|4|5|6|7|8|9
Chapter 2
CSE309N
Designing a Predictive Parser
Consider A
FIRST()=set of leftmost tokens that appear in or in
strings generated by .
E.g. FIRST(type)={,array,integer,char,num}
Chapter 2
CSE309N
Problems with Top Down Parsing
Left Recursion in CFG May Cause Parser to Loop Forever.
Indeed:
In the production AA we write the program
procedure A
{
if lookahead belongs to First(A) then
call the procedure A
}
Chapter 2
CSE309N
Dealing with Left recursion
Solution: Algorithm to Remove Left Recursion:
BASIC IDEA
AA| becomes
A R
R R|
Chapter 2
CSE309N
What happens to semantic actions?
Chapter 2
CSE309N
Comparing Grammars
with Left Recursion
Notice Location of Semantic Actions in Tree
expr
{print(‘+’)}
+
expr term
{print(‘-’)} {print(‘2’)}
2
expr - term
term 5 {print(‘5’)}
9 {print(‘9’)}
Chapter 2
CSE309N
Comparing Grammars
without Left Recursion
Now, Notice Location of Semantic Actions in Tree
for Revised Grammar
What is Order of Processing in this Case?
expr
term rest
2 {print(‘2’)}
Chapter 2
CSE309N
Procedure for the Non terminals
expr, term, and rest
expr()
{
term(), rest();
}
rest()
{
if ( lookahead == ‘+’) {
match(‘+’); term(); putchar(‘+’); rest();
}
else if ( lookahead = ‘-’) {
match(‘-’); term(); putchar(‘-’); rest();
}
else ;
}
Chapter 2
CSE309N
Procedure for the Non terminals
expr, term, and rest (2)
term()
{
if (isdigit(lookahead)){
putchar(lookahead); match();
}
else error();
}
Chapter 2
CSE309N
Optimizing the translator
Tail recursion
When the last statement executed in a procedure body is a
recursive call of the same procedure, the call is said to be tail
recursion.
rest()
{
L: if ( lookahead == ‘+’) {
match(‘+’); term(); putchar(‘+’); goto L;
}
else if ( lookahead = ‘-’) {
match(‘-’); term(); putchar(‘-’); goto L;
}
else ;
} Chapter 2
CSE309N
Optimizing the translator
expr()
{
term(),
while(1)
if ( lookahead == ‘+’) {
match(‘+’); term(); putchar(‘+’);
}
else if ( lookahead = ‘-’) {
match(‘-’); term(); putchar(‘-’);
}
else break;
}
Chapter 2
CSE309N
Lexical Analysis
A lexical analyzer reads and converts the input into a stream of tokens to
be analyzed by the parser.
Functional Responsibilities
1. White Space and Comments Are Filtered Out
blanks, new lines, tabs are removed
modifying the grammar to incorporate white space into
the syntax difficult to implement
Chapter 2
CSE309N
Functional Responsibilities (2)
Constants
• The job of collecting digits into integers is generally given to a
lexical analyzer because numbers can be treated as single units
during translation.
• num be the token representing an integer.
• The value of the integer will be passed along as an attribute of the
token num
• Example:
31 + 28 + 59
<num, 31> <+, > <num, 28> < +, > <num, 31>
NB: 2nd Component of the tuples, the
attributes, play no role during parsing, but
needed during translation
Chapter 2
CSE309N
Functional Responsibilities (3)
Chapter 2
CSE309N
Functional Responsibilities (3)
Solution
1. Keywords are reserved.
2. The character string forms an identifier
only if it is not a keyword.
Chapter 2
CSE309N
Interface to the Lexical Analyzer
Pass token
Read
and its
character
attributes
Lexical
Input Parser
Analyzer
Push back
character
1. Read characters from input
Why push 2. Groups them into lexeme
back? 3. Passes the token together with
attribute values to the later stage
This part is implemented
with a buffer
Chapter 2
CSE309N
The Lexical Analysis Process
A Graphical Depiction
returns token
uses getchar ( ) to lexan ( ) to caller
read character lexical
analyzer
pushes back c using
ungetc (c , stdin)
tokenval
Chapter 2
CSE309N
Example of a Lexical Analyzer
function lexan: integer ; [ Returns an integer encoding of token]
var lexbuf : array[ 0 .. 100 ] of char ;
c: char ;
begin
loop begin
read a character into c ;
if c is a blank or a tab then
do nothing
else if c is a newline then
lineno : = lineno + 1
else if c is a digit then begin
set tokenval to the value of this and following digits ;
return NUM
end
Chapter 2
CSE309N
Algorithm for Lexical Analyzer
Chapter 2
CSE309N
Symbol Table Considerations
OPERATIONS: Insert (string, token_ID)
Lookup (string)
NOTICE: Reserved words are placed into
symbol table for easy lookup
Attributes may be associated with each entry, i.e.,
Semantic Actions
Typing Info: id integer ARRAY symtable
etc. lexptr token attributes
0
div 1
mod 2
id 3
id 4
ARRAY lexemes
Chapter 2
CSE309N
Abstract Stack Machines
The front end of a compiler constructs an intermediate representation of
the source program from which the back end generates the target
program.
One popular form of intermediate representation is code for an abstract
stack machine.
Chapter 2
CSE309N
Instructions
Instructions fall into three classes.
1. Integer arithmetic
2. Stack manipulation
3. Control flow
1 push 5 16 0 1
2 rvalue 2 t 7 11 2
o
3 + 7 3
p
4 rvalue 3 . . . 4
pc
5 *
6 . . .
Chapter 2
CSE309N
L-value and R-value
Chapter 2
CSE309N
Stack manipulation
Chapter 2
CSE309N
Translation of Expressions
2 2 2 2 2 2 2
1461 1461 1461 1461 1 1
1 4 153
Chapter 2
CSE309N
Translation of Expressions (3)
2 2 2 2 2 2 2
1 1 1 1 1 1 4
153 306 306 308 308 3
2 2 5
Chapter 2
CSE309N
Translation of Expressions (4)
2 2 0 1
1 2 day
4 1
2 3 y
-3 -3 4 m
. . . 5 d
Chapter 2
CSE309N
Control Flow
Chapter 2
CSE309N
Translation of statements
{ out := newlabel;
code for expr
stmt.t := expr.t ||
label out
Chapter 2
CSE309N
Translation of statements (2)
Chapter 2
CSE309N
The End
Chapter 2