0% found this document useful (0 votes)
108 views62 pages

A Simple One - Pass Compiler

This chapter discusses the basics of compilers including context-free grammars for defining syntax, parsing, syntax-directed translation, and code generation. It introduces grammars and how they are used to derive strings and parse trees. Issues like ambiguity, associativity of operators, and precedence are covered. Syntax-directed translation is introduced as a way to associate attributes with grammar rules to perform translations during parsing based on the parse tree structure.

Uploaded by

shyamd4
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views62 pages

A Simple One - Pass Compiler

This chapter discusses the basics of compilers including context-free grammars for defining syntax, parsing, syntax-directed translation, and code generation. It introduces grammars and how they are used to derive strings and parse trees. Issues like ambiguity, associativity of operators, and precedence are covered. Syntax-directed translation is introduced as a way to associate attributes with grammar rules to perform translations during parsing based on the parse tree structure.

Uploaded by

shyamd4
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 62

CSE309N

Chapter 2
A Simple One – Pass Compiler

Dewan Tanvir Ahmed


Computer Science & Engineering
Bangladesh University of Engineering and Technology

Chapter 2
CSE309N
The Entire Compilation Process
 Grammars for Syntax Definition
 Syntax-Directed Translation
 Parsing - Top Down & Predictive
 Pulling Together the Pieces
 The Lexical Analysis Process
 Symbol Table Considerations
 A Brief Look at Code Generation
 Concluding Remarks/Looking Ahead

Chapter 2
CSE309N
Overview

Programming Language can be defined by describing


1. The syntax of the language
1. What its program looks like
2. We use CFG or BNF (Backus Naur Form)
2. The semantics of the language
1. What its program mean
2. Difficult to describe
3. Use informal descriptions and suggestive examples

Chapter 2
CSE309N
Grammars for Syntax Definition
 A Context-free Grammar (CFG) Is Utilized to
Describe the Syntactic Structure of a Language
 A CFG Is Characterized By:
1. A Set of Tokens or Terminal Symbols
2. A Set of Non-terminals
3. A Set of Production Rules
Each Rule Has the Form
NT  {T, NT}*
4. A Non-terminal Designated As
the Start Symbol
Chapter 2
CSE309N
Grammars for Syntax Definition
Example CFG

list  list + digit


list  list - digit
list  digit
digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
(the “|” means OR)
(So we could have written
list  list + digit | list - digit | digit )

Chapter 2
CSE309N
Information
 A string of tokens is a sequence of zero or more tokens.
 The string containing with zero tokens, written as , is called
empty string.
 A grammar derives strings by beginning with the start symbol and
repeatedly replacing the non terminal by the right side of a
production for that non terminal.
 The token strings that can be derived from the start symbol form
the language defined by the grammar.

Chapter 2
CSE309N
Grammars are Used to Derive Strings:

Using the CFG defined on the earlier slide, we can


derive the string: 9 - 5 + 2 as follows:
list  list + digit P1 : list  list + digit

 list - digit + digit P2 : list  list - digit

 digit - digit + digit P3 : list  digit


 9 - digit + digit P4 : digit  9

 9 - 5 + digit P4 : digit  5

 9-5+2 P4 : digit  2

Chapter 2
CSE309N
Grammars are Used to Derive Strings:

This derivation could also be represented via a Parse Tree


(parents on left, children on right)
list  list + digit list
 list - digit + digit
 digit - digit + digit
list + digit
 9 - digit + digit - 2
list digit
 9 - 5 + digit
 9-5+2 5
digit
9

Chapter 2
CSE309N
A More Complex Grammar

block  begin opt_stmts end


opt_stmts  stmt_list | 
stmt_list  stmt_list ; stmt | stmt

What is this grammar for ?


What does “” represent ?
What kind of production rule is this ?

Chapter 2
CSE309N
Defining a Parse Tree
 A parse tree pictorially shows how the start symbol of a
grammar derives a string in the language.
 More Formally, a Parse Tree for a CFG Has the Following
Properties:
 Root Is Labeled With the Start Symbol
 Leaf Node Is a Token or 
 Interior Node Is a Non-Terminal
 If A  x1x2…xn, Then A Is an Interior; x1x2…xn Are
Children of A and May Be Non-Terminals or Tokens

Chapter 2
CSE309N
Other Important Concepts
Ambiguity
Two derivations (Parse Trees) for the same token string.

string string
-
+ string string
string string
+ string
string - string 2 9 string

9 5 5 2

Grammar:
string  string + string | string – string | 0 | 1 | …| 9

Why is this a Problem ?

Chapter 2
CSE309N
Other Important Concepts
Associativity of Operators
Left vs. Right

list right

list + digit letter = right


2 a =
list - digit letter right
5 b
digit letter
9 c

list  list + digit | right  letter = right | letter


| list - digit | digit letter  a | b | c | …| z
digit  0 | 1 | 2 | …| 9
Chapter 2
CSE309N
Embedding Associativity
 The language of arithmetic expressions with + -
 (ambiguous) grammar that does not enforce
associativity
string  string + string | string – string | 0 | 1 | …| 9

 non-ambiguous grammar enforcing left


associativity (parse tree will grow to the left)
string  string + digit | string - digit | digit
digit  0 | 1 | 2 | …| 9

 non-ambiguous grammar enforcing right


associativity (parse tree will grow to the right)
string  digit + string | digit - string | digit
digit  0 | 1 | 2 | …| 9

Chapter 2
CSE309N
Other Important Concepts
Operator Precedence
What does ( )
9+5*2 Typically * / is precedence
+ - order
mean?

This can be expr  expr + term | expr – term | term


incorporated term  term * factor | term / factor | factor
into a grammar factor  digit | ( expr )
via rules: digit  0 | 1 | 2 | 3 | … | 9

Precedence Achieved by:


expr & term for each precedence level

Rules for each are left recursive or associate to the left


Chapter 2
CSE309N
Syntax for Statements

stmt  id := expr
| if expr then stmt
| if expr then stmt else stmt
| while expr do stmt
| begin opt_stmts end

Ambiguous Grammar?

Chapter 2
CSE309N
Syntax-Directed Translation
 Associate Attributes With Grammar Rules and Translate as Parsing occurs
 The translation will follow the parse tree structure (and as a result the
structure and form of the parse tree will affect the translation).
 First example: Inductive Translation.
 Infix to Postfix Notation Translation for Expressions
 Translation defined inductively as: Postfix(E) where E is an
Expression.

Rules
1. If E is a variable or constant then Postfix(E) = E
2. If E is E1 op E2 then Postfix(E)
= Postfix(E1 op E2) = Postfix(E1) Postfix(E2) op
3. If E is (E1) then Postfix(E) = Postfix(E1)

Chapter 2
CSE309N
Examples

Postfix( ( 9 – 5 ) + 2 )
= Postfix( ( 9 – 5 ) ) Postfix( 2 ) +
= Postfix( 9 – 5 ) Postfix( 2 ) +
= Postfix( 9 ) Postfix( 5 ) - Postfix( 2 ) +
=95–2+

Postfix(9 – ( 5 + 2 ) )
= Postfix( 9 ) Postfix( ( 5 + 2 ) ) -
= Postfix( 9 ) Postfix( 5 + 2 ) –
= Postfix( 9 ) Postfix( 5 ) Postfix( 2 ) + –
=952+–

Chapter 2
CSE309N
Syntax-Directed Definition
 Each Production Has a Set of Semantic Rules
 Each Grammar Symbol Has a Set of Attributes
 For the Following Example, String Attribute “t” is
Associated With Each Grammar Symbol

expr  expr – term | expr + term | term


term  0 | 1 | 2 | 3 | … | 9

 recall: What is a Derivation for 9 + 5 - 2?


list  list - digit  list + digit - digit  digit + digit - digit
 9 + digit - digit  9 + 5 - digit  9 + 5 - 2
Chapter 2
CSE309N
Syntax-Directed Definition (2)
 Each Production Rule of the CFG Has a Semantic
Rule
Production Semantic Rule
expr  expr + term expr.t := expr.t || term.t || ‘+’
expr  expr – term expr.t := expr.t || term.t || ’-’
expr  term expr.t := term.t
term  0 term.t := ‘0’
term  1 term.t := ‘1’
…. ….
term  9 term.t := ‘9’

 Note: Semantic Rules for expr define t as a


“synthesized attribute” i.e., the various copies of t
obtain their values from “children t’s”
Chapter 2
CSE309N
Semantic Rules are Embedded in Parse Tree

expr.t =95-
2+
expr.t =95- term.t =2

expr.t =9 term.t =5

term.t =9

9 - 5 + 2
 It starts at the root and recursively visits the children of
each node in left-to-right order
 The semantic rules at a given node are evaluated once all
descendants of that node have been visited.
 A parse tree showing all the attribute values at each node
is called annotated parse tree. Chapter 2
CSE309N Translation Schemes
Embedded Semantic Actions into the right sides of
the productions.
A translation scheme is
expr  expr + term {print(‘+’)} like a syntax-directed
definition except the
 expr - term {print(‘-’)}
order of evaluation of
 term the semantic rules is
term  0 {print(‘0’)} explicitly shown.
term  1 {print(‘1’)}
expr
{print(‘+’)}

+
expr term
term  9 {print(‘9’)}
- {print(‘-’)} 2 {print(‘2’)}
expr term
5 {print(‘5’)}
term
9 {print(‘9’)}
Chapter 2
CSE309N
Parsing
Parsing is the process of determining if a
string of tokens can be generated by a grammar.

Parser must be capable of constructing the


tree.
Two types of parser
Top-down:
• starts at root
• proceeds towards leaves

Bottom-up:
• starts at leaves
• proceeds towards root

Chapter 2
CSE309N
Parsing – Top-Down & Predictive
 Top-Down Parsing 
Parse tree / derivation of a type  simple Start symbol

token string occurs in a |  id


top down fashion. | array [ simple ] of type
 For Example, Consider: simple  integer
| char
| num dotdot num

Suppose input is :
array [ num dotdot num ] of integer
Parsing would begin with
type  ???

Chapter 2
CSE309N
Top-Down Parse (type = start symbol)

Lookahead symbol

Input : array [ num dotdot num ] of integer

type type
?
array [ simple ] of type

Lookahead symbol

Input : array [ num dotdot num ] of integer

type

array [ simple ] of type

num dotdot num

Chapter 2
CSE309N
Top-Down Parse (type = start symbol)
Lookahead symbol

Input : array [ num dotdot num ] of integer

type
The selection of
array [ simple ] of type production for non
terminal may involve
num dotdot num simple trail and error

type

array [ simple ] of type

num dotdot num simple

integer

Chapter 2
CSE309N
Top-Down Process
Recursive Descent or Predictive Parsing
 Parser Operates by Attempting to Match Tokens in the
Input Stream
 Utilize both Grammar and Input Below to Motivate Code
for Algorithm

array [ num dotdot num ] of integer


type  simple procedure match ( t : token ) ;
|  id
begin
| array [ simple ] of type if lookahead = t then
simple  integer lookahead : = nexttoken
| char else error
| num dotdot num end ;

Chapter 2
CSE309N
Top-Down Algorithm (Continued)
procedure type ;
begin
if lookahead is in { integer, char, num } then simple
else if lookahead = ‘’ then begin match (‘’ ) ; match( id ) end
else if lookahead = array then begin
match( array ); match(‘[‘); simple; match(‘]’); match(of); type
end
else error
end ;
procedure simple ;
begin
if lookahead = integer then match ( integer );
else if lookahead = char then match ( char );
else if lookahead = num then begin
match (num); match (dotdot); match (num)
end
else error
end ; Chapter 2
CSE309N
Tracing
Input: array [ num dotdot num ] of integer
To initialize the parser:
set global variable : lookahead = array
call procedure: type

Procedure call to type with lookahead = array results in the actions:


match( array ); match(‘[‘); simple; match(‘]’); match(of); type

Procedure call to simple with lookahead = num results in the actions:


match (num); match (dotdot); match (num)

Procedure call to type with lookahead = integer results in the actions:


simple

Procedure call to simple with lookahead = integer results in the actions:


match ( integer )

Chapter 2
CSE309N
Limitations
 Can we apply the previous technique to every
grammar?
 NO:

type  simple
| array [ simple ] of type
simple  integer
| array digit
digit  0|1|2|3|4|5|6|7|8|9

consider the string “array 6”


the predictive parser starts with type and lookahead=
array
apply production type  simple OR type  array digit ??
Chapter 2
CSE309N
When to Use -Productions

The recursive descent parser will use -productions as a


default when no other production can be used.

stmt  begin opt_stmts end


opt_stmts  stmt_list | 

While parsing opt_stmts, if the lookahead symbol is


not in FIRST(stmts_list), then the -productions is
used.

Chapter 2
CSE309N
Designing a Predictive Parser
 Consider A
 FIRST()=set of leftmost tokens that appear in  or in
strings generated by .
 E.g. FIRST(type)={,array,integer,char,num}

 Consider productions of the form A, A the sets


FIRST() and FIRST() should be disjoint

 Then we can implement predictive parsing


 Starting with A? we find into which FIRST() set the
lookahead symbol belongs to and we use this
production.
 Any non-terminal results in the corresponding
procedure call
 Terminals are matched.

Chapter 2
CSE309N
Problems with Top Down Parsing
 Left Recursion in CFG May Cause Parser to Loop Forever.
 Indeed:
 In the production AA we write the program
procedure A
{
if lookahead belongs to First(A) then
call the procedure A
}

 Solution: Remove Left Recursion...


 without changing the Language defined by the
Grammar.

Chapter 2
CSE309N
Dealing with Left recursion
 Solution: Algorithm to Remove Left Recursion:
BASIC IDEA
AA| becomes
A R
R R| 

expr  expr + term | expr - term | term


term  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

expr  term rest


rest  + term rest | - term rest | 
term  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Chapter 2
CSE309N
What happens to semantic actions?

expr  expr + term {print(‘+’)} expr  term rest


 expr - term {print(‘-’)} rest  + term {print(‘+’)} rest
 term  - term {print(‘-’)} rest
term  0 {print(‘0’)} 
term  1 {print(‘1’)} term  0 {print(‘0’)}
… term  1 {print(‘1’)}
term  9 {print(‘9’)} …
term  9 {print(‘9’)}

Chapter 2
CSE309N
Comparing Grammars
with Left Recursion
 Notice Location of Semantic Actions in Tree

 What is Order of Processing?

expr
{print(‘+’)}
+
expr term
{print(‘-’)} {print(‘2’)}
2
expr - term

term 5 {print(‘5’)}

9 {print(‘9’)}

Chapter 2
CSE309N
Comparing Grammars
without Left Recursion
 Now, Notice Location of Semantic Actions in Tree
for Revised Grammar
 What is Order of Processing in this Case?

expr

term rest

9 {print(‘9’)} - term {print(‘-’)} rest


+ term {print(‘+’)} rest
5 {print(‘5’)}

2 {print(‘2’)} 

Chapter 2
CSE309N
Procedure for the Non terminals
expr, term, and rest
expr()
{
term(), rest();
}
rest()
{
if ( lookahead == ‘+’) {
match(‘+’); term(); putchar(‘+’); rest();
}
else if ( lookahead = ‘-’) {
match(‘-’); term(); putchar(‘-’); rest();
}
else ;
}
Chapter 2
CSE309N
Procedure for the Non terminals
expr, term, and rest (2)
term()
{
if (isdigit(lookahead)){
putchar(lookahead); match();
}
else error();
}

Chapter 2
CSE309N
Optimizing the translator
Tail recursion
When the last statement executed in a procedure body is a
recursive call of the same procedure, the call is said to be tail
recursion.
rest()
{
L: if ( lookahead == ‘+’) {
match(‘+’); term(); putchar(‘+’); goto L;
}
else if ( lookahead = ‘-’) {
match(‘-’); term(); putchar(‘-’); goto L;
}
else ;
} Chapter 2
CSE309N
Optimizing the translator
expr()
{
term(),
while(1)
if ( lookahead == ‘+’) {
match(‘+’); term(); putchar(‘+’);
}
else if ( lookahead = ‘-’) {
match(‘-’); term(); putchar(‘-’);
}
else break;
}

Chapter 2
CSE309N
Lexical Analysis

A lexical analyzer reads and converts the input into a stream of tokens to
be analyzed by the parser.

A sequence of input characters that comprises a single token is called a


lexeme.

Functional Responsibilities
1. White Space and Comments Are Filtered Out
 blanks, new lines, tabs are removed
 modifying the grammar to incorporate white space into
the syntax difficult to implement

Chapter 2
CSE309N
Functional Responsibilities (2)

Constants
• The job of collecting digits into integers is generally given to a
lexical analyzer because numbers can be treated as single units
during translation.
• num be the token representing an integer.
• The value of the integer will be passed along as an attribute of the
token num
• Example:
31 + 28 + 59

<num, 31> <+, > <num, 28> < +, > <num, 31>
NB: 2nd Component of the tuples, the
attributes, play no role during parsing, but
needed during translation
Chapter 2
CSE309N
Functional Responsibilities (3)

Recognizing Identifiers and Keywords


Compilers use identifiers as names of
• Variables
• Arrays
• Functions
A grammar for a language treats an identifier as token
Example:
credit = asset + goodwill;
Lexical analyzer would convert it like
id = id + id ;

Chapter 2
CSE309N
Functional Responsibilities (3)

Recognizing Identifiers and Keywords (2)


Languages use fixed character strings ( if, while, extern) to identify certain
construct. We call them keywords.
A mechanism is needed fir deciding when a lexeme forms a keyword and
when it forms an identifier.

Solution
1. Keywords are reserved.
2. The character string forms an identifier
only if it is not a keyword.

Chapter 2
CSE309N
Interface to the Lexical Analyzer

Pass token
Read
and its
character
attributes

Lexical
Input Parser
Analyzer

Push back
character
1. Read characters from input
Why push 2. Groups them into lexeme
back? 3. Passes the token together with
attribute values to the later stage
This part is implemented
with a buffer
Chapter 2
CSE309N
The Lexical Analysis Process
A Graphical Depiction

returns token
uses getchar ( ) to lexan ( ) to caller
read character lexical
analyzer
pushes back c using
ungetc (c , stdin)
tokenval

Sets global variable


to attribute value

Chapter 2
CSE309N
Example of a Lexical Analyzer
function lexan: integer ; [ Returns an integer encoding of token]
var lexbuf : array[ 0 .. 100 ] of char ;
c: char ;
begin
loop begin
read a character into c ;
if c is a blank or a tab then
do nothing
else if c is a newline then
lineno : = lineno + 1
else if c is a digit then begin
set tokenval to the value of this and following digits ;
return NUM
end

Chapter 2
CSE309N
Algorithm for Lexical Analyzer

else if c is a letter then begin


place c and successive letters and digits into lexbuf ;
p : = lookup ( lexbuf ) ;
if p = 0 then
p : = insert ( lexbf, ID) ;
tokenval : = p
return the token field of table entry p
end
else
set tokenval to NONE ; / * there is no attribute * /
return integer encoding of character c
end
end

Note: Insert / Lookup operations occur against the Symbol Table !

Chapter 2
CSE309N
Symbol Table Considerations
OPERATIONS: Insert (string, token_ID)
Lookup (string)
NOTICE: Reserved words are placed into
symbol table for easy lookup
Attributes may be associated with each entry, i.e.,
Semantic Actions
Typing Info: id  integer ARRAY symtable
etc. lexptr token attributes
0
div 1
mod 2
id 3
id 4

d i v EOS m o d EOS c o u n t EOS i EOS

ARRAY lexemes
Chapter 2
CSE309N
Abstract Stack Machines
The front end of a compiler constructs an intermediate representation of
the source program from which the back end generates the target
program.
One popular form of intermediate representation is code for an abstract
stack machine.

I will show you how code will be generated for it.

The properties of the machine


1. Instruction memory
2. Data memory
3. All arithmetic operations are performed on values on a stack

Chapter 2
CSE309N
Instructions
Instructions fall into three classes.
1. Integer arithmetic
2. Stack manipulation
3. Control flow

Instructions Stack Data

1 push 5 16 0 1

2 rvalue 2 t 7 11 2
o
3 + 7 3
p
4 rvalue 3 . . . 4
pc
5 *
6 . . .
Chapter 2
CSE309N
L-value and R-value

What is the difference between left and right side identifier?


L-value Vs. R-value of an identifier
I:=5; L - Location
I:=I+1; R – Contents
The right side specifies an integer value, while left side specifies
where the value is to be stored.
Usually,

r-values are what we think as values


l-values are locations.

Chapter 2
CSE309N
Stack manipulation

push v push v onto the stack


rvalue l push contents on data location l
lvalue l push address of data location l
pop throw away value on top of the stack
:= the r-value on top is placed in the l-value below
it and both are popped
copy push a copy of the top on the stack

Chapter 2
CSE309N
Translation of Expressions

Day = (1461*y) mod 4 + (153*m +2 ) mod 5 + d

lvalue day push 2 0 1


push 1461 + 2 day
rvalue y push 5 1 3 y
* mod 2 4 m
push 4 + -3 5 d
mod rvalue d . . .
push 153 +
rvalue m :=
*
Chapter 2
CSE309N
Translation of Expressions (2)

2 2 2 2 2 2 2
1461 1461 1461 1461 1 1
1 4 153

Chapter 2
CSE309N
Translation of Expressions (3)

2 2 2 2 2 2 2
1 1 1 1 1 1 4
153 306 306 308 308 3
2 2 5

Chapter 2
CSE309N
Translation of Expressions (4)

2 2 0 1
1 2 day
4 1
2 3 y
-3 -3 4 m
. . . 5 d

Chapter 2
CSE309N
Control Flow

The control flow instructions for the stack


machine are

label l target of jumps to l; has no other effect


goto l next instruction is taken from statement with label l
gofalse l pop the top value; jump if it is zero
gotrue l pop the top value; jump if it is nonzero
halt stop execution

Chapter 2
CSE309N
Translation of statements

stmt  if expr then stmt1


If

{ out := newlabel;
code for expr
stmt.t := expr.t ||

gofalse out ‘gofalse’ out ||


stmt1.t ||
code for stmt1 ‘label’ out }

label out

Chapter 2
CSE309N
Translation of statements (2)

label test while


stmt  while expr do stmt1
code for expr
{ test := newlabel;
gofalse out out := newlabel;
stmt.t := ‘label’ test ||
code for stmt1
expr.t ||

goto test ‘gofalse’ out ||


stmt1.t ||
label out ‘goto’ test ||
‘label’ out }
Chapter 2
CSE309N
Concluding Remarks / Looking Ahead
 We’ve Reviewed / Highlighted Entire Compilation
Process
 Introduced Context-free Grammars (CFG) and
Indicated /Illustrated Relationship to Compiler
Theory
 Reviewed Many Different Versions of Parse
Trees That Assist in Both Recognition and
Translation
 We’ll Return to Beginning - Lexical Analysis
 We’ll Explore Close Relationship of Lexical
Analysis to Regular Expressions, Grammars, and
Finite Automatons

Chapter 2
CSE309N

The End

Chapter 2

You might also like