Compiler Design Lab Manual
Compiler Design Lab Manual
COURSE OBJECTIVE:
LIST OF EXPERIMENTS:
1. Construction of NFA.
2. Construction of minimized DFA from a given regular expression.
3. Use LEX tool to implement a lexical analyzer.
4. Use YACC and LEX to implement a parser for the grammar.
5. Implement a recursive descent parser for an expression grammar..
6. Construction of operator precedence parse table.
7. Implementation of symbol table
8. Implementation of shift reduced parsing algorithms.
9. Construction of LR parsing table.
10. Generation of code for a given intermediate code.
11. Implementation of code optimization techniques.
Total: 45 hours
1. Construction of NFA
1.1 AIM
1.2 DESCRIPTION
Lexical analyzer is the first stage of the compiler it reads the input program and
groups them in to lexemes (basic elements of language). In this exercise we are going to
implement the lexical analyzer with NFA. The steps are:
i f
To recognize a symbol a in the alphabet
a
i f
N(r1)
i f
N(r2)
NFA for r1 | r2
For regular expression r1 r2
NFA for r*
b
b:
a
(a | b)
b
a
a
(a|b) *
(a|b) * a b a
b
1.2.3 Recognizer
A recognizer for a language is a program that takes a string x, and answers “yes” if x
is a sentence of that language, and “no” otherwise.
Pattern returnvalue
You can accept the input from the keyboard and you should print the return value
in the following format
2. L={w|w} ends with 00} with three states.Notice that w only has to end with 00, and
before the two zeros, there can be anything. Construct a NFA to recognize L.
2. Construction of minimized DFA from a given regular expression
2.1 AIM
2.2 DESCRIPTION
We may convert a regular expression into a DFA (without creating a NFA first).
First we augment the given regular expression by concatenating it with a special
symbol #.
r (r)# augmented regular expression
Then, we create a syntax tree for this augmented regular expression.
In this syntax tree, all alphabet symbols (plus # and the empty string) in the
augmented regular expression will be on the leaves, and all inner nodes will be the
operators in that augmented regular expression.
Then each alphabet symbol (plus #) will be numbered (position numbers).
Example -- ( a | b) * a #
1 2 34
followpos(1)={1,2,3}
followpos(2)={1,2,3}
followpos(3)={4}
followpos(4)={}
S1=firstpos(root)={1,2,3}
mark S1
a: followpos(1) followpos(3)={1,2,3,4}=S2 move(S1,a)=S2
b: followpos(2)={1,2,3}=S1 move(S1,b)=S1
mark S2
a: followpos(1) followpos(3)={1,2,3,4}=S2 move(S2,a)=S2
b: followpos(2)={1,2,3}=S1 move(S2,b)=S1
start state: S1 b a
accepting states: {S2}
a
S1 S2
b
Example -- ( a | ) b c* #
1 2 3 4
followpos(1)={2}
followpos(2)={3,4}
followpos(3)={3,4}
followpos(4)={}
S1=firstpos(root)={1,2}
mark S1
a: followpos(1)={2}=S2 move(S1,a)=S2
b: followpos(2)={3,4}=S3 move(S1,b)=S3
mark S2
b: followpos(2)={3,4}=S3 move(S2,b)=S3
mark S3
c: followpos(3)={3,4}=S3 move(S3,c)=S3
start state: S1
accepting states: {S3} S2
a
b
S1
b
S3 c
2.2.2 Recognizer
Algorithm
1 start with the initial state
2 Do the following steps until the end of input file
a. Check if there is any transition for the given input symbol
b. If there is a transition then move to that state
c. If not, check that it is a final state
i. if it is the final state, then return token value
ii. else indicate error
1. The language corresponding to the regular expression (11) *+(111)* seems to accept
all strings of 1’s with length in multiple of 2 or 3.Construct a minimized DFA for the
above regular expression.
2. A small application named Mod4Filter that reads lines of text from the standard
input, filters out those that are not binary representations of numbers that are divisible
by four, and echoes the others to the standard output. Write a C program which
models the behavior of the Mod4Filterusing an appropriate type of finite automata.
3. Implementing a Lexical analyzer using LEX tool
3.1 AIM
LEX is a tool widely used to specify lexical analyzers, Using LEX tool implement a
lexical analyzer.
3.2 Description
Lex
source LEX Lex.yy.c
pgm COMPILER
lex.l
lex.yy.c
C COMPILER a.out
Input
stream a.out Sequence
of token
3.2.1.1 Declaration
Variables
Manifest constants: an identifier that is declared to represent a constant
Regular definition:
The translation rules of the LEX program are statements of the form
p1 {action1}
p2 {action2}
.. …
pn {actionn}
3.2.2 Example
%{
/* Definition of manifest constants LT, LE, EQ,NE,GT,GE,IF,THEN, ELSE, ID,
NUMBER, RELOP*/
}%
/*regular definition*/
delim { \t\n}
ws {delim}+
letter {A-Za-z}
digit {0-9}
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
%%
{ws} {/*no action and noreturn*/}
if {return (IF);}
then {return (THEN);}
else {return (ELSE);}
{id} {yyval=install_id(); return (ID);}
{number} {yyval=install_num (); return (NUMBER);}
“<” {yyval = LT; return (RELOP);}
“<=” {yyval = LE; return (RELOP);}
“=” {yyval = EQ; return (RELOP);}
“<>” {yyval = NE; return (RELOP);}
“>” {yyval = GT; return (RELOP);}
“>=” {yyval = GE; return (RELOP);}
%%
INSTALL_ID()
{
/* Procedure to install the lexemes, whose first character is pointed to by yytext and
whose length is yyleng, into the symbol table and return a pointer there to */
}
install_num()
{
Similar procedure to install a lexeme that is a number */
}
4.1 AIM
4.2 DISCRIPTION
Parser
/specification YACC Parser
Spec.y
Yacc is the command tool used to construct the parser we need to give parser
specification as the input and it will produce some .h and .c files which contains the actual
code of the parser it contains one function yyparse() it is the original parser
Basic Specifications
declarations
%%
rules
%%
programs
Write a C program to implement Program semantic rules to calculate the expression that takes an expression
with digits, + and * and computes the value
5. Recursive descent parser
6.1 AIM
6.2 DESCRIPTION:
The operator Precedence parser is a bottom up parsing technique.It is widely sed for a
small but important class of grammar. These grammar have the property that no production
on right side is
ε or has 2 adjacent non terminals.This grammar is called operator grammar.
Eg: E->E+E/E-E/E*E/E/E/(E)/id
In operator Precedence parser we define 3 digit precedence relation <,= and >.These
precedence relation guide the selection of handler and have the following meanings
Rules are designed to select the proper handler to reflect a given act of associativity
and precedence rules for binary operator.
then make θ1> θ2 and θ2> θ,.if the operators are left associative or
3. Make θ <id,id> θ, θ<(,(< θ,&>θ, θ>), θ>$ and $< θ for all operators θ.
4. Also make (=) , $<), $<id, (<(, id>$, )>$, (<id, id>), )>)
These rules ensure that both id and (E) will be reduced to E.also $ serves as both the
left and right end marker,causing handler to be found between $’s wherever possible.
i + i*i
6.4 Sample problem:
1. Construct the operator precedence parser for the following grammar:
E->TE'
E'->+TE/@ "@ represents null character"
T->FT'
T`->*FT'/@
F->(E)/ID
7. Symbol Table
7.1 AIM
7.2 DESCRIPTION
After syntax tree have been constructed, the compiler must check whether the input
program is type- correct (called type checking and part of the semantic analysis). During
type checking, a compiler checks whether the use of names (such as variables, functions,
type names) is consistent with their definition in the program. Consequently, it is necessary
to remember declarations so that we can detect inconsistencies and misuses during type
checking. This is the task of a symbol table. Note that a symbol table is a compile-time data
structure. It’s not used during run time by statically typed languages. Formally, a symbol
table maps names into declarations (called attributes), such as mapping the variable name x
to its type int. More specifically, a symbol table stores:
(1) insert (s,t)- return index of new entry for string ‘s’
and token ‘t’
(2) lookup (s)- return index of new entry for string ‘s’ or o if ‘s’
is not found.
(3) begin_scope () and end_scope () : When we have a new block (ie, when we encounter
the token {), we begin a new scope. When we exit a block (i.e. when we encounter the
token }) we remove the scope (this is the end_scope). When we remove a scope, we
remove all declarations inside this scope. So basically, scopes behave like stacks. One
way to implement these functions is to use a stack. When we begin a new scope we push
a special marker to the stack (e.g., 1). When we insert a new declaration in the hash table
using insert, we also push the bucket number to the stack. When we end a scope, we pop
the stack until and including the first –1 marker.
(4) Handling Reserve Keywords: Symbol table also handle reserve keywords like
‘PLUS’, MINUS’, ‘MUL’ etc. This can be done in following manner.
insert (“PLUS”, PLUS);
In this case first ‘PLUS’ and ‘MINUS’ indicate lexeme and other one indicate token.
ARRAY
all_symbol_table
a+b AND c – d Lexeme_pointe Token Attribute Position
id 1 plus
2 id
3
AND 4
Id 5 minus
6
id 7
When lexical analyzer reads a letter, it starts saving letters, digits in a buffer ‘lex_bufffer’. The string
collected in lex_bufffer is then looked in the symbol table, using the lookup operation. Since the
symbol table initialized with entries for the keywords plus, minus, AND operator and some
identifiers as shown in figure 9.1 the lookup operation will find these entries if lex_buffer contains
either div or mod. If there is no entry for the string in lex_buffer, i.e., lookup return 0, then
lex_buffer contains a lexeme for a new identifier. An entry for the identifier is created using insert( ).
After the insertion is made; ‘n’ is the index of the symbol-table entry for the string in lex_buffer.
This index is communicated to the parser by setting tokenval to n, and the token in the token field of
the entry is returned
a + b AND c-d
BEGIN
PRINT “HELLO”
INTEGER A, B, C
REAL D, E
STRING X, Y
A := 2
B := 4
C := 6
D := -3.56E-8
E := 4.567
X := “text1”
Y := “hello there”
PRINT “Values of integers are [A], [B], [C]”
FOR I:= 1 TO 5 STEP +2
PRINT “[I]”
PRINT “Strings are [X] and [Y]”
END
HELLO
Values of integers are 2, 4, 6
1
21
3
5
Strings are text1 and hello there.
Generate a symbol table to handle the variables and their types etc. An output file called symtab.sym will be created
which will contain the relevant data.
22
8. Shift Reduce Parsing
8.1 AIM
8.2 Description
A shift-reduce parser uses a parse stack which (conceptually) contains grammar symbols.
During the operation of the parser, symbols from the input are shifted onto the stack. If a prefix of
the symbols on top of the stack matches the RHS of a grammar rule which is the correct rule to use
within the current context, then the parser reduces the RHS of the rule to its LHS, replacing the
RHS symbols on top of the stack with the nonterminal occurring on the LHS of the rule. This shift-
reduce process continues until the parser terminates, reporting either success or failure. It terminates
with success when the input is legal and is accepted by the parser. It terminates with failure if an
error is detected in the input.
The parser is nothing but a stack automaton which may be in one of several discrete states. A
state is usually represented simply as an integer. In reality, the parse stack contains states, rather than
grammar symbols. However, since each state corresponds to a unique grammar symbol, the state
stack can be mapped onto the grammar symbol stack mentioned earlier.
The action table is a table with rows indexed by states and columns indexed by terminal
symbols. When the parser is in some state s and the current lookahead terminal is t, the action
taken by the parser depends on the contents of action[s][t], which can contain four different
kinds of entries:
Shift s'
Shift state s' onto the parse stack.
Reduce r
Reduce by rule r. This is explained in more detail below.
Accept
Terminate the parse with success, accepting the input.
Error
Signal a parse error.
The goto table is a table with rows indexed by states and columns indexed by nonterminal
symbols. When the parser is in state s immediately after reducing by rule N, then the next
state to enter is given by goto[s][N].
The current state of a shift-reduce parser is the state on top of the state stack. The detailed
operation of such a parser is as follows:
23
1. Initialize the parse stack to contain a single state s0, where s0 is the distinguished initial state
of the parser.
2. Use the state s on top of the parse stack and the current lookahead t to consult the action table
entry action[s][t]:
If the action table entry is shift s' then push state s' onto the stack and advance the
input so that the lookahead is set to the next token.
If the action table entry is reduce r and rule r has m symbols in its RHS, then pop m
symbols off the parse stack. Let s' be the state now revealed on top of the parse stack
and N be the LHS nonterminal for rule r. Then consult the goto table and push the
state given by goto[s'][N] onto the stack. The lookahead token is not changed by this
step.
If the action table entry is accept, then terminate the parse with success.
If the action table entry is error, then signal an error.
3. Repeat step (2) until the parser terminates.
One possible set of shift-reduce parsing tables is shown below (sn denotes shift n, rn denotes
reduce n, acc denotes accept and blank entries denote error entries):
Parser Tables
Action Table Goto Table
ID ':=' '+' '-' <EOF> stmt expr
0 s1 g2
1 s3
2 s4
3 s5 g6
4 acc acc acc acc acc
5 r4 r4 r4 r4 r4
6 r1 r1 s7 s8 r1
7 s9
8 s10
9 r2 r2 r2 r2 r2
10 r3 r3 r3 r3 r3
24
Definition of "handle" :
1. Shift. Shift the next input symbol onto the top of the stack.
2. Reduce. The right end of the string to be reduced must be at the top of the
stack. Locate the left end of the string within the stack and decide with what
nonterminal to replace the string.
3. Accept. Announce successful completion of parsing.
4. Error. Discover a syntax error and call an error recovery routine.
25
8.6 Sample Problem:
1. Design a parser which accepts a mathematical expression (containing integers only). If the expression is valid, then
evaluate the expression else report that the expression is invalid.
[Note: Design first the Grammar and then implement using Shift-Reduce parsing technique. Your program should
generate an output file clearly showing each step of parsing/evaluation of the intermediate sub-expressions. ]
26
9. LR PARSER
9.1 AIM
9.2 Description
The LR(0) automaton is a DFA which accepts viable prefixes of right sentencial
forms, ending in a handle.
States of the NFA correspond to LR(0) Items.
Use subset construction algorithm to convert NFA to DFA.
LR(0) Item: A grammar rule with a dot ( ) added between symbols on the RHS
Example:The production rule yields four items:
Items indicate how much of a production has been seen at a given point in parsing process.
indicates a string derivable from appears on input and a string derivable
from is expected on input.
indicates input derivable from the RHS has been seen and a reduction
may occur.
Kernel Items: All items whose dots are not at the beginning of the RHS, plus the augmented
initial item .
NFA of LR(0) Items: formed using each item as a state of NFA with transitions
corresponding to movement of dots by one symbol.
27
Items with a dot preceding a nonterminal have epsilon transitions to all items beginning with
that nonterminal.
LR(0) Automaton is the DFA formed by subset construction of the LR(0) NFA.
Input: Input string w and an LR parsing table with functions action and goto for a grammar G.
Method: Initially the parser has s0, the initial state, on its stack, and w$ in the input buffer.
repeat forever begin
28
9.4 Sample Input:
(1) E -> E * B
(2) E -> E + B
(3) E -> B
(4) B -> 0
(5) B -> 1
action goto
state * + 0 1 $ EB
0 s1 s2 3 4
1 r4 r4 r4 r4 r4
2 r5 r5 r5 r5 r5
3 s5 s6 acc
4 r3 r3 r3 r3 r3
5 s1 s2 7
6 s1 s2 8
7 r1 r1 r1 r1 r1
8 r2 r2 r2 r2 r2
9.6 Sample problem:
29
10. Intermediate code Generator
10.1 AIM
Write a C Program to generate intermediate code which converts source program to target code.
10.2 Description
Intermediate codes are machine independent codes, but they are close to machine
instructions.
The given program in a source language is converted to an equivalent program in an
intermediate language by the intermediate code generator.
Intermediate language can be many different languages, and the designer of the compiler
decides this intermediate language.
syntax trees can be used as an intermediate language.
postfix notation can be used as an intermediate language.
three-address code (Quadraples) can be used as an intermediate language
we will use quadraples to discuss intermediate code generation
quadraples are close to machine instructions, but they are not actual machine
instructions.
some programming languages have well defined intermediate languages.
java – java virtual machine
prolog – warren abstract machine
In fact, there are byte-code emulators to execute instructions in these
intermediate languages.
Observe that given the syntax-tree or the dag of the graphical representation we can easily derive a
three address code for assignments as above.
30
10.3 Sample Input:
t0 = k + 8
t1 = c – s
t2 = t0 * t1
a = t2
31
11. Code Optimization Technique
11.1 AIM
11.2 DESCRIPTION
Techniques used in optimization can be broken up among various scopes which can affect anything
from a single statement to the entire program. Generally speaking, locally scoped techniques are
easier to implement than global ones but result in smaller gains. Some examples of scopes include:
Peephole optimizations
Local optimizations
Global optimizations
Loop optimizations
Machine code optimization
32