LP IV Compiler Manual
LP IV Compiler Manual
COLLEGE OF ENGINEERING
Nashik
LABORATORY MANUAL
2018-2019
LABORATORY PRACTICE-IV
[Compiler]
BE-COMPUTER ENGINEERING
SEMESTER-II
Subject Code: 410255
EXAMINATION SCHEME
TEACHING SCHEME
Oral: 50 Marks
Practical: 4 Hrs/Week
Term Work: 50 Marks
-: Name of Faculty:-
Prof. A.R.Jain
Assignment No. 1
Implement a Lexical Analyzer using LEX for a subset of C. Cross
Title check your output with Stanford LEX.
Roll No.
Class B.E. (C.E.)
Date
Subject Laboratory Practice-IV
Signature
Assignment No: 1
Title: Implement a Lexical Analyzer using LEX for a subset of C. Cross check your output with
Stanford LEX.
Aim:
Assignment to understand the syntax of LEX specifications, built-in functions and
variables.
Objectives:
Theory:
Introduction:
LEX stands for Lexical Analyzer.LEX is a UNIX utility which generates the lexical analyzer. LEX is a
tool for generating scanners. Scanners are programs that recognize lexical patterns in text. These lexical
patterns (or regular expressions) are defined in a particular syntax. A matched regular expression may
have an associated action. This action may also include returning a token. When Lex receives input in
the form of a file or text, it attempts to match the text with the regular expression. It takes input one
character at a time and continues until a pattern is matched. If a pattern can be matched, then Lex
performs the associated action (which may include returning a token). If, on the other hand, no regular
expression can be matched, further processing stops and Lex displays an error message. Lex and C are
tightly coupled. A lex file (files in Lex have the .l extension eg: first.l ) is passed through the lex utility,
and produces output files in C (lex.yy.c). The program lex.yy.c basically consists of a transition
diagram constructed from the regular expressions of first.l These file is then compiled object program
a.out, and lexical analyzer transforms an input streams into a sequence of tokens as show in fig 1.1.
To generate a lexical analyzer two important things are needed. Firstly it will need a precise
specification of the tokens of the language. Secondly it will need a specification of the action to be
performed on identifying each token.
1. LEX Specifications:
Definition Section :
The Definition Section includes declarations of variables, start conditions regular definitions, and
manifest constants (A manifest constant is an identifier that is declared to represent a constant e.g.
# define PIE 3.14).
C code: Any indented code between %{ and %} is copied to the C file. This is typically used
for defining file variables, and for prototypes of routines that are defined in the code segment.
Definitions: A definition is very much like # define cpp directive. For example
letter [a-zA-Z]+
digit [0-9]+
These definitions can be used in the rules section: one could start a rule
State definitions: If a rule depends on context, it‟s possible to introduce states and
incorporate those in the rules. A state definition looks like %s STATE, and by default a
state INITIAL is already given.
Rule Section:
Second section is for translation rules which consist of regular expression and action with
respect to it. The translation rules of a Lex program are statements of the form:
p1 {action 1}
p2 {action 2}
p3 {action 3}
... ...
... ...
pn {action n}
Where, each p is a regular expression and each action is a program fragment describing what action the
lexical analyzer should take when a pattern p matches a lexeme. In Lex the actions are written in C.
2. Built - in Functions:
3 yyless(int n) This function can be used to push back all but first „n‟ characters
of the read Token.
4 yymore() This function tells the lexer to append the next token to the current
token.
5 yyerror() This function is used for displaying any error message.
3. Built - in Variables:
1. Regular Expression:
No. RE Meaning
1 a Matches a
2 abc Matches abc
3 [abc] Matches a or b or c
4 [a-f] Matches a,b,c,d,e or f
5 [0-9] Matches any digit
+
6 X Matches one or more of x
7 X* Matches zero or more of x
8 [0-9]+ Matches any integer
9 (…) Grouping an expression into a single unit
10 | Alteration ( or)
11 (b|c) Is euivalent to [a-c]*
12 X? X is optional (0 or 1 occurrence)
13 If(def)? Matches if or ifdef
14 [A-Za-z] Matches any alphabetical character
15 . Matches any character except new line
16 \. Matches the . character
17 \n Matches the new character
18 \t Matches the tab character
19 \\ Matches the \ character
20 [ \t] Matches either a space or tab character
21 [^a-d] Matches any character other than a,b,c and d
22 $ End of the line
Algorithm:
a. Declaration%%
b.Translation rules %%
c. Auxilary procedure.
a. P1 {action}
b. P2 {action}
c. …
d. …
e. Pn {action}
6. Compile the lex program with lex compiler to produce output file as lex.yy.c. eg $ lex
filename.l $ cc lex.yy.c -ll
Conclusion:
LEX is a tool which accepts regular expressions as an input & generates a C code to recognize
that token. If that token is identified, then the LEX allows us to write user defined routines that are to
be executed. When we give input specification file to LEX, LEX generates lex.yy.c file as an output
which contains function yylex() which is generated by the LEX tool & contains a C code to recognize
the token & action to be carried out if we find the token.
Assignment No. 2
Implement a parser for an expression grammar using YACC and
LEX for the subset of C. Cross check your output with Stanford
Title LEX and YACC.
Roll No.
Class B.E. (C.E.)
Date
Subject Laboratory Practice-IV
Signature
Assignment No: 2
Title: Implement a parser for an expression grammar using YACC and LEX for the subset of C. Cross b
check your output with Stanford LEX and YACC.
Aim: Assignment to understand basic syntax of YACC specifications built-in functions and variables
Objective:
Theory:
Parser generator facilitates the construction of the front end of a compiler. YACC is LALR parser
generator. It is used to implement hundreds of compilers. YACC is command (utility) of the UNIX
system. YACC stands for “Yet Another Compiler Complier”.
File in which parser generated is with .y extension. e.g. parser.y, which is containing YACC
specification of the translator. After complete specification UNIX command. YACC transforms
parser.y into a C program called y.tab.c using LR parser. The program y.tab.c is automatically
generated. We can use command with –d option as
yacc –d parser.y
By using –d option two files will get generated namely y.tab.c and y.tab.h. The header file y.tab.h will
store all the token information and so you need not have to create y.tab.h explicitly.
The program y.tab.c is a representation of an LALR parser written in C, along with other C routines
that the user may have prepared. By compiling y.tab.c with the ly library that contains the LR parsing
program using the command.
cc y tab c – ly
we obtain the desired object program a out that perform the translation specified by the original program.
If procedure is needed, they can be compiled or loaded with y.tab.c, just as with any C program.
LEX recognizes regular expressions, whereas YACC recognizes entire grammar. LEX divides the
input stream into tokens, while YACC uses these tokens and groups them together logically. LEX
and YACC work together to analyze the program syntactically. The YACC can report conflicts or
ambiguities (if at all) in the form of error messages.
1. YACC Specifications:
Definition Section:
The definitions and programs section are optional. Definition section handles control
information for the YACC-generated parser and generally set up the execution environment in
which the parser will operate.
Declaration part:
In declaration section, %{ and %} symbol used for C declaration. This section is used for
definition of token, union, type, start, associativity and precedence of operator. Token declared
in this section can then be used in second and third parts of Yacc specification.
In the part of the Yacc specification after the first %% pair, we put the translation rules. Each
rule consists of a grammar production and the associated semantic action. A set of productions
that we have been writing:
… …………
In a Yacc production, unquoted strings of letters and digits not declared to be tokens are taken
to be nonterminals. A quoted single character, e.g. 'c', is taken to be the terminal symbol c, as
well as the integer code for the token represented by that character (i.e., Lex would return the
character code for ' c' to the parser, as an integer). Alternative bodies can be separated by a
vertical bar, and a semicolon follows each head with its alternatives and their semantic actions.
The first head is taken to be the start symbol.
A Yacc semantic action is a sequence of C statements. In a semantic action, the symbol $$ refers
to the attribute value associated with the nonterminal of the head, while $i refers to the value
associated with the ith grammar symbol (terminal or nonterminal) of the body. The semantic
E E + T/T
| term
In above production exp is $1, „+‟ is $2 and term is $3. The semantic action associated with
first production adds values of exp and term and result of addition copying in $$ (exp) left hand
side. For above second number production, we have omitted the semantic action since it is just
copying the value. In general {$$ = $1;} is the default semantic action.
The third part of a Yacc specification consists of supporting C-routines. YACC generates a
single function called yyparse(). This function requires no parameters and returns either a 0 on
success, or 1 on failure. If syntax error over its return 1.The special function yyerror() is called
when YACC encounters an invalid syntax. The yyerror() is passed a single string (char )
argument. This function just prints user defined message like:
When LEX and YACC work together lexical analyzer using yylex () produce pairs consisting of
a token and its associated attribute value. If a token such as DIGIT is returned, the token value
associated with a token is communicated to the parser through a YACC defined variable yylval.
We have to return tokens from LEX to YACC, where its declaration is in YACC. To link this
LEX program include a y.tab.h file, which is generated after YACC compiler the program using
– d option.
2. Built-in Functions:
Function Meaning
yyparser() This is a standard parse routine used for calling syntax analyzer for given translation
rules. When yyparse() is call, the parser attempts to parse an input stream.
yyerror() This function is used for displaying any error message when a yacc detects a syntax
error
3. Built-in Types:
Type Meaning
%token Used to declare the tokens used in the grammar.
Eg.:- %token NUMBER
%union Token data types are declared in YACC using the YACC declaration % union, like this :
% union
{ char str ;
int num ; }
4. Special Characters:
Characters Meanings
% A line with two percent signs separates the part of yacc grammar. All
declarations in definition section start with %, including %{ %},%start,
%token, %type, %left, %right, %nonassoc and %union.
‘’ Literal tokens are enclosed in single quotes. Eg: „+‟ or „-„ or „*‟ or „\‟
etc.
<> In value references in an action, you can override the value‟s defaults type
by enclosing the type name in angle brackets.
; Each rule in rules section should end with semicolon, except those that are
immediately followed by rule that starts a vertical bar.
| When two consecutive rules have same left-hand side, the second rule is
separated by vertical bar.
: In rule section, colon is used to separate left-hand side and right-hand side.
$./a .out
Algorithm:
LEX program:
1. Declare header files y.tab.h which contains information of the tokens and also declare
variable yylval within %{ and %}.
3. Write the Regular Expression for: FOR, OB, CB, SM, CON, EQ, ID, NUM, INC, DEC.
4. If match found for regular expression then write action that store token in yylval where p
is pointer declared in YACC and return the valve of token.
2. Declare tokens FOR, OB, CB, SM, CON, EQ, ID, NUM, INC, DEC.
4. State Context Free Grammar for FOR loop in rule section and write appropriate action
for same.
| ID EQ NUM
E2 : ID RELOP ID
| ID RELOP NUM
E3 : ID INC
| ID DEC
7. Define yyerror() function to displaying any error message when a yacc detects a syntax
error. yyerror(const char *msg) { if(flag==0); printf("\n\n Syntax is Wrong"); }
Conclusion:
The yacc command accepts a language that is used to define a grammar for a target language to
be parsed by the tables and code generated by yacc. The language accepted by yacc as a grammar for
the target language is described below using the yacc input language itself.
The input grammar includes rules describing the input structure of the target language and code
to be invoked when these rules are recognized to provide the associated semantic action. The code to be
executed will appear as bodies of text that are intended to be C-language code. The C-language
inclusions are presumed to form a correct function when processed by yacc into its output files.
FAQ’s
Assignment No. 5
Implement the front end of a compiler that generates the three
address code for a simple language.
Title
Roll No.
Class B.E. (C.E.)
Date
Subject Laboratory Practice-IV
Signature
Assignment No: 5
Title: Implement the front end of a compiler that generates the three address code for a simple language.
Aim: Write an attributed translation grammar to recognize declarations of simple variables, "for",
assignment, if, if-else statements as per syntax of C or Pascal and generate equivalent three address
code for the given input made up of constructs mentioned above using LEX and YACC. Write a code
to store the identifiers from the input in a symbol table and also to record other relevant information
about the identifiers. Display all records stored in the symbol table.
Theory:
Introduction:
In the analysis - synthesis model of a compiler, the front end analyzes a source program and creates an
intermediate representation, from which the back end generates target code. Ideally, details of the
source language are confined to the front end, and details of the target machine to the back end. The
front end translates a source program into an intermediate representation from which the back end
generates target code. With a suitably defined intermediate representation, a compiler for language i
and machine j can then be built by combining the front end for language i with the back end for
machine j. This approach to creating suite of compilers can save a considerable amount of effort: m x n
compilers can be built by writing just m front ends and n back ends.
Intermediate Languages:
Three ways of intermediate representation:
Syntax tree
Postfix notation
Three address code
The semantic rules for generating three-address code from common programming language constructs
are similar to those for constructing syntax trees or for generating postfix notation.
Graphical Representations:
1. Syntax tree:
A syntax tree depicts the natural hierarchical structure of a source program. A dag(Directed
Acyclic Graph) gives the same information but in a more compact way because common sub
expressions are identified. A syntax tree and dag for the assignment statement a : =b * -c + b * -
c are as follows:
2. Postfix notation:
Postfix notation is a linearized representation of a syntax tree; it is a list of the nodes of the tree
in which a node appears immediately after its children. The postfix notation for the syntax tree
given above is
a b c uminus * b c uminus * + assign
3. Three-Address Code:
Three-address code is a sequence of statements of the general
form x : = y op z
Where x, y and z are names, constants, or compiler-generated temporaries; op stands for any operator,
such as a fixed-or floating-point arithmetic operator, or a logical operator on Boolean valued data.
Thus a source language expression like x+ y*z might be translated into a sequence
t1 : = y * z
t2 : = x +t1
Where t1 and t2 are compiler-generated temporary names.
The reason for the term “three-address code” is that each statement usually contains three addresses,
two for the operands and one for the result.
Three-address code is a liberalized representation of a syntax tree or a dag in which explicit names
correspond to the interior nodes of the graph. The syntax tree and dag are represented by the three-
address code sequences. Variable names can appear directly in three address statements.
y. If not, the three-address statement following if x relop y goto L is executed next, as in the
usual sequence.
param x and call p, n for procedure calls and return y, where y representing a returned value is
optional. For example,
param x1
param x2
.......
param xn
call p,n
A three-address statement is an abstract form of intermediate code. In a compiler, these statements can
be implemented as records with fields for the operator and the operands. Three such representations
are: Quadruples, Triples, Indirect triples.
A. Quadruples:
A quadruple is a record structure with four fields, which are, op, arg1, arg2 and result.
The op field contains an internal code for the operator. The 3 address statement x = y op
z is represented by placing y in arg1, z in arg2 and x in result.
The contents of fields arg1, arg2 and result are normally pointers to the symbol-table
entries for the names represented by these fields. If so, temporary names must be entered
into the symbol table as they are created.
Fig a) shows quadruples for the assignment a : b * c + b * c
B. Triples:
To avoid entering temporary names into the symbol table, we might refer to a temporary
value by the position of the statement that computes it.
If we do so, three-address statements can be represented by records with only three
fields: op, arg1 and arg2.
The fields arg1 and arg2, for the arguments of op, are either pointers to the symbol table
or pointers into the triple structure ( for temporary values ).
Since three fields are used, this intermediate code format is known as triples.
Fig b) shows the triples for the assignment statement a: = b * c + b * c.
C. Indirect triples:
Indirect triple representation is the listing pointers to triples rather-than listing the triples
themselves.
Let us use an array statement to list pointers to triples in the desired order.
Fig c) shows the indirect triple representation.
$./a .out
Algorithm:
Write a LEX and YACC program to generate Intermediate Code for arithmetic expression
LEX program:
1. Declaration of header files specially y.tab.h which contains declaration for Letter, Digit, expr.
4. If match found then convert it into char and store it in yylval.p where p is pointer declared in
YACC
5. Return token
10. End
2. Declare structure for three address code representation having fields of argument1, argument2,
operator, result.
8. If final expression evaluates then add it to the table of three address code.
12. Declare main function and call yyparse function untill yyin ends
Addtotable function will add the argument1, argument2, operator and temporary variable to the
structure array of three address code.
Three address code function will print the values from the structure in the form first temporary
variable, argument1, operator, argument2
Quadruple Function will print the values from the structure in the form first operator, argument1,
argument2, result field
Triple Function will print the values from the structure in the form first argument1, argument2, and
operator. The temporary variables in this form are integer / index instead of variables.
Conclusion:
FAQ’s
4. Which representation of 3-address code is better than other and why? Justify.