Lab Manual CD
Lab Manual CD
Laboratory Manual
Year: 2023-2024
INDEX
THEORY:
Some of the most time consuming and tedious parts of writing a compiler involve the lexical
scanning and syntax analysis. Luckily there is freely available software to assist in these
functions. While they will not do everything for you, they will enable faster implementation of
the basic functions. Lex and Yacc are the most commonly used packages with Lex managing
the token recognition and Yacc handling the syntax. They work well together, but conceivably
can be used individually as well. Both operate in a similar manner in which instructions for
token recognition or grammar are written in a special file format. The text files are then read
by lex and/or yacc to produce c code. This resulting source code is compiled to make the final
application. In practice the lexical instruction file has a “.l” suffix and the grammar file has a
“.y” suffix. This process is shown in Figure.
The file format for a lex file consists of basic sections as below:
• The first is an area for c code that will be place verbatim at the beginning of the generated
source code.
Typically is will be used for things like #include, #defines, and variable declarations.
• The next section is for definitions of token types to be recognized. These are not mandatory,
but in general makes the next section easier to read and shorter.
• The third section set the pattern for each token that is to be recognized, and can also include c
code to be called when that token is identified
• The last section is for more c code (generally subroutines) that will be appended to the end of
the generated c code. This would typically include a main function if lex is to be used by itself.
• The format is applied as follows (the use and placement of the % symbols are necessary):
%{
//header c code %}
//definitions %%
//rules %%
//subroutines
The format for a yacc file is similar, but includes a few extras.
• The first area (preceded by a %token) is a list of terminal symbols. You do not need to list
single character ASCII symbols, but anything else including multiple ASCII symbols need to
be in this list (i.e. “==”).
• The next is an area for c code that will be place verbatim at the beginning of the generated
source code. Typically is will be used for things like #include, #defines, and variable
declarations.
• The next section is for definitions - none of the following examples utilize this area
• The fourth section set the pattern for each token that is to be recognized, and can also
include c code to be called when that token is identified
• The last section is for more c code (generally subroutines) that will be appended to the end of
the generated c code. This would typically include a main function if lex is to be used by itself.
• The format is applied as follows (the use and placement of the % symbols are necessary):
• %tokens RESERVED, WORDS, GO, HERE
%
{
//header c code
%}
//definitions
%%
//rules
%%
//subroutines
These formats and general usage will be covered in greater detail in the following (4) sections.
In general it is best not to modify the resulting c code as it is overwritten each time lex or yacc
is run. Most desired functionality can be handled within the lexical and grammar files, but
there are some things that are difficult to achieve that may require editing of the c file.
EXCERSICE:
1. Study the LEX and YACC tool and Evaluate an arithmetic expression with parentheses,
unary and binary operators using Flex and Yacc (CALCULATOR)
EVALUATION:
Date: Signature:
EXPERIMENT NO : 2
OBJECTIVE: On completion of this exercise student will able to construct Finite Automata
for given grammar.
THEORY:
EXCERSICE:
Construct finite automata for a regular expression for a string which ends with abb.
EVALUATION:
Date: Signature:
EXPERIMENT NO : 3
OBJECTIVE: On completion of this exercise student will able to know about Lexical
Analyzer.
THEORY:
Lexical analysis is the first phase of a compiler. It takes the modified source code from
language preprocessors that are written in the form of sentences. The lexical analyzer breaks
these syntaxes into a series of tokens, by removing any whitespace or comments in the
source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works
closely with the syntax analyzer. It reads character streams from the source code, checks for
legal tokens, and passes the data to the syntax analyzer when it demands.
Lexemes
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined
by grammar rules, by means of a pattern. A pattern explains what can be a token, and these
patterns are defined by means of regular expressions.
Language
A language is considered as a finite set of strings over some finite set of alphabets.
Computer languages are considered as finite sets, and mathematically set operations can be
performed on them. Finite languages can be described by means of regular expressions.
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme
that belongs to the language in hand. It searches for the pattern defined by the language
rules.
Regular expressions have the capability to express finite languages by defining a pattern for
finite strings of symbols. The grammar defined by regular expressions is known as regular
grammar. The language defined by regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a
set of strings, so regular expressions serve as names for a set of strings. Programming
language tokens can be described by regular languages. The specification of regular
expressions is an example of a recursive definition. Regular languages are easy to
understand and have efficient implementation.
There are a number of algebraic laws that are obeyed by regular expressions, which can be
used to manipulate regular expressions into equivalent forms.
EXCERSICE:
Design a lexical analyzer for given language and the lexical analyzer should ignore
redundant spaces, tabs and new lines. It should also ignore comments. Although the syntax
specification states that identifiers can be arbitrarily long, you may restrict the length to
some reasonable value. Simulate the same in C language.
REVIEW QUESTIONS:
1. Define lexemes.
2. Define language.
3. Explain working of lexical analysis phase.
EVALUATION:
Date: Signature:
EXPERIMENT NO : 4
OBJECTIVE: On completion of this exercise student will able to know about Lexical
Analyzer.
THEORY:
The lexical analyzer is the first phase of a compiler. Its main task is to read the input
characters and produce as output a sequence of tokens, which parser uses for syntax
analysis.
The sentences of a language consist of strings of tokens. A sequence of input character that
comprises a single token is called a lexeme.
Lexical analyzer performs following functions to convert lexems into tokens:
1. Removal of white spaces and comments:
White spaces are banks, tabs and new lines & comments. White spaces are eliminated by
lexical analyzer so parser will never consider it.
2. Lexical analyzer collects character from the source file and these characters are grouped
in to one of the following Tokens.
I. Identifiers
II. Operators
III. Keywords
IV. Constants
In C language identifiers are the names given to variables, constants, functions and user-
define data. This identifier is defined against a set of rules.
Rules for an Identifier:
1. An Identifier can only have alphanumeric characters ( a-z , A-Z , 0-9 ) and underscore ( _
).
2. The first character of an identifier can only contain alphabet (a-z , A-Z ) or underscore ( _
).
3. Identifiers are also case sensitive in C. For example name and Name are two different
identifiers in C.
4. Keywords are not allowed to be used as Identifiers.
5. No special characters, such as semicolon, period, whitespaces, slash or comma are
permitted to be used in or as Identifier.
A lexical analyzer generally does nothing with combinations of tokens, a task left for a
parser. For example, a typical lexical analyzer recognizes parentheses as tokens, but does
nothing to ensure that each "(" is matched with a ")".
Consider this expression in the C programming language: sum = 3 + 2;
Tokenized in the following table:
Lexeme Token type
Sum Identifier
= Assignment operator
3 Integer literal
+ Addition operator
2 Integer literal
; End of statement
Tokens are frequently defined by regular expressions, which are understood by a lexical
analyzer generator such as lex. The lexical analyzer (either generated automatically by a tool
like lex, or hand-crafted) reads in a stream of characters, identifies the lexemes in the stream,
and categorizes them into tokens. This is called "tokenizing." If the lexer finds an invalid
token, it will report an error.
Operators:
Following tokenizing is parsing. From there, the interpreted data may be loaded into data
structures for general use, interpretation, or compiling.
We know, Addition operator plus („+‟) operates on two Operands
Syntax analyzer will just check whether plus operator has two operands or not. It does not
check the type of operands.
Suppose One of the Operand is String and other is Integer then it does not throw error as it
only checks whether there are two operands associated with „+‟ or not .
Procedure:
1. Read the source file from left to right one character at a time.
2. Using valid token types group them in to tokens which belong to same class.
3. Save token in to destination file.
EXCERSICE:
REVIEW QUESTIONS:
1. Define tokens.
2. Define operators.
3. Define identifiers.
EVALUATION:
Date: Signature:
EXPERIMENT NO: 5
OBJECTIVE: On completion of this exercise student will able to know what is FIRST and
FOLLOW and how to find FIRST and FOLLOW set from given grammar.
THEORY:
FIRST set:
Given a non-terminal symbol, the next symbol on input should uniquely determine which
alternative of the production to choose. These input symbols are called director symbols.
A production alternative can generate a number of terminal strings. The first symbols of those
strings are director symbols for that alternative. To this end, we wish to calculate the set of
terminal symbols which form the set of first symbols for each non-terminal in the language.
This set of symbols is called the first set.
FIRST (α)
If α is any string of grammar symbols, let FIRST (α) be the set of terminals that begin the strings
derived from α. If α => ε then ε is also in FIRST (α).
To compute FIRST(X) for all grammar symbols X, apply the following rules until no more
terminals or ε can be added to any FIRST set:
That is, Y1... Yi-1 => ε. If ε is in FIRST (Yj) for all j = 1, 2, ..., k, then add ε to FIRST(X). For
example, everything in FIRST (Y1) is surely in FIRST(X). If Y1 does not derive ε, then we add
nothing more to FIRST(X), but if Y1 => ε, then we add FIRST (Y2) and so on.
Given a non-terminal symbol, the next symbol on input should uniquely determine which
alternative of the production to choose. These input symbols are called director symbols.
If a production alternative can generate the empty string, then the symbols that can FOLLOW
the production also qualify as director symbols. Hence, we are also interested in what terminal
symbols may follow a nonterminal symbol. This set of terminal symbols is called the follow
set.
FOLLOW (A)
Define FOLLOW (A), for nonterminal A, to be the set of terminals a that can appear
immediately to the right of A in some sentential form, that is, the set of terminals a such that
there exists a derivation of the form S => αAaβ for some α and β. Note that there may, at
some time during the derivation, have been symbols between A and a, but if so, they derived ε
and disappeared. If A can be the rightmost symbol in some sentential form, then $,
representing the input right endmarker, is in FOLLOW (A).
To compute FOLLOW (A) for all nonterminals A, apply the following rules until nothing can
be added to any FOLLOW set:
1. Place $ in FOLLOW(S), where S is the start symbol and $ is the input right end marker.
2. If there is a production A S => αBβ, then everything in FIRST (β), except for ε, is placed in
FOLLOW(B).
3. If there is a production A => αB, or a production A => αBβ where FIRST (β) contains ε
(i.e., β =>ε), then everything in FOLLOW (A) is in FOLLOW (B).
EXCERSICE:
Write a program that will find the FIRST and FOLLOW set of the grammar.
EVALUATION:
Date: Signature:
EXPERIMENT NO : 6
OBJECTIVE: On completion of this exercise student will able to know about predictive parser.
THEORY:
In computer science, a recursive descent parser is a kind of top-down parser built from
a set of mutually recursive procedures (or a non-recursive equivalent) where each such
procedure usually implements one of the production rules of the grammar. Thus the structure
of the resulting program closely mirrors that of the grammar it recognizes.
A predictive parser is a recursive descent parser that does not require backtracking. Predictive
parsing is possible only for the class of LL(k) grammars, which are the context- free
grammars for which there exists some positive integer k that allows a recursive descent parser
to decide which production to use by examining only the next k tokens of input. (The LL (k)
grammars therefore exclude all ambiguous grammars, as well as all grammars that contain
left recursion. Any context-free grammar can be transformed into an equivalent grammar that
has no left recursion, but removal of left recursion does not always yield an LL (k) grammar.)
A predictive parser runs in linear time. Recursive descent with backtracking is a technique
that determines which production to use by trying each production in turn. Recursive
descent with backtracking is not limited to LL (k) grammars, but is not guaranteed to terminate
unless the grammar is LL (k). Even when they terminate, parsers that use recursive descent
with backup may require exponential time.
LL grammars, particularly LL(1) grammars, are of great practical interest, as parsers for these
grammars are easy to construct, and many computer languages are designed to be LL(1) for
this reason. LL parsers are table-based parsers, similar to LR parsers. LL grammars can also
be parsed by recursive descent parsers.
EXCERSICE:
Write a program to implement a program for a predictive parser for below grammar.
S-> A
A-> Bb | Cd
B-> aB | @
C-> Cc | @
REVIEW QUESTIONS:
EVALUATION:
Timely
Problem Analysis Understanding
completion Mock(2) Total(10)
& Solution (3) level (3)
(2)
Date: Signature:
EXPERIMENT NO : 7
OBJECTIVE: On completion of this exercise student will able to know about LR Parsers..
THEORY:
LR parsers are used to parse the large class of context free grammars. This technique is
called LR(k) parsing.
L is left-to-right scanning of the input.
R is for constructing a right most derivation in reverse.
k is the number of input symbols of lookahead that are used in making parsing
decisions.
There are three widely used algorithms available for constructing an LR parser:
SLR(l) - Simple LR
o Works on smallest class of grammar.
o Few number of states, hence very small table.
o Simple and fast construction.
LR( 1) - LR parser
o Also called as Canonical LR parser.
o Works on complete set of LR(l) Grammar.
o Generates large table and large number of states.
o Slow construction.
LALR(1) - Look Ahead LR Parser
o Works on intermediate size of grammar
o Number of states are same as in SLR(1)
Drawbacks of LR parsers
It is too much work to construct LR parser by hand. It needs an automated parser
generator.
If the grammar contains ambiguities or other constructs then it is difficult to parse in a
left-to-right scan of the input.
Model of LR Parser
LR parser consists of an input, an output, a stack, a driver program and a parsing table that
has two functions
Action
Goto
The driver program is same for all LR parsers. Only the parsing table changes from one parser
to another.
The parsing program reads character from an input buffer one at a time, where a shift reduces
parser would shift a symbol; an LR parser shifts a state. Each state summarizes the
information contained in the stack.
The stack holds a sequence of states, so, s1, · ·· , Sm, where Sm is on the top.
Fig. LR Parser
Action
This function takes as arguments a state i and a terminal a (or $, the input end
marker). The value of ACTION [i, a] can have one of the four forms:
i) Shift j, where j is a state.
ii) Reduce by a grammar production A---> β.
iii) Accept. iv) Error.
Goto
This function takes a state and grammar symbol as arguments and produces a state.
If GOTO [Ii ,A] = Ij, the GOTO also maps a state i and non terminal A to state j.
LR(0) Items
An LR(0) item of a grammar G is a production of G with a dot at some position of the body.
(eg.)
A ---> •XYZ A ---> XeYZ A ---> XYeZ A ---> XYZ•
One collection of set of LR(0) items, called the canonical LR(0) collection, provides finite
automaton that is used to make parsing decisions. Such an automaton is called an LR(0)
automaton.
EXCERSICE:
1) Write a C program to implement LR (1) parser.
2) Write a C program to implement LALR (1) parser.
REVIEW QUESTIONS:
Date: Signature:
EXPERIMENT NO : 8
OBJECTIVE: On completion of this exercise student will able to know about operator
precedence parser.
THEORY:
Bottom-up parsers for a large class of context-free grammars can be easily developed
using operator grammars.
Operator grammars have the property that no production right side is empty or has two
adjacent nonterminals. This property enables the implementation of efficient operator-
precedence parsers. This parser relies on the following three precedence relations:
Relation Meaning
a <·b a yields precedence to b
a =·b a has the same precedence as b
a ·>b a takes precedence over b
These operator precedence relations allow to delimit the handles in the right sentential forms:
<· marks the left end,=· appears in the interior of the handle, and·>marks the right end.
Let assume that between the symbols ai and ai+1 there is exactly one precedence relation.
Suppose that $ is the end of the string. Then for all terminals we can write: $ <·b and b
·>$. If we remove all nonterminals and place the correct precedence relation:
<·,=·,·> between the remaining terminals, there remain strings that can be analyzed by easily
developed parser.
For example, the following operator precedence relations can be introduced for simple
expressions and for the input string id1+id2*id3:
After inserting precedence relations the string becomes $<·id1·>+<·id2·>*<·id3·>$ Having
precedence relations allows identifying handles as follows:
• scan the string from left until seeing·>
• scan backwards the string from right to left until seeing<·
• Everything between the two relations<· and·> forms the handle
id + * $
id ·> ·> ·>
+ <· ·> <· ·>
* <· ·> ·> ·>
$ <· <· <· ·>
Note that not the entire sentential form is scanned to find the handle.
EXCERSICE:
Date: Signature:
EXPERIMENT NO : 9
OBJECTIVE: On completion of this exercise student will able to know about how recursive
descent parser works.
THEORY:
Recursive descent is parsing is a top -down method of syntax analysis in which we execute a
set of recursive procedures to process the i/p. A procedure is associated with each non-
terminal of a grammar. Here we consider a special form of a recursive decent parsing called
predicative parsing in which the look ahead symbol unambiguously determines the
procedure/function selected for each no terminal.
ALGORITHM:
Procedure match (t:token)
Begin
If lookahead=t then
lookahead=next token
Else
error
End
EXCERSICE:
Using recursive descent parsing method design a syntax analyzer for any expression in c
language.
EVALUATION:
Date: Signature:
EXPERIMENT NO : 10
OBJECTIVE: On completion of this exercise student will able to implement Intermediate code
in three address code format.
THEORY:
Intermediate code generation phase of the compiler is responsible to generate Code in postfix,
syntax tree, or in three address code format. Three-address code is a sequence of statement of
the general form
X: = y op z
Where x, y & z are name, constants or compiler generated temporaries, op Stands for any
operator, such as a fixed or Boating. Arithmetic operator or a logical operator on Boolean
valued data.
EXCERSICE:
Using Syntax directed translation & predicative parsing technique generate the Intermediate
code in three address code format for expression in c language.
EVALUATION:
Date: Signature: