CompilerDesign UNIT 1-1
CompilerDesign UNIT 1-1
2
21CS601
Compiler Design
Department: CSE
Batch/Year: 2021-25 / III
Created by:
Dr. P. EZHUMALAI, Prof & Head/RMDEC
Dr. A. K. JAITHUNBI, Associate Professor/RMDEC
V.SHARMILA, Assistant Professor/RMDEC
Date: 05.01.2024
3
1. CONTENTS
S. Page
Contents
No No
1 Course Objectives 5
2 Pre Requisites 6
3 Syllabus 7
4 Course outcomes 9
6 Lecture Plan 11
8 Lecture Notes 13
9 Assignments 69
11 Part B Questions 74
15 Assessment Schedule 78
4
2. COURSE OBJECTIVES
5
3. PRE REQUISITES
• Pre-requisite Chart
21MA302
21CS201
Discrete Mathematics
Data Structures
21CS02 Python
21GE101 -
Programming (Lab
Problem solving and C
Integrated)
Programming
6
4. SYLLABUS
21CS601 COMPILER DESIGN (Lab Integrated) LTPC
3 02 4
OBJECTIVES
7
4. SYLLABUS
LIST OF EXPERIMENTS:
1. Develop a lexical analyzer to recognize a few patterns in C. (Ex.
identifiers, constants, comments, operators etc.). Create a symbol
table, while recognizing identifiers.
2. Design a lexical analyzer for the given language. The lexical analyzer
should ignore redundant spaces, tabs and new lines, comments etc.
3. Implement a Lexical Analyzer using Lex Tool.
4. Design Predictive Parser for the given language.
5. Implement an Arithmetic Calculator using LEX and YACC.
6. Generate three address code for a simple program using LEX and YACC.
7. Implement simple code optimization techniques (Constant folding,
Strength reduction and Algebraic transformation).
8. Implement back-end of the complier for which the three address code
is given as input and the 8086 assembly language code is produced as
output.
8
5. COURSE OUTCOME
Construct the parse tree and check the syntax of the given
CO2 K3
source program using parser and YACC tool.
9
6. CO - PO / PSO MAPPING
CO1 K2 3 2 1 - - - - 1 1 1 - 1 2 - -
CO2 K3 3 2 1 - - - - 1 1 1 - 1 2 - -
CO3 K4 3 2 1 - - - - 1 1 1 - 1 2 - -
CO4 K4 3 2 1 - - - - 1 1 1 - 1 2 - -
CO5 K3 3 2 1 - - - - 1 1 1 - 1 2 - -
10
7. LECTURE PLAN : UNIT – I
INTRODUCTION TO COMPILERS
UNIT – I
• TO UNDERSTAND THE BASIC CONCEPTS OF COMPILERS , STUDENTS ABLE TO
TAKE QUIZ AS AN ACTIVITY.
https://sites.google.com/site/wonsunahn/teaching/cs-0449-systems-
software/kahoot-quiz
Video link
https://youtu.be/l-JTDDCRBss
https://youtu.be/AdXideQrkPE
https://youtu.be/XPH_hoP9z40
https://youtu.be/NmOEbti9gzY
https://youtu.be/IFBb84Ec_Cg
• Hands On -Assignment:
12
9. LECTURE NOTES
UNIT – I
INTRODUCTION TO COMPILERS
Preprocessor
2.File inclusion: A preprocessor may include header files into the program
text.
13
COMPILER
Error Message
14
INTERPRETER: An interpreter is a program that appears to execute a source
program as if it were machine language.
INTERPRETER
Source Program Program Output
Data
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Direct Execution
Advantages:
Disadvantages:
15
Loader and Link-editor:
Once the assembler procedures an object program, that program must be placed
into memory and executed. The assembler could place the object program directly
in memory and transfer control to it, thereby causing the machine language
program to be execute. This would waste core by leaving the assembler in memory
while the users program was being executed. Also the programmer would have to
retranslate his program with each execution, thus wasting translation time. To
overcome this problem of wasted translation time and memory, system
programmers developed another component called loader.
“A loader is a program that places programs into memory and prepares them for
execution.” It would be more efficient if subroutines could be translated into object
form the loader could ”relocate” directly behind the users program. The task of
adjusting programs so that they may be placed in arbitrary core locations is called
relocation. Relocation loaders perform four functions.
TRANSLATOR
A translator is a program that takes as input a program written in one language
and produces as output a program in another language. Beside program
translation, the translator performs another very important role, the error-
detection. Any violation of the HLL (High Level Language) specification would be
detected and reported to the programmers. Important role of translator are:
• Translating the HLL program input into an equivalent ml program.
• Providing diagnostic messages wherever the programmer violates
specification of the HLL.
TYPE OF TRANSLATORS:-
16
1.2 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler: A compiler operates in phases. A phase is a logically
interrelated operation that takes source program in one representation and
produces output in another representation. The phases of a compiler are shown in
below
There are two phases of compilation.
a. Analysis (Machine Independent/Language Dependent)
b. Synthesis(Machine Dependent/Language independent)
Compilation process is partitioned into
PHASES OF A COMPILER
No-of-sub processes called ‘phases’.
Lexical Analysis:-
LA or Scanners reads the source program one character at a time, carving the
source program into a sequence of atomic units called tokens.
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this phase
expressions, statements, declarations etc… are identified by using the results of
lexical analysis. Syntax analysis is aided by using techniques based on formal
grammar of the programming language.
Intermediate Code Generations:-
An intermediate representation of the final machine language code is produced.
This phase bridges the analysis and synthesis phases of translation.
Code Optimization :-
This is optional phase described to improve the intermediate code so that the output
runs faster and
17
Code Generation:-
The last phase of translation is code generation. A number of optimizations to
reduce the length of machine language program are carried out during this
phase. The output of the code generator is the machine language program of the
specified computer.
18
The parser has two functions. It checks if the tokens from lexical analyzer,
occur in pattern that are permitted by the specification for the source language. It
also imposes on tokens a tree-like structure that is used by the sub-sequent phases
of the compiler.
Example, if a program contains the expression A+/B after lexical analysis this
expression might appear to the syntax analyzer as the token sequence id+/id. On
seeing the /, the syntax analyzer should detect an error situation, because the
presence of these two adjacent binary operators violates the formulations rule of
an expression.
Syntax analysis is to make explicit the hierarchical structure of the incoming token
stream by identifying which parts of the token stream should be grouped.
The intermediate code generation uses the structure produced by the syntax
analyzer to create a stream of simple instructions. Many styles of intermediate code
are possible. One common style uses instruction with one operator and a small
number of operands.
The output of the syntax analyzer is some representation of a parse tree. the
intermediate code generation phase transforms this parse tree into an intermediate
language representation of the source program.
Code Optimization
This is optional phase described to improve the intermediate code so that the
output runs faster and takes less space. Its output is another intermediate code
program that does the some job as the original, but in a way that saves time and /
or spaces.
1, Local Optimization:-
There are local transformations that can be applied to a program to make an
improvement. For example,
If A > B goto L2
Goto L3
19
L2 :This can be replaced by a single statement
If A < B goto L3
Another important local optimization is the elimination of common sub-
expressions
A := B + C + D E := B + C + F
Might be evaluated as
T1 := B + C
A := T1 + D E := T1 + F
Take this advantage of the common sub-expressions B + C.
2, Loop Optimization:-
Another important source of optimization concerns about increasing the speed of
loops. A typical loop improvement is to move a computation that produces the
same result each time around the loop to a point, in the program just before the
loop is entered.
Code generator :-
Cg produces the object code by deciding on the memory locations for data,
selecting code to access each datum and selecting the registers in which each
computation is to be done. Many computers have only a few high speed registers in
which computations can be performed quickly. A good code generator would
attempt to utilize registers as efficiently as possible.
Error Handing :-
One of the most important functions of a compiler is the detection and reporting of
errors in the source program. The error message should allow the programmer to
determine exactly where the errors have occurred. Errors may occur in all or the
phases of a compiler.
Whenever a phase of the compiler discovers an error, it must report the error to
the error handler, which issues an appropriate diagnostic msg. Both of the table-
management and error-Handling routines interact with all phases of the compiler.
20
21
1.3 LEXICAL ANALYSIS
Upon receiving a ‘get next token’ command from the parser, the lexical analyzer
reads input characters until it can identify the next token.
TOKENS
22
A token can look like anything that is useful for processing an input text stream or
text file. Consider this expression in the C programming language: sum=3+2;
sum Identifier
= Assignment operator
3 Number
+ Addition operator
2 Number
; End of statement
LEXEME:
Collection or group of characters forming tokens is called Lexeme.
PATTERN:
A pattern is a description of the form that the lexemes of a token may take.
In the case of a keyword as a token, the pattern is just the sequence of characters
that form the keyword. For identifiers and some other tokens, the pattern is a more
complex structure that is matched by many strings.
Attributes for Tokens
Some tokens have attributes that can be passed back to the parser. The lexical
analyzer collects information about tokens into their associated attributes. The
attributes influence the translation of tokens.
a. Constant : value of the constant
b. Identifiers: pointer to the corresponding symbol table
entry.
23
ERROR RECOVERY STRATEGIES IN LEXICAL ANALYSIS:
The following are the error-recovery actions in lexical analysis:
1) Deleting an extraneous character.
2) Inserting a missing character.
3) Replacing an incorrect character by a correct character.
4) Transforming two adjacent characters.
5) Panic mode recovery: Deletion of successive characters from the
token until error is resolved.
INPUT BUFFERING
We often have to look one or more characters beyond the next lexeme before we
can be sure we have the right lexeme. As characters are read from left to right,
each character is stored in the buffer to form a meaningful token as shown below:
Forward pointer
A = B + C
24
Each buffer is of the same size N, and N is usually the number of characters on one
disk block. E.g., 1024 or 4096 bytes.
Using one system read command we can read N characters into a buffer.
If fewer than N characters remain in the input file, then a special character,
represented by eof, marks the end of the source file.
Two pointers to the input are maintained:
1. Pointer lexeme_beginning, marks the beginning of the current lexeme,
whose extent we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.
Once the next lexeme is determined, forward is set to the character at its right end.
The string of characters between the two pointers is the current lexeme. After the
lexeme is recorded as an attribute value of a token returned to the parser,
lexeme_beginning is set to the character immediately after the lexeme just found.
25
then begin
reload second half; move forward to beginning of first half
end
else forward := forward + 1;
SENTINELS
For each character read, we make two tests: one for the end of the buffer, and one
to determine what character is read. We can combine the buffer-end test with the
test for the current character if we extend each buffer to hold a sentinel character
at the end.
The sentinel is a special character that cannot be part of the source program, and a
natural choice is the character eof.
The sentinel arrangement is as shown below:
Note that eof retains its use as a marker for the end of the entire input. Any eof
that appears other than at the end of a buffer means that the input is at an end.
Code to advance forward pointer:
forward : = forward + 1;
if forward ↑ = eof then begin
if forward at end of first half then begin reload second half;
forward := forward + 1
end
else if forward at end of second half then begin reload first half;
move forward to beginning of first half end
else /* eof within a buffer signifying end of input */ terminate lexical analysis
end
26
1.4 SPECIFICATION OF TOKENS
There are 3 specifications of tokens:
1) Strings
2) Language
3) Regular expression
Strings and Languages
An alphabet or character class is a finite set of symbols.
A string over an alphabet is a finite sequence of symbols drawn from that alphabet. A
language is any countable set of strings over some fixed alphabet.
In language theory, the terms "sentence" and "word" are often used as synonyms for
"string." The length of a string s, usually written |s|, is the number of occurrences of
symbols in s. For example, banana is a string of length six. The empty string, denoted ε, is
the string of length zero.
Operations on strings
The following string-related terms are commonly used:
1. A prefix of string s is any string obtained by removing zero or more symbols
from the end of string s.
For example, ban is a prefix of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols
from the beginning of s. For example, nana is a suffix of banana.
27
Operations on languages:
The following are the operations that can be applied to languages:
1.Union 2.Concatenation 3.Kleene closure 4.Positive closure The following example
shows the operations on strings: Let L={0,1} and S={a,b,c}
1. Union : L U S={0,1,a,b,c}
2. Concatenation : L.S={0a,1a,0b,1b,0c,1c}
3. Kleene closure : L*={ ε,0,1,00….}
4. Positive closure : L+={0,1,00….}
Regular Expressions
Each regular expression r denotes a language L(r).
Here are the rules that define the regular expressions over some alphabet Σ and
the languages that those expressions denote:
1. ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole
member is the empty string.
2. If „a‟ is a symbol in Σ, then „a‟ is a regular expression, and L(a) = {a}, that
is, the language with one string, of length one, with „a‟ in its one position.
3. Suppose r and s are regular expressions denoting the languages L(r) and
L(s). Then, a) (r)|(s) is a regular expression denoting the language L(r) U L(s).
b) (r)(s) is a regular expression denoting the language L(r)L(s).
c) (r)* is a regular expression denoting (L(r))*.
d) (r) is a regular expression denoting L(r).
4. The unary operator * has highest precedence and is left associative.
5. Concatenation has second highest precedence and is left
associative.
6. | has lowest precedence and is left associative.
28
Regular set
A language that can be defined by a regular expression is called a regular set.
If two regular expressions r and s denote the same regular set, we say they are
equivalent and Write r = s.
There are a number of algebraic laws for regular expressions that can be used to
manipulate into equivalent forms.
For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative.
Regular Definitions
Giving names to regular expressions is referred to as a Regular definition. If Σ is an
alphabet of basic symbols, then a regular definition is a sequence of definitions of
the form
dl → r 1 d2 → r2
……… dn → rn
1. Each di is a distinct name.
2. Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . ,
di-l}.
Example: Identifiers is the set of strings of letters and digits beginning with a letter.
Regular definition for this set:
letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9
id → letter ( letter | digit ) *
Shorthands
Certain constructs occur so frequently in regular expressions that it is convenient to
introduce notational shorthands for them.
29
1. One or more instances (+):
· The unary postfix operator + means “ one or more instances of” .
· If r is a regular expression that denotes the language L(r), then ( r )+
is a regular expression that denotes the language (L (r ))+
· Thus the regular expression a+ denotes the set of all strings of one
or more a‟s.
· The operator + has the same precedence and associativity as the
operator *.
2. Zero or one instance ( ?):
· The unary postfix operator ? means “zero or one instance of”.
· The notation r? is a shorthand for r | ε.
· If „r‟ is a regular expression, then ( r )? is a regular expression that
denotes the language L( r ) U { ε }.
3. Character Classes:
30
1.5 RECOGNITION OF TOKENS
Consider the following grammar fragment:
stmt → if expr then stmt | if expr then stmt else stmt | ε
expr → term relop term | term
term → id | num
where the terminals if , then, else, relop, id and num generate sets of strings given
by the following regular definitions:
if the n → if the n els
els → e
e →
relop → <|<=|=|<>|>|>=
id → letter(letter|digit)*
num → digit+ (.digit+)?(E(+|-
)?digit+)?
For this language fragment the lexical analyzer will recognize the keywords if, then,
else,
as well as the lexemes denoted by relop, id, and num. To simplify matters, we
assume keywords are reserved; that is, they cannot be used as identifiers.
Transition diagrams
It is a diagrammatic representation to depict the action that will take place when a
lexical analyzer is called by the parser to get the next token. It is used to keep
track of information about the characters that are seen as the forward pointer
scans the input.
31
1.6 FINITE AUTOMATA
Finite Automata is one of the mathematical models that consist of a number of
states and edges. It is a transition diagram that recognizes a regular expression or
grammar.
Types of Finite Automata
There are two types of Finite Automata:
Non-deterministic Finite Automata (NFA) Deterministic Finite Automata (DFA)
Non-deterministic Finite Automata
NFA is a mathematical model that consists of five tuples denoted by M = {Qn, Ʃ, δ,
q0, fn}
Qn – finite set of states
Ʃ – finite set of input symbols
δ – transition function that maps state-symbol pairs to set of states q0 – starting
state
fn – final state
32
Deterministic Finite Automata
DFA is a special case of a NFA in which i) no state has an ε-transition.
ii) there is at most one transition from each state on any input. DFA has five tuples
denoted by
M = {Qd, Ʃ, δ, q0, fd} Qd – finite set of states
Ʃ– finite set of input symbols
Δ – transition function that maps state-symbol pairs to set of states q0 –
starting state
fd – final state
1.7 Converting a Regular Expression into a Non-Deterministic Finite
Automaton (Thompson’s Algorithm)
There are only 5 rules, one for each type of RE:
33
The algorithm constructs NFAs with only one final state. For example, the third rule
indicates that, to construct the NFA for the RE AB, we construct the NFAs for A and
B which are represented as two boxes with one start and one final state for each
box. Then the NFA for AB is constructed by connecting the final state of A to the
start state of B using an empty transition.
For example, the RE (a|b)c is mapped to the following NFA:
34
is the set {1,2}. The start state of the constructed DFA is labeled by the closure of the NFA start
state. For every DFA state labeled by some set and for every character c in the language alphabet,
you find all the states reachable by s1, s2, ..., or sn using c arrows and you union together the
closures of these nodes. If this set is not the label of any other node in the DFA constructed so far,
you create a new DFA node with this label. For example, node {1,2} in the DFA above has an arrow
to a {3,4,5} for the character a since the NFA node 3 can be reached by 1 on a and nodes 4 and 5
can be reached by 2. The b arrow for node {1,2} goes to the error node which is associated with an
empty set of NFA nodes. The following NFA recognizes , even though it wasn't constructed
Likewise, we can done the "-closure of a set of states to be the states reachable by " - transitions
from its members. In other words, this is the union of the " -closures of its elements. To convert our
NFA to its DFA counterpart, we begin by taking the " –closure of the start state q of our NFA and
constructing a new start state S.in our DFA corresponding to that " -closure. Next, for each symbol in
our alphabet, we record the set of NFA states that we can reach from S 0on that symbol. For each
such set, we make a DFA state corresponding to its "E-closure, taking care to do this only once for
each set.
35
Building Syntax tree
Example
Step 1 (a|b)*abb - > (a|b)*abb #
Node followpos
1 {1, 2, 3}
2 {1, 2, 3}
3 {4}
4 {5}
5 {6}
6 -
36
Step 4
A=firstpos(n0)={1,2,3}
Move[A,a]={1,2}
=followpos(1) U followpos(3)
= {1,2,3,4}=B
Move[A,b]={2}
followpos(2)={1,2,3}=A Move[B,a]={1,3}
=followpos(1) U followpos(3)=B
Move[B,b]={2,4}
=followpos(2) U followpos(4)
={1,2,3,5}=C
Move[C,a]={1,3}=B
Mov[C,b]={2,5}
= Followpos(2)Ufollowpos(5)={1,2,3,6}---D
Mov(D,a)={1,3}=B
Mov(D,b)={2,4}=C
Required DFA:
(a|b)*abb#
37
1.9 A LANGUAGE FOR SPECIFYING LEXICAL ANALYZER
There is a wide range of tools for constructing lexical
analyzers.
· Lex
· YACC
LEX
Lex is a
computer program that generates lexical analyzers. Lex is commonly used
with
the yacc parser generator.
Creating a lexical analyzer
First, a specification of a lexical analyzer is prepared by
creating a program lex.l in the Lex language.
Then, lex.l is run through the Lex compiler to produce a C
program lex.yy.c.
LEX Compiler
lex.l lex.yy.c
C Compiler
lex.yy.c a.out
a.out
input stream Sequence of tokens
38
Lex Specification
A Lex program consists of three parts:
{ definitions }
%%
{ rules }
%%
{ user subroutines }
where pi is regular expression and action i describes what action the lexical
analyzer should take when pattern pi matches a lexeme.
Actions are written in C code.
User subroutines are auxiliary procedures needed by the actions. These
can be compiled separately and loaded with the lexical analyzer.
YACC- YET ANOTHER COMPILER-COMPILER
Yacc provides a general tool for describing the input to a computer program.
The Yacc user specifies the structures of his input, together with code to be
invoked as each such structure is recognized. Yacc turns such a specification
into a subroutine that handles the input process; frequently, it is convenient
and appropriate to have most of the flow of control in the user's application
handled by this subroutine.
39
In the case two sets are equal, we simply reuse the existing DFA state that we
already constructed. This process is then repeated for each of the new DFA states
(that is, set of NFA states) until we run out of DFA states to process. Finally, every
DFA state whose corresponding set of NFA states contains an accepting state is itself
marked as an accepting state.
The Lexical-Analyzer Generator Lex
Lexical Analyzer tool is called Lex, or in a more recent implementation Flex, that
allows one to specify a lexical analyzer by specifying regular expressions to describe
patterns for tokens.
The input notation for the Lex tool is referred to as the Lex language and the tool
itself is the Lex compiler.
Behind the scenes, the Lex compiler transforms the input patterns into a transition
diagram and generates code, in a file called l e x . y y . c, that simulates this
transition diagram.
Use of Lex
Figure 3.22 suggests how Lex is used. An input file, which we call lex.l , is written in
the Lex language and describes the lexical analyzer to be generated. The Lex
compiler transforms lex.1 to a C program, in a file that is always named lex.yy.c.
The latter file is compiled by the C compiler into a file called a.out . The C-compiler
output is a working lexical analyzer that can take a stream of input characters and
produce a stream of tokens.
The normal use of the compiled C program, referred to as a. out in Fig. 3.22, is as a
subroutine of the parser. It is a C function that returns an integer, which is a code
for one of the possible token names. The attribute value, whether it be another
numeric code, a pointer to the symbol table, or nothing, is placed in a global variable
yylval , which is shared between the lexical analyzer and parser, thereby making it
simple to return both the name and an attribute value of a token.
40
Structure of Lex Programs
A Lex program has the following form:
declarations
%%
translation rules
%%
auxiliary functions
· The declarations section includes declarations of variables, manifest constants
(identifiers declared to stand for a constant, e.g., the name of a token), and regular
definitions.
· The translation rules each have the form
Pattern { Action }
· Each pattern is a regular expression, which may use the regular definitions of
the declaration section. The actions are fragments of code, typically written in C,
although many variants of Lex using other languages have been created.
· The third section holds whatever additional functions are used in the actions.
· Alternatively, these functions can be compiled separately and loaded with the
lexical analyzer.
· When called by the parser, the lexical analyzer begins reading its remaining
input, one character at a time, until it finds the longest prefix of the input that
matches one of the patterns Pi. It then executes the associated action Ai. Typically,
Ai will return to the parser, but if it does not (e.g., because Pi describes whitespace
or comments), then the lexical analyzer proceeds to find additional lexemes, until
one of the corresponding actions causes a return to the parser.
41
The lexical analyzer returns a single value, the token name, to the parser, but uses the
shared, integer variable yylval to pass additional information about the lexeme found, if
needed.
Example: Figure 3.23 is a Lex program that recognizes the tokens of Fig. 3.12 and returns
the token found.
Declarations section:
In the declarations section we see a pair of special brackets, %{ and %}. Anything
within these brackets is copied directly to the file lex.yy.c , and is not treated as a regular
definition. It is common to place there the definitions of the manifest constants, using C
#def i n e statements to associate unique integer codes with each of the manifest constants.
Notice that in the definition of id and number, parentheses are used as grouping
metasymbols and do not stand for themselves. In contrast, E in the definition of number
stands for itself. If we wish to use one of the Lex metasymbols, such as any of the
parentheses, +, *, or ?, to stand for themselves, we may precede them with a backslash.
For instance, we see \. in the definition of number, to represent the dot, since that character
is a metasymbol representing "any character," as usual in UNIX regular expressions.
Auxiliary-function section:
In the auxiliary-function section, we see two such functions, i n s t a llID () and
installNum(). Like the portion of the declaration section that appears between %{. . . % } ,
everything in the auxiliary section is copied directly to file l e x . y y . c , but may be used in
the actions.
42
Translation rules:
Finally, let us examine some of the patterns and rules in the middle section of
Fig. 3.23. First, ws, an identifier declared in the first section, has an associated
empty action. If we find whitespace, we do not return to the parser, but look for
another lexeme.
The second token has the simple regular expression pattern if. Should we see the
two letters if on the input, and they are not followed by another letter or digit
(which would cause the lexical analyzer to find a longer prefix of the input matching
the pattern for id), then the lexical analyzer consumes these two letters from the
input and returns the token name IF, that is, the integer for which the manifest
constant IF stands. Keywords t h e n and e l s e are treated similarly.
%{
/* definitions of manifest constants
#define LT 260
#define LE 261
EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/* regular definitions */
delim [ \t\n]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID);}
{number} {yylval = (int) installNumO ; return(NUMBER);}
“<” {yylval = LT; return(RELOP) ;}
"<=" {yylval = LE; return(RELOP) ;}
“=” {yylval = EQ; return(RELOP) ;}
"<>" {yylval = NE; return(RELOP) ;}
“>” {yylval = GT; return(RELOP) ;}
“>=” {yylval = GE; return(RELOP) ;}
%%
43
int installID() {/* function to install the lexeme, whose first
character is pointed to by yytext, arid whose length is yyleng, into
the symbol table and return a pointer thereto */
}
int installNum() {/* similar to installlD, but puts numerical
constants into a separate table */
}
Figure 3.23: Lex program for the tokens of Fig. 3.12
The fifth token has the pattern defined by id. Note that, although keywords like if
match this pattern as well as an earlier pattern, Lex chooses whichever pattern is
listed first in situations where the longest matching prefix matches two or more
patterns. The action taken when id is matched is threefold:
1. Function i n s t a l l I D ( ) is called to place the lexeme found in the symbol
table.
2. This function returns a pointer to the symbol table, which is placed in global
variable y y l v a l , where it can be used by the parser or a later component of the
compiler. Note that i n s t a l l I D ( ) has available to it two variables that are set
automatically by the lexical analyzer that Lex generates:
(a) yytext is a pointer to the beginning of the lexeme, analogous to lexeme Begin in
Fig. 3.3.
(b) yyleng is the length of the lexeme found.
3. The token name ID is returned to the parser. The action taken when a lexeme
matching the pattern number is similar, using the auxiliary function installNum().
Conflict Resolution in Lex
Two rules that Lex uses to decide on the proper lexeme to select, when several
prefixes of the input match one or more patterns:
1. Always prefer a longer prefix to a shorter prefix.
2. If the longest possible prefix matches two or more patterns, prefer the
pattern listed first in the Lex program.
44
E x a m p l e:
• The first rule tells us to continue reading letters and digits to find the longest
prefix of these characters to group as an identifier. It also tells us to treat <=
as a single lexeme, rather than selecting < as one lexeme and = as the next
lexeme.
• The second rule makes keywords reserved, if we list the keywords before id
in the program.
For instance, if t h e n is determined to be the longest prefix of the input that
matches any pattern, and the pattern then precedes { i d } , as it does in Fig. 3.23,
then the token THEN is returned, rather than ID.
The Lookahead Operator
· Lex automatically reads one character ahead of the last character that
forms the selected lexeme, and then retracts the input so only the lexeme
itself is consumed from the input.
· When we want a certain pattern to be matched to the input only when
it is followed by a certain other characters. If so, we may use the slash in a
pattern to indicate the end of the part of the pattern that matches the
lexeme. What follows / is additional pattern that must be matched before we
can decide that the token in question was seen, but what matches this
second pattern is not part of the lexeme.
45
Example : In Fortran and some other languages, keywords are not reserved. That
situation creates problems, such as a statement
IF(I,J) = 3
where IF is the name of an array, not a keyword. This statement contrasts with
statements of the form
IF( condition ) THEN ...
where IF is a keyword.
To recognize the keyword IF, which is always followed by a left parenthesis, some
text ,the condition that may contain parentheses, a right parenthesis and a letter.
Then, we could write a Lex rule for the keyword IF like:
IF / \( .* \) {letter}
This rule says that the pattern the lexeme matches is just the two letters IF. The
slash says that additional pattern follows but does not match the lexeme.
In this pattern, the first character is the left parentheses. Since that character is a
Lex metasymbol, it must be preceded by a backslash to indicate that it has its literal
meaning. The dot and star match "any string without a newline." Note that the dot
is a Lex metasymbol meaning "any character except newline." It is followed by a
right parenthesis, again with a backslash to give that character its literal meaning.
The additional pattern is followed by the symbol letter, which is a regular definition
representing the character class of all letters.
For instance, suppose this pattern is asked to match a prefix of input:
IF(A<(B+C)*D)THEN...
the first two characters match IF, the next character matches \ ( , the next nine
characters match .*, and the next two match \) and letter. Note the fact that the
first right parenthesis (after C) is not followed by a letter is irrelevant; we only need
to find some way of matching the input to the pattern. We conclude that the letters
IF constitute the lexeme, and they are an instance of token if.
46
Conversion of NFA with εto DFA
Step 1:
Consider M=(Q,Σ,δ,q0,F) is a NFA with ε,we have to convert this NFA with εto
equivalent DFA denoted by MD = (QD,Σ,δD,q0,FD)
Theorem:
A language L is accepted by some ε-NFA if and only if(iff) L is accepted by some
DFA.
Proof:
Suppose
L=L(D) for some DFA.
D into an ε-NFA E by adding transition δ(q, ε)= φfor all states q of D.
we must also convert the transition of D on input symbols.
Example:
δD(p,a) = p into an NFA transition to the set containing only p.
i.e) δE(p,a) = {p}.
Thus e transition of E and D are the same, but E explicitly states that there are no
transition out of any state on ε.
Let E=(QE,Σ,δE,q0,FE) be an ε-NFA. Apply the modified subset construction
described above to produce the DFA.
D = (QD,Σ,δD,q0,FD)
Formally
δ'E(q0,w) = δ'D(qD,w) by induction on the length of w.
Basis:
if |w| = 0, then w = ε.
we know δ'E(q0, ε)= ECLOSE(q0).
we also know that qD = ECLOSE(q0), because this is how the start state of D is
defined.
for DFA, we know that δ'(p, ε)= p for any state p, so on particular
δ'D(qD, ε)= ECLOSE (q0).
we have thus proved that δ'E(q0, ε)= δ'D(qD, ε).
Induction:
Suppose w=xa, where a is the final symbol of w and assume that the statement
holds for x.
i.e) δ'E(q0, x) = δ'D(qD, x).
let both these sets of states be {p1,p2,...pk}.
By the definition of δ'for ε-NFA's ,we compute δ'E(q0, w) by
let {r1,r2,...rm} be ⋃δE(Pi,a)𝑘𝑖=1
Then δ'E(q0,w) = ECLOSE({r1,r2,...rm})
Theorem:
If L is accepted by NFA with ε-transitions, then L is accepted by an NFA without
ε-transitions. That is L(M) = L(M')
Proof:
By induction: In ε-NFA, and ' are same and in NFA and ' are different.
Extended transition function
DFA :
Basis: |W|=0
δˆ(q, ε)= q
If we are in a state q and read no inputs, then we are still in state q
Induction: |W|>1
Suppose w is a string of the form xa;
Then: δˆ(q, w) = δˆ(q, xa)
= δ(δˆ(q, x), a)
= δ(p , a ) let δˆ(q, x) = p
= [r]
NFA:
Basis: δˆ(q, ε)= q
If we are in a state q and read no inputs, then we are still in state q
Induction:
Suppose w is a string of the form xa
Then: δˆ(q, w) = δ(δˆ(q, x), a)
= δ({p1, p2, . . . pk , },a) let δˆ(q, x) = {p1, p2,
. . . pk , }
Solution:
Initial State of given NFA is {q0 }
Since NFA is equivalent to DFA,
Let the Initial State of DFA be [q0] -------------A
0 1
→A A B
B C B
*C C C
a)
Note:
No change in Initial State
No change in the total No. Of States
May be change in Final States
Changes in transitions
Solution:
ε-Closure(q0)={q0,q1}
ε-Closure(q1)={q1}
Transition table
0 1
* q1 Φ {q1}
Transition diagram
Conversion of NFA-ε into its equivalent NFA.
b)
Solution:
ε-Closure(A)={A,B}
ε-Closure(B)={B,D}
ε-Closure(C)={C}
ε-Closure(D)={D}
Transition table
State Input symbol
0 1
->A {A,B,C} Φ
*B {C,D} {D}
C Φ Φ
*D {D} {D}
Note:
B & D are final states because final state of given NFA-ε D lies in both
ε-Closure(B)={B,D}
ε-Closure(D)={D}
Conversion of NFA-ε to DFA
Solution:
States Inputs
0 1 2
->*A A B C
*B ∅ B C
*C ∅ ∅ C
Note:
Since q2 lies in DFA states A,B,C , all the three states becomes the final
states of resultant DFA.
Convert the following NFA-ε into its equivalent DFA.
b)
States Inputs
0 1 ε
->A {B} {A} {B}
B ∅ {B} {C}
*C {C,A} {C} ∅
Solution:
=ε-closure{δ((B,C), 0)}
= {C}
= [C] ---------Z
= {B,C}
=[B,C] ------Y
= [C] -----------Z
= [C] ----------Z
If, on the other hand, there exists some string w ∈Σ∗ such that
δ∗(p, w) ∈F and
δ∗(q, w) ∉F, or vice versa,
then the states p and q are said to be distinguishable by a string w.
Algorithm
Step 1 − All the states Q are divided in two partitions − final states and non-final
states and are denoted by P0. All the states in a partition are 0th equivalent. Take
a counter k and initialize it with 0.
Step 2 − Increment k by 1. For each partition in Pk, divide the states in Pk into two
partitions if they are k-distinguishable. Two states within this partition X and Y are
k-distinguishable if there is an input S such that δ(X, S) and δ(Y, S) are (k-1) -
distinguishable.
Step 4 − Combine kth equivalent sets and make them the new states of the
reduced DFA.
Problem: 1 Find minimized DFA for the given DFA:
State 0 1
-> q0 q1 q5
q1 q6 q2
* q2 q0 q2
q3 q2 q6
q4 q7 q5
q5 q2 q6
q6 q6 q4
q7 q6 q2
Solution :
Partition the states into final and non-final
Q1={q2} Q2={q0,q1,q3,q4,q5,q6,q7}
Now check for equivalent states by comparing each others
belongs to Q2 belongs to Q2
State 0 1 State 0 1
-> q0,q4 q1,q7 q5
-> [q0,q4] [q1,q7] [q3,q5]
q1,q7 q6 q2
[q1,q7] [q6] [q2]
* q2 q0 q2 removing
q3,q5 q2 q6 duplicates
*[ q2] [q0,q4] [q2]
q4 q7 q5
[q3,q5] [q2] [q6]
q5 q2 q6
q6 q6 q4 [q6] [q6] [q0,q4]
q7 q6 q2
State 0 1
-> Q2 Q3 Q4
Q3 Q5 Q1
* Q1 Q2 Q1
Q4 Q1 Q5
Q5 Q5 Q2
Problem 2: Find Minimized DFA
Solution:
Partition the states into final and non-final :
One set will contain q1, q2, q4 which are final states of DFA
another set will contain remaining states.
69
10. PART A : Q & A : UNIT – I
SNo Questions and Answers CO K
What are the two parts of a compilation? Explain
briefly.
Analysis and Synthesis are the two parts of compilation.
(Front-end & Back-end)
1 ● The analysis part breaks up the source program K1
into constituent pieces and creates an
intermediate representation of the source program.
● The synthesis part constructs the desired target
program from the intermediate representation.
List the various compiler construction tools.
● Parser generators
● Scanner Generator
2 ● Syntax-directed translation engines K2
● Automatic code generators
● Dataflow Engines
● Compiler construction tool kits
Differentiate compiler and interpreter.
● The machine-language target program produced
by a compiler is usually much faster than an
3 interpreter at mapping inputs to outputs. K1
● An interpreter, however, can usually give better CO1
error diagnostics than a compiler, because it executes
the source program statement by statement.
Define tokens, Patterns, and lexemes.
● Tokens- Sequence of characters that have a collective
meaning. A token of a language is a category of its
lexemes.
● Patterns- There is a set of strings in the input for
4 K2
which the same token is produced as output. This
set of strings is described by a rule called a
pattern associated with the token.
● Lexeme- A sequence of characters in the source
program that is matched by the pattern for a token.
Describe the possible error recovery actions in lexical
analyzer.
·Panic mode recovery
·Deleting an extraneous character
5 K1
·Inserting a missing character
·Replacing an incorrect character by a correct character
·Transposing two adjacent characters
70
10. PART A : Q & A : UNIT – I
SNo Questions and Answers CO K
Write the regular expression for identifier.
letter -> A | B | ……| Z | a | b | …..| z
6 K2
digit -> 0 | 1 |…….|9
id -> letter (letter | digit ) *
Write the regular expression for Number.
digit -> 0 | 1 |…….|9
digits -> digit digit*
7 optional_fraction->.digits | € K2
optional_exponent -> (E(+|-| €)digits) | €
num -> digits optional_fractionoptional_exponent
id -> letter (letter | digit ) *
List the phases that constitute the front end of a
compiler.
The front end consists of those phases or parts of
phases that depend primarily on the source language
and are largely independent of the target
machine.These include
8 ● Lexical and Syntactic analysis K2
● The creation of symbol table
CO1
● Semantic analysis
● Generation of intermediate code
A certain amount of code optimization can be done by
the front end as well and includes Error handling that
goes along with each of these phases.
Mention the back-end phases of a compiler.
The back end of compiler includes those portions
that depend on the target machine and generally
those portions do not depend on the source
9 K2
language, just the intermediate language. These
include Code optimization
Code generation, along with error handling and symbol-
table operations.
What is the role of lexical analysis phase?
Lexical analyzer reads the source program one
character at a time, and grouped into a sequence
of atomic units called tokens. Identifiers, keywords,
constants, operators and punctuation symbols such as
10 commas, parenthesis, are typical tokens. K2
Identifiers are entered into the Symbol Table.
Removes comments, whitespaces (blanks, newline , tab)
Keeps track of the line numbers.
Identifies Lexical Errors and appropriate error messages.
71
10. PART A : Q & A : UNIT – I
19 K1
73
11. PART B QUESTIONS : UNIT – I
1.Describe the various phases of compiler and trace it with the program segment
(t=b*-c+b*-c). (CO1,K4)
2.Explain in detail the process of compilation. Illustrate the output of each phase of
the compilation for the input “a = (b+c) * (b+c) *2”. (CO1,K3)
3. Discuss various buffering techniques in detail. (CO1,K5)
4. Construct Regular expression to NFA for the sentence (alb)*a. (CO1,K3)
5. Construct NFA using the regular expression (a/b)*abb. (CO1,K3)
6. Construct the NFA from the (a|b)*a(a|b) using Thompson’s construction
algorithm. (CO1,K6)
7. Construct the DFA for the augmented regular expression (a | b )* # directly
using syntax tree. (CO1,K3)
8. Write an algorithm for minimizing the number of states of a DFA. (CO1,K3)
PART C QUESTIONS
74
12. Supportive online Certification courses
NPTEL : https://nptel.ac.in/courses/106/105/106105190/
Swayam : https://www.classcentral.com/course/swayam-compiler-design-12926
coursera : https://www.coursera.org/learn/nand2tetris2
Udemy : https://www.udemy.com/course/introduction-to-compiler-construction-and-design/
Mooc : https://www.mooc-list.com/course/compilers-coursera
Edx : https://www.edx.org/course/compilers
75 75
13. Real time Applications in day to day life and to Industry
76 76
14. CONTENTS BEYOND SYLLABUS : UNIT – I
[~delim1,delim2,...,delimN] :: Token
Text Parsing:
Text parsing is a technique which is used to derive a text string using the
production rules of a grammar to check the acceptability of a string.
77
15. ASSESSMENT SCHEDULE
Name of the
S.NO Start Date End Date Portion
Assessment
UNIT 5 , 1 &
5 Revision 1 13.5.2024 16.5.2024
2
78
16. PRESCRIBED TEXT BOOKS & REFERENCE BOOKS
• TEXT BOOKS:
• REFERENCE BOOKS:
79
17. MINI PROJECT SUGGESTION
• Objective:
Design of lexical analyzer Generator
Design of an Automata for pattern matching
• Planning:
• This method is mostly used to improve the ability of students in application
domain and also to reinforce knowledge imparted during the lecture.
• Students are asked to prepare mini projects involving application of the
concepts, principles or laws learnt.
• The faulty guides the students at various stages of developing the project
and gives timely inputs for the development of the model.
• Students convert their ideas into real time applications.
projects:
1. C Mini Project: Creating a Lexical Analyzer.(CO1,K6)
2. Use Flex to create a lexical analyzer for C. (CO1,K2)
3. Regular Expression matching - to check the two or more regular
expression are similar to each other or not. (CO1,K4)
4. Construct a automata vending machine as an automated machine use finite
state automata to control the functions process.(CO1,K3)
5. Design a simple tool for Symbol Table Management. (CO1,K6)
80
Thank you
Disclaimer:
This document is confidential and intended solely for the educational purpose of RMK Group of Educational
Institutions. If you have received this document through email in error, please notify the system manager. This
document contains proprietary information and is intended only to the respective group / learning community as
intended. If you are not the addressee you should not disseminate, distribute or copy through e-mail. Please notify
the sender immediately by e-mail if you have received this document by mistake and delete this document from
your system. If you are not the intended recipient you are notified that disclosing, copying, distributing or taking any
action in reliance on the contents of this information is strictly prohibited.
81 81