CD Unit 1
CD Unit 1
Analysis
1
PART A
SHORT QUESTIONS WITH ANSWERS
2 Compiler
3 Assembler
4 Linker/loader.
02. Define compiler and interpreter
Ans
igure
Interpreter:interpreter
and
An inter is also a progra that reads the source progtn
executes the program line by line
Source program
Output
Data nputntetpreter
Figure
1
Compiler Design| Unit 1
Define linker and loader and explain briefly. nifferentiate between pass one and pass two of a compiler.
Q3 Q6.
Ans Ans
Linkers:1t is a program that links the two object files containing thecompiled Single Pass Compiler Multipass Compiler
or assembled code to form a single tile which canbedirectly executable. Itis TAsingle pass compiler scans 1. |A multipass compiler scans the
also responsible for pertorming the following functions, the e n t i r e s o u r c e sourcecode several times.
1. Linking the object program with that of the code for standard library 2. Execution time ofthe code for| 2 Execution time of the code for a
translator is less. multipass translator is more.
functions and a single pass
1. Analysis phase
2 Synthesis phase.
of this model is shown in figure below
Diagrammatic representation
4
1.2.1 The Role of Lexical Analyzer
Q15. What are the functions of lexical analyzer?
(or)
Explain the role of lexical analyzer with its implementations.
Ans
Lexical Analyzer: Lexical analysis is the process of reading the input string
identifying tokens, deleting white spaces and locating, repairing and reporting
errors if there are
any.
13
Compiler Design Unit 1
the tirst phase of a Cenerating Tokens: Whenever, the scanner identifies a valid token, it
The levical analvsis is implemented by npiler a tuple of the form,
generates
Input Token Parser - -
Scanner
strng Next token)
Where,
Lexical analysis is
a, is a valid lexeme (identifier, number, keyword etc).
processed here
the pointer to symbol table for q,
a, is
is responsible for, The token generation of the lexical analysis phase is initiated by parser
The scanner of a lexical analvzer
next token( ). After receiving a signal from
1 Removal of white spaces and
comments by using a function namely
ldentifving
narser the scanner starts reading the input string till it finds a valid
token.
) Constants
Revorting about Errors : The lexical analyzer in some compilers is
(ii) ldentifiers
responsible for making a copy of source program and marking errors
(ii) Keywords. in it. The scanner should be careful about duplication of a single error.
Generating tokens However, if the scanner encounters an error, it invokes a procedure in
Reporting about errors. error handler (part of a compiler).
Removal of White Spaces and Comments : Whenever the scanner The lexical analysis is completely dependent on the source language
encounters a blank, tab or a newline character it blindly deletes it.
The and independent of machine, on which the program or source language
reason is the parser need not worry about white spaces, as it itself is is being compiled.
for
comples phase. The parser cannot have regular expression whilo16. Explain
spaces. Even comments can be ignored by the parser, as they will be
whytheanalysisportion ofacompiler is separated into lexical
analysis and parsing phases.
deleted by the scanner, while generating tokens.
2 Identifying Ans: The lexical analyzer being the first phase of a compiler reads or scans
the input to the compiler and divide the input into a number of small parts
(i) Constants: Scanner, whenever encounters a constant, it makes
an entrof a string (token).
into symbol table and return the
pointer. Input to
(ii) identifiers: An identifier can be a name of a
function, variable, etd compier
Howeve, the grammar of Lexical
any language will consider identifier as
token. analyzer
Symbol Next Output
Example: The statement in C, table token (stream of wkens)
à =a+1 Parser
The lexical analyET would consider the above
statement as,
iden= iden + number,
Wherein, iden is one token =, t, number ligure: General Representation of Levical Analyzer
are also treated as single tokei
(iii) e above tigure clearly shows that the levical analyzer acts only when
Keywords : A keyword in
any language will follow all the rules thata invoked by the parser.
imposed on an identifier The point to be noted is that
keywords shou ere are anumber ot easons for which the levical analvzer isseparated
be made reserved identifiers so
thal the scanner do not get O
between an identifier and keywurd
TO Synt.d analyzer. Some ot them are listel below,
15
14
Compiler Design Unit 1
Reasons A B+C Input
string
1. The division of lexical analyzer
efficiency of the compiler. Each
from analyzer improvest h
syntax
phase is given their responsibilitie Lexical Symbol table
analyzer id, value A
The less the number of tasks, the nmore is the efficiency" is the
Invoke tokens assign_op
behind separation. If one phase reads from tokens, check Syntav
ax o for token
id, value = B
a string, then the time consumed will be more and task may not
Syntax addop
id, value = C
pertormed efficiently. analyzer
By the division of lexical analyzer and syntax analyzer into two phas
made very easy. reason is if thes
The
the designing of a compiler is The lexical analyzer forms tokens
that is any name as in A, B, student
all the functions that are to be performo
phases are put together then, like add_op, mult_op, div_op are used whenever
an identifier, operators
by the lexical analyzer has to be performed by syntax analyzer. Of all this is encountered.
most difficult is removin.such operator
functions performed by lexical analyzer, the
to the parser is,
the comments, white spaces, new line characters from the input string,A The format of token given
terms of designing
syntax analyzer is in itself the complicated phase (in <id, pointer to symbol table for value
> For identifiers.
which performs syntax checks according to the grammar of any give mathe-matical operator.
language. If such a complicated phase performs reading input string <type_op> For operators, 'type can be any
removing spaces looks very absurd. Hence, the division of these tw Let us generate tokens for the following string,
phases makes the design of each phase, in turn design of the compile
A =B+C
easy.
sid, pointer to symbol table for A>
Token generation and then storing the input token with its attribute
in symbol table is not an easy task. The lexical analyzer not only
does assign_op>
token generation, but also makes a copy of the source program and id, pointer to symbol table for B>
then links with it the number of errors if any.
For example, consider the
add_op
following program in C language, sid, pointer to symbol table for C
#include<stdio.h>
Hence, the division makes,
void main()
Compiler design easy
Increases efficiency
printf("Hello");
Reduces conmpilation time.
For
Ioreover, the division also improves portability of the compiler.
the input one by one, forms token t are there in Pascal the
he lexical analyzer reads wixample,
if any special characters like
attributes and stores it in symbol table. When it encounters 'T it does shovepresentation is not available, so the character can be isolated. By making
language.
an error message as in some compilers the lexical analy zer is responsible toCal analyzer delete non standard characters. the compiler beeomes tree
checking so. g u n g e or device specific restrictions. The parser is not even aware ot
ny ge or device specific restrictions, it works normally irTespective o
E Us taKe an
example, on how a lexical analyzer divides a string into
tokens, assign attributes and store them in
ny device specific constrants.
symbol table.
17
16
Compiler Design Unit 1
Q17. Define lexeme, token and pattern. Identify the lexeme that make up th Pattern
Token Lexemes
token in the following program segment. Indicate the correspondine letter followed by letter or | sum, average
identifier
token and pattern:
digits
void swap(int i, int )) number any number
0,2.14, 3.26
while Characters w, h, i, 1,e while
{intt;
RELOP , , ,>=,==,!= ,!=
t i
ij;
else Characters e,l,s,e else
Program: Given program segment is,
void swap(int i, int j)
int t
Ans
ti
Tokens: Token is a sequence ofcharacters
that can be treated as a lexical uni ij
in the programming language. It consists of two parts namely token name and
an attribute value. jt
Examples of tokens in a programming language include identifier The table below presents the identified lexemes as well as correspond-
keywords, punctuations and operators. ng tokens and patterns,
Patterns: Pattern is a rule describing the form of a token. In programmin Lexemes Tokens Patterns
language pattern is used to determine whether a token is valid or not. Regula void keyword characters v, o, i, d
expressions are important notations for specifying patterns
Swap identifier letter tollowed by letter or digits
For the pattern for token, keyword is sequence of
that form the
example, characten operator
keyword and for token, identifier (id) is id > letter "
operator
Considerthe following'C statement to better understand the differeno
between tokens, patterns and lexemes.
operator
int keyword charters i, n, t
printf( "sum = %d "\"n", average); identifier letter followed by letter or digits
In the above statement, sum and averuge are lexemes that matches t operator
pattern for token identifier and "sum = %d"\"n" which is enclosed betwee identifier letter followed by letter or digits
"
is a lexeme matching the pattern for token, literal. operator =
The table below shows some tokens, their corresponding identifier letter tollowed by letter or digits
pattern a
lexemes. [operator
19
18
Compiler Design Unit 1
Ans Errors can be encountered at each A compiler not only detect and report errors but also try to handle errors
phase of computation. Some of theno some extent. But this, is
are listed below. very expensive process in terms of efficiency and
speed, as the compiler cannot repair the error based on what a programmer
Lexical Erors: A lexical analyzer reads input string and
doing so, it may encounter the following types of errors.
generates tokens. actually needs. Rather, it can try to repair to the extent possible and give
1 A misspelled identifier a message to the programmer about the repair as well. For example, if an
identifier Sname is declared as an invalid identifier in C language, the lexical
An identifier with more than specified or defined length. analyzer can delete the first character and retain the rest. In the above scenario,
These
the repair was very quick and easy, however in real time scenarios this is
errors can be caused due to the following reasons. not the situation. Errors can be very ambiguous and hard to
(i) An extra character than specified defined repair, in such
or
length Situations the scanner can repair to the possible extent and leave the rest to
(ii) An invalid character the programmer.
(i) A missing character Repairing an Error
(iv) Swapped characters or misplaced Numeric related errors, can be handled in an
characters. easy manner. The scanner
should read a numeric constant and if it exceeds the
The majority of errors that can be
identified during the lexical analys extra numbers can be
length then the
phase of a compiler are misspelled variables or identifiers. The lexical truncated and rest retained. The truncation is
or scanner without any
analy done from the
place where it exceeds the detined length. For example
grammar of a
language cannot do more
than
However, while making a copy of source nlanguage integer can be of length 2 bytes. If the constant exceeds
program, it not only imarks er he size, the scanner will truncate the
but also, provides the line number in which the exceeded
error has occurred.
Same to the
portion and intorm the
Example: Let us consider a statement in C language, programmer.
Case of
comments if an extra character appears outside the comment
printfg("Hello"): bection or
boundary, the character will be deleted.
In the above statement there number of errors. an
are
iflegal
invalid character appears, the
or
Ihe streanm
10 Constant
of tokens in encoded form generated by lexical analyzer is
as follows,
Consider encoding of tokens as follows, 9,),(6,100), (8,3), (9,2), 2, (6, 101), (10, 1), (7, 104), 3, (6, 101). (10, 1), (7, 105)
In the
above encoding, the value 'T' at the
Token Code Token Value
beginning token stream rep-
esents code for "if" keyword. Then comes the pair (9,1) where"9 represents
if Oae for parenthesis and '1' represents opening parenthesis "(. Then comes
then Pair (6, 100) where '6 represents code for identifier and "100
2
ot identifier represent
n 'x in the symbol table.
PesentS operator and '3' represents greater thanSimilarly,
else in pair (8, 3), S
(>) symbol. ihs
while 4
aned till the end of the conditional statement. As a result, theproces above
for 5 ncoded form is obtained.
22 23
Compiler Design | Unit 1
1.2.2 Input Buffering Algorithm: Advance_Forward_Pointer
(i) It enables use of macros in the program. Macros are the set of instru
tions which can be used as many times as required by the progranm.
27
26
Compiler Design Unit 1
6 extern int printf(const char ,...); Consider A, B, R, as regular expression. The identity rules of regular
7 aS are given below. These rules are useful in simplifying regular
expre.
8 int main( ) epressions.
9 t R , = R,
10 printf("Hello World\ n"); =
R,o
=
¢
R,
11 =R
E R =R,e
12
13 return (0); E and e* =¬
14 R,+R, =R,
R R = R ° R , = R,*
Figure: Preprocessor nput
In the above program it be that comment E+RR =R,"ande +R,"R, = R,*
can seen lines have h
removed.
&(R,9= R"
Input Buffering
The only phase of a compiler that reads or scans the source RR," =R,"
the scanner or lexical analyzer. The scanning is carried out on a
program E +R*= R*
character
character basis, thus buffering is a crucial step in making the lexical ( R , +E)" = R,*
to cope up with the
analy
speed of other phases. R (E +R,)" = (e + R)* R," = R,
Q22. Define regular expression. Explain about the properties of regu R R , + R , = R,* R,
expressions
Ans: (A + B)R, AR, + BR, and R, (A + B) R,A
= = +
R,B
Regular Expression:A regular expression is a concise notation for (A +B)* =(A* B*)* (A* + B)* =
1.2.3 Specification of Tokens, Recognition of Tokens 2. Let = {0,1), then the set of strings {01, 10) is a regular set.
Q23. Write short notes token 3. Let={a, b}, then the set of strings starting with a and ending with
on
specification.
Ans Token bis a regular set.
represent a
categorized block of text. Tokens can be specift
using the following,
Kegular Expression: A regular expression is a concise notation for
1.
Strings denoting regular sets. Regular expressions describe the language
accepted by a finite automata.
2 Languages
3 Regular set anguages Associated with RE: Let 2 be afinite alphabet. Then the regular
expressions over 2 denoting the regular sets are defined recursively defined
4. Regular expression. as follows,
1 Strings: A string is a finite collection of symbols chosen from so 9 15 a
regular expression denoting the regular set (o}.
alphabets.
2. E IS
The set of strings over an alphabet 2 =
{a, b} is denoted by 2*, whic
a
regular expression denoting the regular set (E.
set 3.
a
containing empty (e) and all combinations of a and b. aIs a
regular expression denoting the regular set laf.
2= {e, a, b, ab, ba, . t P'and Q are regular sets of
languages L, and L then
2 Languages : The
language denoted by I. is a set of strings over t
b) is a regular exprssion
denoting the set P u
alphabet 2. For example, English is a language in which letters ,
L, L, S, H are of strings over the alphalbet.Similarly, any programniu P9 1s a
regular expression denoting tlhe set PQ
language like Cis a language which contain program as subset ofstri Pis a
regular expression denoting the set P~
formed over the alphabet ofthe language. However these languag
are difficult to specify. Moreover the symbols like , or empty se
Nple 0(0 +
Over
:
1)*1 is a regular expression
denoting the set of all strings
f0, 1} starting
also languages. Some examples of languages are as follows. with 0 and ending with 1.
30 31
Compiler Design Unit 1
Q24. Write a transition diagram to recognize the token relop (correspon in state q, 1s then this one character must bbe
d Ifthe first character e n t e r e d
to relational operators in C++ language). state 15 and its corresponding token and attribute
he lexeme and q,
Ans s returned.
then state q, is entered. If in q
character in is <
Transition Diagrams: The transition Finally, if the first q,
diagrams are the
diagrams obtains recognized" as < and enter into
then the lexeme is
=
oCess
or
ranstorming a regular expression pattern intoa ahe next character is"
Ow chb returning token and attributevalue. If <> is recognized then state q
These diagrams are collection of nodes and
edges, wherein nodes/cireG, y d taken. attribute value is returned. In state 9, it other character,
represent states and edges labeled with a symbol or set of symbols sho < < which do not match with any of relop.
So we attach and retract
*
OiS
direction from one state to another. Each state in the transition shoxeme is
diagra he forward pointer.
corresponds to a condition in input scanningprocess, of a lexeme to be fou
that matches with a
regular expression pattern. retum (reop. EQ)
For instance, the transition
diagram for regular expression, abc*
b
In transition
diagram, accepting states or final states are
represent
by double circle. The final states represent that the required lexen
has been found. Final states include
action tyypically like a token and the
an action if it wants to return other
the parser. corresponding attribute value
In transition
diagram, as there are many transitions there is a poSs renurn (reiop. T)
ity of failureto occur. If this
happens then forward pointer is retrac | retum (rekp. LE)
back to start state and then the next transition diagram is activated
In transition
diagram, an edge marked with start is referred to asS
state or "initialstate". Transition diagram always begins from the"s renun (rekop. NE)
state after which only all the subsequent input symbols are read.
Examples: For instance, consider the transition diagram
for "relop" identifying lexe
renum (rekp. LT)
32 33
Compiler Design Unit -1
For example, If, then and else are kevwords, but scanner treato. Generator Lex:| LEX program is a tool that is used to generate
ats theme exical Analyzer The ool is often referred as lex compiler and its
identifierS or
scanner.
instal_id returns the attribute value ie., pointer to table entry. If tonstants and regular defirnitions.
lexeme is not found in a
and a pointer to table
symbol table then it is placed in symbol ta Amanifest constant is an identifier that is used to represent a constant.
entry is returned. The function of gettokenample:
In C manifest constants are declared as,
to return the
corresponding
the lexeme found.
token i.e., either identifier or
keyword #define MAX = 10
target program.
a source progrlln RE method
34 35
Unit 1
Compiler Design performs lexical analysis
how LEX program
RE, is the regular expression and method, is the program th o
w iwith one
ne example
examp
27.
Explain
patterns C',
in 'C', identifier, comments, numerical
defines the steps to be taken wlhenever the lexical analyzer finds following
that alo lexe
the
for
constants, arithmetic operators.
Lex nput
analyzer
num
ldigit ( digitl+) (E{-\-T
program
Lex compier Lengen yy. Ccompiler . out
digit
Lexgen.l Comments%% /* (word) | digit | white space)*/
Series of tokens)
comments} /No action is taken, no value is returned"/]|
white spacel /No action performed and no return value"/}
Figure (3): Lex Format to Create a Scanner
The lex compiler takes Lex program lexgen./ input
a
identifier lyLvalue = install_iden( );
as and gener
Lexgen.yy.C, which C is a
program. Then this C program (Lexgen.yy. retun(lDE).}|
compiled through the C-compiler to generate an object code file 'O.oul num
lyLvalue =inst.ill_number( )
O.out we can pass the
strings or input to be scanned (or) to O.out we ln aop
string for which we want tokens to be generated. Hence, O.out
return(NUND:)|
becon
Scanner or lexical
analyzer returnAOP):1
36 37
Compiler Design
install_iden()
install_number()