0% found this document useful (0 votes)
6 views17 pages

CD Unit 1

Bdururirnr shudhe Shhwjedn Sueis Wusje

Uploaded by

saxasix195
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views17 pages

CD Unit 1

Bdururirnr shudhe Shhwjedn Sueis Wusje

Uploaded by

saxasix195
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

UNIT Introduction, Lexical

Analysis
1
PART A
SHORT QUESTIONS WITH ANSWERS

Define Language Processor and Language Processing System.


a1.
Ans

Language Processor: Language processor is a hardware device that translation


in programming language such as FORTRAN and
all the programs written
COBOL into machine understandable form.
is a system that
Language Processing System: Language processing system
translates the source language which is taken as input into machine language.
This translation can be done by dividing the source file into modules. These
modules are as follows,
1. Preprocessor

2 Compiler
3 Assembler
4 Linker/loader.
02. Define compiler and interpreter
Ans

ierA compiler is a program that converts a source program into


tar get proEram.
Source Target
program l
Compier
progrum

igure
Interpreter:interpreter
and
An inter is also a progra that reads the source progtn
executes the program line by line
Source program
Output
Data nputntetpreter

Figure
1
Compiler Design| Unit 1
Define linker and loader and explain briefly. nifferentiate between pass one and pass two of a compiler.
Q3 Q6.
Ans Ans

Linkers:1t is a program that links the two object files containing thecompiled Single Pass Compiler Multipass Compiler
or assembled code to form a single tile which canbedirectly executable. Itis TAsingle pass compiler scans 1. |A multipass compiler scans the
also responsible for pertorming the following functions, the e n t i r e s o u r c e sourcecode several times.
1. Linking the object program with that of the code for standard library 2. Execution time ofthe code for| 2 Execution time of the code for a
translator is less. multipass translator is more.
functions and a single pass

3. Debugging of the translated3. |Debugging of the translated code


Resources provided by the operating system like memory allocators
and input and output devices.
code is difficult iseasy
4. Memory requirement is relatively
Memory requirement is more|
Loaders: It is program which loads or resolves all the code (relocatable
a
to design a single pass com- less.
code) whose principal memory references have undetermined initial loca-
tions present anywhere in the memory. It resolves with respect to the given
piler.
5. The generated code is less 5. The generated code is more ef-
base or initial address. fficieni ficient.
Q4. Define bootstrapping. Going back to the previously | 6. | Backtracking to previously scanned
read source program is not source code is allowed.
Ans: If language provides certain facilities to compile the language itself,
a
allowed.
then those facilities are referred to as boot-strapping. 7. I t is also called a narrow | 7. It is also called wide
a compiler.
i) compiler.
Bootstrapping helps in creation of compilers. 8. Pascal and C uses a single | 8. |Java requires a mulipass compiler.
(ii) It also helps in making compilers portable. pass compiler.
The structure of Q7. What is input buffering? How is input buffering implemented?
compiler a is very
complicated. A compiler is not a
hardware piece, rather it is also a program. And to write a program we need Ans:
a language, a programming language. Hence, compilers are written insimple nput Buffering: The only phase of a compiler that reads or scans the source

program is the scanner or lexical analyzer. The scanning is carried


a
languages. For example, C compilers are programmed in C language. out on
character-by-character basis, thus buffering is a crucial step in making the
05 List the phases of compiler. lexicalanalyzer to cope up with the speed of other phases.
Ans The six Input Buffer Scheme: Buffering is used in deciding whether a pattern torms
phases of compiler are, a valid token or not. The lexical analyzer reads the input characters from tre
1. Lexical analyzer buffer. Input buffer scheme is one of the butfering techniques. In this tech-
ot holding
2 Syntax analyzer e
C
, the entire buffer is partitioned into two halves each capable
(number of characters).
Semantic analyzer Q8. What are the features of a Lexical analyser?
Ans: The features of levical analyzer are as tollows,
Intermediate code generator
I removes white space and comments
5. Code optimizer (i) Tt
generates series ot
tokens.
6 Code generator. (ii) It
maintains line numbers.
3
2
Compiler Design
for storing informas
symbol table, which is responsible rmation
(iv) It creates identified in the input.
identifiers and constants
regarding
identified during generation of tokens
(v) It reports about e r r o r s
(vi) The lexical analyzer being the first phase of a compiler, reads or Cang

and in the second phase generates the stro


the input to the compiler
of tokens. Ar
compilation? Explain briefly. Co
Q9 What are the two parts of a
ta
Ans The following are the two phases of compilation, In

1. Analysis phase
2 Synthesis phase.
of this model is shown in figure below
Diagrammatic representation

Lexical analysis Processing


Syntax analysis intermediate Target
Source Semantic analysis| code
Program
Program Synthesis phase
Analysis phase

Figure: Analysis Synthesis Model


1. Analysis Phase: This phase performs the following operations t
process to source program.

2 Synthesis Phase : In this phase, the intermediate code generated r


previous phase to processed to generated the target program. I
processing of the source program may involve optimization i
belter execution.

4
1.2.1 The Role of Lexical Analyzer
Q15. What are the functions of lexical analyzer?

(or)
Explain the role of lexical analyzer with its implementations.

Ans
Lexical Analyzer: Lexical analysis is the process of reading the input string
identifying tokens, deleting white spaces and locating, repairing and reporting
errors if there are
any.
13
Compiler Design Unit 1
the tirst phase of a Cenerating Tokens: Whenever, the scanner identifies a valid token, it
The levical analvsis is implemented by npiler a tuple of the form,
generates
Input Token Parser - -
Scanner
strng Next token)
Where,
Lexical analysis is
a, is a valid lexeme (identifier, number, keyword etc).
processed here
the pointer to symbol table for q,
a, is
is responsible for, The token generation of the lexical analysis phase is initiated by parser
The scanner of a lexical analvzer
next token( ). After receiving a signal from
1 Removal of white spaces and
comments by using a function namely
ldentifving
narser the scanner starts reading the input string till it finds a valid
token.
) Constants
Revorting about Errors : The lexical analyzer in some compilers is
(ii) ldentifiers
responsible for making a copy of source program and marking errors
(ii) Keywords. in it. The scanner should be careful about duplication of a single error.
Generating tokens However, if the scanner encounters an error, it invokes a procedure in
Reporting about errors. error handler (part of a compiler).
Removal of White Spaces and Comments : Whenever the scanner The lexical analysis is completely dependent on the source language
encounters a blank, tab or a newline character it blindly deletes it.
The and independent of machine, on which the program or source language
reason is the parser need not worry about white spaces, as it itself is is being compiled.
for
comples phase. The parser cannot have regular expression whilo16. Explain
spaces. Even comments can be ignored by the parser, as they will be
whytheanalysisportion ofacompiler is separated into lexical
analysis and parsing phases.
deleted by the scanner, while generating tokens.

2 Identifying Ans: The lexical analyzer being the first phase of a compiler reads or scans
the input to the compiler and divide the input into a number of small parts
(i) Constants: Scanner, whenever encounters a constant, it makes
an entrof a string (token).
into symbol table and return the
pointer. Input to
(ii) identifiers: An identifier can be a name of a
function, variable, etd compier
Howeve, the grammar of Lexical
any language will consider identifier as
token. analyzer
Symbol Next Output
Example: The statement in C, table token (stream of wkens)

à =a+1 Parser
The lexical analyET would consider the above
statement as,
iden= iden + number,

Wherein, iden is one token =, t, number ligure: General Representation of Levical Analyzer
are also treated as single tokei
(iii) e above tigure clearly shows that the levical analyzer acts only when
Keywords : A keyword in
any language will follow all the rules thata invoked by the parser.
imposed on an identifier The point to be noted is that
keywords shou ere are anumber ot easons for which the levical analvzer isseparated
be made reserved identifiers so
thal the scanner do not get O
between an identifier and keywurd
TO Synt.d analyzer. Some ot them are listel below,
15
14
Compiler Design Unit 1
Reasons A B+C Input
string
1. The division of lexical analyzer
efficiency of the compiler. Each
from analyzer improvest h
syntax
phase is given their responsibilitie Lexical Symbol table
analyzer id, value A
The less the number of tasks, the nmore is the efficiency" is the
Invoke tokens assign_op
behind separation. If one phase reads from tokens, check Syntav
ax o for token
id, value = B
a string, then the time consumed will be more and task may not
Syntax addop
id, value = C
pertormed efficiently. analyzer
By the division of lexical analyzer and syntax analyzer into two phas
made very easy. reason is if thes
The
the designing of a compiler is The lexical analyzer forms tokens
that is any name as in A, B, student
all the functions that are to be performo
phases are put together then, like add_op, mult_op, div_op are used whenever
an identifier, operators
by the lexical analyzer has to be performed by syntax analyzer. Of all this is encountered.
most difficult is removin.such operator
functions performed by lexical analyzer, the
to the parser is,
the comments, white spaces, new line characters from the input string,A The format of token given
terms of designing
syntax analyzer is in itself the complicated phase (in <id, pointer to symbol table for value
> For identifiers.

which performs syntax checks according to the grammar of any give mathe-matical operator.
language. If such a complicated phase performs reading input string <type_op> For operators, 'type can be any
removing spaces looks very absurd. Hence, the division of these tw Let us generate tokens for the following string,
phases makes the design of each phase, in turn design of the compile
A =B+C
easy.
sid, pointer to symbol table for A>
Token generation and then storing the input token with its attribute
in symbol table is not an easy task. The lexical analyzer not only
does assign_op>
token generation, but also makes a copy of the source program and id, pointer to symbol table for B>
then links with it the number of errors if any.
For example, consider the
add_op
following program in C language, sid, pointer to symbol table for C
#include<stdio.h>
Hence, the division makes,
void main()
Compiler design easy
Increases efficiency
printf("Hello");
Reduces conmpilation time.
For
Ioreover, the division also improves portability of the compiler.
the input one by one, forms token t are there in Pascal the
he lexical analyzer reads wixample,
if any special characters like
attributes and stores it in symbol table. When it encounters 'T it does shovepresentation is not available, so the character can be isolated. By making
language.
an error message as in some compilers the lexical analy zer is responsible toCal analyzer delete non standard characters. the compiler beeomes tree
checking so. g u n g e or device specific restrictions. The parser is not even aware ot
ny ge or device specific restrictions, it works normally irTespective o
E Us taKe an
example, on how a lexical analyzer divides a string into
tokens, assign attributes and store them in
ny device specific constrants.
symbol table.
17
16
Compiler Design Unit 1
Q17. Define lexeme, token and pattern. Identify the lexeme that make up th Pattern
Token Lexemes
token in the following program segment. Indicate the correspondine letter followed by letter or | sum, average
identifier
token and pattern:
digits
void swap(int i, int )) number any number
0,2.14, 3.26
while Characters w, h, i, 1,e while
{intt;
RELOP , , ,>=,==,!= ,!=
t i
ij;
else Characters e,l,s,e else
Program: Given program segment is,
void swap(int i, int j)

int t
Ans
ti
Tokens: Token is a sequence ofcharacters
that can be treated as a lexical uni ij
in the programming language. It consists of two parts namely token name and
an attribute value. jt
Examples of tokens in a programming language include identifier The table below presents the identified lexemes as well as correspond-
keywords, punctuations and operators. ng tokens and patterns,
Patterns: Pattern is a rule describing the form of a token. In programmin Lexemes Tokens Patterns
language pattern is used to determine whether a token is valid or not. Regula void keyword characters v, o, i, d
expressions are important notations for specifying patterns
Swap identifier letter tollowed by letter or digits
For the pattern for token, keyword is sequence of
that form the
example, characten operator
keyword and for token, identifier (id) is id > letter "

|" (lette int characters i, n, t


keyword
"|"digit)" identifier letter tollowed by letter or digits.
Lexemes: Lexeme is the smallest logical part of a source program that operator
matched by the pattern for token. Lexical analyzer recognizes lexeme as a int characters i, n,t
instance of the token. |keyword
identifier letter followed by letter or digits
Forinstance, token RELOP contains lexemes like S,<,>, <=, >=, <?

operator
Considerthe following'C statement to better understand the differeno
between tokens, patterns and lexemes.
operator
int keyword charters i, n, t
printf( "sum = %d "\"n", average); identifier letter followed by letter or digits
In the above statement, sum and averuge are lexemes that matches t operator
pattern for token identifier and "sum = %d"\"n" which is enclosed betwee identifier letter followed by letter or digits
"
is a lexeme matching the pattern for token, literal. operator =
The table below shows some tokens, their corresponding identifier letter tollowed by letter or digits
pattern a
lexemes. [operator
19
18
Compiler Design Unit 1

identifier letter followed by letter or


digits Role of
Lexical Analyzer

operator answer refer Unit-1, Q15.


For
identifier letter followed by letter or digits of Lexical Analyzer
Error Handler
operator lexical analyzer is shown below,
The error handler of
identifier letter followed by letter of digits
Input
operator
Scanner
identifier letter followed by letter or digits
ator Error Next
p handler
Token
token ()Oken Symbol
operator table
Parser
Q18. Describe about lexical errors and various error
with various examples.
recovery strategie Figure: Error Handler

Ans Errors can be encountered at each A compiler not only detect and report errors but also try to handle errors
phase of computation. Some of theno some extent. But this, is
are listed below. very expensive process in terms of efficiency and
speed, as the compiler cannot repair the error based on what a programmer
Lexical Erors: A lexical analyzer reads input string and
doing so, it may encounter the following types of errors.
generates tokens. actually needs. Rather, it can try to repair to the extent possible and give
1 A misspelled identifier a message to the programmer about the repair as well. For example, if an
identifier Sname is declared as an invalid identifier in C language, the lexical
An identifier with more than specified or defined length. analyzer can delete the first character and retain the rest. In the above scenario,
These
the repair was very quick and easy, however in real time scenarios this is
errors can be caused due to the following reasons. not the situation. Errors can be very ambiguous and hard to
(i) An extra character than specified defined repair, in such
or
length Situations the scanner can repair to the possible extent and leave the rest to
(ii) An invalid character the programmer.
(i) A missing character Repairing an Error
(iv) Swapped characters or misplaced Numeric related errors, can be handled in an
characters. easy manner. The scanner
should read a numeric constant and if it exceeds the
The majority of errors that can be
identified during the lexical analys extra numbers can be
length then the
phase of a compiler are misspelled variables or identifiers. The lexical truncated and rest retained. The truncation is
or scanner without any
analy done from the
place where it exceeds the detined length. For example
grammar of a
language cannot do more
than
However, while making a copy of source nlanguage integer can be of length 2 bytes. If the constant exceeds
program, it not only imarks er he size, the scanner will truncate the
but also, provides the line number in which the exceeded
error has occurred.
Same to the
portion and intorm the
Example: Let us consider a statement in C language, programmer.
Case of
comments if an extra character appears outside the comment
printfg("Hello"): bection or
boundary, the character will be deleted.
In the above statement there number of errors. an
are
iflegal
invalid character appears, the
or

printig is not a valid function in C language, however it is a va


same can be deleteu
anner is not able to repair an error and is unable to proceed
identifier. Hence, the scanner cannot detect an error. It considers prm er, it should not get stuck to the error rather it can delete the
be a valid identifier. But, this error will be detected during syntax ana t e r s tillit gets a valid token that can be passed to the parser. Sucn
phase. As well as after the "(" there is a missing quotation and the state
O eror recovery is called parsic mode recovery.
is ending with a colon and not a semicolon.
20 21
Compiler Designi Unit 1
a19. Consider the following conditional statement:
identifier ptr to symbol table
if(x> 3) then y =5 else y =10; constant 7 ptr to symbol table
How does lexical analyzer help the above statement in Dre 8
compilation? process o
Ans: Given that,
if (x> 3) then y = 5 else y = 10; 3
The lexical
analyzer being the tirst phase of a compiler reads ors
and the
the input the compiler divides input series
to into
ike Scan
of tokens
tifiers, keywords, operators, constants and so on as follows, eiden
8
8
9
Lexeme Token
9 2
if keyword 10
operator 11
11
identifier
11
operator 11
3 constant
The symbol table for identifier and constants is,
operator
then Location Counter Type Value
keyword
identifier 100 identifier
101 identifier
operator
103 constant 3
constant
104 constant 5
else keyword 105 constant 10
identifier Consider the given conditional statement,
if (r > 3)
operator then y 5 elsey= 10;
=

Ihe streanm
10 Constant
of tokens in encoded form generated by lexical analyzer is
as follows,
Consider encoding of tokens as follows, 9,),(6,100), (8,3), (9,2), 2, (6, 101), (10, 1), (7, 104), 3, (6, 101). (10, 1), (7, 105)
In the
above encoding, the value 'T' at the
Token Code Token Value
beginning token stream rep-
esents code for "if" keyword. Then comes the pair (9,1) where"9 represents
if Oae for parenthesis and '1' represents opening parenthesis "(. Then comes
then Pair (6, 100) where '6 represents code for identifier and "100
2
ot identifier represent
n 'x in the symbol table.
PesentS operator and '3' represents greater thanSimilarly,
else in pair (8, 3), S
(>) symbol. ihs
while 4
aned till the end of the conditional statement. As a result, theproces above
for 5 ncoded form is obtained.
22 23
Compiler Design | Unit 1
1.2.2 Input Buffering Algorithm: Advance_Forward_Pointer

buffering? Explain the use of


a20. What is meant by input sentinels iCtont
Ste
recognizing tokens. Check whether'f is at the end of first half. If yes, then goto step (2) else
(or) goto step (3).
Explain the concept of input Buffering.
Step2
Ans Refill the second half and inCrement f by 1, i.e., make f=f+1
Input Buffering: The only phase ofa compiler that reads or scans the sourcsten3
program is the scanner or lexical analyzer. The scanning is carried out on
Check whetherf is at the end of second half. If yes, goto step (4), else
character-by-character basis, thus buffering is a crucial step in making th
goto step (5).
lexical analyzer to cope up with the speed of other phases.

Input Buffer Scheme: Buffering is used in deciding whether a pattern fomStep4


a valid token or not. The lexical
analyzer reads the input characters from th Refill the first half and make 'f point to the first character of the first
buffer. Input buffer scheme is one of the buffering techniques. In this tec half.
nique, the entire buffer is partitioned into two halves each capable of holdin Step5
C (number of characters), as shown in the below
figure, Incrementf by 1, i.e., makef=f+1.
half 2nd half
The two-halve buffering scheme works well for limited lookahead, i.e.,
co it can recognizea token only if characters
making a token fit into the same
buffer. Programs written in PL/1 are more challenging to these techniques,
ex_beg as the lookahead is too large. For example in the statement,
Figure: Two-halves Input Buffer
A
DECLARE(A1, A2, . Ai)
single read command is used to read all the 'C' characters from
halves. A special character'eof is read into the buffer,
bot DECLARE can either mean an array name or a keyword. To decide the
ters in the if the number of chara exact one,'f' has to be moved till the character next to the right
technique uses two
input are less than C. That is, 'eof' indicates end of source file.
Th parenthesis
is reached. Moreover, for every increment of pointer f we are required
pointers lex_beg and f(forward pointer). The sequen
of characters between these pertornm two tests.
pointers give the current lexeme. In the beg
ning, lex_ beg and / point to the same first Buffering with Sentinels: The
only thee pon ncrement of pointer'f in the inefficiency pertorming
character. of twO
f incremented till lexeme is found. Once lexemeThen,
is
tests ror eet
a
is found, the
a input bufter scheme, can be eliminated oy u
is set to
point to the next
character. When current lexeme is processea, polnsentinels. sentinel is a special character usually eof. Sentinels are e
A
pointers are set to point to the irst character of the next lexeme. Dthe part of the source proeram nor counts to the
number ot
er.At the end of each halt, a sentinel i.e., eof is used to charactes
ut

lf pointerf 1s reached to the end


of the first half of the indicate the end
it about to point to the first character input buffer, an alt. Moreover, end ot the source program is also indicated by
is

then the second half


the second half of the
in
input buffel Inus, three eofs are anotnet
used when the first half is tull and the second hait is
is immediately filled with 'C new characters. Similar
whenf is about to move out of the second oipletely full.
half, the left or first half is t
with 'C new characters and f 1s made to
point to the first character in eot
bt eof cof
half.
End of
End of a buffer hatf
The input buffer heme involves bufler hal End of fik
inoving he forward pointer : lex_ beg T
algorithm for advancing the jorWiard jojnler 7 is s
follows,
igure: lnput Butter With Sentinels
24 25
Compiler Design Unit 1
With this scheme, most of the times perform only one test to c. Itenables the use of header files in the program through #include state-
whether f is pointing to the eof only. When f reaches either of the eofs) ment.
the first half of a buffer, the second half of a buffer or the end of the som t enables the removal of comments in input file prior to parsing,
program required to pertorm additional tests. The improved algorithm i i ) t e
is to the output of the preprocessor.
1ncrementing f using sentinels is as follows, The 'i' extension given
Algorithm: ImprovedAdvance Forward Pointer Example
Consider the following C program before preprocessing
Step1 C file
#A Sample input
Increment the forward pointer f by 1 ie., f=f+ 1. n exl. c
S cat
Step2
2 #define SUCCESS0
Check whetherf is pointing to a sentinel i.e, eof.
(3), else exit.
If yes, then goto ste 3
4 /*The function prototype for printf found in /stdioh"/
Step3 5 extern int printf (const char *, ...);
f
Check whether is at the eof of the first half. If yes, then goto step (
else goto step (5).
7 int main()
Step4 8
Refill the second half and increment'f by 1. 9 printf ("Hello World\n");
10
Step5 returning 0 to the Operating system */
11
Check whether fis pointing to the end of the second half. If yes, the
12 return (SUCCESS);
goto step (6), else goto step (7).
13
Step # Creating an executable from sample input file using the GNU C

Refill the first half and compiler system


make'f point to the first character in the fin Sgcc-
W all exl.c - o exl
half
# Invoking the executable
Step7
S /exl
End of the source
program is reached, thus, terminate lexical analys Hello World
a21. What is the functionality of Figure: An Input C File
preprocessing and input buffering? After
Ans preprocessing the output of the preprocessor 1s,
#Thepre-processed output file exl.i
Preprocessing: Pre-processing is the
functionality of a preprocessor (a pr cat n exl.i
gram which is developed to
particular data in desired form and who
process
output is given to another program, say compiler). The output of preprocess 1#1 "exl.c"
is called a preprocessed form of the input data.
Functions of Preprocessing

(i) It enables use of macros in the program. Macros are the set of instru
tions which can be used as many times as required by the progranm.
27
26
Compiler Design Unit 1
6 extern int printf(const char ,...); Consider A, B, R, as regular expression. The identity rules of regular
7 aS are given below. These rules are useful in simplifying regular
expre.
8 int main( ) epressions.
9 t R , = R,
10 printf("Hello World\ n"); =

R,o
=
¢
R,
11 =R
E R =R,e
12
13 return (0); E and e* =¬

14 R,+R, =R,
R R = R ° R , = R,*
Figure: Preprocessor nput
In the above program it be that comment E+RR =R,"ande +R,"R, = R,*
can seen lines have h
removed.
&(R,9= R"
Input Buffering
The only phase of a compiler that reads or scans the source RR," =R,"
the scanner or lexical analyzer. The scanning is carried out on a
program E +R*= R*
character
character basis, thus buffering is a crucial step in making the lexical ( R , +E)" = R,*
to cope up with the
analy
speed of other phases. R (E +R,)" = (e + R)* R," = R,
Q22. Define regular expression. Explain about the properties of regu R R , + R , = R,* R,
expressions
Ans: (A + B)R, AR, + BR, and R, (A + B) R,A
= = +
R,B
Regular Expression:A regular expression is a concise notation for (A +B)* =(A* B*)* (A* + B)* =

regular sets. Regular expression describes the language accepted denoti


automata.
by fim (AB)*A =A(AB)*
Operations of Regular Expressions: Let Pand Q be two regular expressions.
Properties of Regular Expression he operations that can be performed on Pand Q are as tollowNs,
Let P and Q be the Union (+)
regular expressions.
The The union
(+) operator is an operator which is used to combine the
properties of P and Q are as follows,
elements of the two sets. It has a lower precedence when compared to
1.
L(P+Q)-L(P) uL(Q) other two
operators (i.e., *
and .).
2. LP. Q) L(P).L(Q) union ot P and Q (i.e., Pu Q)represents the set fa ub| ae Pand
3. L(L(P) L{P) = bEQ.
4. L(P) = (L(P) Example
If P=
5. L(P).E ¬.LP) L{P). [1, 2, 3} and Q= {1, 2.3, 4, 6}
= =
5, then,

Algebraic Identities: Two regular saidto


PuQ=[1, 2, 3} u |1, 2,3, 4, 5, 6
equivalent if both of them expressions A and B are
={1, 2,3, 4, 5, 6}
28
represents the same set of strings
29
Compiler Design Unit 1
number of d's and b's: fE, ab,
(b) Concatenation: The concatenation or dot () operator is an on. ( The set ofstrings
consisting ot equal
which is used to concatenate the elements of two sets. It has aabb,babaab,..
preceden
lower than star (*) operator but higher than union (+) operator. over
any alphabet: {o
The concatenation of Pand Q (ie., PQ) represents the set fab la i) The empty language
and be Q}. For any 2, 2 is a language.
(ii)
a finite alphabet. Then a class containing sets of
Example Regular Set: Let 2 be
If P = {1, 2, 3) and Q = {1, 2} then, strings over , called regular sets are recursively defined as follows,
PQ = {1,2,3]. [1, 2
(a) The empty set is a regular set over

(b) E, the string length


of 0 is a regular set over .
(1.1),1.2),21),22,3.1), 32)}
(c) Kleene Closure or Star Closure:The Kleene closure or star (*) operat (c) Each input symbol say
a in Y is a regular set over
is an
operator which is used to obtain the set of strings includingn If P and Q are two regular sets over 2, then
( from the set of alphabet (). It has
highest precedence than oth (d)
operators. PuQis a regular set over E.
The Kieene closure of P represents the smallest superset of P whi
holds all the strings (¬) developed by the
PQ is regular set over .
a

concatenation of zero or mo P is a regular set over Y.


strings inP. The superset is closed under string concatenation.
Example Examples
la, bc*=(E, a, bc, abc, aa, bcbc,.. 1. Let 2 = {0}, then the set of strings {0, 00, 000,... is a regular set.

1.2.3 Specification of Tokens, Recognition of Tokens 2. Let = {0,1), then the set of strings {01, 10) is a regular set.
Q23. Write short notes token 3. Let={a, b}, then the set of strings starting with a and ending with
on
specification.
Ans Token bis a regular set.
represent a
categorized block of text. Tokens can be specift
using the following,
Kegular Expression: A regular expression is a concise notation for
1.
Strings denoting regular sets. Regular expressions describe the language
accepted by a finite automata.
2 Languages
3 Regular set anguages Associated with RE: Let 2 be afinite alphabet. Then the regular
expressions over 2 denoting the regular sets are defined recursively defined
4. Regular expression. as follows,
1 Strings: A string is a finite collection of symbols chosen from so 9 15 a
regular expression denoting the regular set (o}.
alphabets.
2. E IS
The set of strings over an alphabet 2 =
{a, b} is denoted by 2*, whic
a
regular expression denoting the regular set (E.
set 3.
a
containing empty (e) and all combinations of a and b. aIs a
regular expression denoting the regular set laf.
2= {e, a, b, ab, ba, . t P'and Q are regular sets of
languages L, and L then
2 Languages : The
language denoted by I. is a set of strings over t
b) is a regular exprssion
denoting the set P u
alphabet 2. For example, English is a language in which letters ,
L, L, S, H are of strings over the alphalbet.Similarly, any programniu P9 1s a
regular expression denoting tlhe set PQ
language like Cis a language which contain program as subset ofstri Pis a
regular expression denoting the set P~
formed over the alphabet ofthe language. However these languag
are difficult to specify. Moreover the symbols like , or empty se
Nple 0(0 +
Over
:
1)*1 is a regular expression
denoting the set of all strings
f0, 1} starting
also languages. Some examples of languages are as follows. with 0 and ending with 1.
30 31
Compiler Design Unit 1
Q24. Write a transition diagram to recognize the token relop (correspon in state q, 1s then this one character must bbe
d Ifthe first character e n t e r e d
to relational operators in C++ language). state 15 and its corresponding token and attribute
he lexeme and q,
Ans s returned.
then state q, is entered. If in q
character in is <
Transition Diagrams: The transition Finally, if the first q,
diagrams are the
diagrams obtains recognized" as < and enter into
then the lexeme is
=

oCess
or
ranstorming a regular expression pattern intoa ahe next character is"
Ow chb returning token and attributevalue. If <> is recognized then state q
These diagrams are collection of nodes and
edges, wherein nodes/cireG, y d taken. attribute value is returned. In state 9, it other character,
represent states and edges labeled with a symbol or set of symbols sho < < which do not match with any of relop.
So we attach and retract
*

OiS
direction from one state to another. Each state in the transition shoxeme is
diagra he forward pointer.
corresponds to a condition in input scanningprocess, of a lexeme to be fou
that matches with a
regular expression pattern. retum (reop. EQ)
For instance, the transition
diagram for regular expression, abc*
b

Figure: Transition Diagram


The conventions about transition
diagram are as folows, retum (rebp GE)

In transition
diagram, accepting states or final states are
represent
by double circle. The final states represent that the required lexen
has been found. Final states include
action tyypically like a token and the
an action if it wants to return other
the parser. corresponding attribute value
In transition
diagram, as there are many transitions there is a poSs renurn (reiop. T)
ity of failureto occur. If this
happens then forward pointer is retrac | retum (rekp. LE)
back to start state and then the next transition diagram is activated
In transition
diagram, an edge marked with start is referred to asS
state or "initialstate". Transition diagram always begins from the"s renun (rekop. NE)
state after which only all the subsequent input symbols are read.
Examples: For instance, consider the transition diagram
for "relop" identifying lexe
renum (rekp. LT)

Transition diagram begins at start state, q, If the first input *y


recognized is > then
among all the lexemes i.e., (<, ,>=, <>,=)0 Figure: Transition Diagram for Relop
lexemes (> =,>) are to be What
recognized. are reserved words and identifiers?
recognized with an example. Explain how they are
Transition diagram leads> and enter into
S then lexeme > = is recognized and
state q,. Il the
nexte
the state y , is entered where v w oAn iulentifier
r lin auv Can
can be
be aa name of a tunction, variable, constant ete. A
nane ot a tunction, variable, constant
token and the corresponding attribute value is returned. If in A etc. In
nguage will folloy all the rules innpased on an identitier. A
urned. Insstateq, Analysis, keywords
whichare e
character is> or then the pattern identified
<
made reserved are reognized as identitiers. So keywords should
is either >< or >>
reserved identifier
in
identifiera identifier.so that the scanner 6
*

relop so we attach a tonottheinclude


final stale and retract the forward pointer
position (i.e., lexeme do the symbol that got us to final stale

32 33
Compiler Design Unit -1
For example, If, then and else are kevwords, but scanner treato. Generator Lex:| LEX program is a tool that is used to generate
ats theme exical Analyzer The ool is often referred as lex compiler and its
identifierS or
scanner.

etter/dig it exical analyzer as lex language


ktter other
nput specification onsutdtes three parts or sections in it.
1ex PrOgran
(get token (), install ex Specification: Any
|Declarations section
Figure: Transition Diagram for ldentifiers and Keywords %%
The above is a
figure transition diagram such
for searching an identi Translation rules
lexeme. The diagram also recognizes keywords as if, then, else as th section Lex program
look as identifiers.
6%
Keywords which look like identifiers can be handled using two metho Auxiliary procedures
1 Initially symbol table is created in which information about keywo
a section
is stored. Two return statements called install_id ( ) and gettoken ()
used in order to obtain attribute value and token Figure (1): Structure of a Lex Program
respectively. 1
functionality of install_id ( ) is to search for a lexeme in symbol ta
and make its entry. If the lexeme present in a separated by Each section is %%.
table is asymbol
keywo
then install_id () returns '0' and if the lexeme found is a variabietDeclarations Section: This section is used to declare variables, manifest

instal_id returns the attribute value ie., pointer to table entry. If tonstants and regular defirnitions.
lexeme is not found in a
and a pointer to table
symbol table then it is placed in symbol ta Amanifest constant is an identifier that is used to represent a constant.
entry is returned. The function of gettokenample:
In C manifest constants are declared as,
to return the
corresponding
the lexeme found.
token i.e., either identifier or
keyword #define MAX = 10

In second A regular detinition is a sequence of regular expressions where each


In this
technique, a transition diagram is created for each keywoular expression is given a name for notational convenienCe.
transition diagram, a test for
example, consider a transition
non letter or
digit is applied.ieis a set of regular expressions used to denote a language It n,
re,, te.
diagram keyword "ELSE"
for ana 1,
the set of names for the regular expressions respectively, then a regular
non letter
E lefinition is given as,
digit

Figure: Transition Diagram for Keyword Else re


1, re.
As seen in the above transition
digit. It means
diagram it ends
that after any identifier it can
witha state non lett
be
either a digit ora non letter
can not be
character. It is compulsory to check
a
whether identifier has en
or not otherwise a token ELSE is returned for
" "

the lexeme like


1.2.4 The elsewanslation Rules Section: The translation rules is a set ot
Lexical-Analyzer Generator Lex statements,
s t h e action to be pertornwd for each regular expression given. Below
026. Define compiler. Explain in brief about the LEX e lorm of statements in traslation rules section.
compiler.
RE
(or) methnd)
Write about lexical RE
analyzer generator lex. nwethnd.
Ans

Compiler: A compiler is a program that converts


into

target program.
a source progrlln RE method
34 35
Unit 1
Compiler Design performs lexical analysis
how LEX program
RE, is the regular expression and method, is the program th o
w iwith one
ne example
examp

27.
Explain
patterns C',
in 'C', identifier, comments, numerical
defines the steps to be taken wlhenever the lexical analyzer finds following
that alo lexe
the
for
constants, arithmetic operators.

matching the regular expression RE The methods are


usually implema
in language C, however they can be implemented in
any language.
men (or)
Auxiliary Procedures These procedures are used by the method for identifying the keywords and identifiers from
Write a LEX program
translation rules section. The auxiliary procedures can be compiled
separat the file?
however loaded with the scanner.
Input Tokens Ans
Scanner
program L
Parser Tree for the patterns is given below,
Next token) The LEX program

Figure (2): Coordination between Scanner and Parser


NUM, IDE, AOP
Behavior of Lexical Analyzer (If Generated with
by LEX) Parser:
process is initiated by parser in the need of tokens. This invocation mal
scanner to read the
longest input lexeme that exactly nmatch any of the regu word [letter digit |-)*
expressions RE, and if the lexeme matches more than one
regular express
then the scanner will select the first
After checking for the match,
one.
aop +1-1/I*IA
delimiter [blank | tab | newline]s
respective method or function method, is called or performed. After execut
the method the control should be handed over to
the if unable to
parser,
white space delimiterl+
so the scanner will scan unless it gets a lexeme whose method
result in
(action) letter a-z A-Z]
a control return toparser. As a result of successful return to
the scanner returns
pan digit [0-9
a token for a lexeme in the
input.
identifier letterl' ([letter}! {digit}))*
Input to be scanned
or sent to lexical

Lex nput
analyzer
num
ldigit ( digitl+) (E{-\-T
program
Lex compier Lengen yy. Ccompiler . out
digit
Lexgen.l Comments%% /* (word) | digit | white space)*/

Series of tokens)
comments} /No action is taken, no value is returned"/]|
white spacel /No action performed and no return value"/}
Figure (3): Lex Format to Create a Scanner
The lex compiler takes Lex program lexgen./ input
a
identifier lyLvalue = install_iden( );
as and gener
Lexgen.yy.C, which C is a
program. Then this C program (Lexgen.yy. retun(lDE).}|
compiled through the C-compiler to generate an object code file 'O.oul num
lyLvalue =inst.ill_number( )
O.out we can pass the
strings or input to be scanned (or) to O.out we ln aop
string for which we want tokens to be generated. Hence, O.out
return(NUND:)|
becon
Scanner or lexical
analyzer returnAOP):1
36 37
Compiler Design
install_iden()

/*This procedure installs the lexeme, which the


is pointed by pointer
text and its length is length, in to the symbol table. This procedure
returns a pointer in to that lexeme*/

install_number()

install number) stores the number in symbol table and returns a


pointer
poi
to its entry.

You might also like