compiler design
compiler design
NNCE III / 05 CD - QB
UNIT – I
PREPARED BY
VERIFIED BY
1
DR.NNCE III / 05 CD - QB
ENGINEERING
REGULATION 2021
PERIODS PER TOTAL
S. COURS CATE
COURSE TITLE WEEK CONTACT CREDITS
N E GORY
O. L T P PERIODS
CODE
THEORY
1. CS3591 Computer Networks PCC 3 0 2 5 4
2. CS3501 Compiler Design PCC 3 0 2 5 4
3. Cryptography and Cyber
CB3491 PCC 3 0 0 3 3
Security
4. CS3551 Distributed Computing PCC 3 0 0 3 3
5. Professional Elective I PEC - - - - 3
6. Professional Elective II PEC - - - - 3
7. Mandatory Course-I& MC 3 0 0 3 0
TOTAL - - - - 20
2
DR.NNCE III / 05 CD - QB
SEMESTER V
SYLLABUS
Role of Parser – Grammars – Context-free grammars – Writing a grammar Top Down Parsing -
General Strategies - Recursive Descent Parser Predictive Parser-LL(1) - Parser-Shift Reduce
Parser-LR Parser- LR (0)Item Construction of SLR Parsing Table - Introduction to LALR Parser
- Error Handling and Recovery in Syntax Analyzer-YACC tool - Design of a syntax Analyzer for
a Sample Language
3
DR.NNCE III / 05 CD - QB
COURSE OUTCOME
CO2:Design a lexical analyser for a sample language and learn to use the LEX tool.
CO3:Apply different parsing algorithms to develop a parser and learn to use YACC tool
TEXT BOOK:
1. Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman, “Compilers: Principles,
Techniques and Tools”, Second Edition, Pearson Education, 2009.
REFERENCES
4
DR.NNCE III / 05 CD - QB
Randy Allen, Ken Kennedy, Optimizing Compilers for Modern Architectures: A Dependence
based Approach, Morgan Kaufmann Publishers, 2002.
Keith D Cooper and Linda Torczon, Engineering a Compiler‖, Morgan Kaufmann Publishers
Elsevier Science, 2004.
V. Raghavan, Principles of Compiler Design‖, Tata McGraw Hill Education Publishers, 2010.
5
DR.NNCE III / 05 CD - QB
LECTURE PLAN
Course : BE-CSE
6
DR.NNCE III / 05 CD - QB
PERIODS
HOURS
NUMBER TEACHING
S.NO TOPICS PLANNED REMARKS
OF AID
(L/T)
HOURS
1 1 Structure of a compiler L BB
2 1 Lexical Analysis L BB
4 1 Input Buffering L BB
5 1 Specification of Tokens L BB
6 1 Recognition of Tokens L BB
8 1 Minimizing DFA L BB
7
DR.NNCE III / 05 CD - QB
9 1 LEX TOOL L BB
PERIODS
HOURS
NUMBER TEACHING
S.NO TOPICS PLANNED REMARKS
OF AID
(L/T)
HOURS
10 1 Role of Parsers L BB
11 1 Grammars L BB
12 1 Context-free grammar L BB
8
DR.NNCE III / 05 CD - QB
17 1 CANONICAL LR PARSING L BB
20 1 YACC L BB
21 1 L BB
Design of a syntax Analyzer for a Sample
Language
PERIODS
HOURS
NUMBER TEACHING
S.NO TOPICS PLANNED REMARKS
OF AID
(L/T)
HOURS
9
DR.NNCE III / 05 CD - QB
25 1 Syntax Tree L BB
28 1 Type Checking. L BB
29 1 L BB
Back patching
PERIODS
HOURS
NUMBER TEACHING
S.NO TOPICS PLANNED REMARKS
OF AID
(L/T)
HOURS
10
DR.NNCE III / 05 CD - QB
PERIODS
HOURS
NUMBER TEACHING
S.NO TOPICS PLANNED REMARKS
OF AID
(L/T)
HOURS
38 2 DAG L BB
11
DR.NNCE III / 05 CD - QB
12
DR.NNCE III / 05 CD - QB
UNIT I
IMPORTANT QUESTIONS
PART – A
7. Give the significance of symbol table. Draw a sample table. NOV/DEC 2023
13. What are the cousins of compiler? APRIL/MAY 2004, APRIL/MAY 2005
13
DR.NNCE III / 05 CD - QB
14. What are the main two parts of compilation? What are they performing? MAY
16,17,18,DCE18
17. What is the need for separating the analysis phase into lexical analysis and
parsing? (Or) What are the issues of lexical analyzer? MAY13,14,
25. Apply rules used to define a regular expression. Give example. MAY18, DEC18
26. Construct regular expressions for the language L={WЄ{a,b}| w ends in abb}
DEC18
PART – B
Return(x<=-10.0||x>=10.0)?100:x*x;
Into appropriate lexemes. Which lexemes should get associated lexical values? What
should those values be? APRIL/MAY 2024
14
DR.NNCE III / 05 CD - QB
27. Explain the structure of a compiler. Illustrate the output of each phase of
compilation. APRIL/MAY 2024
28. Construct minimized DFA for the regular expression (a|b)*abb APRIL/MAY 2024
29. What are the phases of compiler? Explain each phase in detail. NOV/DEC 2023
31. Every statement of the software written in any programming language translated
to machine understandable language before execution. Elaborate the translation
process. Explain the process using the statement.
NOV/DEC 2023,
APRIL/MAY 2022
3. Elaborate the difference phases of compiler with a neat sketch. Show the output of
each phase of a compiler when the following statement is parsed
SI= (p*n*r)/100
4. List out the functions of a lexical analyzer? state the reasons for the separation of
analysis of program into lexical, syntax and semantic analysis. (APRIL/MAY 2023)
5. discuss the phases of a compiler indicating the input and output of each phase and
translating the statement “amount=principle +rate*360 (APRIL/MAY 2022,
NOV /DEC 2022, NOV / DEC 2020)
10. Draw a transition diagram for identifiers and keywords (NOV/ DEC 2020)
11. Prove that the following two regular expressions are equivalent by showing that the
minimum state DFA’s are same. i) (a/b)* (APRIL / MAY 2022)
12. Construct a DFA without constructing NFA for following regular expression. Find
minimized DFA r = (a|b)*abb#
16
DR.NNCE III / 05 CD - QB
PART – A
1. What is a Complier?
A Complier is a program that reads a program written in one language-the source language-
and translates it in to an equivalent program in another language-the target language . As an
important part of this translation process, the compiler reports to its user the presence of errors in
the source program
Structure editors
Pretty printers
Static
checkers
Interpreters.
Preprocessors ii.Assemblers
Loaders
Link editors.
4. What are the main two parts of compilation? What are theyperforming?
MAY 16,17,18,DCE18
The two main parts are
17
DR.NNCE III / 05 CD - QB
Analysis part breaks up the source program into constituent pieces and creates
an intermediate representation of the source program.
Synthesis part constructs the desired target program from the intermediate representation
Parse generator
Scanner generators
Loading
Link editing
The process of loading consists of taking relocatable machine code, altering the
relocatable address and placing the altered instructions and data in memory at the proper
locations.
Link editing: This allows us to make a single program from several files of relocatable
machine code. These files may have been the result of several compilations, and one or more
may be library files of routines provided by the system and available to any program that needs
them.
8. What is a preprocessor?
18
DR.NNCE III / 05 CD - QB
The preprocessor may also expand macros into source language statements.
Skeletalsource program
Preprocessor
Source program
Macro processing
File inclusion
Relational Preprocessor
Language extensions
A Symbol table is a data structure containing a record for each identifier, with fields for the
attributes of the identifier. The data structure allows us to find the record for each identifier
quickly and to store or retrieve data from that record quickly.
Lexical analysis
Syntax analysis
Semantic analysis
Code optimization
Code generation
13. What is the need for separating the analysis phase into lexical analysis and parsing?
(Or) What are the issues of lexical analyzer? MAY13,14,
Simpler design is perhaps the most important consideration. The separation of lexical
analysis from syntax analysis often allows us to simplify one or the other of these phases.
19
DR.NNCE III / 05 CD - QB
The first phase of compiler is Lexical Analysis. This is also known as linear analysis in
which the stream of characters making up the source program is read from left-to- right and
grouped into tokens that are sequences of characters having a collective meaning.
A Sentinel is a special character that cannot be part of the source program. Normally weuse
‘eof’ as the sentinel. This is used for speeding-up the lexical analyzer.
17. What is a regular expression? State the rules, which define regular expression?
ε-is a regular expression that denotes {ε} that is the set containing the empty string
Suppose r and s are regular expressions denoting the languages L(r ) and L(s) Then,
Recognizers are machines. These are the machines which accept the strings belonging to
certain language. If the valid strings of such language are accepted by the machine then it is said
that the corresponding language is accepted by that machine, otherwise it is rejected.
The lexical analyzer scans the source program and separates out the tokens from it.
R.E=letter(letter+digit)*
R.E=digit.digit*
LEXEMES: it represent the sequence of charecters in the source program that are
matched with pattern of token.
24. Apply rules used to define a regular expression. Give example. MAY18, DEC18
21
DR.NNCE III / 05 CD - QB
If R1 and R2 are regular expressions then R=R1+R2 (same as R=R1|R2) is also a regular
expression which represents union expression.
If R1 and R2 are regular expressions then R=R1.R2 is also a regular expression which represents
concatenation operation.
If R1 is a regular expressions then R=R1* is also a regular expression which represents kleen
closure.
25. Construct regular expressions for the language L={WЄ{a,b}| w ends in abb} DEC18
digit-> [0-9]
digits-> digit*
In a lexical analysis phase, during token recognition process, one or more characters beyond the
lexeme must be searched to get the correct identification of lexeme. Hence there is need to
manage some lookaheads. For this purpose input buffering is used
A two buffer scheme is used to manage the lookaheads. Using the sentinels the buffer ends are
marked and between the two end pointers the correct lexeme is identified.
PART – B
Important questions
1. List out the functions of a lexical analyzer? state the reasons for the separation of
analysis of program into lexical, syntax and semantic analysis.
(APRIL/MAY 2023)
It is the first phase of the compiler. It gets input from the source program and produces tokens as output. It reads
the characters one by one, starting from left to right and forms the tokens.
22
DR.NNCE III / 05 CD - QB
Example: a + b = 20
The lexical analyser not only generates a token but also enters the lexeme into the symbol table if it is not
already there.
Its main task is to read the input characters and produce as output a sequence of tokens that the parser uses for
syntax analysis.
Upon receiving a “get next token” command from the parser, the lexical analyzer reads input characters until
it can identify the next token.
Modularity - each analysis phase has distinct responsibilities and can be implemented
independently, making the compiler design modular and easier to understand.
Error isolation- separating analysis helps in locating errors to specific phasesmake easier
to fix issues
Language specification- where lexical syntax and semantic aspects are distinct and defined
separately
Reusability - individual analysis phase can be reused for different programming lang,
reducing redundancy and improving development efficiency.
2. discuss the phases of a compiler indicating the input and output of each phase and
translating the statement “amount=principle+rate*360 (APRIL/MAY 2023)
PHASES OF COMPILER
A Compiler operates in phases, each of which transforms the source program from one representation into another.
The following are the phases of the compiler:
23
DR.NNCE III / 05 CD - QB
Main phases:
Lexical analysis
Syntax analysis
Semantic analysis
Code optimization
Code generation
Sub-Phases:
Error handling
LEXICAL ANALYSIS:
It is the first phase of the compiler. It gets input from the source program and produces tokens as output. It reads the
characters one by one, starting from left to right and forms the tokens.
24
DR.NNCE III / 05 CD - QB
Example: a + b = 20
The lexical analyser not only generates a token but also enters the lexeme into the symbol table if it is not
already there.
SYNTAX ANALYSIS:
It gets the token stream as input from the lexical analyser of the compiler and generates
Syntax tree:
It is a tree in which interior nodes are operators and exterior nodes are operands.
a +
b *
c 2
SEMANTIC ANALYSIS:
It gets input from the syntax analysis as parse tree and checks whether the given syntax is correct or not. It performs type
conversion of all the data types into real data types.
It gets input from the semantic analysis and converts the input into output as intermediate code such as three
address code.
The three-address code consists of a sequence of instructions, each of which has atmost three operands.
Example: t1=t2+t3
CODE OPTIMIZATION:
It gets the intermediate code as input and produces optimized intermediate code as output.
This phase reduces the redundant code and attempts to improve the intermediate code so that faster-running machine
code will result.
During the code optimization, the result of the program is not affected. To improve the code generation, the optimization
involves
loop unrolling.
CODE GENERATION:
It gets input from code optimization phase and produces the target code or object code as result. Intermediate instructions
are translated into a sequence of machine instructions that perform the same task. The code generation involves
26
DR.NNCE III / 05 CD - QB
Symbol table is used to store all the information about identifiers used in the program.
It is a data structure containing a record for each identifier, with fields for the attributes of the identifier.
It allows to find the record for each identifier quickly and to store or retrieve data from that record.
Whenever an identifier is detected in any of the phases, it is stored in the symbol table.
27
DR.NNCE III / 05 CD - QB
id1:=temp3
Code optimiation
28
DR.NNCE III / 05 CD - QB
Code Generator
MOVF id3, r2
MULF *60.0,
r2
MOVF id2,
r2
ADDF r2, r1
MOVF r1,
id1
ERROR HANDLING:
Each phase can encounter errors. After detecting an error, a phase must handle the error so that compilation can
proceed.
In semantic analysis, errors occur when the compiler detects constructs with right syntactic structure but no meaning
and during type conversion.
In code optimization, errors occur when the result is affected by the optimization. In code generation, it shows error
when code is missing etc.
INPUT BUFFERING
We often have to look one or more characters beyond the next lexeme before we can be sure we have the right lexeme.
As characters are read from left to right, each character is stored in the buffer to form a meaningful token as shown below:
Forward pointer
A = B + C
29
DR.NNCE III / 05 CD - QB
We introduce a two-buffer scheme that handles large look aheads safely. We then consider an improvement involving
"sentinels" that saves time checking for the ends of buffers.
BUFFER PAIRS
::E::=::M:* : * : : * : 2 : eof
C
lexeme_beginning
forward
Each buffer is of the same size N, and N is usually the number of characters on one disk block. E.g., 1024 or 4096
bytes.
Using one system read command we can read N characters into a buffer.
If fewer than N characters remain in the input file, then a special character, represented by eof, marks the end of the
source file.
Once the next lexeme is determined, forward is set to the character at its right end.
The string of characters between the two pointers is the current lexeme.
After the lexeme is recorded as an attribute value of a token returned to the parser,
lexeme_beginning is set to the character immediately after the lexeme just found.
Advancing forward pointer requires that we first test whether we have reached the end of one of the buffers, and if so, we
must reload the other buffer from the input, and move forward to the beginning of the newly loaded buffer. If the end of
30
DR.NNCE III / 05 CD - QB
second buffer is reached, we must again reload the first buffer with input and the pointer wraps to the beginning of the
buffer.
forward := forward + 1
end
else if forward at end of second half then begin reload second half;
end
SENTINELS
For each character read, we make two tests: one for the end of the buffer, and one to determine what character is
read. We can combine the buffer-end test with the test for the current character if we extend each buffer to hold a
sentinel character at the end.
The sentinel is a special character that cannot be part of the source program, and a natural choice is the character eof.
lexeme_beginning
forward
Note that eof retains its use as a marker for the end of the entire input. Any eof that appears other than at the end of a
buffer means that the input is at an end.
LEX
31
DR.NNCE III / 05 CD - QB
Lex is a computer program that generates lexical analyzers. Lex is commonly used with
First, a specification of a lexical analyzer is prepared by creating a program lex.l in the Lex language.
Then, lex.l is run through the Lex compiler to produce a C program lex.yy.c.
Finally, lex.yy.c is run through the C compiler to produce an object program a.out, which is the lexical analyzer that
transforms an input stream into a sequence of tokens.
Stream of output
Input string a.out
Lex Specification
{ definitions }
%%
{ rules }
%%
{ user subroutines }
{action1}
p2 {action2}
32
DR.NNCE III / 05 CD - QB
pn {actionn}
where pi is regular expression and actioni describes what action the lexical analyzer should take when pattern pi
matches a lexeme. Actions are written in C code.
User subroutines are auxiliary procedures needed by the actions. These can be compiled separately and loaded
with the lexical analyzer.
These are specialized tools that have been developed for helping implement various phases of a compiler.
Parser Generators:
-These produce syntax analyzers, normally from input that is based on a context-free grammar.
Scanner Generator:
-These generate lexical analyzers, normally from a specification based on regular expressions.
Syntax-Directed Translation:
-These produce routines that walk the parse tree and as a result generate intermediate code.
-Each translation is defined in terms of translations at its neighbor nodes in the tree.
-It takes a collection of rules to translate intermediate language into machine language. The rules must
include sufficient details to handle different possible access methods for data.
Data-Flow Engines:
33
DR.NNCE III / 05 CD - QB
-It does code optimization using data-flow analysis, that is, the gathering of information about how values
are transmitted from one part of a program to each other part.
SPECIFICATION OF TOKENS
Strings
Language
34
DR.NNCE III / 05 CD - QB
Regular expression
A string over an alphabet is a finite sequence of symbols drawn from that alphabet. A language is any countable set
of strings over some fixed alphabet.
In language theory, the terms "sentence" and "word" are often used as synonyms for "string." The length of a string s,
usually written |s|, is the number of occurrences of symbols in s. For example, banana is a string of length six. The
empty string, denoted ε, is the string of length zero.
Operations on strings
A prefix of string s is any string obtained by removing zero or more symbols from the end of string s.
A suffix of string s is any string obtained by removing zero or more symbols from the beginning of s. For
example, nana is a suffix of banana.
A substring of s is obtained by deleting any prefix and any suffix from s. For example, nan is a substring of
banana.
The proper prefixes, suffixes, and substrings of a string s are those prefixes, suffixes, and substrings,
respectively of s that are not ε or not equal to s itself.
A subsequence of s is any string formed by deleting zero or more not necessarily consecutive positions of s. For
example, baan is a subsequence of banana.
Operations on languages:
The following example shows the operations on strings: Let L={0,1} and S={a,b,c}
Union : L U S={0,1,a,b,c}
Concatenation : L.S={0a,1a,0b,1b,0c,1c}
35
DR.NNCE III / 05 CD - QB
|ε
| term
term → id
| num
where the terminals if , then, else, relop, id and num generate sets of strings given by the following regular definitions:
if → if
then → then
else → else
relop → <|<=|=|<>|>|>=
id → letter(letter|digit)*
For this language fragment the lexical analyzer will recognize the keywords if, then, else, as well as the lexemes
denoted by relop, id, and num. To simplify matters, we assume keywords are reserved; that is, they cannot be used as
identifiers.
The task of a scanner generator, such as flex, is to generate the transition tables or to
synthesize the scanner program given a scanner specification (in the form of a set of REs). So it
needs to convert a RE into a DFA. This is accomplished in two steps: first it convert a RE to
NFA then convert the NFA to DFA
DFA.
36
DR.NNCE III / 05 CD - QB
A NFA is similar to a DFA but it also permits multiple transitions over the same
character and transitions over . The first type indicates that, when reading the common character
associated with these transitions, we have more than one choice; the NFA succeeds if at least one
of these choices succeeds. The transition doesn't consume any input characters, so you may
jump to another state for free.
Clearly DFAs are a subset of NFAs. But it turns out that DFAs and NFAs have the same
expressive power. The problem is that when converting a NFA to a DFA we may get an
exponential blowup in the number of states.
We will first learn how to convert a RE into a NFA. This is the easy part. There are only
5 rules, one for each type of RE:
The algorithm constructs NFAs with only one final state. For example, the third rule
indicates that, to construct the NFA for the RE AB, we construct the NFAs for A and B which are
represented as two boxes with one start and one final state for each box. Then the NFA for AB
is constructed by connecting the final state of A to the start state of B using an empty transition.
37
DR.NNCE III / 05 CD - QB
The next step is to convert a NFA to a DFA (called subset construction). Suppose that you assign
a number to each NFA state. The DFA states generated by subset construction have sets of
numbers, instead of just one number.
First we need to handle transitions that lead to other states for free (without
consuming any input). These are the transitions. We define the closure of a
NFA node as the set of all the nodes reachable by this node using zero, one,
or more transitions. For example, The closure of node 1 in the left figure
below
is the set {1,2}. The start state of the constructed DFA is labeled by the
closure of the NFA start state. For every DFA state labeled by some set
and for every character c in the language alphabet, you find all
the states reachable by s1, s2, ..., or sn using c arrows and you union
together the closures of these nodes.
If this set is not the label of any other node in the DFA constructed so
far, you create a new DFA node with this label. For example, node {1,2} in
the DFA above has an arrow to a {3,4,5} for the character a since the NFA
node 3 can be reached by 1 on a and nodes 4 and 5 can be reached by 2. The
b arrow for node {1,2} goes to the error node which is associated with an
empty set of NFA nodes. The following NFA recognizes ,
even though it wasn't constructed with the 5 RE-to-NFA rules. It has the
following DFA:
38
DR.NNCE III / 05 CD - QB
11.Prove that the following two regular expressions are equivalent by showing that the
minimum state DFA’s are same. i) (a/b)*
Transition table
{0,1,2,3,7} A B C
{1,2,3,4,6,7} B B C
{1,2,3,5,6,7} C B C
ɛ-CLOSURE{q0} = {q0,q1,q2,q4,q7}
ɛ-CLOSURE{q1} = {q1,q2,q4}
39
DR.NNCE III / 05 CD - QB
ɛ-CLOSURE{q2} ={q2}
ɛ-CLOSURE{q3} ={q3,q6,q1,q2,q4,q7}
ɛ-CLOSURE{q4} ={q4}
ɛ-CLOSURE{q5} ={q1,q2,q4,q5,q6,q7}
ɛ-CLOSURE{q6} ={ q1,q2,q4,q6,q7}
ɛ-CLOSURE{q7} ={q7}
40
DR.NNCE III / 05 CD - QB
DFA
13. Construct a DFA without constructing NFA for following regular expression. Find
minimized DFA r = (a|b)*abb#
1. Firstly, we construct the augmented regular expression for the given expression. By
concatenating a unique right-end marker ‘#’ to a regular expression r, we give the
accepting state for r a transition on ‘#’ making it an important state of the NFA for r#.
So, r' = (a|b)*abb#
2. Then we construct the syntax tree for r#.
41
DR.NNCE III / 05 CD - QB
Next we need to evaluate four functions nullable, firstpos, lastpos, and followpos.
1. nullable(n) is true for a syntax tree node n if and only if the regular expression represented
by n has € in its language.
2. firstpos(n) gives the set of positions that can match the first symbol of a string generated by
the subexpression rooted at n.
3. lastpos(n) gives the set of positions that can match the last symbol of a string generated by
the subexpression rooted at n.
4.
We refer to an interior node as a cat-node, or-node, or star-node if it is labeled by a
concatenation, | or * operator, respectively.
∅ ∅
n is a leaf node
true
labeled €
42
DR.NNCE III / 05 CD - QB
n is a leaf node
labelled with false {i} {i}
position i
firstpos(c1) ∪
lastpos(c1) ∪ lastpos(c2)
n is an or node with
nullable(c1) or
left child c1 and
nullable(c2) firstpos(c2)
right child c2
firstpos(c1) ∪
If nullable(c1) then
lastpos(c2) ∪ lastpos(c1)
If nullable(c2) then
n is a cat node with nullable(c1) and
left child c1 and nullable(c2) firstpos(c2) else
else lastpos(c2)
right child c2 firstpos(c1)
n is a star node
true firstpos(c1) lastpos(c1)
with child node c1
1. If n is a cat-node with left child c1 and right child c2 and i is a position in lastpos(c1),
then all positions in firstpos(c2) are in followpos(i).
3. Now that we have seen the rules for computing firstpos and lastpos, we now proceed to
calculate the values of the same for the syntax tree of the given regular expression (a|b)*abb#
43
DR.NNCE III / 05 CD - QB
Let us now compute the followpos bottom up for each node in the syntax tree.
DE followpos
1 {1, 2, 3}
2 {1, 2, 3}
3 {4}
4 {5}
5 {6}
6 ∅
Now we construct Dstates, the set of states of DFA D and Dtran, the transition table
for D. The start state of DFA D is firstpos(root) and the accepting states are all those
containing the position associated with the endmarker symbol #.
consider the input symbol a. Positions 1 and 3 are for a, so let B = followpos(1) ∪
According to our example, the firstpos of the root is {1, 2, 3}. Let this state be A and
followpos(3) = {1, 2, 3, 4}. Since this set has not yet been seen, we set Dtran[A, a] := B.
44
DR.NNCE III / 05 CD - QB
When we consider input b, we find that out of the positions in A, only 2 is associated
with b, thus we consider the set followpos(2) = {1, 2, 3}. Since this set has already been seen
before, we do not add it to Dstates but we add the transition Dtran[A, b]:= A.
Continuing like this with the rest of the states, we arrive at the below transition table.
Input
State a b
⇢A B A
B B C
C B D
D B A
45
DR.NNCE III / 05 CD - QB
Return(x<=-10.0||x>=10.0)?100:x*x;
Into appropriate lexemes. Which lexemes should get associated lexical values? What
should those values be? APRIL/MAY 2024
Scanning: The scanning phase-only eliminates the non-token elements from the source program. Such
as eliminating comments, compacting the consecutive white spaces etc.
Lexical Analysis: Lexical analysis phase performs the tokenization on the output provided by the
scanner and thereby produces tokens.
<Return><(><x><<=><-10.0><||><x><>=><10.0><)><?><100><:><x><*><x>;
Literal -> /*returns x-squared, but never Float x; more than 100*/
Operators-> <<=>
46
DR.NNCE III / 05 CD - QB
Constant-> <-10.0>
Operators-> <||>
Operators-> <>=>
Constant-> <10.0>
Operators-> <?>
Constant-> <100>
Operators-> <:>
Operators-> <*>
Being the first phase in the analysis of the source program the lexical analyzer plays an important role
in the transformation of the source program to the target program.
This entire scenario can be realized with the help of the figure given below:
47
DR.NNCE III / 05 CD - QB
The lexical analyzer phase has the scanner or lexer program implemented in it which produces tokens
only when they are commanded by the parser to do so.
The parser generates the getNextToken command and sends it to the lexical analyzer as a response to
this the lexical analyzer starts reading the input stream character by character until it identifies a
lexeme that can be recognized as a token.
As soon as a token is produced the lexical analyzer sends it to the syntax analyzer for parsing.
Along with the syntax analyzer, the lexical analyzer also communicates with the symbol table. When
a lexical analyzer identifies a lexeme as an identifier it enters that lexeme into the symbol table.
Sometimes the information of identifier in symbol table helps lexical analyzer in determining the
token that has to be sent to the parser.
Apart from identifying the tokens in the input stream, the lexical analyzer also eliminates the blank
space/white space and the comments of the program. Such other things include characters the
separates tokens, tabs, blank spaces, new lines.
The lexical analyzer helps in relating the error messages produced by the compiler. Just, for example,
the lexical analyzer keeps the record of each new line character it comes across while scanning the
source program so it easily relates the error message with the line number of the source program.
If the source program uses macros, the lexical analyzer expands the macros in the source program.
Lexical Error
The lexical analyzer itself is not efficient to determine the error from the source program. For
example, consider a statement:
48
DR.NNCE III / 05 CD - QB
Now, in the above statement when the string prtf is encountered the lexical analyzer is unable to
guess whether the prtf is an incorrect spelling of the keyword ‘printf’ or it is an undeclared function
identifier.
But according to the predefined rule prtf is a valid lexeme whose pattern concludes it to be an
identifier token. Now, the lexical analyzer will send prtf token to the next phase i.e. parser that will be
handling the error that occurred due to the transposition of letters.
Error Recovery
Well, sometimes it is even impossible for a lexical analyzer to identify a lexeme as a token, as the
pattern of the lexeme does not match any of the predefined patterns for tokens. In this case, we have
to apply some error recovery strategies.
In panic mode recovery the successive character from the lexeme is deleted until the lexical analyzer
identifies a valid token.
Identify the possible missing character and insert it into the remaining input appropriately.
While performing the above error recovery actions check whether the prefix of the remaining input
matches any pattern of tokens. Generally, a lexical error occurs due to a single character. So, you can
correct the lexical error with a single transformation. And as far as possible a smaller number of
transformations must convert the source program into a sequence of valid tokens that it can hand over
to the parser.
49