Unit 4
Unit 4
Prepared By:
Mr.M.Shanmugam, AP/CSE
Error messages
2. What are the classifications of a compiler?
Compilers are sometimes classified as:
➢ Single-pass
➢ Multi-pass
➢ Load-and-go
➢ Debugging or
➢ Optimizing
11. Draw the parse tree for a source program as position: = initial + rate * 60.
20. State the function of front end and back end of a compiler phase. (MAY 2013)
The front end consists of those phases that depends primarily on the source language and are largely
independent of the target machine.
These includes
➢ Lexical analysis
➢ Syntactic analysis
➢ Semantic analysis
➢ Creation of symbol table
➢ Generation of intermediate code
➢ Code optimization
➢ Error handling
21. State the function back end of a compiler phase. (MAY 2013)
The back end of compiler includes those portions that depend on the target machine and generally those
portions do not depend on the source language, just the intermediate language.
These include
➢ Code optimization
➢ Code generation
➢ Error handling and
➢ Symbol-table operations
24. What is the role of lexical analyzer? (NOV 2013, 2015) (MAY 2018)
▪ The lexical analyzer is the first phase of a compiler.
▪ Its main task is to read the input characters and produce as output a sequence of tokens that the parser
uses the syntax analysis.
▪ Upon receiving a “get next token” command from the parser, the lexical analyzer reads input characters
until it can identify the next token.
27. What is the major advantage of a lexical analyzer generator? (NOV 2011)
The major advantages of a lexical analyzer generator are
➢ One task is stripping out from the source program comments and white space in the form of blank, tab
and new line characters.
➢ Another is error messages from the compiler in the source program.
36. How the regular expression can be defined in specification of the language.
1. is a regular expression that denotes {}, the set containing empty string.
2. If a is a symbol in , then a is a regular expression that denotes {a}, the set containing the string a.
3. Suppose r and s are regular expressions denoting the language L(r) and L(s), then
➢ (r) |(s) is a regular expression denoting L(r) L(s).
➢ (r)(s) is regular expression denoting L (r) L(s).
➢ (r) * is a regular expression denoting (L (r))*.
➢ (r) is a regular expression denoting L (r).
37. Mention the various notational short hands for representing regular expressions.
➢ One or more instances
➢ Zero or one instance
➢ Character classes
➢ Non regular sets
41. List out the parts on Lex specifications. (OR) Give the structure of a Lex program with example.
(MAY 2012, 2015)
A Lex program consists of three parts:
declarations
%%
translation rules
%%
auxiliary procedures
▪ The declarations section includes declarations of variables, manifest constants, and regular definitions.
▪ The translation rules of a Lex program are statement of the form
p1 { action1}
p2 { action2}
…… …..
pn { actionn}
▪ The auxiliary procedures are needed by the actions. These procedures can be compiled separately and
loaded with the lexical analyzer.
42. Compare and contrast Compilers and Interpreters. (MAY 2015, 2017)
Compilers Interpreters
Compiler Takes Entire program as input Interpreter Takes Single instruction as input
Intermediate Object Code is Generated No Intermediate Object Code is Generated
Memory Requirement : More Memory Requirement is Less
(Since Object Code is Generated)
Errors are displayed after entire program is Errors are displayed for every instruction interpreted
checked
Example : C Compiler Example : BASIC
44. Write regular expression for the following language over the alphabet over the alphabet ∑ = {a, b}.
All strings that contain an even number of b’s. (NOV 2017)
▪ The regular expression for the following language over the alphabet over the alphabet ∑ = {a, b}.
▪ All strings that contain an even number of b’s.
a*(ba*ba*)*
45. What is the role of parser? (MAY 2018)
▪ In compiler model, parser obtains a string of tokens from the lexical analyzer and verifies that the string can
be generated by the grammar for the source program.
▪ The parser should report any syntax errors in an intelligible fashion.
46. Define Parsing? (NOV 2015)
A parser for grammar G is a program that takes as input a string ‘w’ and produces as output either a parse tree
for ’w’, if ‘w’ is a sentence of G, or an error message indicating that w is not a sentence of G. It obtains a string
of tokens from the lexical analyzer, verifies that the string generated by the grammar for the source language.
FIRST
1. If X is terminal, and then FIRST(X) is {X}.
2. If X →ε is a production, then add ε to FIRST(X).
3. If X is non-terminal and X → Y1,Y2….Yk is a production, then place a in FIRST(X) if for some i , a is in
FIRST(Yi) , and ε is in all of FIRST(Y1),…FIRST(Yi-1);
FOLLOW
1. Place $ in FOLLOW(S), where S is the start symbol and $ is the input right end marker.
2. If there is a production A → αBβ, then everything in FIRST (β) except for ε is placed in FOLLOW (B).
3. If there is a production A→αB, or a production A→αBβ where FIRST (β) contains ε, then everything in
FOLLOW (A) is in FOLLOW (B).
65. What is bottom up parser? (NOV 2017)
▪ Bottom-up parsing is also known as shift-reduce parsing.
▪ Shift-reduce parsing attempts to construct a parse tree for an input string beginning at the leaves (the
bottom) and working up towards the root (the top).
S→0|1|AS0|BS0 A→ ε B→ ε
FIRST(S) = {0, 1} U FIRST (A) U FIRST (B) = {0, 1} U {ε} U {ε} = {0, 1, ε}
77. Eliminate left recursion from the following grammar: (MAY 2015)
Solution:
A compiler is a program that reads a program written in one language – the source language and translates
it into an equivalent program in another language – the target language. As an important part of this
translation process, the compiler reports to its user the presence of errors in the source program.
Error messages
▪ At first, the variety of compilers may appear overwhelming. There are thousands of source languages,
ranging from traditional programming languages such as FORTRAN and Pascal to specialized languages in
every area of computer application.
▪ Target languages are equally as varied; a target language may be another programming language, or the
machine language of any computer between a microprocessor and a supercomputer.
▪ Throughout the 1950’s, compilers were considered notoriously difficult programs to write. The first
FORTRAN compiler, for example, took 18 staff-years to implement.
▪ The analysis part breaks up the source program into constituent pieces and creates an intermediate
representation of the source program.
▪ The synthesis part constructs the desired target program from the intermediate representation.
▪ During analysis, the operations implied by the source program are determined and recorded in a hierarchical
structure called a tree.
▪ Often, a special kind of tree called a syntax tree is used, in syntax tree each node represents an operation and
the children of the node represent the arguments of the operation.
For example, a syntax tree of an assignment statement is position: = initial + rate * 60.
Many software tools that manipulate source programs first perform some kind of analysis. Some examples of
such tools include:
1. Structure editors
2. Pretty printers
3. Static checkers
4. Interpreters
Structure editors
A structure editor takes as input a sequence of commands to build a source program. The
structure editor not only performs the text-creation and modification functions of an ordinary text editor, but it
also analyzes the program text, appropriate hierarchical structure on the source program. It is useful in the
preparation of source program.
Pretty printers
A pretty printer analyzes a program and prints it in a way that the structure of the program becomes
clearly visible. For example, comments may appear in a special font.
Static checkers
A static checker reads a program, analyzes it, and attempts to discover potential bugs without
running the source program.
Interpreters
Interpreters are frequently used to execute command languages, since each operator executed in
command languages is usually a complex routine such as an editor or compiler.
The analysis portion in each of the following examples is similar to that of a conventional compiler.
➢ Text formatters
➢ Silicon compilers
➢ Query interpreters
Text formatters
A text formatter takes input that is a stream of characters, most of which is text to be typeset, and it
includes commands to indicate paragraphs, figures, or mathematical structures like subscripts and the
superscripts.
Silicon compilers
A silicon compiler has a source language that is similar or identical to a conventional programming
language. However, the variables of the language represent, not locations in memory, but also logical signals (0
or 1) or groups of signals in a switching circuit. The output is a circuit design in an appropriate language.
Query interpreters
A Query interpreter translates a predicate containing relational and Boolean operators into commands to
search a database for records satisfying that predicate.
2. Explain the Analysis of the Source Program? (10 MARKS) (NOV 2015)
In compiling, analysis consists of three phases:
1. Linear Analysis
2. Hierarchical Analysis
3. Semantic Analysis
Linear Analysis
▪ Linear analysis, in which the stream of characters making up the source program is read from left-to-right
and grouped into tokens that are sequences of characters having a collective meaning
▪ In a compiler, linear analysis is called lexical analysis or scanning.
For example, in lexical analysis the characters in the assignment statement position: = initial + rate * 60 would
be grouped into the following tokens:
1. The identifier position.
2. The assignment symbol :=
3. The identifier initial.
4. The plus sign.
5. The identifier rate.
6. The multiplication sign.
7. The number 60.
▪ The blanks separating the characters of these tokens would normally be eliminated during the lexical
analysis.
Hierarchical Analysis
▪ Hierarchical analysis is called parsing or syntax analysis.
▪ Hierarchical analysis involves grouping the tokens of the source program into grammatical phases that are
used by the compiler to synthesize output.
▪ The grammatical phrases of the source program are represented by a parse tree.
▪ In the expression initial + rate * 60, the phrase rate * 60 is a logical unit because the usual conventions of
arithmetic expressions tell us that the multiplication is performed before addition.
▪ Because the expression initial + rate is followed by a *, it is not grouped into a single phrase by itself.
▪ The hierarchical structure of a program is usually expressed by recursive rules. For example, the following
rules, as part of the definition of expression:
▪ Rules (1) and (2) are non-recursive basic rules, while (3) defines expressions in terms of operators applied to
other expressions.
Similarly, many languages define statements recursively by rules such as:
1. If identifier1 is an identifier and expression2 is an expression, then
identifier1 := expression2
is a statement.
2. If expression1 is an expression and statement2 is a statement, then
while (expression1) do statement2
if (expression1) then statement2
are statements.
A syntax tree is a compressed representation of the parse tree in which the operators appear as the interior
nodes and the operands of an operator are the children of the node for that operator.
Semantic Analysis
▪ The semantic analysis phase checks the source program for semantic errors and gathers type information for
the subsequent code-generation phase.
▪ It uses the hierarchical structure determined by the syntax-analysis phase to identify the operators and
operand of expressions and statements.
▪ An important component of semantic analysis is type checking.
▪ The compiler checks that each operator has operands that are permitted by the source language
specification.
▪ For example, when a binary arithmetic operator is applied to an integer and real.
▪ In this case, the compiler may need to be converting the integer to a real. As shown in figure given below.
3. Explain the various Phases of a Compiler with an example? (10 MARKS) (NOV 2011, 2012, 2013,
2016, 2017) (MAY 2012, 2013, 2014, 2015, 2016, 2018, 2022)
A compiler operates in phases, each of which transforms the source program from one representation to another.
Lexical analyzer
▪ The lexical analysis phase reads the characters in the source program and groups them into a stream of
tokens in which each token represents a logically cohesive sequence of characters, such as an identifier, a
keyword (if, while, etc.), a punctuation character or a multi-character operator like :=.
▪ In a compiler, linear analysis is called lexical analysis or scanning.
▪ Linear analysis in which the stream of characters making up the source program is read from left-to-right
and grouped into tokens that are sequences of characters.
▪ The character sequence forming a token is called the lexeme for the token.
▪ Certain tokens will be augmented by a ‘lexical value’. For example, when an identifier like rate is found, the
lexical analyzer not only generates a token id, but also enters the lexeme rate into the symbol table, if it is not
already there.
Consider id1, id2 and id3 for position, initial, and rate respectively, that the internal representation of an
identifier is different from the character sequence forming the identifier.
The representation of the statement given above after the lexical analysis would be:
id1: = id2 + id3 * 10
Syntax analyzer
▪ Hierarchical analysis involves grouping the tokens of the source program into grammatical phases that are
used by the compiler to synthesize output.
▪ Hierarchical analysis is called parsing or syntax analysis.
▪ Syntax analysis imposes a hierarchical structure on the token stream, which is shown by syntax trees.
Semantic analyzer
▪ The Semantic analysis, it checks the source program for semantic errors and gathers type information for the
subsequent code generation phase.
▪ It uses the hierarchical structure determined by the syntax analysis phase to identify the operators and
operands of expressions and statements.
▪ Compiler report an error, if real number is used to index an array.
▪ The bit pattern representing an integer is generally different from the bit pattern for a real, even they have the
same value.
▪ For example, the identifiers position, initial, rate declared to be real and that 60 by itself assumed to
be integer.
▪ The general approach is to convert the integer to a real.
▪ After Syntax and semantic analysis, some compilers generate an explicit intermediate representation of the
source program. The intermediate representation as a program for an abstract machine.
▪ This intermediate representation should have two important properties;
➢ it should be easy to produce, and
➢ it should be easy to translate into the target program
▪ An intermediate form called “three-address code, “which is like the assembly language for a machine in which
every memory location can act like a register.
▪ Three-address code consists of a sequence of instructions, each of which has at most three operands.
▪ The source program might appear in three-address code as
Code Optimization
▪ The code optimization phase attempts to improve the intermediate code, so that faster-running machine code
will result.
▪ For example, a natural algorithm generates the intermediate code, using an instruction for each operator in
the tree representation after semantic analysis, even though there is a better way to perform the same
calculation, using the two instructions.
temp1:= id3 * 60.0
id1 := id2 + temp1
▪ There is nothing wrong with this simple algorithm, since the problem can be fixed during the code-
optimization phase.
▪ That is, the compiler can deduce that the conversion of 60 from integer to real representation can be done
once and for all at compile time, so the inttoreal operation can be eliminated.
▪ There is a great variation in the amount of code optimization different compilers.
▪ ‘Optimizing compilers’, a significant fraction of the time of the compiler is spent on this phase.
Code Generation
▪ The final phase of the compiler is the generation of target code, consisting normally of relocatable machine
code or assembly code.
▪ Memory locations are selected for each of the variables used by the program. Then, intermediate instructions
are each translated into a sequence of machine instructions that perform the same task.
▪ The assignment of variables to registers.
▪ An essential function of a compiler is to record the identifiers used in the source program and collect
information about various attributes of each identifier.
▪ These attributes may provide information about the storage allocated for an identifier, its type, its scope and
in case of procedure names, such things at the number and types of its arguments and methods of passing
each argument and type returned.
▪ The symbol table is a data structure containing a record for each identifier with fields for the attributes of
the identifier.
▪ The data structure allows us to find the record for each identifier quickly and to store or retrieve data from
that record quickly.
▪ Whenever an identifier is detected by a lexical analyzer, it is entered into the symbol table. The attributes of
an identifier cannot be determined by the lexical analyzer.
▪ However, the attributes of an identifier cannot normally be determined during lexical analysis.
Example
The lexical analyzers is the part of the compiler that reads the source text, it may also perform certain secondary
tasks at the user interface.
➢ One task is stripping out from the source program comments and white space in the form of blank, tab and
new line characters.
➢ Another is error messages from the compiler in the source program.
The lexical analyzers are divided into a cascade two phases are
1. Scanning → is responsible for doing simple tasks.
2. Lexical analysis →more complex operations
For example, a FORTRAN compiler might use a scanner to eliminate blanks from the input.
The lexical analyzer is the only phase of the compiler that reads the source program character-by-character; it is
possible to spend a considerable amount of time in the lexical analysis phase.
Buffer Pairs
▪ Two pointers to the input buffer are maintained.
▪ The string of characters between the pointers is the current lexeme.
▪ Initially, both pointers point to the first character of the next lexeme to be found.
▪ Forward pointer, scans ahead until a match for a pattern is found.
▪ Once the next lexeme is determined, the forward pointer is set to the character at its right end.
▪ After the lexeme is processed, both pointers are set to the character immediately past the lexeme.
▪ The comments and white space can be treated as patterns that yield no token.
▪ A buffer into two N-character halves, where N is the no.of characters on one disk block, eg. 1024 or 4096.
▪ If the forward pointer is to move past the halfway mark, the right half is filled with N new input characters.
▪ If the forward pointer is to move past the right end of the buffer, the left half is filled with N new characters
and the forward pointer wraps around to the beginning of the buffer.
Sentinels
▪ The sentinel is a special character that cannot be part of the source program.
▪ Each buffer half to hold a sentinel character at the end (eof).
▪ The term alphabet or character class denotes any finite set of symbols. Typical examples of symbols are
letters and characters.
▪ The set {0, 1} is the binary alphabet.
▪ ASCII and EBCDIC are two examples of computer alphabet.
▪ A string over some alphabet is a finite sequence of symbols drawn from that alphabet.
▪ The length of string s, usually written |s|, is the number of occurrences of symbols in s.
▪ The empty string denoted Ɛ, is a special string of length zero.
▪ The term language denotes any set of strings over some fixed alphabet. Abstract languages like , the
empty set, or {},the set containing only the empty string, are languages.
▪ If x and y are strings, then the concatenation of x and y is also string, denoted xy, is the string formed by
appending y to x.
For example, if x = ban and y = ana, then xy = banana.
▪ The empty string Ɛ is the identity element under concatenation; that is, for any string s, SƐ = ƐS= s.
d1→r1
d2→r2
...
dn→rn
where each di is a distinct name, and each ri is a regular expression.
Example
1. The set of Pascal identifier is the set of strings of letters and digits beginning with a letter.
The regular definition is
letter → A | B | C | … | Z | a | b | … | z
digit → 0 | 1 | 2 | … | 9
id → letter ( letter | digit )*
2. Unsigned numbers in Pascal are strings such as 5280, 39.37, 6.336E4 or 1.894E-4.
The regular definition is
digit → 0 | 1 | 2 | … | 9
digits → digit digit*
optional_fraction → . digits |
optional_exponent→ ( E ( + | -| ) digits) |
num → digits optional_fraction optional_exponent
where the terminals if, then, else, relop, id, and num generate set of strings given by the following regular
definitions:
if → if
then → then
else → else
relop → < | <= | > | >= | = | < >
id → letter ( letter | digit )*
num → digit + (. digit + ) ? ( E(+ | -) ? digit + ) ?
The lexical analyzer will recognize the keywords if, then, else, as well as the lexemes denoted by relop, id, and
num.
We assume lexemes are separated by white space consisting of blanks, tabs, and newlines. In lexical analyzer will
strip out white space.
delim → blank | tab | newline
ws → delim +
If a match for ws is found, the lexical analyzer does not return a token to the parser.
▪ To construct a lexical analyzer that will isolate the lexeme for the next token in the input buffer and produce
as output a pair consisting of the appropriate token and attribute value, using the translation table.
▪ The attribute values for the relational operators are given by the symbolic constants LT, LE, EQ, NE, GT,GE.
Regular expression pattern for tokens
Transition diagram
▪ An intermediate step in the construction of a lexical analyzer, produce a stylized flowchart called a
transition diagram.
▪ Transition diagram depict the actions that take place when a lexical analyzer is called by the parser to get the
next token.
▪ Transition diagram to keep track of information about characters that are seen as the forward pointer scans
the input.
▪ Moving from position to position in the diagram as characters are read.
▪ Positions in a transition diagram are drawn as circles and are called states.
▪ The states are connected by arrow, called edges.
▪ Edges leaving state s have labels indicating the input characters that can next appear after the transition
diagram has reached state s.
▪ The label other refers to any character that is not indicated by any of the other edges leaving s.
▪ Transition diagram are deterministic, ie., no symbol can match the labels of two edges leaving one state.
▪ One state is labeled as the start state; it is the initial state of the transition diagram where control resides
when we begin to recognize a token.
▪ Certain states may have actions that are executed when the flow of control reaches that state.
▪ On entering a state we read the next input character.
▪ If there is an edge from the current state whose label matches this input character, we then go to the state
pointed to by the edge.
▪ Otherwise we indicate failure.
Transition Diagrams for >=
▪ gettoken( ): return token (id, if, then,…) if it looks the symbol table
▪ install_id( ): return 0 if keyword or a pointer to the symbol table entry if id
We use a case statement to find the start state of the next transition diagram. In the C implementation, two
variables state and start keep track of the present state and the starting state of the current transition diagram.
Edges in transition diagrams are traced by repeatedly selecting the code fragment for a state and executing the
code fragment to determine the next state.
8. Elaborate on the language for specifying lexical analyzer. (5 MARKS) (NOV 2013)
▪ Several tools have been built for constructing lexical analyzers from special purpose notations based on
regular expressions.
▪ The use of regular expressions for specifying tokens patterns.
▪ A particular tool called Lex, which is used to specify lexical analyzer for a variety of languages.
▪ We refer to the tool as the Lex compiler and to its input specification as the Lex language.
▪ The declarations section includes declarations of variables, manifest constants, and regular definitions.
▪ The translation rules of a Lex program are statement of the form
p1 { action1}
p2 { action2}
…… …..
pn { actionn}
where each pi is a regular expression and each actioni is a program fragment describing what action the
lexical analyzer should take when pattern matches a lexeme.
▪ The third section holds whatever auxiliary procedures are needed by the actions. Alternatively, these
procedures can be compiled separately and loaded with the lexical analyzer.
A lexical analyzer created by Lex behaves with a parser in the following manner.
▪ When activated by the parser, the lexical analyzer begins reading its remaining input, one character at a time,
until it has found the longest prefix of the input that is matched by one of the regular expressions pi.
▪ Then it executes actioni.
▪ The lexical analyzer returns a single quantity, the token, to the parser.
▪ To pass an attribute value with the information about the lexeme, we can set a global variable called yylval.
Lex Program for the tokens
%{
/* definitions of manifest constants
LT, LE, EQ, NE, GT, GE,
IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/* regular definitions */
delim [ \t\n]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = install_id(); return(ID); }
{number} {yylval = install_num(); return(NUMBER);}
“<“ {yylval = LT; return(RELOP); }
“<=“ {yylval = LE; return(RELOP); }
“=“ {yylval = EQ; return(RELOP); }
“<>“ {yylval = NE; return(RELOP); }
“>“ {yylval = GT; return(RELOP); }
“>=“ {yylval = GE; return(RELOP); }
%%
install_id()
{
/* procedure to install the lexeme, whose first character is pointed to by yytext,
and whose length is yyleng, into the symbol table and return a pointer */
}
install_num()
{
/* similar procedure to install a lexeme that is a number */
}
In the declaration section, the declaration of certain manifest constants used by the translation rules. These
declarations are surrounded by the special brackets %{ and %}. Anything appearing between these brackets is
copied directly into the lexical analyzer lex.yy.c and is not treated as part of the regular definitions or the
translation rules.
The auxiliary procedures in the third section. There are two procedures, install_id and install_num, that
are used by the translation rules; these procedures will be copied into lex.yy.c verbatim.
In the definitions section are some regular definitions. Each such definition consists of a name and a
regular expression denoted by that name. For example, the first name defined is delim; it stands for the
character class [ \t\n], that is any of the three symbols blank, tab (\t) or newline (\n). The second definition
is of white space, denoted by the name ws. White space is any sequence of one or more delimiter character.
In the definition of letter, we see the use of a character class. The shorthand [A-Za-z] means any of the
capital letters A through Z or the lower-case letters a through z. The fifth definition, of id, uses parentheses,
which are metasymbols in Lex. Similarly, the vertical bar is Lex metasymbol representing union.
The translation rules in the section following the first %%. The first rule says that if we see ws, that is,
any maximal sequence of blanks, tabs, and newlines, we take no action. In particular, we do not return to the
parser.
The second rule says that if the letters if are seen, return the token IF, which is a manifest constant
representing some integer understood by the parser to be the token if.
In the rule for id, we see two statements in the associated action. First, the variable yylval is set to the
value returned by procedure install_id; the definition of that procedure is in the third section. yylval is a
variable whose definition appears in the Lex output lex.yy.c, and which is also available to the parser. The
purpose of yylval is to hold the lexical value returned, since the second statement of the action, return (ID), can
only return a code for the token class.
We may suppose that it looks in the symbol table for the lexeme matched by the pattern id. Lex makes the
lexeme available to routines appearing in the third section through two variables yytext and yyleng. The
variable yytext corresponds to the variable that we have been calling lexeme_beginning, that is, a pointer to the
first character of the lexeme; yyleng is an integer telling how long the lexeme is. For example, if install_id fails to
find the identifier in the symbol table, it might create a new entry for it. The yyleng characters of the input,
starting at yytext, might be copied into a character array and delimited by an end of string marker. The new
symbol table entry would point to the beginning of this copy.
Numbers are treated similarly by the next rule, and for the last six rules yylval is used to return is used to
return a code for the particular relational operator found while the actual return value is the code for token
relop.
A NFA can be diagrammatically represented by a labeled directed graph, called a transition graph, in which the
nodes are the states and the labeled edges represent the transition function. This graph looks like a transition
diagram, but the same character can label two or more transitions out of one state and edges can be labeled by
the special symbol € as well as by input symbols.
The transition graph for an NFA that recognizes the language (a|b)*abb is shown
Deterministic Finite Automata (DFA)
A deterministic finite automata has at most one transition from each state on any input. A DFA is a special case of
a NFA in which:-
1. it has no transitions on input € ,
2. Each input symbol has at most one transition from any state.
The regular expression is converted into minimized DFA by the following procedure:
Regular expression → NFA → DFA → Minimized DFA
The Finite Automata is called DFA if there is only one path for a specific input from current state to next state.
From state S0 for input ‘a’ there is only one path going to S2. Similarly from S0 there is only one path for input
going to S1.
Transition Table
▪ When describing an NFA, we use the transition graph representation. In computer, the transition function of
an NFA can be implemented in several different ways. The easiest implementation in a transition table is
which there is a row for each state and a column for each input symbol and €, if necessary.
▪ The transition table representation has the advantage that it provides fast access to the transitions of a given
state on a given character; its advantage is that it can take up a lot of space when the input alphabet is large
and most transitions are to the empty set.
10. Write a LEX program to count the vowels and consonants in a given string. (5 MARKS) (NOV 2017)
LEX Program to count the number of vowels and consonants in a given string:
Vow.l
%{
int vow_count=0;
int const_count =0;
%}
%%
[aeiouAEIOU] {vow_count++;}
[a-zA-Z] {const_count++;}
%%
int main()
{
printf("Enter the string of vowels and consonants:");
yylex();
printf("The number of vowels are: %d\n", vow);
printf("The number of consonants are: %d\n", cons);
return 0;
}
Output:
$$ lex Vow.l
$$ gcc Vow.c -ll
$$ ./a.out
Enter the string of vowels and consonants:
My name is SMVEC
The number of vowels are: 4
The number of consonants are: 9
11. Explain the role of the parser? (10 MARKS) (MAY 2012)
▪ In compiler model, parser obtains a string of tokens from the lexical analyzer and verifies that the string can
be generated by the grammar for the source program.
▪ The parser should report any syntax errors in an intelligible fashion.
▪ It should also recover from commonly occurring errors so it can continue processing the remainder if it’s
input.
▪ In both cases, the input to the parser is scanned from left to right, one symbol at a time.
▪ The most efficient top-down and bottom-up parsers can be implemented only for sub-classes of context-free
grammars.
➢ LL for top-down parsing
➢ LR for bottom-up parsing
Syntax error handling
If a compiler to process only correct programs, its design and implementation would be greatly simplified.
The program can contain errors at many different levels of syntax error handler
➢ Lexical, such as misspelling an identifier, keyword, or operator
➢ Syntactic, such as an arithmetic expression with unbalanced parentheses
➢ Semantic, such as an operator applied to an incompatible operand
➢ Logical, such as an infinitely recursive call
The error handler in a parser has simple to state goals:
➢ It should report the presence of errors clearly and accurately.
➢ It should recover from each error quickly enough to be able to detect subsequent errors.
➢ It should not significantly slow down the processing of correct programs.
Several parsing methods such as LL and LR methods, detect an error as soon as possible.
Error productions
▪ Augment the grammar with productions that generate the erroneous constructs.
▪ The grammar augmented by these error productions to construct a parser.
▪ If an error production is used by the parser, generate error diagnostics to indicate the erroneous construct
recognized the input.
Global correction
▪ Algorithms are used for choosing a minimal sequence of changes to obtain a globally least cost correction.
▪ Given an incorrect input string x and grammar G, these algorithms will find a parse tree for a related string
y; such that the number of insertions, deletions and changes of tokens required to transform x into y is as
small as possible.
▪ This technique is most costly in terms of time and space.
12. Explain the Context Free Grammar (CFG)? (5 MARKS) (NOV 2017)
A Context Free Grammar (CFG) consists of terminals, non-terminals, a start symbol and productions.
Example
The grammar with the following productions defines simple arithmetic expressions.
expr → expr op expr
expr → ( expr )
expr → - expr
expr → id
op → +
op → -
op → *
op → /
op → ↑
▪ In this grammar, the terminals symbols are
id + - * / ↑ ( )
▪ The non-terminal symbols are expr and op
▪ expr is the start symbol
Notational Conventions
Derivations
▪ Derivational view gives a precise description of the top-down construction of a parse tree.
▪ The central idea is that a production is treated as a rewriting rule in which the non-terminals on the left is
replaced by the string on the right side of the production.
▪ For example, consider the following grammar for arithmetic expressions, with the non-terminals E
representing an expression.
E → E + E | E – E | E * E | (E) | - E | id
▪ The production E → - E signifies that an expression preceded by a minus sign is also an expression. This
production can be used to generate more complex expressions from simpler expressions by allowing us to
replace any instance of an E by - E.
E -E
▪ Given a grammar G with starts symbol, use the ➔+ relation to define L (G), the language generated by G.
▪ Strings in L (G) may contain only terminal symbols of G.
▪ A string of terminals w is in L (G) if and only if S ➔+ w. The string w is called a sentence of G.
▪ A language that can be, generated by a grammar is said to be a context-free language.
▪ If two grammars generate, the same language, the grammars are said to be equivalent.
The string – (id+id) is a sentence of the grammar, and then the derivation is
E -E - (E) - (E+E) - (id+E) - (id+id)
➢ At each derivation step, we can choose any of the non-terminals in the sentential form of G for the
replacement.
➢ If we always choose the left-most non-terminal in each derivation step, this derivation is called as left-most
derivation.
➢ If we always choose the right-most non-terminal in each derivation step, this derivation is called as right-
most derivation (Canonical derivation).
▪ A parse tree may be viewed as a graphical representation for a derivation that filters out the choice
regarding replacement order.
▪ Each interior node of a parse tree is labeled by some non-terminals A, and that the children of the node are
labeled, from left to right, by the symbols in the right side of the production by which this A was replaced in
the derivation.
▪ The leaves of the parse tree are labeled by non-terminals or terminals and read from left to right; they
constitute a sentential form, called the yield or frontier of the tree.
Ambiguity
▪ A grammar produces more than one parse tree for a sentence is called as an ambiguous.
▪ An ambiguous grammar is one that produces more than one left most or more than one right most
derivation for the same sentence.
▪ For the most parsers, the grammar must be unambiguous.
▪ The unambiguous grammar unique selection of the parse tree for a sentence.
The sentence id+id*id has the two distinct leftmost derivations:
EE+E EE*E
id + E E+E*E
id + E * E id + E * E
id + id * E id + id * E
id + id * id id + id * id
13. Write the steps in writing a grammar for a programming language. (5 MARKS) (NOV 2013, 2017)
(Or) Draw an algorithm for left factoring grammar. (MAY 2014)
There are several reasons the regular expressions differ from CFG.
1. The lexical rules of a language are frequently quite simple. No need of any notation as powerful as grammars.
2. Regular expressions generally provide a more concise and easier to understand notation for tokens than
grammars.
3. More efficient lexical analyzers can be constructed automatically from regular expressions than from
arbitrary grammars.
4. Separating the syntactic structure of a language into lexical and non-lexical parts provides a convenient way
of modularizing the front end of a compiler into two manageable-sized components.
➢ Regular expressions are most useful for describing the structure of lexical constructs such as identifiers,
constants, keywords etc…
➢ Grammars are most useful in describing nested structures such as balanced parenthesis, matching begin -
end’s, corresponding if-then-else’s and so on.
Eliminating Ambiguity
▪ An ambiguous grammar can be rewritten to eliminate the ambiguity.
▪ Example for eliminate the ambiguity from the following “dangling-else” grammar:
stmt → if expr then stmt
| if expr then stmt else stmt
| other
Here “other” stands for any other statements. According to this grammar, the compound conditional statement
if E1 then S1 else if E2 then S2 else S3
has the parse tree as
▪ A grammar is left recursive if it has a non-terminal A such that there is a derivation A A.
▪ Top-down parsing techniques cannot handle left-recursive grammars, so a transformation that eliminates left
recursion is needed.
▪ The left recursive pair of productions A→A | β could be replaced by non- left- recursive productions
A→βA’
A’→ A’ | ε
Left Factoring
▪ Left factoring is a grammar transformation that is useful for producing a grammar suitable for predictive
parsing.
▪ The basic idea is that when it is not clear which of two alternative productions to use to expand a non-
terminal A, then rewrite the A-productions to defer the decision until the input to make the right choice.
In general, productions are of the form A → αβ1 | αβ2 then it is left factored as:
A → αA’
A’ → β1 | β2
Input: Grammar G.
Output: An equivalent left-factored grammar.
Method: For each non-terminal A, find the longest prefix α common to two or more of its alternatives. If α ≠ ε, i.e.,
there is a nontrivial common prefix, replace all the A productions A → αβ1 | αβ2 | … | αβn | γ where γ represents
all alternatives that do not begin with α by,
A → αA’| γ
A’ → β1 | β2 | … | βn
14. Briefly write on Parsing techniques. Explain with illustration the designing of a Predictive Parser.
(10 MARKS) (NOV 2013, 2016) (MAY 2016)
➢ The top-down parsing and show how to construct an efficient non-backtracking form of top-down parser
called predictive parser.
➢ Define the class LL (1) grammars from which predictive parsers can be constructed automatically.
Recursive-Descent Parsing
▪ Top-down parsing can be viewed as an attempt to find a leftmost derivation for an input string.
▪ It is to construct a parse tree for the input starting from the root and creating the nodes of the parse tree in
preorder.
▪ The special case of recursive-descent parsing called predictive parsing, where no backtracking is required.
▪ The general form of top-down parsing, called recursive-descent, that may involve backtracking, ie, making
repeated scans of input.
▪ However, backtracking parsers are not seen frequently.
▪ One reason is that backtracking is rarely needed to parse programming language constructs.
▪ In natural language parsing, backtracking is still not very efficient and tabular methods such as the dynamic
programming algorithm.
A left-recursive grammar can cause a recursive-descent parser, even one with backtracking, to go into an infinite
loop.
Predictive Parsers
▪ For writing a grammar, eliminating left recursion from it and left factoring the resulting grammar, we can
obtain a grammar that can be parsed by a recursive-descent parsing that needs no backtracking i.e., a
predictive parser.
▪ Flow-of-control constructs in most programming languages, with their distinguishing keywords, are usually
detectable in this way.
▪ For example, if we have the productions
stmt → if expr then stmt else stmt
| while expr do stmt
| begin stmt_list end
then the keywords if, while, and begin that could possibly succeed to find a statement.
Several differences between the transition diagrams for a lexical analyzer and a predictive parser.
➢ In case of parser, there is one diagram for each non-terminal.
➢ The labels of edges are tokens and non-terminals.
➢ A transition on a token means that transition if that token is the next input symbol.
➢ A transition on a non-terminal A is a call of the procedure for A.
To construct the transition diagram of a predictive parser from a grammar, first eliminate left recursion from the
grammar, and then left factor the grammar.
Then for each non-terminal A do the following:
1. Create an initial and final (return) state.
2. For each production A → X1, X2 ... Xn, create a path from the initial to the final state, with edges
labeled X1, X2,….,Xn.
▪ A predictive parsing program based on a transition diagrams attempts to match terminal symbols against the
input and makes a potentially recursive procedure call whenever it has to follow an edge labeled by a non-
terminal.
▪ A non-recursive implementation can be obtained by stacking the states s when there is a transition on non-
terminal out of s and popping the stack when the final state for a non-terminal is reached.
▪ Transition diagrams can be simplified by substituting diagrams in one another; these substitutions are
similar to the transformations on grammars.
A→βA’
A’→αA’ | ε
The program considers X, the symbol on top of the stack, and a, the current input symbol. These two symbols
determine the action of the parser. There are three possibilities.
1. If X = a = $, the parser halts and announces successful completion of parsing.
2. If X = a ≠ $, the parser pops X off the stack and advances the input pointer to the next input symbol.
3. If X is a non-terminal, the program consults entry M [X, a] of the parsing table M. This entry will be either an
X-production of the grammar or an error entry. For example M [X, a] = {X → UVW}, the parser replaces X on
top of the stack by WVU (U on top).
4. If M[X, a] = error, the parser calls an error recovery routine.
Solution:
3. Computation of FOLLOW
FOLLOW (E) = {$} U FOLLOW (E) = { ), $ }
FOLLOW (E’) = FOLLOW (E) = { ), $ }
FOLLOW (T) = FOLLOW (E’) U FIRST (E’) = {), $} U {+} = { +,), $ }
FOLLOW (T’) = FOLLOW (T) = { +,), $ }
FOLLOW (F) = FOLLOW (T’) U FIRST (T’) = {+,), $} U {*} = { +,*,), $ }
Solution:
2. Computation of FIRST
FIRST(S) = { i , a }
FIRST(S’) = { e , }
FIRST (E) = { b }
3. Computation of FOLLOW
FOLLOW(S) = {$} U FIRST(S’) = {$} U {e} = { $, e }
FOLLOW (S’) = FOLLOW(S) = { $, e }
FOLLOW (E) = { t }
18. Explain the LR parsing algorithm in detail. (10 MARKS) (NOV 2011, 2012, 2014, 2015)
(MAY 2012, 2016, 2017, 2018)
Bottom-up syntax analysis technique can be used to parse a large class of context-free grammars. The technique
is called LR (k) parsing.
➢ the "L" is for left-to-right scanning of the input,
➢ the "R" for constructing a rightmost derivation in reverse, and
➢ the k for the number of input symbols of look ahead that are used in making parsing decisions.
➢ When (k) is omitted, k is assumed to be 1.
1. LR parsers can be constructed to recognize virtually all programming language constructs for which
context-free grammars can be written.
2. The LR parsing method is the most general non backtracking shift-reduce parsing method known, yet it
can be implemented as efficiently as other shift-reduce methods.
3. The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars
that can be parsed with predictive parsers.
4. An LR parser can detect a syntactic error as soon as it is possible to do a left-to-right scan of the input.
The principal drawback of the method is that it is too much work to construct an LR parser by hand for a typical
programming-language grammar. A special tool – an LR parser generator.
Three techniques are used for constructing an LR parsing table for a grammar.
1. The first method, called simple LR (SLR), is the easiest to implement, but the least powerful of the three.
It may fail to produce a parsing table for certain grammars on which the other methods succeed.
2. The second method, called canonical LR (CLR), is the most powerful, and the most expensive.
3. The third method, called look ahead LR (LALR), is intermediate in power and cost between the other
two. The LALR method will work on most programming language grammars and, with some effort, can be
implemented efficiently.
Model of an LR Parser
▪ The driver program is the same for all LR parsers; only the parsing table changes from one parser to another.
▪ The parsing program reads characters from an input buffer one at a time.
▪ The program uses a stack to store a string of the form s0X1s1X2s2 ... Xmsm, where, sm is on top.
▪ Each Xi is a grammar symbol and each si is a symbol called a state.
▪ Each state symbol summarizes the information contained in, the stack below it, and the combination of the
state symbol on top of the' stack and' the current input symbol are used to index the parsing table and
determine the shift reduce parsing decision.
LR Parsing Algorithm
Input: An input string w and an LR parsing table with functions action and goto for a grammar G.
Output: If w is in L (G), a bottom-up parse for w; otherwise, an error indication.
Method: Initially, the parser has s0 on its stack, where s0 is the initial state, and w$ in the input buffer.
The parser then executes the program until an accept or error action is encountered.
E→E+T | T
T→T*F | F
F →(E) | id
Construct an LR parsing table for the above grammar. Give the moves of the LR parser on id * id + id.
Solution:
E -> E + T / T
T -> T * F / F
F -> (E) / id
E’ -> E
E -> E + T
E -> T
T -> T * F
T -> F
F -> (E)
F -> id
I0 :
E’ -> .E
E -> .E + T
E -> .T
T -> . T * F
T -> .F
F -> .(E)
F -> . id
3. Computation of Goto Function
goto (I, X) { where I -> set of states and X -> E, T, F, +, *, (, ), id }
goto(I0, E)
I1: E’ -> E.
E -> E. + T
goto(I0, T)
I2: E -> T.
T -> T. * F
goto(I0, F)
I3: T -> F.
goto(I0, ( )
I4 : F -> (.E)
E -> .E + T
E -> .T
T -> .T * F
T -> .F
F -> .(E)
F -> .id
goto(I0, id)
I5: F -> id.
goto(I1, +)
I6: E -> E + .T
T -> .T * F
T -> .F
F -> .(E)
F -> .id
goto(I2, *)
I7: T -> T * .F
F -> .(E)
F -> .id
goto(I4, E)
I8: F -> (E . )
E -> E . + T
goto(I4, T)
I2: E -> T.
T - > T.* E
goto(I4, F)
I3: T -> F.
goto(I4, ( )
goto(I4, id)
goto(I6, T)
I9: E -> E + T.
T -> T. * F
goto(I6, F)
I3: T -> F.
goto(I6, ( )
goto(I6, id)
goto(I7, F)
I10: T -> T * F.
goto(I7, ( )
I4: F -> (.E)
E -> .E + T
E -> .T
T -> .T * F
T -> .F
F -> .(E)
F -> .id
goto(I7, id)
goto(I8, ) )
goto(I8, +)
I6: E -> E + .T
T -> .T * F
T -> .F
F -> .(E)
F -> .id
goto(I9, *)
I7: T -> T *.F
F -> .(E)
F -> .id
1. Shifting Process
I0 : goto(I0, E) = I1
goto(I0, T) = I2
goto(I0, F) = I3
goto(I0, ( ) = I4
goto(I0, id) = I5
I1: goto(I1, +) = I6
I2: goto(I2, * ) = I7
I4: goto(I4, E) = I8
goto(I4, T) = I2
goto(I4, F) = I3
goto(I4, ( ) = I4
goto(I4, id) = I5
I6: goto(I6, T) = I9
goto(I6, F) = I3
goto(I6, ( ) = I4
goto(I6, id) = I5
goto(I7, ( ) = I4
goto(I7, id) = I5
goto(I8, +) = I6
I9: goto(I9, * ) = I7
E -> TE’
E’ -> +TE’ | ε
T -> FT’
T’ -> * FT’ | ε
F -> (E) | id
4. Reducing Process
F -> id
1) POP 2 symbols from the stack
2) goto [0, F] = 3
3) Push ‘F3’ into the stack
T->F
1) POP 2 symbols from the stack
2) goto [0, T] = 2
3) Push ‘T2’ into the stack
E->T
1) POP 2 symbols from the stack
2) goto [0, E] = 1
3) Push ‘E1’ into the stack
F->id
1) POP 2 symbols from the stack
2) goto [6, F] = 3
3) Push ‘F3’ into the stack
T->F
1) POP 2 symbols from the stack
2) goto [6, T] = 4
3) Push ‘T4’ into the stack
E->E+T
1) POP 6 symbols from the stack
2) goto [0, E] = 1
3) Push ‘E1’ into the stack
20. Consider the following context free grammar (5 MARKS) (MAY 2018)
S→ SS+ | SS* | a
(i) Show how the string aa + a * can be generated by using the above grammar.
(ii) Construct a Parse tree for the generated string.
Solution
S → SS+ | SS* | a
(i) Show how the string aa + a * can be generated by using the above grammar.
Derivation:
S → SS*
→ SS+S * [ S → SS+ ]
→ aS+S * [ S → a]
→ aa+S* [S→a]
→ aa+a* [S → a]
2 MARKS
All strings that contain an even number of b’s. (NOV 2017) (Ref.Qn.No.48, Pg.no.11)
17. What will be the output for the lexical analyzer? (MAY 2018) (Ref.Qn.No.28, Pg.no.7)
18. List the types of parser for grammars? (MAY 2014) (Ref.Qn.No.3, Pg.no.2)
19. What is Parsing Tree?(MAY 2012, 2016) (Ref.Qn.No.10, Pg.no.3)
20. Differentiate phase and pass.(NOV 2012) (Ref.Qn.No.32, Pg.no.7)
21. Derive the first and follow for the follow for the following grammar.
10 MARKS
(OR)
2. How to recognize the tokens? (Ref.Qn.No.10, Pg.no.35)
(OR)
2. Discuss the role of the lexical analyzer. (Ref.Qn.No.7, Pg.no.28)
(OR)
2. Explain the buffered I/O with sentinels. Elaborate on the language for specifying lexical analyzer.
(OR)
2. Explain in detail how to handle lexical errors due to transposition of the letters. (Ref.Qn.No.7, Pg.no.28)
(OR)
2. What is the role of the lexical analyzer? Explain (Ref.Qn.No.7, Pg.no.28)
(OR)
2. Write short notes on
(OR)
2. Give a detailed account on lexical analysis. (Ref.Qn.No.7, Pg.no.28)
(OR)
2. (a). Discuss the different phases of a compiler. (5) (Ref.Qn.No.3, Pg.no.17)
(OR)
2. List out the role of lexical analyzer. How tokens are recognized in the source program? Explain it with an
example. (Ref.Qn.No.7,10, Pg.no.28,35)
NOV 2011 (REGULAR)
1. Explain the LR parsing algorithm in detail. (Ref.Qn.No.8, Pg.no.32)
MAY 2012 (ARREAR)
1. a) Write an algorithm for constructing LR parser table. (Ref.Qn.No.8, Pg.no.32)
2. Discuss the Role of the parser. (Ref.Qn.No.1, Pg.no.14)
E→E+T | T
T→T*F | F
F →(E) | id
MAY 2013 (ARREAR)
1. Give the following CFG grammar G=({S,A,B},S,(a,b,x),P) with P:
S→A
S→xb
A→aAb
A→B
B→x
Compete the set of LR (1) items for this grammar. Augment the grammar with the default initial
production S’→S$ as the production (0) and Construct the corresponding LR parsing table.
(OR)
2. Find the predictive parser for the given grammar and parse the sentence (a+b)*c (Ref.Qn.No.6, Pg.no.29)
E→E+T|T
T→T*F|F
F → (E) | id
(OR)
2. Describe about LR parser. (Ref.Qn.No.8, Pg.no.32)
(OR)
2. (a). Consider the following context free grammar (5) (Ref.Qn.No.26, Pg.no.81)
S→ SS+ | SS* | a
(i) Show how the string aa + a * can be generated by using the above grammar.