0% found this document useful (0 votes)
29 views81 pages

Unit 4

Uploaded by

sidharthmanoj39
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views81 pages

Unit 4

Uploaded by

sidharthmanoj39
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SUBJECT NAME: AUTOMATA AND COMPILER DESIGN SUBJECT CODE: U20CST305

Prepared By:
Mr.M.Shanmugam, AP/CSE

Mr. A. Prakash, AP/CSE

Mrs. S.Deeba, AP/CSE

Verified by: Approved by:


UNIT- IV
Lexical Analysis And Syntax Analysis
Compilers: The Phases of compiler – Lexical analysis - The role of the lexical analyser - Input buffering
- Specification of tokens - Recognition of tokens - A language for specifying lexical analyzers - design of
a lexical analyzer. Parser: Top Down Parser- Predictive Parser, Bottom up Parser- SLR Parser - Syntax
directed definitions - construction of syntax trees.
2 MARKS
1. What is a compiler? (MAY, NOV 2012)
A compiler is a program that reads a program written in one language – the source language and translates it
into an equivalent program in another language – the target language. In translation process, the compiler
reports to its user the presence of errors in the source program.

Source program Target program


Compiler

Error messages
2. What are the classifications of a compiler?
Compilers are sometimes classified as:
➢ Single-pass
➢ Multi-pass
➢ Load-and-go
➢ Debugging or
➢ Optimizing

3. What are the two parts of a compilation? Explain briefly.


There are two parts to compilation as
1. Analysis part
2. Synthesis part
▪ The analysis part breaks up the source program into constituent pieces and creates an intermediate
representation of the source program.
▪ The synthesis part constructs the desired target program from the intermediate representation.

4. What are the tools available in analysis phase?


Many software tools that manipulate source programs are
➢ Structure editors
➢ Pretty printers
➢ Static checkers
➢ Interpreters

5. What is meant by Static Checking? (MAY 2014)


A static checker reads a program, analyzes it, and attempts to discover potential bugs without running the
source program.
6. What is Query Interpreters?
A Query interpreter translates a predicate containing relational and Boolean operators into commands to
search a database for records satisfying that predicate.
7. List the analysis of the source program?
Analysis consists of three phases:
1. Linear Analysis
2. Hierarchical Analysis
3. Semantic Analysis

8. What is linear analysis or lexical analysis? (MAY 2016)


▪ In a compiler, linear analysis is called lexical analysis or scanning.
▪ Linear analysis in which the stream of characters making up the source program is read from left-to-right
and grouped into tokens that are sequences of characters having a collective meaning.

9. What is hierarchical analysis? (NOV 2011)


▪ Hierarchical analysis is called parsing or syntax analysis.
▪ Hierarchical analysis involves grouping the tokens of the source program into grammatical phases that are
used by the compiler to synthesize output.

10. What is semantic analysis?


▪ The Semantic analysis, it checks the source program for semantic errors and gathers type information for
the subsequent code generation phase.
▪ It uses the hierarchical structure determined by the syntax analysis phase to identify the operators and
operands of expressions and statements.

11. Draw the parse tree for a source program as position: = initial + rate * 60.

12. List the various phases of a compiler?


The following are the various phases of a compiler:
➢ Lexical Analyzer
➢ Syntax Analyzer
➢ Semantic Analyzer
➢ Intermediate code generator
➢ Code optimizer
➢ Code generator

13. What is a symbol table?


▪ A symbol table is a data structure containing a record for each identifier, with fields for the attributes of
the identifier. The data structure allows us to find the record for each identifier quickly and to store or
retrieve data from that record quickly.
▪ Whenever an identifier is detected by a lexical analyzer, it is entered into the symbol table. The attributes
of an identifier cannot be determined by the lexical analyzer.

14. What is Intermediate code generator?


The intermediate representation should have two important properties;
▪ it should be easy to produce, and
▪ it should be easy to translate into the target program

15. Define three address codes? (NOV 2016)


▪ An intermediate form called “three-address code, “which is like the assembly language for a machine in
which every memory location can act like a register.
▪ Three-address code consists of a sequence of instructions, each of which has at most three operands.

16. What is code generator? (MAY 2018)


▪ The final phase of the compiler is the generation of target code, consisting normally of relocatable
machine code or assembly code.
▪ Memory locations are selected for each of the variables used by the program. Then, intermediate
instructions are each translated into a sequence of machine instructions that perform the same task.

17. Define Assembly code?


▪ Assembly code is a mnemonic version of machine code, in which names are used instead of binary codes
for operations, and names are also given to memory addresses.
▪ A typical sequence b := a + 2,the assembly instructions might be
MOV a, R1
ADD #2, R1
MOV R1, b

18. What is the use Two-Pass assembly?


The assembler makes two passes over the input, where a pass consists of reading an input file once.
▪ In the first pass, all the identifiers that denote storage locations are found and stored in a symbol table.
▪ Identifiers are assigned storage locations as they are encountered for the first time.
▪ In the second pass, the assembler scans the input again. This time, it translates each operation code into
the sequence of bits representing that operation in machine language and it translates each identifier
representing a location into address given for that identifier in the symbol table.
▪ The output of the second pass is usually relocatable machine code.
19. What is loader and link-editor?
▪ Usually, a program called a loader performs the two functions of loading and link-editing.
▪ The process of loading consists of taking relocatable machine code, altering the relocatable addresses,
and placing the altered instructions and data in memory at the proper location.
▪ The link-editor allows us to make a single program from several files of relocatable machine code. These
files may be the result of several different compilation and one or more library files of routines provided
by the system.

20. State the function of front end and back end of a compiler phase. (MAY 2013)
The front end consists of those phases that depends primarily on the source language and are largely
independent of the target machine.
These includes
➢ Lexical analysis
➢ Syntactic analysis
➢ Semantic analysis
➢ Creation of symbol table
➢ Generation of intermediate code
➢ Code optimization
➢ Error handling

21. State the function back end of a compiler phase. (MAY 2013)
The back end of compiler includes those portions that depend on the target machine and generally those
portions do not depend on the source language, just the intermediate language.
These include
➢ Code optimization
➢ Code generation
➢ Error handling and
➢ Symbol-table operations

22. What is single pass?


Several phase of compilation are usually implemented in a single pass consisting of reading an input file and
writing an output file.
23. Define compiler-compiler.
Systems to help with the compiler-writing process are often been referred to as compiler-compilers,
compiler-generators or translator-writing systems. Largely they are oriented around a particular model of
languages, and they are suitable for generating compilers of languages similar model.

24. What is the role of lexical analyzer? (NOV 2013, 2015) (MAY 2018)
▪ The lexical analyzer is the first phase of a compiler.
▪ Its main task is to read the input characters and produce as output a sequence of tokens that the parser
uses the syntax analysis.
▪ Upon receiving a “get next token” command from the parser, the lexical analyzer reads input characters
until it can identify the next token.

25. What are the issues in lexical analysis?


The issues in lexical analysis are
➢ Simpler design is the most important consideration.
➢ Compiler efficiency is improved.
➢ Compiler portability is enhanced.

26. Why separate lexical analysis phase is required? (MAY 2013)


i. Simpler design is the most important consideration.
▪ Comments and white space have already been removed by lexical analyzer.
ii. Compiler Efficiency is improved.
▪ Specialized buffering techniques for reading input characters and processing tokens can significantly
speed up the performance of a compiler.
iii. Compiler Portability is enhanced.
▪ Input alphabet peculiarities and other device-specific anomalies can be restricted to the lexical
analyzer.

27. What is the major advantage of a lexical analyzer generator? (NOV 2011)
The major advantages of a lexical analyzer generator are
➢ One task is stripping out from the source program comments and white space in the form of blank, tab
and new line characters.
➢ Another is error messages from the compiler in the source program.

28. What are tokens, patterns, and lexeme?


▪ Tokens- Sequence of characters that have a collective meaning.
▪ Patterns- There is a set of strings in the input for which the same token is produced as output. This set of
strings is described by a rule called a pattern associated with the token.
▪ Lexeme- A sequence of characters in the source program that is matched by the pattern for a token.

29. Differentiate between tokens, patterns, and lexeme?

Token lexeme patterns


const const const
if if if
relation <, <=, =, < >, >, >= < or <= or = or < > or >= or >
id pi, count, D2 letter followed by letters and digits
num 3.1416, 0, 6.02E23 any numeric constant
literal “core dumped” any characters between “ and “ except “

30. List the various Panic mode recovery in lexical analyzer?


Possible error recovery actions are:
1. Deleting an extraneous character
2. Inserting a missing character
3. Replacing an incorrect character by a correct character
4. Transposing two adjacent characters

31. What are the approaches to implement lexical analyzer?


The three general approaches to the implementation of a lexical analyzer
1. Use a lexical analyzer generator such as the Lex compiler to produce a regular expression based
specification. In this case, the generator provides for reading and buffering the input.
2. Write the lexical analyzer in a conventional systems programming languages using the I/O facilities of
that language to read the input.
3. Write the lexical analyzer in assembly language and reading of input.

32. What is input buffering? (MAY 2017)


➢ The input buffer is useful when look-ahead on the input is to identify tokens.
➢ To ensure that a right lexeme is found, one or more characters have to be looked up beyond the next lexeme.
➢ Hence a two-buffer scheme is introduced to handle large look-ahead safely.
➢ Techniques for speeding up the process of lexical analyzer such as the use of sentinels to mark the buffer
end have been adopted.

33. Define sentinels. (MAY 2014)


▪ Sentinel is a special character that cannot be part of the source program.
▪ The techniques for speeding up the lexical analyzer, use the “sentinels “ to mark the buffer end.

34. List the operations on languages.


The operations that can be applied to languages are
➢ Union - L U M ={s | s is in L or s is in M}
➢ Concatenation – LM ={st | s is in L and t is in M}
➢ Kleene Closure – L* (zero or more concatenations of L)
➢ Positive Closure – L+ (one or more concatenations of L)

35 Write a regular expression for an identifier.


▪ An identifier is defined as a letter followed by zero or more letters or digits. The regular expression for an
identifier is given as
letter (letter | digit)*

36. How the regular expression can be defined in specification of the language.
1.  is a regular expression that denotes {}, the set containing empty string.
2. If a is a symbol in , then a is a regular expression that denotes {a}, the set containing the string a.
3. Suppose r and s are regular expressions denoting the language L(r) and L(s), then
➢ (r) |(s) is a regular expression denoting L(r) L(s).
➢ (r)(s) is regular expression denoting L (r) L(s).
➢ (r) * is a regular expression denoting (L (r))*.
➢ (r) is a regular expression denoting L (r).

37. Mention the various notational short hands for representing regular expressions.
➢ One or more instances
➢ Zero or one instance
➢ Character classes
➢ Non regular sets

38. What is transition diagram? (NOV 2012, 2016)


▪ An intermediate step in the construction of a lexical analyzer, we first produce a stylized flowchart called
a transition diagram.
▪ Transition diagram depicts the actions that take place when a lexical analyzer is called by the parser to
get the next token.

39. Define Lex complier?


A particular tool called Lex, used to specify lexical analyzers for a variety of languages. We refer to the tool as
the Lex compiler and to its input specification as the Lex language.

40. How to create a lexical analyzer with Lex?

41. List out the parts on Lex specifications. (OR) Give the structure of a Lex program with example.
(MAY 2012, 2015)
A Lex program consists of three parts:
declarations
%%
translation rules
%%
auxiliary procedures
▪ The declarations section includes declarations of variables, manifest constants, and regular definitions.
▪ The translation rules of a Lex program are statement of the form
p1 { action1}
p2 { action2}
…… …..
pn { actionn}
▪ The auxiliary procedures are needed by the actions. These procedures can be compiled separately and
loaded with the lexical analyzer.
42. Compare and contrast Compilers and Interpreters. (MAY 2015, 2017)

Compilers Interpreters
Compiler Takes Entire program as input Interpreter Takes Single instruction as input
Intermediate Object Code is Generated No Intermediate Object Code is Generated
Memory Requirement : More Memory Requirement is Less
(Since Object Code is Generated)
Errors are displayed after entire program is Errors are displayed for every instruction interpreted
checked
Example : C Compiler Example : BASIC

43. What is meant by tokens? (NOV 2016, 2017)


➢ Tokens - Sequence of characters that have a collective meaning.
➢ The tokens are keywords, operators, identifiers, constants, literal strings, and punctuation symbols such as
parentheses, commas, and semicolons.

44. Write regular expression for the following language over the alphabet over the alphabet ∑ = {a, b}.
All strings that contain an even number of b’s. (NOV 2017)
▪ The regular expression for the following language over the alphabet over the alphabet ∑ = {a, b}.
▪ All strings that contain an even number of b’s.
a*(ba*ba*)*
45. What is the role of parser? (MAY 2018)
▪ In compiler model, parser obtains a string of tokens from the lexical analyzer and verifies that the string can
be generated by the grammar for the source program.
▪ The parser should report any syntax errors in an intelligible fashion.
46. Define Parsing? (NOV 2015)
A parser for grammar G is a program that takes as input a string ‘w’ and produces as output either a parse tree
for ’w’, if ‘w’ is a sentence of G, or an error message indicating that w is not a sentence of G. It obtains a string
of tokens from the lexical analyzer, verifies that the string generated by the grammar for the source language.

47. What are the types of Parser? (MAY 2014)


There are three general types of parsers for grammars.
1. Universal parsing methods
▪ Cocke-Younger –Kasami algorithm and
▪ Earley’s algorithm
2. Top down parser
3. Bottom up parser

48. What are the different levels of syntax error handler?


▪ Lexical, such as misspelling an identifier, keyword, or operator
▪ Syntactic, such as an arithmetic expression with unbalanced parentheses
▪ Semantic, such as an operator applied to an incompatible operand
▪ Logical, such as an infinitely recursive call

49. What are the goals of error handler in a parser?


▪ It should report the presence of errors clearly and accurately.
▪ It should recover from each error quickly enough to be able to detect subsequent errors.
▪ It should not significantly slow down the processing of correct programs.

50. What are error recovery strategies in parser?


➢ Panic mode
➢ Phrase level
➢ Error productions
➢ Global correction

51. Define derivation.


Derivation is the top-down construction of parse tree. The production treated as a rewriting rule in which the
non-terminal on the left is replaced by the string on the right side of the production.

52. What is left-most and right-most derivation?


▪ The left-most non-terminal in each derivation step, this derivation is called as left-most derivation.
▪ The right-most non-terminal in each derivation step, this derivation is called as right-most derivation
(Canonical derivation).

53. What is Parsing Tree? (MAY 2012, 2016)


▪ A parse tree can be viewed as a graphical representation for a derivation.
▪ The leaves of a parse tree are terminal symbols.
▪ Inner nodes of a parse tree are non-terminal symbols.

54. Define yield of the string?


The leaves of the parse tree are labeled by non-terminals or terminals and read from left to right; they
constitute a sentential form called the yield or frontier of the tree.

55. Define ambiguous. (MAY 2012)


A grammar that produces more than one parse tree for a sentence is said to be ambiguous.

56. What is meant by ambiguous grammar? (NOV 2016)


An ambiguous grammar is one that produces more than one left most or more than one right most derivation
for the same sentence.

57. What is left recursion?


+
▪ A grammar is left recursive if it has a non-terminal A such that there is a derivation A  A.
▪ Top-down parsing techniques cannot handle left-recursive grammars, so a transformation that eliminates
left recursion is needed.
Example: The left recursive pair of productions A→A | β could be replaced by non- left- recursive productions
A→βA’
A’→  A’ | ε

58. Define left factoring?


Left factoring is a grammar transformation that is useful for producing a grammar suitable for
predictive parser.

59. What are the problems with top down parsing?


The following are the problems associated with top down parsing:
➢ Backtracking
➢ Left recursion
➢ Left factoring
➢ Ambiguity
60. Define top down parsing?
▪ Top-down parser viewed as an attempt to find the left most derivation for an input string. It can be viewed as
attempting to construct a parse tree for the input starting from the root and creating the nodes of the parse
tree in preorder.
▪ Top down parsing called recursive descent that may involve backtracking ie. making repeated scanning of the
input.

61. What is meant by recursive-descent or predictive parser? (MAY 2016)


▪ A parser that uses a set of recursive procedures to recognize its input with no backtracking is called a
recursive-descent parser.
▪ This recursive-descent parser called predictive parsing.

62. Briefly describe the LL (k) items. (NOV 2013)


In LL (k) the first “L “ scanning the input from left to right and
second “L” producing a leftmost derivation and
the “1” one input symbol of lookahead at each step

63. What are the possibilities of non-recursive predictive parsing?


a) If X = a = $, the parser halts and announces successful completion of parsing.
b) If X = a = $, the parser pops X off the stack and advances the input pointer to the next symbol.
c) If X is a nonterminal, the program consults entry M[X, a] of the parsing table M. This entry will be either
an X-production of the grammar or an error entry.

64. Write the algorithm for FIRST and FOLLOW?

FIRST
1. If X is terminal, and then FIRST(X) is {X}.
2. If X →ε is a production, then add ε to FIRST(X).
3. If X is non-terminal and X → Y1,Y2….Yk is a production, then place a in FIRST(X) if for some i , a is in
FIRST(Yi) , and ε is in all of FIRST(Y1),…FIRST(Yi-1);

FOLLOW
1. Place $ in FOLLOW(S), where S is the start symbol and $ is the input right end marker.
2. If there is a production A → αBβ, then everything in FIRST (β) except for ε is placed in FOLLOW (B).
3. If there is a production A→αB, or a production A→αBβ where FIRST (β) contains ε, then everything in
FOLLOW (A) is in FOLLOW (B).
65. What is bottom up parser? (NOV 2017)
▪ Bottom-up parsing is also known as shift-reduce parsing.
▪ Shift-reduce parsing attempts to construct a parse tree for an input string beginning at the leaves (the
bottom) and working up towards the root (the top).

66. Define handle? (OCT 2022)


A handle of a string is a substring that matches the right side of a production, and whose reduction to the
non-terminal on the left side of the production represents one step along the reverse of a right most
derivation.

67. What is meant by handle pruning?


▪ A rightmost derivation in reverse can be obtained by handle pruning.
▪ If w is a sentence of the grammar at hand, then w = γn, where γn is the nth right-sentential form of some as
yet unknown rightmost derivation
S = γ0 => γ1…=> γn-1 => γn = w

68. What is meant by viable prefixes?


▪ The set of prefixes of right sentential forms that can appear on the stack of a shift-reduce parser are called
viable prefixes.
▪ An equivalent definition of a viable prefix is that it is a prefix of a right sentential form that does not
continue past the right end of the rightmost handle of that sentential.

69. Define LR (k) parser?


LR parsers can be used to parse a large class of context free grammars. The technique is called LR (K)
parsing.
➢ “L” is for left-to-right scanning of the input
➢ “R” for constructing a right most derivation in reverse
➢ “k” for the number of input symbols of lookahead that are used in making parsing decisions.

70. Mention the types of LR parser?


The three methods in LR parser
1. Simple LR (SLR) parser
2. Canonical LR (CLR) parser
3. Lookahead LR (LALR) parser

71. What are the techniques for producing LR parsing Table?


1. Shift s, where s is a state
2. Reduce by a grammar production A→
3. Accept and
4. Error

72. What are the two functions of LR parsing algorithm?


The two functions in LR parsing algorithm are
1. Action function
2. GOTO function

73. Define SLR parser?


The parsing table consisting of the parsing action and goto function determined by constructing an SLR
parsing table algorithm is called SLR(1) table. An LR parser using the SLR (1) table is called SLR (1) parser. A
grammar having an SLR (1) parsing table is called SLR (1) grammar.

74. Differentiate phase and pass. (NOV 2012)


Phase Pass
▪ Phase is often used to call such a single ▪ Number of passes of a compiler is the number of
independent part of a compiler. times it goes over the source.
▪ It is used in complier. ▪ It also used in compiler.
▪ It has lexical, syntax, semantic analyzer, ▪ Compilers are indentified as one-pass or multi-pass
intermediate code generator, code compilers.
optimizer, and code generator. ▪ I t is easier to write a one-pass compiler and also
they perform faster than multi-pass compilers.

75. What are the syntax-directed methods?


The syntax-directed methods can be used to translate into intermediate form programming language
constructs such as
➢ Declaration
➢ Assignment statements
➢ Boolean Expression
➢ Flow of control statements
76. Derive the first and follow for the follow for the following grammar. (MAY 2013)

S→0|1|AS0|BS0 A→ ε B→ ε

Computation for FIRST

FIRST(S) = {0, 1} U FIRST (A) U FIRST (B) = {0, 1} U {ε} U {ε} = {0, 1, ε}

FIRST (A) = {ε}

FIRST (B) = {ε}

Computation for FOLLOW:

FOLLOW (S) = {$} U {0} U {0} = {$, 0}

FOLLOW (A) = FOLLOW(S) = {$, 0}

FOLLOW (B) = FOLLOW(S) = {$, 0}

77. Eliminate left recursion from the following grammar: (MAY 2015)

bexpr → bexpr or bterm | bterm

bterm → bterm and bfast | bfact

bfact → (bexpr) | true | false

Solution:

bexpr → bterm bexpr’

bexpr’ → or bterm bexpr’ | ε

bterm → bfact bterm’

bterm’ → and bfast bterm’ | ε

bfact → (bexpr) | true | false


10 MARKS

1. Write a short note on Compiler Design? (5 MARKS)

A compiler is a program that reads a program written in one language – the source language and translates
it into an equivalent program in another language – the target language. As an important part of this
translation process, the compiler reports to its user the presence of errors in the source program.

Source program Target program


Compiler

Error messages

▪ At first, the variety of compilers may appear overwhelming. There are thousands of source languages,
ranging from traditional programming languages such as FORTRAN and Pascal to specialized languages in
every area of computer application.
▪ Target languages are equally as varied; a target language may be another programming language, or the
machine language of any computer between a microprocessor and a supercomputer.

Compilers are sometimes classified as:


▪ Single-pass
▪ Multi-pass
▪ Load-and-go
▪ Debugging or
▪ Optimizing

▪ Throughout the 1950’s, compilers were considered notoriously difficult programs to write. The first
FORTRAN compiler, for example, took 18 staff-years to implement.

Analysis-Synthesis Model of Compilation

There are two parts to compilation as


1. Analysis part
2. Synthesis part

▪ The analysis part breaks up the source program into constituent pieces and creates an intermediate
representation of the source program.
▪ The synthesis part constructs the desired target program from the intermediate representation.
▪ During analysis, the operations implied by the source program are determined and recorded in a hierarchical
structure called a tree.
▪ Often, a special kind of tree called a syntax tree is used, in syntax tree each node represents an operation and
the children of the node represent the arguments of the operation.
For example, a syntax tree of an assignment statement is position: = initial + rate * 60.

Many software tools that manipulate source programs first perform some kind of analysis. Some examples of
such tools include:
1. Structure editors
2. Pretty printers
3. Static checkers
4. Interpreters

Structure editors
A structure editor takes as input a sequence of commands to build a source program. The
structure editor not only performs the text-creation and modification functions of an ordinary text editor, but it
also analyzes the program text, appropriate hierarchical structure on the source program. It is useful in the
preparation of source program.

Pretty printers
A pretty printer analyzes a program and prints it in a way that the structure of the program becomes
clearly visible. For example, comments may appear in a special font.

Static checkers
A static checker reads a program, analyzes it, and attempts to discover potential bugs without
running the source program.

Interpreters
Interpreters are frequently used to execute command languages, since each operator executed in
command languages is usually a complex routine such as an editor or compiler.
The analysis portion in each of the following examples is similar to that of a conventional compiler.
➢ Text formatters
➢ Silicon compilers
➢ Query interpreters

Text formatters
A text formatter takes input that is a stream of characters, most of which is text to be typeset, and it
includes commands to indicate paragraphs, figures, or mathematical structures like subscripts and the
superscripts.

Silicon compilers
A silicon compiler has a source language that is similar or identical to a conventional programming
language. However, the variables of the language represent, not locations in memory, but also logical signals (0
or 1) or groups of signals in a switching circuit. The output is a circuit design in an appropriate language.

Query interpreters
A Query interpreter translates a predicate containing relational and Boolean operators into commands to
search a database for records satisfying that predicate.

2. Explain the Analysis of the Source Program? (10 MARKS) (NOV 2015)
In compiling, analysis consists of three phases:
1. Linear Analysis
2. Hierarchical Analysis
3. Semantic Analysis

Linear Analysis
▪ Linear analysis, in which the stream of characters making up the source program is read from left-to-right
and grouped into tokens that are sequences of characters having a collective meaning
▪ In a compiler, linear analysis is called lexical analysis or scanning.
For example, in lexical analysis the characters in the assignment statement position: = initial + rate * 60 would
be grouped into the following tokens:
1. The identifier position.
2. The assignment symbol :=
3. The identifier initial.
4. The plus sign.
5. The identifier rate.
6. The multiplication sign.
7. The number 60.
▪ The blanks separating the characters of these tokens would normally be eliminated during the lexical
analysis.

Hierarchical Analysis
▪ Hierarchical analysis is called parsing or syntax analysis.
▪ Hierarchical analysis involves grouping the tokens of the source program into grammatical phases that are
used by the compiler to synthesize output.
▪ The grammatical phrases of the source program are represented by a parse tree.

Parse tree for position: = initial + rate * 60

▪ In the expression initial + rate * 60, the phrase rate * 60 is a logical unit because the usual conventions of
arithmetic expressions tell us that the multiplication is performed before addition.
▪ Because the expression initial + rate is followed by a *, it is not grouped into a single phrase by itself.

▪ The hierarchical structure of a program is usually expressed by recursive rules. For example, the following
rules, as part of the definition of expression:

1. Any identifier is an expression.


2. Any number is an expression
3. If expression1 and expression2 are expressions, then so are
➢ expression1 + expression2
➢ expression1 * expression2
➢ (expression1)

▪ Rules (1) and (2) are non-recursive basic rules, while (3) defines expressions in terms of operators applied to
other expressions.
Similarly, many languages define statements recursively by rules such as:
1. If identifier1 is an identifier and expression2 is an expression, then
identifier1 := expression2
is a statement.
2. If expression1 is an expression and statement2 is a statement, then
while (expression1) do statement2
if (expression1) then statement2
are statements.

A syntax tree is a compressed representation of the parse tree in which the operators appear as the interior
nodes and the operands of an operator are the children of the node for that operator.

Semantic Analysis
▪ The semantic analysis phase checks the source program for semantic errors and gathers type information for
the subsequent code-generation phase.
▪ It uses the hierarchical structure determined by the syntax-analysis phase to identify the operators and
operand of expressions and statements.
▪ An important component of semantic analysis is type checking.
▪ The compiler checks that each operator has operands that are permitted by the source language
specification.
▪ For example, when a binary arithmetic operator is applied to an integer and real.
▪ In this case, the compiler may need to be converting the integer to a real. As shown in figure given below.
3. Explain the various Phases of a Compiler with an example? (10 MARKS) (NOV 2011, 2012, 2013,

2016, 2017) (MAY 2012, 2013, 2014, 2015, 2016, 2018, 2022)

A compiler operates in phases, each of which transforms the source program from one representation to another.

The six phases of a complier are


1. Lexical Analyzer
2. Syntax Analyzer
3. Semantic Analyzer
4. Intermediate code generator
5. Code optimizer
6. Code generator

Two other activities are


➢ Symbol table Manager
➢ Error handler

A typical decomposition of a compiler is shown in fig given below


Phases of a compiler

The Analysis Phases


As translation progresses, the compiler’s internal representation of the source program changes. Consider the
translation of the statement,
position: = initial + rate * 10

Lexical analyzer

▪ The lexical analysis phase reads the characters in the source program and groups them into a stream of
tokens in which each token represents a logically cohesive sequence of characters, such as an identifier, a
keyword (if, while, etc.), a punctuation character or a multi-character operator like :=.
▪ In a compiler, linear analysis is called lexical analysis or scanning.

▪ Linear analysis in which the stream of characters making up the source program is read from left-to-right
and grouped into tokens that are sequences of characters.
▪ The character sequence forming a token is called the lexeme for the token.
▪ Certain tokens will be augmented by a ‘lexical value’. For example, when an identifier like rate is found, the
lexical analyzer not only generates a token id, but also enters the lexeme rate into the symbol table, if it is not
already there.

Consider id1, id2 and id3 for position, initial, and rate respectively, that the internal representation of an
identifier is different from the character sequence forming the identifier.

The representation of the statement given above after the lexical analysis would be:
id1: = id2 + id3 * 10

Syntax analyzer

▪ Hierarchical analysis involves grouping the tokens of the source program into grammatical phases that are
used by the compiler to synthesize output.
▪ Hierarchical analysis is called parsing or syntax analysis.
▪ Syntax analysis imposes a hierarchical structure on the token stream, which is shown by syntax trees.
Semantic analyzer

▪ The Semantic analysis, it checks the source program for semantic errors and gathers type information for the
subsequent code generation phase.
▪ It uses the hierarchical structure determined by the syntax analysis phase to identify the operators and
operands of expressions and statements.
▪ Compiler report an error, if real number is used to index an array.
▪ The bit pattern representing an integer is generally different from the bit pattern for a real, even they have the
same value.
▪ For example, the identifiers position, initial, rate declared to be real and that 60 by itself assumed to
be integer.
▪ The general approach is to convert the integer to a real.

Intermediate code generator

▪ After Syntax and semantic analysis, some compilers generate an explicit intermediate representation of the
source program. The intermediate representation as a program for an abstract machine.
▪ This intermediate representation should have two important properties;
➢ it should be easy to produce, and
➢ it should be easy to translate into the target program
▪ An intermediate form called “three-address code, “which is like the assembly language for a machine in which
every memory location can act like a register.
▪ Three-address code consists of a sequence of instructions, each of which has at most three operands.
▪ The source program might appear in three-address code as

temp1: = inttoreal (60)


temp2: = id3 * temp1
temp3: = id2 + temp2
id1: = temp3
This intermediate form has several properties:
1. First, each three address instruction has at most one operator in addition to the assignment. Thus, when
generating these instructions, the compiler has to decide on the order in which operations are to be done;
the multiplication precedes the addition in the source program.
2. Second, the compiler must generate a temporary name to hold the value computed by each instruction.
3. Third, some “three-address” instructions have fewer than three operands.

Code Optimization

▪ The code optimization phase attempts to improve the intermediate code, so that faster-running machine code
will result.
▪ For example, a natural algorithm generates the intermediate code, using an instruction for each operator in
the tree representation after semantic analysis, even though there is a better way to perform the same
calculation, using the two instructions.
temp1:= id3 * 60.0
id1 := id2 + temp1

▪ There is nothing wrong with this simple algorithm, since the problem can be fixed during the code-
optimization phase.
▪ That is, the compiler can deduce that the conversion of 60 from integer to real representation can be done
once and for all at compile time, so the inttoreal operation can be eliminated.
▪ There is a great variation in the amount of code optimization different compilers.
▪ ‘Optimizing compilers’, a significant fraction of the time of the compiler is spent on this phase.

Code Generation
▪ The final phase of the compiler is the generation of target code, consisting normally of relocatable machine
code or assembly code.
▪ Memory locations are selected for each of the variables used by the program. Then, intermediate instructions
are each translated into a sequence of machine instructions that perform the same task.
▪ The assignment of variables to registers.

For example, using registers 1 and 2, the translation of the code as


MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
➢ The first and second operands of each instruction specify a source and destination, respectively.
➢ The F in each instruction tells us that instructions deal with floating-point numbers.
➢ This code moves the contents of the address id3 into register 2, and then multiplies it with the real-constant
60.0.
➢ The # signifies that 60.0 is to be treated as a constant.
➢ The third instruction moves id2 into register 1 and adds to it the value previously computed in register 2.
➢ Finally, the value in register 1 is moved into the address of id1.

Symbol Table Management

▪ An essential function of a compiler is to record the identifiers used in the source program and collect
information about various attributes of each identifier.
▪ These attributes may provide information about the storage allocated for an identifier, its type, its scope and
in case of procedure names, such things at the number and types of its arguments and methods of passing
each argument and type returned.
▪ The symbol table is a data structure containing a record for each identifier with fields for the attributes of
the identifier.
▪ The data structure allows us to find the record for each identifier quickly and to store or retrieve data from
that record quickly.
▪ Whenever an identifier is detected by a lexical analyzer, it is entered into the symbol table. The attributes of
an identifier cannot be determined by the lexical analyzer.
▪ However, the attributes of an identifier cannot normally be determined during lexical analysis.

For example, in a Pascal declaration like


var position, initial, rate : real;
▪ The type real is not known when position, initial and rate are seen by the lexical analyzer.
▪ The remaining phases get information about identifiers into the symbol table and then use this information
in various ways.

Error Detection and Reporting


▪ Each phase can encounter errors. However, after detecting an error, a phase must somehow deal with that
error, so that compilation can proceed, allowing further errors in the source program to be detected.
▪ A compiler that stops when it finds the first error is not as helpful as it could be.
▪ The syntax and semantic analysis phases usually handle a large fraction of the errors detectable by the
compiler.
▪ The lexical phase can detect errors where the characters remaining in the input do not form any token of the
language.
▪ Errors where the token stream violates the structure rules (syntax) of the language are determined by the
syntax analysis phase.
▪ During semantic analysis the compiler tries to detect the right syntactic structure but no meaning to the
operation involved.
▪ Example: we try to add two identifiers, one of which is the name of an array and the other the name of the
procedure.

Example

Consider the translation of the statement, position: = initial + rate * 10


4. Discuss the Role of the Lexical Analyzer? (10 MARKS) (NOV 2012, 2014) (MAY 2014, 2016, 2017,
2018)

▪ The lexical analyzer is the first phase of a compiler.


▪ Its main task is to read the input characters and produce as output a sequence of tokens that the parser uses
the syntax analysis.
▪ This is implemented by making the lexical analyzer be a sub-routine or a co-routine of the parser.
▪ Upon receiving a “get next token” command from the parser, the lexical analyzer reads input characters until
it can identify the next token.

Interaction of lexical analyzer with parser

The lexical analyzers is the part of the compiler that reads the source text, it may also perform certain secondary
tasks at the user interface.
➢ One task is stripping out from the source program comments and white space in the form of blank, tab and
new line characters.
➢ Another is error messages from the compiler in the source program.

The lexical analyzers are divided into a cascade two phases are
1. Scanning → is responsible for doing simple tasks.
2. Lexical analysis →more complex operations

For example, a FORTRAN compiler might use a scanner to eliminate blanks from the input.

Issues in Lexical Analysis


There are several reasons for separating the analysis phase of compiling into lexical analysis and
parsing.
i. Simpler design is the most important consideration.
▪ Comments and white space have already been removed by lexical analyzer.
ii. Compiler Efficiency is improved.
▪ A large amount of time is spent reading the source program and partitioning it into tokens.
▪ Specialized buffering techniques for reading input characters and processing tokens can significantly
speed up the performance of a compiler.
iii. Compiler Portability is enhanced.
▪ Input alphabet peculiarities and other device-specific anomalies can be restricted to the lexical
analyzer.
▪ The representation of a special or non-standard symbol such as ↑ in Pascal can be isolated in the
lexical analyzer.

Tokens, Patterns, and Lexemes


▪ Tokens- Sequence of characters that have a collective meaning.
▪ Patterns- There is a set of strings in the input for which the same token is produced as output. This set of
strings is described by a rule called a pattern associated with the token.
▪ Lexeme- A sequence of characters in the source program that is matched by the pattern for a token.

Token lexeme patterns


const const const
if if if
relation <, <=, =, < >, >, >= < or <= or = or < > or >= or >
id pi, count, D2 letter followed by letters and digits
num 3.1416, 0, 6.02E23 any numeric constant
literal “core dumped” any characters between “ and “ except “

Attributes for Tokens


▪ When more than one pattern matches a lexeme, the lexical analyzer must provide information about the
particular lexeme that matched to the phases of a compiler.
▪ For example, the pattern num matches both the strings 0 and 1.
▪ The lexical analyzer collects information about tokens into their associated attributes.
▪ The tokens influence parsing decisions; the attributes influence the translation of tokens.
▪ A token has usually only a single attribute – a pointer to the symbol-table entry in which the information
about the token is kept; the pointer becomes the attribute for the token.

The tokens and associated attribute-values for the FORTRAN statement


E = M * C ** 2
<id, pointer to symbol-table entry for R>
<assign_op, >
<id, pointer to symbol-table entry for M>
<mult_op, >
<id, pointer to symbol-table entry for C>
<exp_op, >
<num, integer value 2>
Lexical Errors
Possible error recovery actions or Panic mode recovery are
1. Deleting an extraneous character
2. Inserting a missing character
3. Replacing an incorrect character by a correct character
4. Transposing two adjacent characters

5. Explain the Input Buffering with sentinels. (5 MARKS) (NOV 2013)


▪ The two- buffer input scheme is useful when look-ahead on the input is necessary to identify tokens.
▪ The techniques for speeding up the lexical analyzer, use the “sentinels “to mark the buffer end.

The three general approaches to the implementation of a lexical analyzer


1. Use a lexical analyzer generator such as the Lex compiler to produce a regular expression based
specification. In this case, the generator provides for reading and buffering the input.
2. Write the lexical analyzer in a conventional systems programming languages using the I/O facilities of
that language to read the input.
3. Write the lexical analyzer in assembly language and reading of input.

The lexical analyzer is the only phase of the compiler that reads the source program character-by-character; it is
possible to spend a considerable amount of time in the lexical analysis phase.

Buffer Pairs
▪ Two pointers to the input buffer are maintained.
▪ The string of characters between the pointers is the current lexeme.
▪ Initially, both pointers point to the first character of the next lexeme to be found.
▪ Forward pointer, scans ahead until a match for a pattern is found.
▪ Once the next lexeme is determined, the forward pointer is set to the character at its right end.
▪ After the lexeme is processed, both pointers are set to the character immediately past the lexeme.
▪ The comments and white space can be treated as patterns that yield no token.
▪ A buffer into two N-character halves, where N is the no.of characters on one disk block, eg. 1024 or 4096.

▪ If the forward pointer is to move past the halfway mark, the right half is filled with N new input characters.
▪ If the forward pointer is to move past the right end of the buffer, the left half is filled with N new characters
and the forward pointer wraps around to the beginning of the buffer.

Code to advance forward pointer


if forward at the end of first half then begin
reload second half ;
forward : = forward + 1;
end
else if forward at end of second half then begin
reload first half ;
move forward to beginning of first half
end
else forward : = forward + 1;

Sentinels
▪ The sentinel is a special character that cannot be part of the source program.
▪ Each buffer half to hold a sentinel character at the end (eof).

Lookahead code with sentinels


forward : = forward + 1 ;
if forward  = eof then begin
if forward at end of first half then begin
reload second half ;
forward : = forward + 1
end
else if forward at end of second half then begin
reload first half ;
move forward to beginning of first half
end
else / * eof within buffer signifying end of input * /
terminate lexical analysis
end
6. Explain in detail about the Specification of Tokens? (10 MARKS)
Regular expressions are an important notation for specifying lexeme patterns. Each pattern matches a set of
strings, so regular expressions will serve as names for set of strings.

(i) Strings and Languages

▪ The term alphabet or character class denotes any finite set of symbols. Typical examples of symbols are
letters and characters.
▪ The set {0, 1} is the binary alphabet.
▪ ASCII and EBCDIC are two examples of computer alphabet.
▪ A string over some alphabet is a finite sequence of symbols drawn from that alphabet.
▪ The length of string s, usually written |s|, is the number of occurrences of symbols in s.
▪ The empty string denoted Ɛ, is a special string of length zero.
▪ The term language denotes any set of strings over some fixed alphabet. Abstract languages like , the
empty set, or {},the set containing only the empty string, are languages.
▪ If x and y are strings, then the concatenation of x and y is also string, denoted xy, is the string formed by
appending y to x.
For example, if x = ban and y = ana, then xy = banana.
▪ The empty string Ɛ is the identity element under concatenation; that is, for any string s, SƐ = ƐS= s.

(ii) Operations on Languages


There are several important operations that can be applied to languages.
In lexical analysis
➢ Union
➢ Concatenation
➢ Closure
Example
▪ Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z } and D be the set of digits {0,1,.. .9}.
▪ L and D are, respectively, the alphabets of uppercase and lowercase letters and of digits.

1. L U D is the set of letters and digits


2. LD is the set of strings consisting of a letter followed by digit
3. L4 is the set of all 4-letter strings.
4. L* is the set of all strings of letters, including e, the empty string.
5. L (L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.

(iii) Regular Expressions


▪ Regular expression is a notation for describing string. In Pascal, an identifier is a letter followed by zero or
more letter or digits.
▪ The Pascal identifier as
letter (letter | digit) *

The rules is the specification of language denoted by


1.  is a regular expression that denotes {}, the set containing empty string.
2. If a is a symbol in , then a is a regular expression that denotes {a}, the set containing the string a.
3. Suppose r and s are regular expressions denoting the language L(r) and L(s), then
a) (r) |(s) is a regular expression denoting L(r)  L(s).
b) (r)(s) is regular expression denoting L (r) L(s).
c) (r) * is a regular expression denoting (L (r))*.
d) (r) is a regular expression denoting L (r).

A language denoted by a regular expression is said to be a regular set.

Unnecessary parentheses can be avoided in regular expression


1. The unary operator * has the highest precedence and is left associative.
2. Concatenation has the second highest precedence and is left associative.
3. | has the lowest precedence and is left associative.

(iv) Regular Definitions


▪ For notation, give names to regular expressions and to define regular expressions using these names as if
they were symbols.
▪ If  is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form:

d1→r1
d2→r2
...

dn→rn
where each di is a distinct name, and each ri is a regular expression.

Example
1. The set of Pascal identifier is the set of strings of letters and digits beginning with a letter.
The regular definition is
letter → A | B | C | … | Z | a | b | … | z
digit → 0 | 1 | 2 | … | 9
id → letter ( letter | digit )*
2. Unsigned numbers in Pascal are strings such as 5280, 39.37, 6.336E4 or 1.894E-4.
The regular definition is
digit → 0 | 1 | 2 | … | 9
digits → digit digit*
optional_fraction → . digits | 
optional_exponent→ ( E ( + | -| ) digits) | 
num → digits optional_fraction optional_exponent

(v) Notational Shorthands


1. One or more instances
Unary postfix operator + means “one or more instances of”.
2. Zero or one instance
Unary postfix operator ? means “ zero or one instances of ”.The regular definition for num
digit → 0 | 1 | 2 | … | 9
digits → digit+
optional_fraction → (. digits ) ?
optional_exponent → ( E ( + | -) ? digits) ?
num → digits optional_fraction optional_exponent
3. Character classes
▪ The notation [abc] where a, b and c are alphabet symbols denotes the regular expression a | b | c.
▪ The character class such as [a-z] denotes the regular expression a | b | … | z.
▪ Using character classes, we describe identifiers as being strings generated by the regular expression,
[A – Z a – z] [A – Z a – z 0 – 9] *
7. Illustrate the steps involved in the Recognition of Tokens? (10 MARKS) (NOV 2011, 2015) (MAY
2013, 2018)

We considered the problem of how to specify tokens and recognize them.

Consider the following grammar


stmt → if expr then stmt
| if expr then stmt else stmt
| 
expr → term relop term
| term
term → id
| num

where the terminals if, then, else, relop, id, and num generate set of strings given by the following regular
definitions:
if → if
then → then
else → else
relop → < | <= | > | >= | = | < >
id → letter ( letter | digit )*
num → digit + (. digit + ) ? ( E(+ | -) ? digit + ) ?

The lexical analyzer will recognize the keywords if, then, else, as well as the lexemes denoted by relop, id, and
num.

We assume lexemes are separated by white space consisting of blanks, tabs, and newlines. In lexical analyzer will
strip out white space.
delim → blank | tab | newline
ws → delim +

If a match for ws is found, the lexical analyzer does not return a token to the parser.
▪ To construct a lexical analyzer that will isolate the lexeme for the next token in the input buffer and produce
as output a pair consisting of the appropriate token and attribute value, using the translation table.
▪ The attribute values for the relational operators are given by the symbolic constants LT, LE, EQ, NE, GT,GE.
Regular expression pattern for tokens

Transition diagram
▪ An intermediate step in the construction of a lexical analyzer, produce a stylized flowchart called a
transition diagram.
▪ Transition diagram depict the actions that take place when a lexical analyzer is called by the parser to get the
next token.
▪ Transition diagram to keep track of information about characters that are seen as the forward pointer scans
the input.
▪ Moving from position to position in the diagram as characters are read.
▪ Positions in a transition diagram are drawn as circles and are called states.
▪ The states are connected by arrow, called edges.
▪ Edges leaving state s have labels indicating the input characters that can next appear after the transition
diagram has reached state s.
▪ The label other refers to any character that is not indicated by any of the other edges leaving s.
▪ Transition diagram are deterministic, ie., no symbol can match the labels of two edges leaving one state.
▪ One state is labeled as the start state; it is the initial state of the transition diagram where control resides
when we begin to recognize a token.
▪ Certain states may have actions that are executed when the flow of control reaches that state.
▪ On entering a state we read the next input character.
▪ If there is an edge from the current state whose label matches this input character, we then go to the state
pointed to by the edge.
▪ Otherwise we indicate failure.
Transition Diagrams for >=

➢ start state : stare 0 in the above example


➢ If input character is >, go to state 6.
➢ other refers to any character that is not indicated by any of the other edges leaving s.

Transition diagram for relational operators

Transition diagram for identifiers and keywords

▪ gettoken( ): return token (id, if, then,…) if it looks the symbol table
▪ install_id( ): return 0 if keyword or a pointer to the symbol table entry if id

Transition diagram for digits (Unsigned Numbers)


Transition diagram for delim

Implementing a Transition diagram


▪ A sequence of transition diagram can be converted into a program to look for the tokens by the diagrams.
▪ The systematic approach that works for all transition diagram and constructs programs whose size is
proportional to the number of states and edges in the diagrams.
▪ Each state gets a segment of code. If there are edges leaving a state, then its code reads a character and
selects an edge to follow, if possible.
▪ A function nextchar() is used to read next character from the input buffer, advance the forward pointer at
each call, and return the character read. If there is an edge labeled by the character read, or labeled by a
character class containing the character read, then control is transferred to the code for the state pointed to
by that edge.
▪ If there is no such edge, and the current state is not one that indicates a token has been found, then a routine
fail() is invoked to retract the forward pointer to the position of the beginning pointer and to indicate a
search for a token specified by the next transition diagram.
▪ If there are no other transition diagrams to try, fail() calls an error recovery routine.
▪ To return tokens we use a global variable lexical_value, which is assigned the pointers returned by functions
install_id() and install()_num when an identifier or number, respectively, is found. The token class is
returned by the main procedure of the lexical analyzer, called nexttoken().

We use a case statement to find the start state of the next transition diagram. In the C implementation, two
variables state and start keep track of the present state and the starting state of the current transition diagram.

C code to find next start state

Edges in transition diagrams are traced by repeatedly selecting the code fragment for a state and executing the
code fragment to determine the next state.
8. Elaborate on the language for specifying lexical analyzer. (5 MARKS) (NOV 2013)
▪ Several tools have been built for constructing lexical analyzers from special purpose notations based on
regular expressions.
▪ The use of regular expressions for specifying tokens patterns.
▪ A particular tool called Lex, which is used to specify lexical analyzer for a variety of languages.
▪ We refer to the tool as the Lex compiler and to its input specification as the Lex language.

Creating a lexical analyzer with Lex


1. First, a specification of a lexical analyzer is prepared by creating a program lex.l in the Lex language.
2. Then, lex.l is run through the Lex compiler to produce a C program lex.yy.c.
3. The program lex.yy.c consists of tabular representation of a transition diagram constructed from regular
expression of lex.l, together with a standard routine that uses the table to recognize lexemes.
4. The actions associated with regular expressions in lex.l are pieces of C code and are carried over directly
to lex.yy.c.
5. Finally lex.yy.c is run through the C compiler to produce an object program a.out, which is the lexical
analyzer that transforms an input stream into a sequence of tokens.
Lex Specifications
A Lex program consists of three parts:
declarations
%%
translation rules
%%
auxiliary procedures

▪ The declarations section includes declarations of variables, manifest constants, and regular definitions.
▪ The translation rules of a Lex program are statement of the form
p1 { action1}
p2 { action2}
…… …..
pn { actionn}
where each pi is a regular expression and each actioni is a program fragment describing what action the
lexical analyzer should take when pattern matches a lexeme.
▪ The third section holds whatever auxiliary procedures are needed by the actions. Alternatively, these
procedures can be compiled separately and loaded with the lexical analyzer.

A lexical analyzer created by Lex behaves with a parser in the following manner.
▪ When activated by the parser, the lexical analyzer begins reading its remaining input, one character at a time,
until it has found the longest prefix of the input that is matched by one of the regular expressions pi.
▪ Then it executes actioni.
▪ The lexical analyzer returns a single quantity, the token, to the parser.
▪ To pass an attribute value with the information about the lexeme, we can set a global variable called yylval.
Lex Program for the tokens

%{
/* definitions of manifest constants
LT, LE, EQ, NE, GT, GE,
IF, THEN, ELSE, ID, NUMBER, RELOP */
%}

/* regular definitions */
delim [ \t\n]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?

%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = install_id(); return(ID); }
{number} {yylval = install_num(); return(NUMBER);}
“<“ {yylval = LT; return(RELOP); }
“<=“ {yylval = LE; return(RELOP); }
“=“ {yylval = EQ; return(RELOP); }
“<>“ {yylval = NE; return(RELOP); }
“>“ {yylval = GT; return(RELOP); }
“>=“ {yylval = GE; return(RELOP); }
%%

install_id()
{
/* procedure to install the lexeme, whose first character is pointed to by yytext,
and whose length is yyleng, into the symbol table and return a pointer */
}

install_num()
{
/* similar procedure to install a lexeme that is a number */
}
In the declaration section, the declaration of certain manifest constants used by the translation rules. These
declarations are surrounded by the special brackets %{ and %}. Anything appearing between these brackets is
copied directly into the lexical analyzer lex.yy.c and is not treated as part of the regular definitions or the
translation rules.

The auxiliary procedures in the third section. There are two procedures, install_id and install_num, that
are used by the translation rules; these procedures will be copied into lex.yy.c verbatim.

In the definitions section are some regular definitions. Each such definition consists of a name and a
regular expression denoted by that name. For example, the first name defined is delim; it stands for the
character class [ \t\n], that is any of the three symbols blank, tab (\t) or newline (\n). The second definition
is of white space, denoted by the name ws. White space is any sequence of one or more delimiter character.

In the definition of letter, we see the use of a character class. The shorthand [A-Za-z] means any of the
capital letters A through Z or the lower-case letters a through z. The fifth definition, of id, uses parentheses,
which are metasymbols in Lex. Similarly, the vertical bar is Lex metasymbol representing union.

The translation rules in the section following the first %%. The first rule says that if we see ws, that is,
any maximal sequence of blanks, tabs, and newlines, we take no action. In particular, we do not return to the
parser.

The second rule says that if the letters if are seen, return the token IF, which is a manifest constant
representing some integer understood by the parser to be the token if.

In the rule for id, we see two statements in the associated action. First, the variable yylval is set to the
value returned by procedure install_id; the definition of that procedure is in the third section. yylval is a
variable whose definition appears in the Lex output lex.yy.c, and which is also available to the parser. The
purpose of yylval is to hold the lexical value returned, since the second statement of the action, return (ID), can
only return a code for the token class.

We may suppose that it looks in the symbol table for the lexeme matched by the pattern id. Lex makes the
lexeme available to routines appearing in the third section through two variables yytext and yyleng. The
variable yytext corresponds to the variable that we have been calling lexeme_beginning, that is, a pointer to the
first character of the lexeme; yyleng is an integer telling how long the lexeme is. For example, if install_id fails to
find the identifier in the symbol table, it might create a new entry for it. The yyleng characters of the input,
starting at yytext, might be copied into a character array and delimited by an end of string marker. The new
symbol table entry would point to the beginning of this copy.
Numbers are treated similarly by the next rule, and for the last six rules yylval is used to return is used to
return a code for the particular relational operator found while the actual return value is the code for token
relop.

9. Explain about Finite Automata? (10 MARKS) (NOV 2016)


A recognizer for a language is a program that takes as input a string x and answer “yes” if x is a sentence of
the language and “no” otherwise. We compile a regular expression into a recognizer by constructing a
generalized transition diagram called a finite automaton.
➢ An automata has a mechanism to read input from input tape,
➢ Any language is recognized by some automation, Hence these automation are basically language
‘acceptors’ or ‘language recognizers’.

Types of Finite Automata


1. Deterministic Finite Automata (DFA)
2. Non-Deterministic Finite Automata (NFA)
A finite automaton can be deterministic on non-deterministic, where “non-deterministic” means that
more than one transition out of a state may be possible on the same input symbol.
Both deterministic and non-deterministic finite automata are capable of recognizing precisely the regular
sets. Thus they both can recognize exactly what regular expression can denote. DFA can lead to faster recognizers
than NFA, a DFA can be much bigger than an equivalent NFA.

Non-Deterministic Finite Automata (NFA)


A NFA is a mathematical model that consists of
1. a set of states S
2. a set of input symbols Σ (the input symbol alphabet)
3. a transition for move from one state to another states.
4. A state so that is distinguished as the start (or initial) state.
5. A set of states F distinguished as accepting (or final) state.

A NFA can be diagrammatically represented by a labeled directed graph, called a transition graph, in which the
nodes are the states and the labeled edges represent the transition function. This graph looks like a transition
diagram, but the same character can label two or more transitions out of one state and edges can be labeled by
the special symbol € as well as by input symbols.
The transition graph for an NFA that recognizes the language (a|b)*abb is shown
Deterministic Finite Automata (DFA)
A deterministic finite automata has at most one transition from each state on any input. A DFA is a special case of
a NFA in which:-
1. it has no transitions on input € ,
2. Each input symbol has at most one transition from any state.

DFA formally defined by 5 tuple notation M = (Q, Σ, δ, qo, F),


where
1. Q is a finite ‘set of states’, which is non empty.
2. Σ is ‘input alphabets’, indicates input set.
3. qo is an ‘initial state’ and qo is in Q ie, qo, Σ, Q
4. F is a set of ‘Final states’,
5. δ is a ‘transmission function’ or mapping function, using this function the next state can be determined.

The regular expression is converted into minimized DFA by the following procedure:
Regular expression → NFA → DFA → Minimized DFA

The Finite Automata is called DFA if there is only one path for a specific input from current state to next state.

From state S0 for input ‘a’ there is only one path going to S2. Similarly from S0 there is only one path for input
going to S1.

Transition Table
▪ When describing an NFA, we use the transition graph representation. In computer, the transition function of
an NFA can be implemented in several different ways. The easiest implementation in a transition table is
which there is a row for each state and a column for each input symbol and €, if necessary.
▪ The transition table representation has the advantage that it provides fast access to the transitions of a given
state on a given character; its advantage is that it can take up a lot of space when the input alphabet is large
and most transitions are to the empty set.

10. Write a LEX program to count the vowels and consonants in a given string. (5 MARKS) (NOV 2017)
LEX Program to count the number of vowels and consonants in a given string:
Vow.l
%{
int vow_count=0;
int const_count =0;
%}

%%
[aeiouAEIOU] {vow_count++;}
[a-zA-Z] {const_count++;}
%%

int main()
{
printf("Enter the string of vowels and consonants:");
yylex();
printf("The number of vowels are: %d\n", vow);
printf("The number of consonants are: %d\n", cons);
return 0;
}

Output:
$$ lex Vow.l
$$ gcc Vow.c -ll
$$ ./a.out
Enter the string of vowels and consonants:
My name is SMVEC
The number of vowels are: 4
The number of consonants are: 9
11. Explain the role of the parser? (10 MARKS) (MAY 2012)

▪ In compiler model, parser obtains a string of tokens from the lexical analyzer and verifies that the string can
be generated by the grammar for the source program.
▪ The parser should report any syntax errors in an intelligible fashion.
▪ It should also recover from commonly occurring errors so it can continue processing the remainder if it’s
input.

Position of parser in compiler model

There are three general types of parsers for grammars.

1. Universal parsing methods → too inefficient to use in production compilers.


➢ Cocke-Younger –Kasami algorithm and
➢ Earley’s algorithm
2. Top down parser → it builds parse trees from the top (root) to the bottom (leaves).
3. Bottom up parser → it start from the leaves and work up to the root.

▪ In both cases, the input to the parser is scanned from left to right, one symbol at a time.
▪ The most efficient top-down and bottom-up parsers can be implemented only for sub-classes of context-free
grammars.
➢ LL for top-down parsing
➢ LR for bottom-up parsing
Syntax error handling
If a compiler to process only correct programs, its design and implementation would be greatly simplified.

The program can contain errors at many different levels of syntax error handler
➢ Lexical, such as misspelling an identifier, keyword, or operator
➢ Syntactic, such as an arithmetic expression with unbalanced parentheses
➢ Semantic, such as an operator applied to an incompatible operand
➢ Logical, such as an infinitely recursive call
The error handler in a parser has simple to state goals:
➢ It should report the presence of errors clearly and accurately.
➢ It should recover from each error quickly enough to be able to detect subsequent errors.
➢ It should not significantly slow down the processing of correct programs.

Several parsing methods such as LL and LR methods, detect an error as soon as possible.

Error - Recovery Strategies


There are many different strategies that a parser can recover from a syntactic error.
➢ Panic mode
➢ Phrase level
➢ Error productions
➢ Global correction

Panic mode recovery


▪ This is the simplest method to implement and can be used by most parsing methods.
▪ On discovering an error, parser discards input symbols one at a time until one of the designated set of
synchronizing tokens is found.
▪ The synchronizing tokens are usually delimiters such as semicolon or end.
▪ It skips many inputs without checking additional errors, so it has an advantage of simplicity.
▪ It guaranteed not to go in to an infinite loop.

Phrase - level recovery


▪ On discovering an error, parser perform local correction on the remaining input;
▪ It may replace a prefix of the remaining input by some string that allows the parser to continue.
▪ Local correction would be to replace a comma by a semicolon, delete an extra semicolon, or insert a missing
semicolon.

Error productions
▪ Augment the grammar with productions that generate the erroneous constructs.
▪ The grammar augmented by these error productions to construct a parser.
▪ If an error production is used by the parser, generate error diagnostics to indicate the erroneous construct
recognized the input.

Global correction
▪ Algorithms are used for choosing a minimal sequence of changes to obtain a globally least cost correction.
▪ Given an incorrect input string x and grammar G, these algorithms will find a parse tree for a related string
y; such that the number of insertions, deletions and changes of tokens required to transform x into y is as
small as possible.
▪ This technique is most costly in terms of time and space.

12. Explain the Context Free Grammar (CFG)? (5 MARKS) (NOV 2017)

A Context Free Grammar (CFG) consists of terminals, non-terminals, a start symbol and productions.

The (CFG) Grammar G can be represented as G = (V, T, S, P)


➢ A finite set of terminals (The set of tokens)
➢ A finite set of non-terminals (syntactic-variables)
➢ A start symbol (one of the non-terminal symbol)
➢ A finite set of productions rules in the following form
▪ A →  , where A is a non-terminal and  is a string of terminals and non-terminals including the empty
string).
▪ Each production consists of a non-terminal, followed by an arrow, followed by a string of non-terminals
and terminals.

Example
The grammar with the following productions defines simple arithmetic expressions.
expr → expr op expr
expr → ( expr )
expr → - expr
expr → id
op → +
op → -
op → *
op → /
op → ↑
▪ In this grammar, the terminals symbols are
id + - * / ↑ ( )
▪ The non-terminal symbols are expr and op
▪ expr is the start symbol

Notational Conventions

1. These symbols are terminals:


i) Lower case letters in the alphabet such as a, b, c.
ii) Operator symbols such as +,-, etc.
iii) Punctuation symbols such as parenthesis, comma, etc.
iv) The digits 0, 1,…….., 9.
v) Boldface strings such as id or if.
2. These symbols are non-terminals
i) Upper case letters in the alphabet such as A, B, C.
ii) The letter S, when it appears, is usually the start symbol.
iii) Lower-case italic names such as expr or stmt.
3. Upper-case letters late in the alphabet, such as X, Y, Z, represent grammar symbols, that is, either non-
terminals or terminals.
4. Lower-case letters late in the alphabet u, v, ..., z, represent strings of terminals.
5. Lower-case Greek letters  ,β, γ represent strings of grammar symbols.
6. If A → 1, A→2, A→3, ……, A→k are all productions with, A on the left (A-productions), write as
A→1|2|3|…|k the alternatives for A.
7. The left side of the first production is the start symbol.

Derivations

▪ Derivational view gives a precise description of the top-down construction of a parse tree.
▪ The central idea is that a production is treated as a rewriting rule in which the non-terminals on the left is
replaced by the string on the right side of the production.
▪ For example, consider the following grammar for arithmetic expressions, with the non-terminals E
representing an expression.
E → E + E | E – E | E * E | (E) | - E | id

▪ The production E → - E signifies that an expression preceded by a minus sign is also an expression. This
production can be used to generate more complex expressions from simpler expressions by allowing us to
replace any instance of an E by - E.
E  -E
▪ Given a grammar G with starts symbol, use the ➔+ relation to define L (G), the language generated by G.
▪ Strings in L (G) may contain only terminal symbols of G.
▪ A string of terminals w is in L (G) if and only if S ➔+ w. The string w is called a sentence of G.
▪ A language that can be, generated by a grammar is said to be a context-free language.
▪ If two grammars generate, the same language, the grammars are said to be equivalent.

The string – (id+id) is a sentence of the grammar, and then the derivation is
E  -E  - (E)  - (E+E)  - (id+E)  - (id+id)
➢ At each derivation step, we can choose any of the non-terminals in the sentential form of G for the
replacement.
➢ If we always choose the left-most non-terminal in each derivation step, this derivation is called as left-most
derivation.
➢ If we always choose the right-most non-terminal in each derivation step, this derivation is called as right-
most derivation (Canonical derivation).

Parse Trees and Derivations

▪ A parse tree may be viewed as a graphical representation for a derivation that filters out the choice
regarding replacement order.
▪ Each interior node of a parse tree is labeled by some non-terminals A, and that the children of the node are
labeled, from left to right, by the symbols in the right side of the production by which this A was replaced in
the derivation.
▪ The leaves of the parse tree are labeled by non-terminals or terminals and read from left to right; they
constitute a sentential form, called the yield or frontier of the tree.

Ambiguity
▪ A grammar produces more than one parse tree for a sentence is called as an ambiguous.
▪ An ambiguous grammar is one that produces more than one left most or more than one right most
derivation for the same sentence.
▪ For the most parsers, the grammar must be unambiguous.
▪ The unambiguous grammar unique selection of the parse tree for a sentence.
The sentence id+id*id has the two distinct leftmost derivations:

EE+E EE*E
 id + E E+E*E
 id + E * E  id + E * E
 id + id * E  id + id * E
 id + id * id  id + id * id

with the two corresponding parse trees are

13. Write the steps in writing a grammar for a programming language. (5 MARKS) (NOV 2013, 2017)
(Or) Draw an algorithm for left factoring grammar. (MAY 2014)

Grammars are capable of describing the syntax of the programming languages.

Regular Expressions vs. Context-Free Grammars


▪ Every constructs that can be described by a regular expression can also be described by a grammar.
▪ For example the regular expression (a | b)* abb, the grammar is:
A0 → aA0 | bA0 | aA1
A1 → bA2
A2 → bA3
A3 → ε
which describe the same language, the set of strings of a’s and b’s ending in abb.
▪ Mathematically, the NFA is converted into a grammar that generates the same language as recognized by the
NFA.

There are several reasons the regular expressions differ from CFG.

1. The lexical rules of a language are frequently quite simple. No need of any notation as powerful as grammars.
2. Regular expressions generally provide a more concise and easier to understand notation for tokens than
grammars.
3. More efficient lexical analyzers can be constructed automatically from regular expressions than from
arbitrary grammars.
4. Separating the syntactic structure of a language into lexical and non-lexical parts provides a convenient way
of modularizing the front end of a compiler into two manageable-sized components.
➢ Regular expressions are most useful for describing the structure of lexical constructs such as identifiers,
constants, keywords etc…
➢ Grammars are most useful in describing nested structures such as balanced parenthesis, matching begin -
end’s, corresponding if-then-else’s and so on.

Eliminating Ambiguity
▪ An ambiguous grammar can be rewritten to eliminate the ambiguity.
▪ Example for eliminate the ambiguity from the following “dangling-else” grammar:
stmt → if expr then stmt
| if expr then stmt else stmt
| other
Here “other” stands for any other statements. According to this grammar, the compound conditional statement
if E1 then S1 else if E2 then S2 else S3
has the parse tree as

Grammar is ambiguous since the string


if E1 then if E2 then S1 else S2
has the two parse trees
▪ In all programming languages with conditional statements of this form, the first parse tree is preferred.
▪ The general rule is, “Match each else with the closest previous unmatched then” this disambiguating rule can
be incorporated directly into the grammar.

The unambiguous grammar will be:


stmt → matched_stmt
| unmatched_stmt
matched_stmt → if expr then matched_stmt else matchedstmt
| other
unmatched_stmt → if expr then stmt
| if expr then matched_stmt else unmatched_stmt

Elimination of Left Recursion

▪ A grammar is left recursive if it has a non-terminal A such that there is a derivation A  A.
▪ Top-down parsing techniques cannot handle left-recursive grammars, so a transformation that eliminates left
recursion is needed.
▪ The left recursive pair of productions A→A | β could be replaced by non- left- recursive productions
A→βA’
A’→  A’ | ε

Algorithm to eliminating left recursion from a grammar

Input: Grammar G with no cycles or ε-productions.


Output: An equivalent grammar with no left recursion.
Method: Note that the resulting non-left-recursive grammar may have ε-productions.
1. Arrange the non-terminals in some order A1, A2, …. An
2. for i := 1 to n do begin
for j := 1 to i-1 do begin
replace each production of the form Ai → Ajγ
by the productions Ai → δ1γ | δ2γ | … | δkγ
where Aj → δ1 | δ2 | … | δk are all the current Aj-productions;
end
eliminate the immediate left recursion among the Ai-productions
end

Left Factoring
▪ Left factoring is a grammar transformation that is useful for producing a grammar suitable for predictive
parsing.
▪ The basic idea is that when it is not clear which of two alternative productions to use to expand a non-
terminal A, then rewrite the A-productions to defer the decision until the input to make the right choice.

In general, productions are of the form A → αβ1 | αβ2 then it is left factored as:
A → αA’
A’ → β1 | β2

Algorithm for Left factoring a grammar

Input: Grammar G.
Output: An equivalent left-factored grammar.
Method: For each non-terminal A, find the longest prefix α common to two or more of its alternatives. If α ≠ ε, i.e.,
there is a nontrivial common prefix, replace all the A productions A → αβ1 | αβ2 | … | αβn | γ where γ represents
all alternatives that do not begin with α by,
A → αA’| γ
A’ → β1 | β2 | … | βn
14. Briefly write on Parsing techniques. Explain with illustration the designing of a Predictive Parser.
(10 MARKS) (NOV 2013, 2016) (MAY 2016)
➢ The top-down parsing and show how to construct an efficient non-backtracking form of top-down parser
called predictive parser.
➢ Define the class LL (1) grammars from which predictive parsers can be constructed automatically.

Recursive-Descent Parsing
▪ Top-down parsing can be viewed as an attempt to find a leftmost derivation for an input string.
▪ It is to construct a parse tree for the input starting from the root and creating the nodes of the parse tree in
preorder.
▪ The special case of recursive-descent parsing called predictive parsing, where no backtracking is required.
▪ The general form of top-down parsing, called recursive-descent, that may involve backtracking, ie, making
repeated scans of input.
▪ However, backtracking parsers are not seen frequently.
▪ One reason is that backtracking is rarely needed to parse programming language constructs.
▪ In natural language parsing, backtracking is still not very efficient and tabular methods such as the dynamic
programming algorithm.

Consider the grammar


S → cAd
A →ab | a
An input string w=cad, steps in top-down parse are as:

A left-recursive grammar can cause a recursive-descent parser, even one with backtracking, to go into an infinite
loop.
Predictive Parsers
▪ For writing a grammar, eliminating left recursion from it and left factoring the resulting grammar, we can
obtain a grammar that can be parsed by a recursive-descent parsing that needs no backtracking i.e., a
predictive parser.
▪ Flow-of-control constructs in most programming languages, with their distinguishing keywords, are usually
detectable in this way.
▪ For example, if we have the productions
stmt → if expr then stmt else stmt
| while expr do stmt
| begin stmt_list end
then the keywords if, while, and begin that could possibly succeed to find a statement.

Transition Diagrams for Predictive Parsers

Several differences between the transition diagrams for a lexical analyzer and a predictive parser.
➢ In case of parser, there is one diagram for each non-terminal.
➢ The labels of edges are tokens and non-terminals.
➢ A transition on a token means that transition if that token is the next input symbol.
➢ A transition on a non-terminal A is a call of the procedure for A.

To construct the transition diagram of a predictive parser from a grammar, first eliminate left recursion from the
grammar, and then left factor the grammar.
Then for each non-terminal A do the following:
1. Create an initial and final (return) state.
2. For each production A → X1, X2 ... Xn, create a path from the initial to the final state, with edges
labeled X1, X2,….,Xn.

Predictive Parser working


▪ It begins in the start state for the start symbol.
▪ If after some actions it is in state s with an edge labeled by terminal a to state t, and if the next input symbol
is a, then the parser moves the input cursor one position right and goes to state t.
▪ If, on the other hand, the edge is labeled by a non-terminal A, the parser instead goes to the start state for A,
without moving the input cursor.
▪ If it ever reaches the final state for A, it immediately goes to state t, in effect having read A from the input
during the time it moved from state s to t.
▪ Finally, if there is an edge from s to t labeled ε, then from state s the parser immediately goes to state t,
without advancing the input.

▪ A predictive parsing program based on a transition diagrams attempts to match terminal symbols against the
input and makes a potentially recursive procedure call whenever it has to follow an edge labeled by a non-
terminal.
▪ A non-recursive implementation can be obtained by stacking the states s when there is a transition on non-
terminal out of s and popping the stack when the final state for a non-terminal is reached.
▪ Transition diagrams can be simplified by substituting diagrams in one another; these substitutions are
similar to the transformations on grammars.
A→βA’
A’→αA’ | ε

Consider the following grammar for arithmetic expressions,


E→T
E’→+TE’ | ε
T→FT ’
T’→ *FT’ | ε
F →(E) | id

Transition diagrams for grammar

Simplified transition diagram


Simplified transition diagrams for arithmetic expressions

15. Explain the Non-recursive predictive parsing? (10 MARKS)


A non-recursive predictive parser by maintaining a stack explicitly, rather than implicitly through recursive
calls. The key problem during predictive parser is that of determining the production to be applied for a non-
terminal.

Model of a non-recursive predictive parser

A table-driven predictive parser has


▪ an input buffer,
▪ a stack,
▪ a parsing table; and
▪ an output stream.
▪ The input buffer contains the string to be parsed, followed by $, a symbol used as a right end marker to
indicate the end of the input string.
▪ The stack contains a sequence of grammar symbols with $ on the bottom, indicating the bottom of the
stack. Initially, the stack contains the start symbol of the grammar on top of $.
▪ The parsing table is a two dimensional array M [A, a], where A is a non-terminal, and a is a terminal or the
symbol $.

The program considers X, the symbol on top of the stack, and a, the current input symbol. These two symbols
determine the action of the parser. There are three possibilities.
1. If X = a = $, the parser halts and announces successful completion of parsing.
2. If X = a ≠ $, the parser pops X off the stack and advances the input pointer to the next input symbol.
3. If X is a non-terminal, the program consults entry M [X, a] of the parsing table M. This entry will be either an
X-production of the grammar or an error entry. For example M [X, a] = {X → UVW}, the parser replaces X on
top of the stack by WVU (U on top).
4. If M[X, a] = error, the parser calls an error recovery routine.

Algorithm Non-Recursive Predictive Parsing

Input: A string w and a parsing table M for grammar G.


Output: If w is in L (G), a leftmost derivation of w; otherwise an error indication.
Method: Initially, the parser is in a configuration in which it has $S on the stack with S, the start
symbol of G on top; and w$ in the input buffer. The program that utilizes the predictive parsing table M to
produce a parse for the input.

set ip to point to the first symbol of w$;


repeat
let X be the top of the stack and a the symbol pointed by ip
if X is a terminal or $ then
if X=a then
pop X from the stack and advance ip
else error( )
else /* X is a non-terminal */
if M [X ,a] = X→Y1 Y2 …YK then begin
pop X from the stack;
push YK … …Y2 Y1 on to the stack ,with Y1 on top;
output the production X →Y1 Y2 …YK
end
else error( )
until X= $ /* stack is empty */

Algorithm for FIRST


1. If X is terminal, and then FIRST(X) is {X}.
2. If X →ε is a production, then add ε to FIRST(X).
3. If X is non-terminal and X → Y1,Y2….Yk is a production, then place a in FIRST(X) if for some i , a is in
FIRST(Yi) , and ε is in all of FIRST(Y1),…FIRST(Yi-1);

Algorithm for FOLLOW


1. Place $ in FOLLOW(S), where S is the start symbol and $ is the input right end marker.
2. If there is a production A → αBβ, then everything in FIRST (β) except for ε is placed in FOLLOW (B).
3. If there is a production A→αB, or a production A→αBβ where FIRST (β) contains ε, then everything in
FOLLOW (A) is in FOLLOW (B).

16. Construct the predictive parser for the following grammar


E→E+T|T
T→T*F|F
F → (E) | id
Compute FIRST and FOLLOW and also find the parsing table. The input string is id+id * id (or) (a+b)*c
(NOV 2014).

Solution:

The given grammar is


E→ E+T|T
T→T*F|F
F → (E) | id

1. Eliminating Left Recursion from the Grammar


E → TE’
E’ → +TE’ | 
T → FT’
T’ → *FT’ | 
F → (E) | id
2. Computation of FIRST
FIRST (E) = FIRST (T) = FIRST (F) = { (, id }
FIRST (E`) = { +, ε }
FIRST (T) = FIRST (F) = { (, id }
FIRST (T`) = { *, ε }
FIRST (F) = { (, id }

3. Computation of FOLLOW
FOLLOW (E) = {$} U FOLLOW (E) = { ), $ }
FOLLOW (E’) = FOLLOW (E) = { ), $ }
FOLLOW (T) = FOLLOW (E’) U FIRST (E’) = {), $} U {+} = { +,), $ }
FOLLOW (T’) = FOLLOW (T) = { +,), $ }
FOLLOW (F) = FOLLOW (T’) U FIRST (T’) = {+,), $} U {*} = { +,*,), $ }

4. Construction of Parsing Table

5. The Predictive Parser on Input String is id+ id * id


17. Consider the following LL (1) grammar
S→iEtS|iEtSeS|a
E→b
Find the parsing table for the above grammar.

Solution:

The given LL (1) grammar is


S→iEtS|iEtSeS|a
E→b

1. Elimination of Left Factoring


S → i E t S S’ | a
S’ → e S | 
E→b

2. Computation of FIRST
FIRST(S) = { i , a }
FIRST(S’) = { e ,  }
FIRST (E) = { b }
3. Computation of FOLLOW
FOLLOW(S) = {$} U FIRST(S’) = {$} U {e} = { $, e }
FOLLOW (S’) = FOLLOW(S) = { $, e }
FOLLOW (E) = { t }

4. Construction of Parsing Table

18. Explain the LR parsing algorithm in detail. (10 MARKS) (NOV 2011, 2012, 2014, 2015)
(MAY 2012, 2016, 2017, 2018)

Bottom-up syntax analysis technique can be used to parse a large class of context-free grammars. The technique
is called LR (k) parsing.
➢ the "L" is for left-to-right scanning of the input,
➢ the "R" for constructing a rightmost derivation in reverse, and
➢ the k for the number of input symbols of look ahead that are used in making parsing decisions.
➢ When (k) is omitted, k is assumed to be 1.

LR parsing is attractive for a variety of reasons.

1. LR parsers can be constructed to recognize virtually all programming language constructs for which
context-free grammars can be written.
2. The LR parsing method is the most general non backtracking shift-reduce parsing method known, yet it
can be implemented as efficiently as other shift-reduce methods.
3. The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars
that can be parsed with predictive parsers.
4. An LR parser can detect a syntactic error as soon as it is possible to do a left-to-right scan of the input.
The principal drawback of the method is that it is too much work to construct an LR parser by hand for a typical
programming-language grammar. A special tool – an LR parser generator.

Three techniques are used for constructing an LR parsing table for a grammar.

1. The first method, called simple LR (SLR), is the easiest to implement, but the least powerful of the three.
It may fail to produce a parsing table for certain grammars on which the other methods succeed.
2. The second method, called canonical LR (CLR), is the most powerful, and the most expensive.
3. The third method, called look ahead LR (LALR), is intermediate in power and cost between the other
two. The LALR method will work on most programming language grammars and, with some effort, can be
implemented efficiently.

The LR Parsing Algorithm


LR parsing consists of
▪ an input,
▪ an output,
▪ a stack,
▪ a driver program, and
▪ a parsing table that has two parts (action and goto).

Model of an LR Parser

▪ The driver program is the same for all LR parsers; only the parsing table changes from one parser to another.
▪ The parsing program reads characters from an input buffer one at a time.
▪ The program uses a stack to store a string of the form s0X1s1X2s2 ... Xmsm, where, sm is on top.
▪ Each Xi is a grammar symbol and each si is a symbol called a state.
▪ Each state symbol summarizes the information contained in, the stack below it, and the combination of the
state symbol on top of the' stack and' the current input symbol are used to index the parsing table and
determine the shift reduce parsing decision.

The parsing table consists of two parts,


1. a parsing action function action and
2. a goto function goto.
▪ The program driving the LR parser behaves as follows.
▪ It determines sm, the state currently on top of the stack, and ai, the current, input symbol.
▪ It then consults action[sm, ai], the parsing action table entry for state sm and input ai, which can have one of
four values:
1. shift s, where s is a state,
2. reduce by a grammar production A →β
3. accept, and
4. error

LR Parsing Algorithm

Input: An input string w and an LR parsing table with functions action and goto for a grammar G.
Output: If w is in L (G), a bottom-up parse for w; otherwise, an error indication.
Method: Initially, the parser has s0 on its stack, where s0 is the initial state, and w$ in the input buffer.
The parser then executes the program until an accept or error action is encountered.

set ip to point to the first symbol of w$;


repeat forever begin
let s bet the state on the top of the stack and
a the symbol pointed to by ip;
if action[s,a]=shift s’ then begin
push a then s’ on top of the stack;
advance ip to the next input symbol
end
else if action[s,a]=reduce A →β then begin
pop 2*| β| symbols off the stack;
let s’ be the state now on top of the stack;
push A the goto[s’,A] on top of the stack;
output the production A →β
end
else if action[s,a]=accept then
return
else error()
end
19. Consider the following grammar to construct the SLR parsing table? (10 MARKS) (NOV 2012) (MAY
2022)

E→E+T | T

T→T*F | F

F →(E) | id
Construct an LR parsing table for the above grammar. Give the moves of the LR parser on id * id + id.
Solution:

The given SLR grammar is

E -> E + T / T

T -> T * F / F

F -> (E) / id

Let the grammar G be


E -> E + T
E -> T
T -> T * F
T -> F
F -> (E)
F -> id

1. The augmented Grammar G’

E’ -> E
E -> E + T
E -> T
T -> T * F
T -> F
F -> (E)
F -> id

2. Computation of Closure Function

I0 :

E’ -> .E
E -> .E + T
E -> .T
T -> . T * F
T -> .F
F -> .(E)
F -> . id
3. Computation of Goto Function
goto (I, X) { where I -> set of states and X -> E, T, F, +, *, (, ), id }

goto(I0, E)

I1: E’ -> E.

E -> E. + T

goto(I0, T)

I2: E -> T.

T -> T. * F

goto(I0, F)

I3: T -> F.

goto(I0, +) -> NULL

goto(I0, +) -> NULL

goto(I0, ( )

I4 : F -> (.E)
E -> .E + T
E -> .T
T -> .T * F
T -> .F
F -> .(E)
F -> .id

goto(I0, )) -> NULL

goto(I0, id)
I5: F -> id.

Repeat in the new set for the closure function

goto(I1, +)
I6: E -> E + .T
T -> .T * F
T -> .F
F -> .(E)
F -> .id
goto(I2, *)
I7: T -> T * .F
F -> .(E)
F -> .id

goto(I3, X) -> NULL

goto(I4, E)

I8: F -> (E . )

E -> E . + T

goto(I4, T)

I2: E -> T.

T - > T.* E

goto(I4, F)

I3: T -> F.

goto(I4, ( )

I4: F -> (.E)


E -> .E + T
E -> .T
T -> .T * F
T -> .F
F -> .(E)
F -> .id

goto(I4, id)

I5: F -> id.

goto(I6, T)

I9: E -> E + T.

T -> T. * F

goto(I6, F)

I3: T -> F.

goto(I6, ( )

I4: F -> (.E)


E -> .E + T
E -> .T
T -> . T * F
T -> .F
F -> .(E)
F -> . id

goto(I6, id)

I5: F -> id.

goto(I7, F)

I10: T -> T * F.

goto(I7, ( )
I4: F -> (.E)
E -> .E + T
E -> .T
T -> .T * F
T -> .F
F -> .(E)
F -> .id

goto(I7, id)

I5: F -> id.

goto(I8, ) )

I11: F -> (E).

goto(I8, +)
I6: E -> E + .T
T -> .T * F
T -> .F
F -> .(E)
F -> .id

goto(I9, *)
I7: T -> T *.F
F -> .(E)
F -> .id

4. Construction of Parsing Table

1. Shifting Process

I0 : goto(I0, E) = I1

goto(I0, T) = I2

goto(I0, F) = I3

goto(I0, ( ) = I4
goto(I0, id) = I5

I1: goto(I1, +) = I6

I2: goto(I2, * ) = I7

I4: goto(I4, E) = I8

goto(I4, T) = I2

goto(I4, F) = I3

goto(I4, ( ) = I4

goto(I4, id) = I5

I6: goto(I6, T) = I9

goto(I6, F) = I3

goto(I6, ( ) = I4

goto(I6, id) = I5

I7: goto(I7, F) = I10

goto(I7, ( ) = I4

goto(I7, id) = I5

I8: goto(I8, )) = I11

goto(I8, +) = I6

I9: goto(I9, * ) = I7

2. i) Elimination of Left Recursion Elimination

E -> TE’
E’ -> +TE’ | ε

T -> FT’

T’ -> * FT’ | ε

F -> (E) | id

ii) Computation of FIRST


FIRST (E) = FIRST (T) = FIRST (F) = { (, id }
FIRST (E`) = { +, ε }
FIRST (T) = FIRST (F) = { (, id }
FIRST (T`) = { *, ε }
FIRST (F) = { (, id }

iii) Computation of FOLLOW


FOLLOW (E) = {$} U FOLLOW (E) = { ), $ }
FOLLOW (E’) = FOLLOW (E) = { ), $ }
FOLLOW (T) = FOLLOW (E’) U FIRST (E’) = {), $} U {+} = { +,), $ }
FOLLOW (T’) = FOLLOW (T) = { +,), $ }
FOLLOW (F) = FOLLOW (T’) U FIRST (T’) = {+,), $} U {*} = { +,*,), $ }

3. Numbering the Grammar


(1) E -> E + T
(2) E -> T
(3) T -> T * F
(4) T -> F
(5) F -> (E)
(6) F -> id

4. Reducing Process

I2: E -> T. FOLLOW (T) = { + , ) , $ }

I3: T -> F. FOLLOW (F) = { + , * , ) , $ }

I5: F -> id. FOLLOW (F) = { + , * , ) , $ }

I9: E -> E + T. FOLLOW (T) = { + , ) , $ }

I10: T -> T * F. FOLLOW (F) = { + , * , ) , $ }


I11: F -> (E). FOLLOW (F) = {+ , * , ) , $ }

5. SLR Parsing Table

6. The SLR Parser and given Input String is id+id

S.No STACK I/P STRING PARSING ROUTINE

(1) 0 id + id $ action[0, id] = s5 then shift ‘s5’

push id and 5 into the stack

(2) 0 id 5 + id $ action [5, +] = r6 then reduce r6:

F -> id
1) POP 2 symbols from the stack
2) goto [0, F] = 3
3) Push ‘F3’ into the stack

(3) OF3 +id$ action[3,+] = r4 then reduce r4:

T->F
1) POP 2 symbols from the stack
2) goto [0, T] = 2
3) Push ‘T2’ into the stack

(4) 0T2 +id$ action[2,+] = r2 then reduce r4:

E->T
1) POP 2 symbols from the stack
2) goto [0, E] = 1
3) Push ‘E1’ into the stack

(5) 0E1 +id$ action[1, +] = s6 then shift ‘s6’

push + and 6 into the stack

(6) 0E1+6 id$ action[6, id] = s5 then shift ‘s5’

push id and 5 into the stack

(7) 0E1+6id5 $ action[5,$] = r6 then reduce r6:

F->id
1) POP 2 symbols from the stack
2) goto [6, F] = 3
3) Push ‘F3’ into the stack

(8) 0E1+6idF3 $ action[3,$] = r4 then reduce r4:

T->F
1) POP 2 symbols from the stack
2) goto [6, T] = 4
3) Push ‘T4’ into the stack

(9) 0E1+6T4 $ action[4,$] = r1 then reduce r1:

E->E+T
1) POP 6 symbols from the stack
2) goto [0, E] = 1
3) Push ‘E1’ into the stack

(10) 0E1 $ action[1,$] = acc

The given input string is accepted.

20. Consider the following context free grammar (5 MARKS) (MAY 2018)

S→ SS+ | SS* | a

(i) Show how the string aa + a * can be generated by using the above grammar.
(ii) Construct a Parse tree for the generated string.

Solution

The given context free grammar

S → SS+ | SS* | a

(i) Show how the string aa + a * can be generated by using the above grammar.

Derivation:
S → SS*
→ SS+S * [ S → SS+ ]
→ aS+S * [ S → a]
→ aa+S* [S→a]
→ aa+a* [S → a]

(ii) Construct a Parse tree for the generated string.


Parse Tree:
Parse tree for the given input string aa+a*
UNIVERSITY QUESTIONS

2 MARKS

1. What is hierarchical analysis? (NOV 2011) (Ref.Qn.No.9, Pg.no.3)


2. What is the major advantage of a lexical analyzer generator? (NOV 2011) (Ref.Qn.No.31, Pg.no.8)
3. List out the parts on Lex specifications. (MAY 2012) (Ref.Qn.No.45, Pg.no.10)
4. What is Compiler? (MAY 2012, 2016) (NOV 2012) (Ref.Qn.No.1, Pg.no.2)
5. What is meant by Static Checking? (MAY 2014) (Ref.Qn.No.5, Pg.no.2)
6. Define Sentinel. (MAY 2014) (Ref.Qn.No.37, Pg.no.9)
7. What is transition diagram? (NOV 2012, 2016) (Ref.Qn.No.42, Pg.no.10)
8. Why separate lexical analysis phase is required? (MAY 2013) (Ref.Qn.No.30, Pg.no.7)
9. State the function of front end and back end of a compiler phase. (MAY 2013) (Ref.Qn.No.23,24 Pg.no.6)
10. List the role of lexical analyzer? (NOV 2013, 2015) (Ref.Qn.No.28, Pg.no.7)
11. Compare and contrast Compilers and Interpreters. (MAY 2015, 2017) (Ref.Qn.No.46, Pg.no.11)
12. Give the structure of a Lex program with example. (MAY 2015) (Ref.Qn.No.45, Pg.no.10)
13. What is lexical analysis? (MAY 2016) (Ref.Qn.No.8, Pg.no.3)
14. What is meant by tokens? (NOV 2016, 2017) (Ref.Qn.No.47, Pg.no.11)
15. Define Input Buffering. (MAY 2017) (Ref.Qn.No.36, Pg.no.9)
16. Write regular expression for the following language over the alphabet over the alphabet ∑ = {a, b}.

All strings that contain an even number of b’s. (NOV 2017) (Ref.Qn.No.48, Pg.no.11)
17. What will be the output for the lexical analyzer? (MAY 2018) (Ref.Qn.No.28, Pg.no.7)
18. List the types of parser for grammars? (MAY 2014) (Ref.Qn.No.3, Pg.no.2)
19. What is Parsing Tree?(MAY 2012, 2016) (Ref.Qn.No.10, Pg.no.3)
20. Differentiate phase and pass.(NOV 2012) (Ref.Qn.No.32, Pg.no.7)
21. Derive the first and follow for the follow for the following grammar.

S→0|1|AS0|BS0 A→ɛ B→ɛ (MAY 2013) (Ref.Qn.No.58, Pg.no.12)


22. State the function of an intermediate code generator.(MAY 2013) (Ref.Qn.No.33, Pg.no.7)
23. Briefly describe the LL (k) items.(NOV 2013) (Ref.Qn.No.19, Pg.no.5)
24. Eliminate left recursion from the following grammar: (MAY 2015) (Ref.Qn.No.59, Pg.no.13)

bexpr → bexpr or bterm | bterm

bterm → bterm and bfast | bfact

bfact → (bexpr) | true | false


25. Define Parsing. (NOV 2015) (Ref.Qn.No.1,2, Pg.no.2)
26. What is a Predictive Parser? (MAY 2016) (Ref.Qn.No.18, Pg.no.4)
27. What is meant by ambiguous grammar? (NOV 2016) (Ref.Qn.No.13, Pg.no.4)
28. Why is bottom-up parsing method called as shift reduce parsing? (NOV 2017) (Ref.Qn.No.22, Pg.no.5)
29. What is a parser? (MAY 2018) (Ref.Qn.No.1, Pg.no.2)

10 MARKS

NOV 2011 (REGULAR)


1. Draw the different phases of a compiler and explain. (Ref.Qn.No.3, Pg.no.17)

(OR)
2. How to recognize the tokens? (Ref.Qn.No.10, Pg.no.35)

MAY 2012 (ARREAR)


1. Discuss the Phases of a compiler. (Ref.Qn.No.3, Pg.no.17)

NOV 2012 (REGULAR)


1. Explain the phases of a compiler. (Ref.Qn.No.3, Pg.no.17)

(OR)
2. Discuss the role of the lexical analyzer. (Ref.Qn.No.7, Pg.no.28)

MAY 2013 (ARREAR)


1. With a neat sketch discuss the functionalities of various phases of a compiler. (Ref.Qn.No.3, Pg.no.17)

NOV 2013 (REGULAR)


1. Describe the different stage of a compiler with an example. Consider an example for a simple arithmetic
expression statement. (Ref.Qn.No.3, Pg.no.17)

(OR)
2. Explain the buffered I/O with sentinels. Elaborate on the language for specifying lexical analyzer.

(Ref.Qn.No.8, Pg.no.30) (Ref.Qn.No.11, Pg.no.40)

MAY 2014 (ARREAR)


1. Discuss about Phases of compiler in compiler structure? (Ref.Qn.No.3, Pg.no.17)

(OR)
2. Explain in detail how to handle lexical errors due to transposition of the letters. (Ref.Qn.No.7, Pg.no.28)

NOV 2014 (REGULAR)


1. Explain Phases of a compiler with neat diagram. (Ref.Qn.No.3, Pg.no.17)

(OR)
2. What is the role of the lexical analyzer? Explain (Ref.Qn.No.7, Pg.no.28)

MAY 2015 (ARREAR)


1. Discuss in detail various phases of compilation with a neat diagram. (Ref.Qn.No.3, Pg.no.17)
NOV 2015 (REGULAR)
1. Describe the Architecture of Transition Diagram based lexical Analyzer. (Ref.Qn.No.10, Pg.no.35)

(OR)
2. Write short notes on

(a). Semantic Analysis. (5)

MAY 2016 (ARREAR)


1. Write the steps required to translate the source program to object program. (Ref.Qn.No.3, Pg.no.17)

(OR)
2. Give a detailed account on lexical analysis. (Ref.Qn.No.7, Pg.no.28)

NOV 2016 (REGULAR)


1. Describe in detail about the phases of compiler. (Ref.Qn.No.3, Pg.no.17)

MAY 2017 (ARREAR)


1. Elaborate in detail about Design of Lexical Analyser Generator. (Ref.Qn.No.7, Pg.no.28)

NOV 2017 (REGULAR)


1. (a). Write a LEX program to count the vowels and consonants in a given string. (5) (Ref.Qn.No.13, Pg.no.46)

(OR)
2. (a). Discuss the different phases of a compiler. (5) (Ref.Qn.No.3, Pg.no.17)

MAY 2018 (ARREAR)


1. Write a neat diagram, explain the phases of a compiler. (Ref.Qn.No.3, Pg.no.17)

(OR)
2. List out the role of lexical analyzer. How tokens are recognized in the source program? Explain it with an
example. (Ref.Qn.No.7,10, Pg.no.28,35)
NOV 2011 (REGULAR)
1. Explain the LR parsing algorithm in detail. (Ref.Qn.No.8, Pg.no.32)
MAY 2012 (ARREAR)
1. a) Write an algorithm for constructing LR parser table. (Ref.Qn.No.8, Pg.no.32)
2. Discuss the Role of the parser. (Ref.Qn.No.1, Pg.no.14)

NOV 2012 (REGULAR)


1. a) Write an algorithm for constructing LR parser table. (Ref.Qn.No.8, Pg.no.32)
b) Consider the following grammar to construct the LR parsing table (Ref.Qn.No.9, Pg.no.35)

E→E+T | T

T→T*F | F

F →(E) | id
MAY 2013 (ARREAR)
1. Give the following CFG grammar G=({S,A,B},S,(a,b,x),P) with P:

S→A

S→xb

A→aAb

A→B

B→x

For this grammar answer the following questions:

Compete the set of LR (1) items for this grammar. Augment the grammar with the default initial
production S’→S$ as the production (0) and Construct the corresponding LR parsing table.

NOV 2013 (REGULAR)


1. Briefly write on Parsing techniques. Explain with illustration the designing of a Predictive Parser. (5)
(Ref.Qn.No.4, Pg.no.23)

MAY 2014 (ARREAR)


1. Draw an algorithm for left factoring grammar? (Ref.Qn.No.3, Pg.no.19)

NOV 2014 (REGULAR)


1. Describe LR parsing algorithm. (Ref.Qn.No.8, Pg.no.32)

(OR)
2. Find the predictive parser for the given grammar and parse the sentence (a+b)*c (Ref.Qn.No.6, Pg.no.29)
E→E+T|T
T→T*F|F
F → (E) | id

MAY 2015 (ARREAR)


1. Consider the grammar: (Ref.Qn.No.21, Pg.no.74)
S → L=R | R
L → * R | id
R→L

Show that the grammar is not a LR (0) grammar.


NOV 2015 (REGULAR)
1. Draw and explain the model of LR Parser. (Ref.Qn.No.8, Pg.no.32)

MAY 2016 (ARREAR)


1. Illustrate with suitable example the working function of predictive parser and LR parser?
(Ref.Qn.No.4,8, Pg.no.23,32)

NOV 2016 (REGULAR)


1. Write about predictive parser. (Ref.Qn.No.4, Pg.no.23)

(OR)
2. Describe about LR parser. (Ref.Qn.No.8, Pg.no.32)

MAY 2017 (ARREAR)


1. Explain in detail about LR Parser and its applications with an example? (Ref.Qn.No.8, Pg.no.32)

MAY 2018 (ARREAR)


1. (a). Explain LR parser with an example? (5) (Ref.Qn.No.8, Pg.no.32)
(b). Parse the string id + id * id according to the following grammar. (5) (Ref.Qn.No.25, Pg.no.80)
E=T+E |T
T=V*T|V
V = id

(OR)

2. (a). Consider the following context free grammar (5) (Ref.Qn.No.26, Pg.no.81)

S→ SS+ | SS* | a

(i) Show how the string aa + a * can be generated by using the above grammar.

(ii) Construct a Parse tree for the generated string.

You might also like