Compiler Design
Compiler Design
Module- I:
Compiler Structure: Model of compilation:
Programming languages are notations for describing computations to people and to machines.
The world as we know it depends on programming languages, because all the software running
on all the computers was written in some programming language. But, before a program can be
run, it first must be translated into a form in which it can be executed by a computer.
The software systems that do this translation are called compilers.
Language Processors
A compiler is a program that can read a program in one lan-guage — the source language —
and translate it into an equivalent program in another language — the target language; see Fig.
1.1. An important role of the compiler is to report any errors in the source program that it detects
during the translation process.
A COMPILER
If the target program is an executable machine-language program, it can then be called by the
user to process inputs and produce outputs;
The machine-language target program produced by a compiler is usually much faster than an
interpreter at mapping inputs to outputs . An interpreter, however, can usually give better error
diagnostics than a compiler, because it executes the source program statement by statement.
The modified source program is then fed to a compiler. The compiler may produce an assembly-
language program as its output, because assembly lan-guage is easier to produce as output and is
easier to debug. The assembly language is then processed by a program called an assembler that
produces relocatable machine code as its output.
Large programs are often compiled in pieces, so the relocatable machine code may have to be
linked together with other relocatable object files and library files into the code that actually runs
on the machine. The linker resolves external memory addresses, where the code in one file may
refer to a location in another file. The loader then puts together all of the executable object files
into memory for execution.
Some compilers have a machine-independent optimization phase between the front end and the
back end. The purpose of this optimization phase is to perform transformations on the
intermediate representation, so that the back end can produce a better target program than it
would have otherwise pro-duced from an unoptimized intermediate representation.
Lexical analysis:
The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the
stream of characters making up the source program and groups the characters into meaningful
sequences called lexemes. For each lexeme, the lexical analyzer produces as output a token of the
form that it passes on to the subsequent phase, syntax analysis. In the token, the first
component token-name is an abstract symbol that is used during syntax analysis, and the second
component attribute-value points to an entry in the symbol table for this token. Information
from the symbol-table entry Is needed for semantic analysis and code generation.
For example, suppose a source program contains the assignment statement
position = initial + rate * 60 (1.1)
The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:
1. position is a lexeme that would be mapped into a token (id, 1), where id is an abstract
symbol standing for identifier and 1 points to the symbol table entry for position . The
symbol-table entry for an identifier holds information about the identifier, such as its
name and type.
2. The assignment symbol = is a lexeme that is mapped into the token (=) . Since this token
needs no attribute-value, we have omitted the second component. We could have used any
abstract symbol such as assign for the token-name, but for notational convenience we have
chosen to use the lexeme itself as the name of the abstract symbol.
3. initial is a lexeme that is mapped into the token (id, 2), where 2 points to the symbol-table
entry for initial .
4. + is a lexeme that is mapped into the token (+) .
5. rate is a lexeme that is mapped into the token (id, 3), where 3 points to the symbol-table entry
for rate .
6. * is a lexeme that is mapped into the token (*).
7. 60 is a lexeme that is mapped into the token (60).
Blanks separating the lexemes would be discarded by the lexical analyzer.
lexical analysis as the sequence of tokens
<id,l> <= > <id, 2> <+> <id, 3> <*> <60> (1.2)
In this representation, the token names =, +, and * are abstract symbols for
the assignment, addition, and multiplication operators, respectively.
Interface with input parser and symbol table:
Symbol Table is an important data structure created and maintained by the compiler in order
to keep track of semantics of variables i.e. it stores information about the scope and binding
information about names, information about instances of various entities such as variable and
function names, classes, objects, etc.
It is built-in lexical and syntax analysis phases.
The information is collected by the analysis phases of the compiler and is used by the
synthesis phases of the compiler to generate code.
It is used by the compiler to achieve compile-time efficiency.
It is used by various phases of the compiler as follows:
Lexical Analysis: Creates new table entries in the table, for example like entries about
tokens.
Syntax Analysis: Adds information regarding attribute type, scope, dimension, line of
reference, use, etc in the table.
Semantic Analysis: Uses available information in the table to check for semantics i.e.
to verify that expressions and assignments are semantically correct(type checking) and
update it accordingly.
Intermediate Code generation: Refers symbol table for knowing how much and what
type of run-time is allocated and table helps in adding temporary variable information.
Code Optimization: Uses information present in the symbol table for machine-
dependent optimization.
Target Code generation: Generates code by using address information of identifier
present in the table.
Symbol Table entries – Each entry in the symbol table is associated with attributes that
support the compiler in different phases.
Items stored in Symbol table:
Variable names and constants
Procedure and function names
Literal constants and strings
Compiler generated temporaries
Labels in source languages
Information used by the compiler from Symbol table:
Data type and name
Declaring procedures
Offset in storage
If structure or record then, a pointer to structure table.
For parameters, whether parameter passing by value or by reference
Number and type of arguments passed to function
Base Address
Operations of Symbol table – The basic operations defined on a symbol table include:
Token:
It is basically a sequence of characters that are treated as a unit as it cannot be further broken
down. In programming languages like C language- keywords (int, char, float, const, goto,
continue, etc.) identifiers (user-defined names), operators (+, -, *, /), delimiters/punctuators
like comma (,), semicolon(;), braces ({ }), etc. , strings can be considered as tokens. This phase
recognizes three types of tokens: Terminal Symbols (TRM)- Keywords and Operators,
Literals (LIT), and Identifiers (IDN).
Let’s understand now how to calculate tokens in a source code (C language):
Example 1:
Example 2:
int main() {
It is a sequence of characters in the source code that are matched by given predefined language
rules for every lexeme to be specified as a valid token.
Example:
Pattern
For a keyword to be identified as a valid token, the pattern is the sequence of characters that
make the keyword.
For identifier to be identified as a valid token, the pattern is the predefined rules that it must
start with alphabet, followed by alphabet or a digit.
it must start
Interpretation with the
of type name of a variable, alphabet,
Identifier function, etc main, a followed by the
alphabet or a
Criteria Token Lexeme Pattern
digit.
Interpretation
of type all the operators are
Operator considered tokens. +, = +, =
any string of
characters
Interpretation a grammar rule or (except ‘ ‘)
of type Literal boolean literal. “Welcome to GeeksforGeeks!” between ” and “
Example:
z = x + y;
This statement has the below form for syntax analyzer
<id> = <id> + <id>; //<id>- identifier (token)
The Lexical Analyzer not only provides a series of tokens but also creates a Symbol Table that
consists of all the tokens present in the source code except Whitespaces and comments.
Lexical analysis is the process of producing tokens from the source program. It has the
following issues:
• Lookahead
• Ambiguities
1. Lookahead:
Lookahead is required to decide when one token will end and the next token will begin. The
simple example which has lookahead issues are i vs. if, = vs. ==. Therefore a way to describe the
lexemes of each token is required.
The lexical analysis programs written with lex accept ambiguous specifications and choose the
longest match possible at each input point. Lex can handle ambiguous specifications. When
more than one expression can match the current input, lex chooses as follows:
• The longest match is preferred.
• Among rules which matched the same number of characters, the rule given first is preferred.
Error recovery strategies for a lexical analyzer
A character sequence that cannot be scanned into any valid token is a lexical error. Misspelling
of identifiers, keyword, or operators are considered as lexical errors. Usually, a lexical error is
caused by the appearance of some illegal character, mostly at the beginning of a token.
The following are the error-recovery strategies in lexical analysis:
1) Panic Mode Recovery: Once an error is found, the successive characters are always ignored
until we reach a well-formed token like end, semicolon.
2) Deleting an extraneous character.
3) Inserting a missing character.
4)Replacing an incorrect character by a correct character.
5)Transforming two adjacent characters.
Lexical Error:
When the token pattern does not match the prefix of the remaining input, the lexical analyzer
gets stuck and has to recover from this state to analyze the remaining input. In simple words,
a lexical error occurs when a sequence of characters does not match the pattern of any token.
It typically happens during the execution of a program.
Types of Lexical Error:
Types of lexical error that can occur in a lexical analyzer are as follows:
1. Exceeding length of identifier or numeric constants.
#include <iostream>
using namespace std;
int main() {
Example:
#include <iostream>
using namespace std;
int main() {
printf("Geeksforgeeks");$
return 0;
}
This is a lexical error since an illegal character $ appears at the end of the statement.
3. Unmatched string
Example:
#include <iostream>
using namespace std;
int main() {
/* comment
cout<<"GFG!";
return 0;
}
This is a lexical error since the ending of comment “*/” is not present but the beginning is
present.
4. Spelling Error.
#include <iostream>
using namespace std;
int main() {
int 3num= 1234; /* spelling error as identifier
cannot start with a number*/
return 0;
}
#include <iostream>
using namespace std;
int main() {
int main() {
cout<<"GFG!";
return 0;
}
#include <iostream>
using namespace std;
int main()
{
/* spelling of main here would be treated as an lexical
error and won't be considered as an identifier,
transposition of character 'i' and 'a'*/
cout << "GFG!";
return 0;
}
Error Recovery Technique
When a situation arises in which the lexical analyzer is unable to proceed because none of the
patterns for tokens matches any prefix of the remaining input. The simplest recovery strategy is
“panic mode” recovery. We delete successive characters from the remaining input until the
lexical analyzer can identify a well-formed token at the beginning of what input is left.
input buffering:
The lexical analyzer scans the input from left to right one character at a time. It uses two
pointers begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input scanned.
Initially both the pointers point to the first character of the input string as:
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme. In above example as soon as ptr (fp) encounters a
blank space the lexeme “int” is identified. The fp will be moved ahead at white space, when fp
encounters white space, it ignore and moves ahead. then both the begin ptr(bp) and forward
ptr(fp) are set at next token. The input character is thus read from secondary storage, but
reading in this way from secondary storage is costly. hence buffering technique is used.A
block of data is first read into a buffer, and then second by lexical analyzer. there are two
methods used in this context: One Buffer Scheme, and Two Buffer Scheme. These are
explained as following
below.
1. One Buffer Scheme: In this scheme, only one buffer is used to store the input string but the
problem with this scheme is that if lexeme is very long then it crosses the buffer boundary,
to scan rest of the lexeme the buffer has to be refilled, that makes overwriting the first of
lexeme.
2. Two Buffer Scheme: To overcome the problem of one buffer scheme, in this method two
buffers are used to store the input string. the first buffer and second buffer are scanned
alternately. when end of current buffer is reached the other buffer is filled. the only problem
with this method is that if length of the lexeme is longer than length of the buffer then
scanning input cannot be scanned completely. Initially both the bp and fp are pointing to the
first character of first buffer. Then the fp moves towards right in search of end of lexeme. as
soon as blank character is recognized, the string between bp and fp is identified as
corresponding token. to identify, the boundary of first buffer end of buffer character should
be placed at the end first buffer. Similarly end of second buffer is also recognized by the
end of buffer mark present at the end of second buffer. when fp encounters first eof, then
one can recognize end of first buffer and hence filling up second buffer is started. in the
same way when second eof is obtained then it indicates of second buffer. alternatively both
the buffers can be filled up until end of the input program and stream of tokens is identified.
This eof character introduced at the end is calling Sentinel which is used to identify the end
of buffer.
Specification of tokens:
In language theory, the terms "sentence" and "word" are often used as synonyms for "string."
The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For
example, banana is a string of length six. The empty string, denoted ε, is the string of length zero.
Operations on strings.
Operations on languages:
The following are the operations that can be applied to languages:
1. Union
2. Concatenation
3. Kleene closure
4. Positive closure
The following example shows the operations on strings: Let L={0,1} and S={a,b,c}
Regular Expressions
· Each regular expression r denotes a language L(r).
· Here are the rules that define the regular expressions over some alphabet Σ
and the languages that those expressions denote
1.ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is
the empty string.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the
language with one string, of length one, with ‘a’ in its one position.
3.Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then, a)
(r)|(s) is a regular expression denoting the language L(r) U L(s).
b) (r)(s) is a regular expression denoting the language L(r)L(s). c) (r)* is a regular
expression denoting (L(r))*.
d) (r) is a regular expression denoting L(r).
4.The unary operator * has highest precedence and is left associative.
5.Concatenation has second highest precedence and is left associative.
6. | has lowest precedence and is left associative.
Regular set
A language that can be defined by a regular expression is called a regular set. If two
regular expressions r and s denote the same regular set, we say they are equivalent and
write r = s.
There are a number of algebraic laws for regular expressions that can be used to
manipulate into equivalent forms.
For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative.
Regular Definitions
Shorthands
Non-regular Set
Regular Languages are the most restricted types of languages and are accepted by finite
automata.
Regular Expressions.
Regular Expressions are used to denote regular languages. An expression is regular if:
ɸ is a regular expression for regular language ɸ.
ɛ is a regular expression for regular language {ɛ}.
If a ∈ Σ (Σ represents the input alphabet), a is regular expression with language {a}.
If a and b are regular expression, a + b is also a regular expression with language {a,b}.
If a and b are regular expression, ab (concatenation of a and b) is also regular.
If a is regular expression, a* (0 or more times a) is also regular.
a followed by 0 or more
b (a.b*) {a, ab, abb, abbb, abbbb,….}
Regular Grammar : A grammar is regular if it has rules of form A -> a or A -> aB or A -> ɛ
where ɛ is a special symbol called NULL.
Union : If L1 and If L2 are two regular languages, their union L1 ∪ L2 will also be regular.
For example, L1 = {an | n ≥ 0} and L2 = {bn | n ≥ 0}
L3 = L1 ∪ L2 = {an ∪ bn | n ≥ 0} is also regular.
Intersection : If L1 and If L2 are two regular languages, their intersection L1 ∩ L2 will also
be regular. For example,
L1= {am bn | n ≥ 0 and m ≥ 0} and L2= {am bn ∪ bn am | n ≥ 0 and m ≥ 0}
L3 = L1 ∩ L2 = {am bn | n ≥ 0 and m ≥ 0} is also regular.
Concatenation : If L1 and If L2 are two regular languages, their concatenation L1.L2 will also
be regular. For example,
L1 = {an | n ≥ 0} and L2 = {bn | n ≥ 0}
L3 = L1.L2 = {am . bn | m ≥ 0 and n ≥ 0} is also regular.
Kleene Closure : If L1 is a regular language, its Kleene closure L1* will also be regular. For
example,
L1 = (a ∪ b)
L1* = (a ∪ b)*.
Complement : If L(G) is regular language, its complement L’(G) will also be regular.
Complement of a language can be found by subtracting strings which are in L(G) from all
possible strings. For example,
L(G) = {an | n > 3}
L’(G) = {an | n <= 3}
Note : Two regular expressions are equivalent if languages generated by them are same. For
example, (a+b*)* and (a+b)* generate same language. Every string which is generated by
(a+b*)* is also generated by (a+b)* and vice versa.
Question 1 : Which one of the following languages over the alphabet {0,1} is described by the
regular expression?
(0+1)*0(0+1)*0(0+1)*
(A) The set of all strings containing the substring 00.
(B) The set of all strings containing at most two 0’s.
(C) The set of all strings containing at least two 0’s.
(D) The set of all strings that begin and end with either 0 or 1.
Solution : Option A says that it must have substring 00. But 10101 is also a part of language
but it does not contain 00 as substring. So it is not correct option.
Option B says that it can have maximum two 0’s but 00000 is also a part of language. So it is
not correct option.
Option C says that it must contain atleast two 0. In regular expression, two 0 are present. So
this is correct option.
Option D says that it contains all strings that begin and end with either 0 or 1. But it can
generate strings which start with 0 and end with 1 or vice versa as well. So it is not correct.
Solution : Two regular expressions are equivalent if languages generated by them are same.
Option (A) can generate all strings generated by 0*(10*)*. So they are equivalent.
Option (B) string null can not generated by given languages but 0*(10*)* can. So they are not
equivalent.
Option (C) will have 10 as substring but 0*(10*)* may or may not. So they are not equivalent.
Question 4 : The regular expression for the language having input alphabets a and b, in which
two a’s do not come together:
(A) (b + ab)* + (b +ab)*a
(B) a(b + ba)* + (b + ba)*
(C) both options (A) and (B)
(D) none of the above
Solution:
Option (C) stating both both options (A) and (B) is the correct regular expression for the stated
question.
The language in the question can be expressed as
L={&epsilon,a,b,bb,ab,aba,ba,bab,baba,abab,…}.
In option (A) ‘ab’ is considered the building block for finding out the required regular
expression.(b + ab)* covers all cases of strings generated ending with ‘b’.(b + ab)*a covers all
cases of strings generated ending with a.
Applying similar logic for option (B) we can see that the regular expression is derived
considering ‘ba’ as the building block and it covers all cases of strings starting with a and
starting with b.
Compiler Design
Module-I
Syntax Analysis:
The second phase of the compiler is syntax analysis or parsing. The parser uses the first components of the
tokens produced by the lexical analyzer to create a tree-like intermediate representation that depicts the
grammatical structure of the token stream. A typical representation is a syntax tree in which each interior
node represents an operation and the children of the node represent the arguments of the operation. A syntax
tree for the token stream (1.2) is shown as the output of the syntactic analyzer in Fig. 1.7.
This tree shows the order in which the operations in the assignment
position = initial + rate * 60.
are to be performed. The tree has an interior node labeled * with (id, 3) as its left child and the integer 60 as
its right child. The node (id, 3) represents the identifier rate . The node labeled * makes it explicit that we
must first multiply the value of r a t e by 60. The node labeled + indicates that we must add the result of this
multiplication to the value of initial . The root of the tree, labeled =, indicates that we must store the result of
this addition into the location for the identifier p o s i t i o n . This ordering of operations is consistent with
the usual conventions of arithmetic which tell us that multiplication has higher precedence than addition, and
hence that the multiplication is to be performed before the addition.
The subsequent phases of the compiler use the grammatical structure to help analyze the source program and
generate the target program. In Chapter 4 we shall use context-free grammars to specify the grammatical
structure of programming languages and discuss algorithms for constructing efficient syntax analyzers
automatically from certain classes of grammars.
Semantic Analysis
The semantic analyzer uses the syntax tree and the information in the symbol table to check the source
program for semantic consistency with the language definition. It also gathers type information and saves it
in either the syntax tree or the symbol table, for subsequent use during intermediate-code generation.
An important part of semantic analysis is type checking, where the compiler checks that each operator has
matching operands. For example, many program-ming language definitions require an array index to be an
integer; the compiler must report an error if a floating-point number is used to index an array.
Grammar:
The productions of a grammar specify the manner in which the terminals and non-terminals can be
combined to form strings. Each production consists of a non-terminal called the left side of the production,
an arrow, and a sequence of tokens and/or on- terminals, called the right side of the production.
Context free grammar is a formal grammar which is used to generate all possible strings in a given formal
language.
Context free grammar G can be defined by four tuples as:
1. G= (V, T, P, S)
2. Where,
3. G describes the grammar
4. T describes a finite set of terminal symbols.
5. V describes a finite set of non-terminal symbols
6. P describes a set of production rules
7. S is the start symbol.
8. In CFG, the start symbol is used to derive the string. You can derive the string by repeatedly
replacing a non-terminal by the right hand side of the production, until all non-terminal have been
replaced by terminal symbols.
Example:
Production rules:
1. S → aSa
2. S → bSb
3. S → c
Now check that abbcbba string can be derived from the given CFG.
1. S ⇒ aSa
2. S ⇒ abSba
3. S ⇒ abbSbba
4. S ⇒ abbcbba
5. By applying the production S → aSa, S → bSb recursively and finally applying the production
S → c, we get the string abbcbba.
Capabilities of CFG:
Derivation is a sequence of production rules. It is used to get the input string through these production rules.
During parsing we have to take two decisions. These are as follows:
We have two options to decide which non-terminal to be replaced with production rule.
Left-most Derivation
In the left most derivation, the input is scanned and replaced with the production rule from left to right. So in
left most derivatives we read the input string from left to right.
Example:
Production rules:
1. S = S + S
2. S = S - S
3. S = a | b |c
Input:
a-b+c
1. S=S+S
2. S=S-S+S
3. S=a-S+S
4. S=a-b+S
5. S=a-b+c
Right-most Derivation
In the right most derivation, the input is scanned and replaced with the production rule from right to left. So
in right most derivatives we read the input string from right to left.
Example:
1. S = S + S
2. S = S - S
3. S = a | b |c
Input:
a-b+c
1. S=S-S
2. S=S-S+S
3. S=S-S+c
4. S=S-b+c
5. S=a-b+c
Parsing:
The process of transforming the data from one format to another is called Parsing. This process can be
accomplished by the parser. The parser is a component of the translator that helps to organise linear text
structure following the set of defined rules which is known as grammar.
Ambiguity:
A grammar is said to be ambiguous if there exists more than one leftmost derivation or more than one
rightmost derivative or more than one parse tree for the given input string. If the grammar is not ambiguous
then it is called unambiguous.
Example:
1. S = aSb | SS
2. S = ∈
For the string aabb, the above grammar generates two parse trees:
If the grammar has ambiguity then it is not good for a compiler construction. No method can automatically
detect and remove the ambiguity but you can remove ambiguity by re-writing the whole grammar without
ambiguity.
Types of Parsers:
The parser is that phase of the compiler which takes a token string as input and with the help of existing
grammar, converts it into the corresponding Intermediate Representation(IR). The parser is also known
as Syntax Analyzer.
The parser is mainly classified into two categories, i.e. Top-down Parser, and Bottom-up Parser. These are
explained below:
Top-Down Parser:
The top-down parser is the parser that generates parse for the given input string with the help of
grammar productions by expanding the non-terminals i.e. it starts from the start symbol and ends on the
terminals. It uses left most derivation.
Further Top-down parser is classified into 2 types: A recursive descent parser, and Non-recursive descent
parser.
1. Recursive descent parser is also known as the Brute force parser or the backtracking parser. It
basically generates the parse tree by using brute force and backtracking.
2. Non-recursive descent parser is also known as LL(1) parser or predictive parser or without
backtracking parser or dynamic parser. It uses a parsing table to generate the parse tree instead of
backtracking.
Bottom-up Parser:
Bottom-up Parser is the parser that generates the parse tree for the given input string with the help of
grammar productions by compressing the non-terminals i.e. it starts from non-terminals and ends on the
start symbol. It uses the reverse of the rightmost derivation.
Further Bottom-up parser is classified into two types: LR parser, and Operator precedence parser.
LR parser is the bottom-up parser that generates the parse tree for the given string by using
unambiguous grammar. It follows the reverse of the rightmost derivation.
LR parser is of four types:
(a)LR(0).
(b)SLR(1).
(c)LALR(1).
(d)CLR(1) .
Operator precedence parser generates the parse tree from given grammar and string but the only
condition is two consecutive non-terminals and epsilon never appears on the right-hand side of any
production.
The operator precedence parsing techniques can be applied to Operator grammars.
Operator grammar: A grammar is said to be operator grammar if there does not exist any production
rule on the right-hand side.
1. as ε(Epsilon)
2. Two non-terminals appear consecutively, that is, without any terminal between them operator
precedence parsing is not a simple technique to apply to most the language constructs, but it
evolves into an easy technique to implement where a suitable grammar may be produced.
SLR parser, CLR parser and LALR parser which are the parts of Bottom Up parser.
SLR Parser The SLR parser is similar to LR(0) parser except that the reduced entry. The reduced
productions are written only in the FOLLOW of the variable whose production is reduced.
Construction of SLR parsing table –
1. Construct C = { I0, I1, ……. In}, the collection of sets of LR(0) items for G’.
2. State i is constructed from Ii. The parsing actions for state i are determined as follow :
If [ A -> ?.a? ] is in Ii and GOTO(Ii , a) = Ij , then set ACTION[i, a] to “shift j”. Here a must be
terminal.
If [A -> ?.] is in Ii, then set ACTION[i, a] to “reduce A -> ?” for all a in FOLLOW(A); here A may
not be S’.
Is [S -> S.] is in Ii, then set action[i, $] to “accept”. If any conflicting actions are generated by the
above rules we say that the grammar is not SLR.
3. The goto transitions for state i are constructed for all nonterminals A using the rule: if GOTO( Ii , A ) =
Ij then GOTO [i, A] = j.
4. All entries not defined by rules 2 and 3 are made error.
Eg: If in the parsing table we have multiple entries then it is said to be a conflict.
Consider the grammar E -> T+E | T
T ->id
Augmented grammar - E’ -> E
E -> T+E | T
T -> id
Every SLR grammar is unambiguous but there are many unambiguous grammars that are not SLR.
CLR PARSER In the SLR method we were working with LR(0)) items. In CLR parsing we will be using
LR(1) items. LR(k) item is defined to be an item using lookaheads of length k. So , the LR(1) item is
comprised of two parts : the LR(0) item and the lookahead associated with the item. LR(1) parsers are
more powerful parser. For LR(1) items we modify the Closure and GOTO function.
Closure Operation
Closure(I)
repeat
for (each item [ A -> ?.B?, a ] in I )
for (each production B -> ? in G’)
for (each terminal b in FIRST(?a))
add [ B -> .? , b ] to set I;
until no more items are added to I;
return I;
Lets understand it with an example –
Goto Operation
Goto(I, X)
Initialise J to be the empty set;
for ( each item A -> ?.X?, a ] in I )
Add item A -> ?X.?, a ] to se J; /* move the dot one step */
return Closure(J); /* apply closure to the set */
Eg-
LR(1) items
Void items(G’)
Initialise C to { closure ({[S’ -> .S, $]})};
Repeat
For (each set of items I in C)
For (each grammar symbol X)
if( GOTO(I, X) is not empty and not in C)
Add GOTO(I, X) to C;
Until no new set of items are added to C;
Construction of GOTO graph
State I0 – closure of augmented LR(1) item.
Using I0 find all collection of sets of LR(1) items with the help of DFA
Convert DFA to LR(1) parsing table
Construction of CLR parsing table- Input – augmented grammar G’
1. Construct C = { I0, I1, ……. In} , the collection of sets of LR(0) items for G’.
2. State i is constructed from Ii. The parsing actions for state i are determined as follow : i) If [ A -> ?.a?,
b ] is in Ii and GOTO(Ii , a) = Ij, then set ACTION[i, a] to “shift j”. Here a must be terminal. ii) If [A -
> ?. , a] is in Ii , A ≠ S, then set ACTION[i, a] to “reduce A -> ?”. iii) Is [S -> S. , $ ] is in Ii, then set
action[i, $] to “accept”. If any conflicting actions are generated by the above rules we say that the
grammar is not CLR.
3. The goto transitions for state i are constructed for all nonterminals A using the rule: if GOTO( Ii, A ) =
Ij then GOTO [i, A] = j.
4. All entries not defined by rules 2 and 3 are made error.
Eg:
Consider the following grammar
S -> AaAb | BbBa
A -> ?
B -> ?
Augmented grammar - S’ -> S
S -> AaAb | BbBa
A -> ?
B -> ?
GOTO graph for this grammar will be -
Note – if a state has two reductions and both have same lookahead then it will in multiple entries in
parsing table thus a conflict. If a state has one reduction and their is a shift from that state on a terminal
same as the lookahead of the reduction then it will lead to multiple entries in parsing table thus a conflict.
LALR PARSER LALR parser are same as CLR parser with one difference. In CLR parser if two states
differ only in lookahead then we combine those states in LALR parser. After minimisation if the parsing
table has no conflict that the grammar is LALR also. Eg:
consider the grammar S ->AA
A -> aA | b
Augmented grammar - S’ -> S
S ->AA
A -> aA | b
Important Notes 1. Even though CLR parser does not have RR conflict but LALR may contain RR
conflict. 2. If number of states LR(0) = n1, number of states SLR = n2, number of states LALR =
n3, number of states CLR = n4 then, n1 = n2 = n3 <= n4
5.
6. In this article, we are going to discuss Non-Recursive Descent which is also known as LL(1)
Parser.
LL(1) Parsing: Here the 1st L represents that the scanning of the Input will be done from Left to Right
manner and the second L shows that in this parsing technique we are going to use Left most Derivation
Tree. And finally, the 1 represents the number of look-ahead, which means how many symbols are you
going to see when you want to make a decision.
Step 1: First check all the essential conditions mentioned above and go to step 2.
Step 2: Calculate First() and Follow() for all non-terminals.
1. First(): If there is a variable, and from that variable, if we try to drive all the strings then the beginning
Terminal Symbol is called the First.
2. Follow(): What is the Terminal Symbol which follows a variable in the process of derivation.
Step 3: For each production A –> α. (A tends to alpha)
1. Find First(α) and for each terminal in First(α), make entry A –> α in the table.
2. If First(α) contains ε (epsilon) as terminal, then find the Follow(A) and for each terminal in Follow(A),
make entry A –> ε in the table.
3. If the First(α) contains ε and Follow(A) contains $ as terminal, then make entry A –> ε in the table for
the $.
To construct the parsing table, we have two functions:
In the table, rows will contain the Non-Terminals and the column will contain the Terminal Symbols. All
the Null Productions of the Grammars will go under the Follow elements and the remaining productions
will lie under the elements of the First set.
First Follow
E’ –> +TE’/ε { +, ε } { $, ) }
T’ –> *FT’/ε { *, ε } { +, $, ) }
id + * ( ) $
As you can see that all the null productions are put under the Follow set of that symbol and all the
remaining productions lie under the First of that symbol.
Note: Every grammar is not feasible for LL(1) Parsing table. It may be possible that one cell may contain
more than one production
.
Example-2:
Consider the Grammar
S --> A | a
A --> a
Step1 – The grammar does not satisfy all properties in step 1, as the grammar is ambiguous. Still, let’s try
to make the parser table and see what happens
Step 2 – calculating first() and follow()
Find their First and Follow sets:
First Follow
a $
S S –> A, S –> a
A A –> a
Here, we can see that there are two productions in the same cell. Hence, this grammar is not feasible for
LL(1) Parser.
Trick – Above grammar is ambiguous grammar. So the grammar does not satisfy the essential conditions.
So we can say that this grammar is not feasible for LL(1) Parser even without making the parse table.
Example 3: Consider the Grammar
S -> (L) | a
L -> SL'
L' -> )SL' | ε
Step1 – The grammar satisfies all properties in step 1
Step 2 – calculating first() and follow()
First Follow
S {(,a} { $, ) }
L {(,a} {)}
L’ {),ε} {)}
Step 3 – making parser table
Parsing Table:
( ) a $
L’ -> )SL’
L’ -> ε
L’
Here, we can see that there are two productions in the same cell. Hence, this grammar is not feasible for
LL(1) Parser. Although the grammar satisfies all the essential conditions in step 1, it is still not feasible for
LL(1) Parser. We saw in example 2 that we must have these essential conditions and in example 3 we saw
that those conditions are insufficient to be a LL(1) parser.
Transformation on the grammars:
These notes describe methods of transforming grammars. We are motivated by the desire to transform a
grammar into an LL(1) grammar. If a grammar can be so transformed it allows access to the predictive top-
down parsing techniques (i.e. either a table-driven pushdown automata or a recursive descent parser). The
bad news is that given an arbitrary context free grammar it is undecidable whether there is an LL(1) (or
LL(k)) grammar which will recognize the same language: thus, ultimately our objective is impossible!
In fact, this is not quite the problem we wish to solve! We are not simply looking for a completely arbitrary
and unrelated grammar which by some remarkable fluke happens to recognize the same language. We
require a transformation between the set of parse trees ... this because we need to transfer the semantics (or
meaning) intended by the old grammar onto the new grammar. Thus, we really need a step by step
transformation system which allows us to also keep track of how we have transformed the parse tree.
We may see the problem as follows: we are given a grammar G and we wish to transform it to a grammar G'
while providing a translation T: G' -> G such that
INPUT
/ \
parse / \ parse
/ \
/ \
\/ \/
G' -----------> G
T
commutes, in the sense that parsing an input for G' and then translating the parse tree into one for G is the
same as parsing straight into G. In fact, futhermore, it is desirable for the translation T to be given by a linear
time "fold" (an attribute grammar). As we shall see we can do a number of useful transformation steps in the
direction of obtaining an LL(1) grammar. These steps almost amount to a procedure for generating an
equivalent LL(1) grammar (if one exists) -- the one aspect we shall not discuss is how to eliminate the
follow set clashes.
The basic methods for transforming grammars which we shall look at are:
These techniques are often sufficient to bring a grammar to heel! However, it should be noted that if the
original grammar is ambiguous then, as these techniques "preserve meaning", the result will still be
ambiguous and so necessarily will fail to be LL(1) or to belong to any other unambiguous class (such as
LL(k) or LR(k)). Furthermore, if the language is simply is not LL(1) then these techniques will also fail.
Thus, you should not expect them to deliver magically an LL(1) grammar!
Set against the fact that the transformations may not help are:
1. By trying to transform the grammar you may expose aspects of the grammar which were not apparent
in its original form;
2. Many of the grammars you will meet in practice can be transformed to LL(1)! If they can be
transformed then one has access to particularly simple parsing techniques with good error reporting.
Eliminating left recursion is the most difficult transformation step. I shall present an approach that has the
following advantages:
it allows one to carry out the transformation process by hand relatively easily
it is directly applicable to most grammars as it allows nullable productions
it tries to keep as many of the original features of the grammar intact as is possible and tries to keep
the the size of the grammar under control.
Transformations highlight the difference between the presentation (that is the grammar) of a language and
the language itself (the set of strings recognized). Recall many grammars can recognize the same language
but only one language is recognized by a grammar: transformations certainly must preserve the language ...
we also require that they allow one to recover the parse tree as well.
The algorithm I will describe works on any grammar which has no cycles and is null unambiguous. A
grammar has a cycle if there is a nonterminal x with a nontrivial derivation x =+=> x. Grammars with
cycles are quite pathological (and ambiguous): they are of little interest in parsing applications so this does
not unduly limit the applicability of this algorithm. Importantly, there is a straightforward test of whether a
grammar has cycles. Null ambiguous grammars are, similarly, quite pathological (and ambiguous) and null
ambiguity can be easily determined.
A grammar that is used to define mathematical operators is called an operator grammar or operator
precedence grammar. Such grammars have the restriction that no production has either an empty
right-hand side (null productions) or two adjacent non-terminals in its right-hand side.
Examples –
This is an example of operator grammar:
E->E+E/E*E/id
However, the grammar given below is not an operator grammar because two non-terminals are adjacent to
each other:
S->SAS/a
A->bSb/b
We can convert it into an operator grammar, though:
S->SbSbS/SbS/a
A->bSb/b
Operator precedence parser –
An operator precedence parser is a bottom-up parser that interprets an operator grammar. This parser is
only used for operator grammars. Ambiguous grammars are not allowed in any parser except operator
precedence parser.
There are two methods for determining what precedence relations should hold between a pair of terminals:
1. Use the conventional associativity and precedence of operator.
2. The second method of selecting operator-precedence relations is first to construct an unambiguous
grammar for the language, a grammar that reflects the correct associativity and precedence in its parse
trees.
This parser relies on the following three precedence relations: ⋖, ≐, ⋗
a ⋖ b This means a “yields precedence to” b.
a ⋗ b This means a “takes precedence over” b.
a ≐ b This means a “has same precedence as” b.
Since there is no cycle in the graph, we can make this function table:
#include<stdio.h>
#include<string.h>
void f()
exit(0);
void main()
char grm[20][20], c;
int i, n, j = 2, flag = 0;
// taking number of productions from user
scanf("%d", &n);
scanf("%s", grm[i]);
c = grm[i][2];
while (c != '\0') {
flag = 1;
else {
flag = 0;
f();
}
if (c == '$') {
flag = 0;
f();
c = grm[i][++j];
if (flag == 1)
printf("Operator grammar");
Input :3
A=A*A
B=AA
A=$
Input :2
A=A/A
B=A+A