0% found this document useful (0 votes)
36 views

Compiler Design

The document discusses the structure and phases of a compiler. It describes how a compiler translates a program from a source language into a target language by performing lexical, syntactic and semantic analysis on the source code and then code generation. The key components of a compiler discussed are the front-end, back-end, intermediate representations, symbol table and different phases like lexical analysis, syntax analysis and code generation.

Uploaded by

sauravarya952362
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Compiler Design

The document discusses the structure and phases of a compiler. It describes how a compiler translates a program from a source language into a target language by performing lexical, syntactic and semantic analysis on the source code and then code generation. The key components of a compiler discussed are the front-end, back-end, intermediate representations, symbol table and different phases like lexical analysis, syntax analysis and code generation.

Uploaded by

sauravarya952362
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Compiler Design

Module- I:
Compiler Structure: Model of compilation:

Programming languages are notations for describing computations to people and to machines.
The world as we know it depends on programming languages, because all the software running
on all the computers was written in some programming language. But, before a program can be
run, it first must be translated into a form in which it can be executed by a computer.
The software systems that do this translation are called compilers.

Language Processors
A compiler is a program that can read a program in one lan-guage — the source language —
and translate it into an equivalent program in another language — the target language; see Fig.
1.1. An important role of the compiler is to report any errors in the source program that it detects
during the translation process.

A COMPILER
If the target program is an executable machine-language program, it can then be called by the
user to process inputs and produce outputs;

An interpreter is another common kind of language processor. Instead of producing a target


program as a translation, an interpreter appears to directly execute the operations specified in the
source program on inputs supplied by the user
The machine-language target program produced by a compiler is usually much faster than an
interpreter at mapping inputs to outputs . An interpreter, however, can usually give better error
diagnostics than a compiler, because it executes the source program statement by statement.

The machine-language target program produced by a compiler is usually much faster than an
interpreter at mapping inputs to outputs . An interpreter, however, can usually give better error
diagnostics than a compiler, because it executes the source program statement by statement.

E x a m p l e 1 . 1 : Java language processors combine compilation and interpretation, as shown


in Fig. 1.4. A Java source program may first be compiled into an intermediate form
called bytecodes. The bytecodes are then interpreted by a virtual machine. A benefit of this
arrangement is that bytecodes compiled on one machine can be interpreted on another machine,
perhaps across a network.
In order to achieve faster processing of inputs to outputs, some Java compilers, called just-in-
time compilers, translate the bytecodes into machine language immediately before they run the
intermediate program to process the input.
In addition to a compiler, several other programs may be required to create an executable target
program, as shown in Fig. 1.5. A source program may be divided into modules stored in separate
files. The task of collecting the source program is sometimes entrusted to a separate program,
called a preprocessor. The preprocessor may also expand shorthands, called macros, into source
lan-guage statements.

The modified source program is then fed to a compiler. The compiler may produce an assembly-
language program as its output, because assembly lan-guage is easier to produce as output and is
easier to debug. The assembly language is then processed by a program called an assembler that
produces relocatable machine code as its output.
Large programs are often compiled in pieces, so the relocatable machine code may have to be
linked together with other relocatable object files and library files into the code that actually runs
on the machine. The linker resolves external memory addresses, where the code in one file may
refer to a location in another file. The loader then puts together all of the executable object files
into memory for execution.

The Structure of a Compiler


we have treated a compiler as a single box that maps a source program into a semantically
equivalent target program. If we open up this box a little, we see that there are two parts to this
mapping: analysis and synthesis.
The analysis part breaks up the source program into constituent pieces and imposes a
grammatical structure on them. It then uses this structure to cre-ate an intermediate
representation of the source program. If the analysis part detects that the source program is either
syntactically ill formed or semanti-cally unsound, then it must provide informative messages, so
the user can take corrective action. The analysis part also collects information about the source
program and stores it in a data structure called a symbol table, which is passed along with the
intermediate representation to the synthesis part.
The synthesis part constructs the desired target program from the interme-diate representation
and the information in the symbol table. The analysis part is often called the front end of the
compiler; the synthesis part is the back end.
If we examine the compilation process in more detail, we see that it operates as a sequence of
phases, each of which transforms one representation of the source program to another. A typical
decomposition of a compiler into phases is shown in Fig. 1.6. In practice, several phases may be
grouped together, and the intermediate representations between the grouped phases need not be
constructed explicitly. The symbol table, which stores information about the entire source
program, is used by all phases of the compile
various phases of a compiler.

Some compilers have a machine-independent optimization phase between the front end and the
back end. The purpose of this optimization phase is to perform transformations on the
intermediate representation, so that the back end can produce a better target program than it
would have otherwise pro-duced from an unoptimized intermediate representation.

Lexical analysis:

The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the
stream of characters making up the source program and groups the characters into meaningful
sequences called lexemes. For each lexeme, the lexical analyzer produces as output a token of the
form that it passes on to the subsequent phase, syntax analysis. In the token, the first
component token-name is an abstract symbol that is used during syntax analysis, and the second
component attribute-value points to an entry in the symbol table for this token. Information
from the symbol-table entry Is needed for semantic analysis and code generation.
For example, suppose a source program contains the assignment statement
position = initial + rate * 60 (1.1)
The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:
1. position is a lexeme that would be mapped into a token (id, 1), where id is an abstract
symbol standing for identifier and 1 points to the symbol table entry for position . The
symbol-table entry for an identifier holds information about the identifier, such as its
name and type.
2. The assignment symbol = is a lexeme that is mapped into the token (=) . Since this token
needs no attribute-value, we have omitted the second component. We could have used any
abstract symbol such as assign for the token-name, but for notational convenience we have
chosen to use the lexeme itself as the name of the abstract symbol.
3. initial is a lexeme that is mapped into the token (id, 2), where 2 points to the symbol-table
entry for initial .
4. + is a lexeme that is mapped into the token (+) .
5. rate is a lexeme that is mapped into the token (id, 3), where 3 points to the symbol-table entry
for rate .
6. * is a lexeme that is mapped into the token (*).
7. 60 is a lexeme that is mapped into the token (60).
Blanks separating the lexemes would be discarded by the lexical analyzer.
lexical analysis as the sequence of tokens
<id,l> <= > <id, 2> <+> <id, 3> <*> <60> (1.2)
In this representation, the token names =, +, and * are abstract symbols for
the assignment, addition, and multiplication operators, respectively.
Interface with input parser and symbol table:

Symbol Table is an important data structure created and maintained by the compiler in order
to keep track of semantics of variables i.e. it stores information about the scope and binding
information about names, information about instances of various entities such as variable and
function names, classes, objects, etc.
 It is built-in lexical and syntax analysis phases.
 The information is collected by the analysis phases of the compiler and is used by the
synthesis phases of the compiler to generate code.
 It is used by the compiler to achieve compile-time efficiency.
 It is used by various phases of the compiler as follows:
 Lexical Analysis: Creates new table entries in the table, for example like entries about
tokens.
 Syntax Analysis: Adds information regarding attribute type, scope, dimension, line of
reference, use, etc in the table.
 Semantic Analysis: Uses available information in the table to check for semantics i.e.
to verify that expressions and assignments are semantically correct(type checking) and
update it accordingly.
 Intermediate Code generation: Refers symbol table for knowing how much and what
type of run-time is allocated and table helps in adding temporary variable information.
 Code Optimization: Uses information present in the symbol table for machine-
dependent optimization.
 Target Code generation: Generates code by using address information of identifier
present in the table.
Symbol Table entries – Each entry in the symbol table is associated with attributes that
support the compiler in different phases.
Items stored in Symbol table:
 Variable names and constants
 Procedure and function names
 Literal constants and strings
 Compiler generated temporaries
 Labels in source languages
Information used by the compiler from Symbol table:
 Data type and name
 Declaring procedures
 Offset in storage
 If structure or record then, a pointer to structure table.
 For parameters, whether parameter passing by value or by reference
 Number and type of arguments passed to function
 Base Address
 Operations of Symbol table – The basic operations defined on a symbol table include:

Implementation of Symbol table –


Following are commonly used data structures for implementing symbol table:-
1. List –
 In this method, an array is used to store names and associated information.
 A pointer “available” is maintained at end of all stored records and new names are
added in the order as they arrive
 To search for a name we start from the beginning of the list till available pointer and if
not found we get an error “use of the undeclared name”
 While inserting a new name we must ensure that it is not already present otherwise an
error occurs i.e. “Multiple defined names”
 Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
 The advantage is that it takes a minimum amount of space.
2. Linked List –
 This implementation is using a linked list. A link field is added to each record.
 Searching of names is done in order pointed by the link of the link field.
 A pointer “First” is maintained to point to the first record of the symbol table.
 Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
Hash Table –
 In hashing scheme, two tables are maintained – a hash table and symbol table and are the
most commonly used method to implement symbol tables.
 A hash table is an array with an index range: 0 to table size – 1. These entries are pointers
pointing to the names of the symbol table.
 To search for a name we use a hash function that will result in an integer between 0 to table
size – 1.
 Insertion and lookup can be made very fast – O(1).
 The advantage is quick to search is possible and the disadvantage is that hashing is
complicated to implement
Binary Search Tree –
 Another approach to implementing a symbol table is to use a binary search tree i.e. we
add two link fields i.e. left and right child.
 All names are created as child of the root node that always follows the property of the
binary search tree.
 Insertion and lookup are O(log2 n) on average.
Advantages of Symbol Table
1. The efficiency of a program can be increased by using symbol tables, which give quick and
simple access to crucial data such as variable and function names, data kinds, and memory
locations.
2. better coding structure Symbol tables can be used to organize and simplify code,
making it simpler to comprehend, discover, and correct problems.
3. Faster code execution: By offering quick access to information like memory addresses,
symbol tables can be utilized to optimize code execution by lowering the number of
memory accesses required during execution.
4. Symbol tables can be used to increase the portability of code by offering a standardized
method of storing and retrieving data, which can make it simpler to migrate code
between other systems or programming languages.
5. Improved code reuse: By offering a standardized method of storing and accessing
information, symbol tables can be utilized to increase the reuse of code across multiple
projects.
6. Symbol tables can be used to facilitate easy access to and examination of a program’s
state during execution, enhancing debugging by making it simpler to identify and
correct mistakes.
Applications of Symbol Table
1. Resolution of variable and function names: Symbol tables are used to identify the data
types and memory locations of variables and functions as well as to resolve their names.
2. Resolution of scope issues: To resolve naming conflicts and ascertain the range of
variables and functions, symbol tables are utilized.
3. Symbol tables, which offer quick access to information such as memory locations, are used
to optimize code execution.
4. Code generation: By giving details like memory locations and data kinds, symbol tables
are utilized to create machine code from source code.
5. Error checking and code debugging: By supplying details about the status of a program
during execution, symbol tables are used to check for faults and debug code.
6. Code organization and documentation: By supplying details about a program’s structure,
symbol tables can be used to organize code and make it simpler to understand.

token, lexeme and patterns.

Token:

It is basically a sequence of characters that are treated as a unit as it cannot be further broken
down. In programming languages like C language- keywords (int, char, float, const, goto,
continue, etc.) identifiers (user-defined names), operators (+, -, *, /), delimiters/punctuators
like comma (,), semicolon(;), braces ({ }), etc. , strings can be considered as tokens. This phase
recognizes three types of tokens: Terminal Symbols (TRM)- Keywords and Operators,
Literals (LIT), and Identifiers (IDN).
Let’s understand now how to calculate tokens in a source code (C language):

Example 1:

int a = 10; //Input Source code


Tokens:

int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)


Answer – Total number of tokens = 5

Example 2:
int main() {

// printf() sends the string inside quotation to


// the standard output (the display)
printf("Welcome to GeeksforGeeks!");
return 0;
}
Tokens
'int', 'main', '(', ')', '{', 'printf', '(', ' "Welcome to GeeksforGeeks!" ',
')', ';', 'return', '0', ';', '}'
Answer – Total number of tokens = 14
Lexeme

It is a sequence of characters in the source code that are matched by given predefined language
rules for every lexeme to be specified as a valid token.
Example:

main is lexeme of type identifier(token)


(,),{,} are lexemes of type punctuation(token)

Pattern

It specifies a set of rules that a scanner follows to create a token.


Example of Programming Language (C, C++):

For a keyword to be identified as a valid token, the pattern is the sequence of characters that
make the keyword.
For identifier to be identified as a valid token, the pattern is the predefined rules that it must
start with alphabet, followed by alphabet or a digit.

Difference between Token, Lexeme, and Pattern

Criteria Token Lexeme Pattern

Token is basically a It is a sequence of characters in It specifies a set


sequence of characters the source code that are of rules that a
that are treated as a unit matched by given predefined scanner follows
as it cannot be further language rules for every lexeme to create a
Definition broken down. to be specified as a valid token. token.

all the reserved The sequence


Interpretation keywords of that of characters
of type language(main, printf, that make the
Keyword etc.) int, goto keyword.

it must start
Interpretation with the
of type name of a variable, alphabet,
Identifier function, etc main, a followed by the
alphabet or a
Criteria Token Lexeme Pattern

digit.

Interpretation
of type all the operators are
Operator considered tokens. +, = +, =

each kind of punctuation


Interpretation is considered a token.
of type e.g. semicolon, bracket,
Punctuation comma, etc. (, ), {, } (, ), {, }

any string of
characters
Interpretation a grammar rule or (except ‘ ‘)
of type Literal boolean literal. “Welcome to GeeksforGeeks!” between ” and “

The output of Lexical Analysis Phase:


The output of Lexical Analyzer serves as an input to Syntax Analyzer as a sequence of tokens
and not the series of lexemes because during the syntax analysis phase individual unit is not
vital but the category or class to which this lexeme belongs is considerable.

Example:
z = x + y;
This statement has the below form for syntax analyzer
<id> = <id> + <id>; //<id>- identifier (token)
The Lexical Analyzer not only provides a series of tokens but also creates a Symbol Table that
consists of all the tokens present in the source code except Whitespaces and comments.

Difficulties in lexical analysis.

Lexical analysis is the process of producing tokens from the source program. It has the
following issues:
• Lookahead
• Ambiguities
1. Lookahead:

Lookahead is required to decide when one token will end and the next token will begin. The
simple example which has lookahead issues are i vs. if, = vs. ==. Therefore a way to describe the
lexemes of each token is required.

A way needed to resolve ambiguities


• Is if it is two variables i and f or if?
• Is == is two equal signs =, = or ==?
• arr(5, 4) vs. fn(5, 4) II in Ada (as array reference syntax and function call syntax are similar.
Hence, the number of lookahead to be considered and a way to describe the lexemes of each
token is also needed.
2. Ambiguities:

The lexical analysis programs written with lex accept ambiguous specifications and choose the
longest match possible at each input point. Lex can handle ambiguous specifications. When
more than one expression can match the current input, lex chooses as follows:
• The longest match is preferred.
• Among rules which matched the same number of characters, the rule given first is preferred.
Error recovery strategies for a lexical analyzer
A character sequence that cannot be scanned into any valid token is a lexical error. Misspelling
of identifiers, keyword, or operators are considered as lexical errors. Usually, a lexical error is
caused by the appearance of some illegal character, mostly at the beginning of a token.
The following are the error-recovery strategies in lexical analysis:
1) Panic Mode Recovery: Once an error is found, the successive characters are always ignored
until we reach a well-formed token like end, semicolon.
2) Deleting an extraneous character.
3) Inserting a missing character.
4)Replacing an incorrect character by a correct character.
5)Transforming two adjacent characters.
Lexical Error:

When the token pattern does not match the prefix of the remaining input, the lexical analyzer
gets stuck and has to recover from this state to analyze the remaining input. In simple words,
a lexical error occurs when a sequence of characters does not match the pattern of any token.
It typically happens during the execution of a program.
Types of Lexical Error:

Types of lexical error that can occur in a lexical analyzer are as follows:
1. Exceeding length of identifier or numeric constants.

#include <iostream>
using namespace std;

int main() {

int a=2147483647 +1;


return 0;
}
This is a lexical error since signed integer lies between −2,147,483,648 and 2,147,483,647 .

2. Appearance of illegal characters.

Example:
#include <iostream>
using namespace std;

int main() {

printf("Geeksforgeeks");$
return 0;
}

This is a lexical error since an illegal character $ appears at the end of the statement.
3. Unmatched string

Example:
#include <iostream>
using namespace std;

int main() {
/* comment
cout<<"GFG!";
return 0;
}
This is a lexical error since the ending of comment “*/” is not present but the beginning is
present.
4. Spelling Error.

#include <iostream>
using namespace std;

int main() {
int 3num= 1234; /* spelling error as identifier
cannot start with a number*/
return 0;
}

5.Replacing a character with an incorrect character.

#include <iostream>
using namespace std;

int main() {

int x = 12$34; /*lexical error as '$' doesn't


belong within 0-9 range*/
return 0;
}

Other lexical errors include


6. Removal of the character that should be present.
#include <iostream> /*missing 'o' character
hence lexical error*/
using namespace std;

int main() {

cout<<"GFG!";
return 0;
}

7. Transposition of two characters.

#include <iostream>
using namespace std;

int main()
{
/* spelling of main here would be treated as an lexical
error and won't be considered as an identifier,
transposition of character 'i' and 'a'*/
cout << "GFG!";
return 0;
}
Error Recovery Technique

When a situation arises in which the lexical analyzer is unable to proceed because none of the
patterns for tokens matches any prefix of the remaining input. The simplest recovery strategy is
“panic mode” recovery. We delete successive characters from the remaining input until the
lexical analyzer can identify a well-formed token at the beginning of what input is left.

Error-recovery actions are:

1. Transpose of two adjacent characters.


2. Insert a missing character into the remaining input.
3. Replace a character with another character.
4. Delete one character from the remaining input.

input buffering:

The lexical analyzer scans the input from left to right one character at a time. It uses two
pointers begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input scanned.

Initially both the pointers point to the first character of the input string as:
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme. In above example as soon as ptr (fp) encounters a
blank space the lexeme “int” is identified. The fp will be moved ahead at white space, when fp
encounters white space, it ignore and moves ahead. then both the begin ptr(bp) and forward
ptr(fp) are set at next token. The input character is thus read from secondary storage, but
reading in this way from secondary storage is costly. hence buffering technique is used.A
block of data is first read into a buffer, and then second by lexical analyzer. there are two
methods used in this context: One Buffer Scheme, and Two Buffer Scheme. These are
explained as following
below.

1. One Buffer Scheme: In this scheme, only one buffer is used to store the input string but the
problem with this scheme is that if lexeme is very long then it crosses the buffer boundary,
to scan rest of the lexeme the buffer has to be refilled, that makes overwriting the first of

lexeme.
2. Two Buffer Scheme: To overcome the problem of one buffer scheme, in this method two
buffers are used to store the input string. the first buffer and second buffer are scanned
alternately. when end of current buffer is reached the other buffer is filled. the only problem
with this method is that if length of the lexeme is longer than length of the buffer then
scanning input cannot be scanned completely. Initially both the bp and fp are pointing to the
first character of first buffer. Then the fp moves towards right in search of end of lexeme. as
soon as blank character is recognized, the string between bp and fp is identified as
corresponding token. to identify, the boundary of first buffer end of buffer character should
be placed at the end first buffer. Similarly end of second buffer is also recognized by the
end of buffer mark present at the end of second buffer. when fp encounters first eof, then
one can recognize end of first buffer and hence filling up second buffer is started. in the
same way when second eof is obtained then it indicates of second buffer. alternatively both
the buffers can be filled up until end of the input program and stream of tokens is identified.
This eof character introduced at the end is calling Sentinel which is used to identify the end

of buffer.

Specification of tokens:

There are 3 specifications of tokens:


1)Strings.
2) Language.
3)Regular expression.

Strings and Languages


An alphabet or character class is a finite set of symbols.
A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
A language is any countable set of strings over some fixed alphabet.

In language theory, the terms "sentence" and "word" are often used as synonyms for "string."
The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For
example, banana is a string of length six. The empty string, denoted ε, is the string of length zero.
Operations on strings.

The following string-related terms are commonly used:

1. A prefix of string s is any string obtained by removing zero or more symbols


from the end of string s. For example, ban is a prefix of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols
from the beginning of s. For example, nana is a suffix of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s. For
example, nan is a substring of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those prefixes,
suffixes, and substrings, respectively of s that are not ε or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily
consecutive positions of s
 6. For example, baan is a subsequence of banana.

 Operations on languages:
 The following are the operations that can be applied to languages:
 1. Union
 2. Concatenation
 3. Kleene closure
 4. Positive closure

 The following example shows the operations on strings: Let L={0,1} and S={a,b,c}

Regular Expressions
 · Each regular expression r denotes a language L(r).

 · Here are the rules that define the regular expressions over some alphabet Σ
and the languages that those expressions denote
 1.ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is
the empty string.
 2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the
language with one string, of length one, with ‘a’ in its one position.
 3.Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then, a)
(r)|(s) is a regular expression denoting the language L(r) U L(s).
 b) (r)(s) is a regular expression denoting the language L(r)L(s). c) (r)* is a regular
expression denoting (L(r))*.
 d) (r) is a regular expression denoting L(r).
 4.The unary operator * has highest precedence and is left associative.
 5.Concatenation has second highest precedence and is left associative.
 6. | has lowest precedence and is left associative.
 Regular set

 A language that can be defined by a regular expression is called a regular set. If two
regular expressions r and s denote the same regular set, we say they are equivalent and
write r = s.

 There are a number of algebraic laws for regular expressions that can be used to
manipulate into equivalent forms.
 For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative.

 Regular Definitions

 Giving names to regular expressions is referred to as a Regular definition. If Σ is an


alphabet of basic symbols, then a regular definition is a sequence of definitions of the
form
 dl → r 1
 d2 → r2
 ………
 dn → rn
 1.Each di is a distinct name.
 2.Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.
 Example: Identifiers is the set of strings of letters and digits beginning with a letter.
Regular
 definition for this set:
 letter → A | B | …. | Z | a | b | …. | z | digit → 0 | 1 | …. | 9
 id → letter ( letter | digit ) *

Shorthands

 Certain constructs occur so frequently in regular expressions that it is convenient to


introduce notational short hands for them.
 1. One or more instances (+):
 - The unary postfix operator + means “ one or more instances of” .
 - If r is a regular expression that denotes the language L(r), then ( r )+ is a regular
expression that denotes the language (L (r ))+
 - Thus the regular expression a+ denotes the set of all strings of one or more a’s.
 - The operator + has the same precedence and associativity as the operator *.

2. Zero or one instance ( ?):


 - The unary postfix operator ? means “zero or one instance of”.
 - The notation r? is a shorthand for r | ε.
 - If ‘r’ is a regular expression, then ( r )? is a regular expression that denotes the language
3. Character Classes:
 - The notation [abc] where a, b and c are alphabet symbols denotes the regular expression
a | b | c.
 - Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z.
 - We can describe identifiers as being strings generated by the regular expression, [A–
Za–z][A– Za–z0–9]*

 Non-regular Set

 A language which cannot be described by any regular expression is a non-regular set.


Example: The set of all strings of balanced parentheses and repeating strings cannot be
described by a regular expression. This set can be specified by a context-free grammar.

Regular grammar & language definition.

Regular Languages are the most restricted types of languages and are accepted by finite
automata.

Regular Expressions.

Regular Expressions are used to denote regular languages. An expression is regular if:
 ɸ is a regular expression for regular language ɸ.
 ɛ is a regular expression for regular language {ɛ}.
 If a ∈ Σ (Σ represents the input alphabet), a is regular expression with language {a}.
 If a and b are regular expression, a + b is also a regular expression with language {a,b}.
 If a and b are regular expression, ab (concatenation of a and b) is also regular.
 If a is regular expression, a* (0 or more times a) is also regular.

Regular Expression Regular Languages

set of vovels (a∪e∪i∪o∪u) {a, e, i, o, u}

a followed by 0 or more
b (a.b*) {a, ab, abb, abbb, abbbb,….}

any no. of vowels v*.c* ( where v – { ε , a ,aou, aiou, b, abcd…..} where ε


followed by any no. of vowels and c – represent empty string (in case 0 vowels
consonants consonants) and o consonants )

Regular Grammar : A grammar is regular if it has rules of form A -> a or A -> aB or A -> ɛ
where ɛ is a special symbol called NULL.

Regular Languages : A language is regular if it can be expressed in terms of regular


expression.
Closure Properties of Regular Languages.

Union : If L1 and If L2 are two regular languages, their union L1 ∪ L2 will also be regular.
For example, L1 = {an | n ≥ 0} and L2 = {bn | n ≥ 0}
L3 = L1 ∪ L2 = {an ∪ bn | n ≥ 0} is also regular.

Intersection : If L1 and If L2 are two regular languages, their intersection L1 ∩ L2 will also
be regular. For example,
L1= {am bn | n ≥ 0 and m ≥ 0} and L2= {am bn ∪ bn am | n ≥ 0 and m ≥ 0}
L3 = L1 ∩ L2 = {am bn | n ≥ 0 and m ≥ 0} is also regular.

Concatenation : If L1 and If L2 are two regular languages, their concatenation L1.L2 will also
be regular. For example,
L1 = {an | n ≥ 0} and L2 = {bn | n ≥ 0}
L3 = L1.L2 = {am . bn | m ≥ 0 and n ≥ 0} is also regular.

Kleene Closure : If L1 is a regular language, its Kleene closure L1* will also be regular. For
example,
L1 = (a ∪ b)
L1* = (a ∪ b)*.

Complement : If L(G) is regular language, its complement L’(G) will also be regular.
Complement of a language can be found by subtracting strings which are in L(G) from all
possible strings. For example,
L(G) = {an | n > 3}
L’(G) = {an | n <= 3}

Note : Two regular expressions are equivalent if languages generated by them are same. For
example, (a+b*)* and (a+b)* generate same language. Every string which is generated by
(a+b*)* is also generated by (a+b)* and vice versa.

How to solve problems on regular expression and regular languages?

Question 1 : Which one of the following languages over the alphabet {0,1} is described by the
regular expression?
(0+1)*0(0+1)*0(0+1)*
(A) The set of all strings containing the substring 00.
(B) The set of all strings containing at most two 0’s.
(C) The set of all strings containing at least two 0’s.
(D) The set of all strings that begin and end with either 0 or 1.

Solution : Option A says that it must have substring 00. But 10101 is also a part of language
but it does not contain 00 as substring. So it is not correct option.
Option B says that it can have maximum two 0’s but 00000 is also a part of language. So it is
not correct option.
Option C says that it must contain atleast two 0. In regular expression, two 0 are present. So
this is correct option.
Option D says that it contains all strings that begin and end with either 0 or 1. But it can
generate strings which start with 0 and end with 1 or vice versa as well. So it is not correct.

Question 2 : Which of the following languages is generated by given grammar?


S -> aS | bS | ∊
(A) {an bm | n,m ≥ 0}
(B) {w ∈ {a,b}* | w has equal number of a’s and b’s}
(C) {an | n ≥ 0} ∪ {bn | n ≥ 0} ∪ {an bn | n ≥ 0}
(D) {a,b}*
Solution : Option (A) says that it will have 0 or more a followed by 0 or more b. But S -> bS
=> baS => ba is also a part of language. So (A) is not correct.
Option (B) says that it will have equal no. of a’s and b’s. But But S -> bS => b is also a part of
language. So (B) is not correct.
Option (C) says either it will have 0 or more a’s or 0 or more b’s or a’s followed by b’s. But as
shown in option (A), ba is also part of language. So (C) is not correct.
Option (D) says it can have any number of a’s and any numbers of b’s in any order. So (D) is
correct.

Question 3 : The regular expression 0*(10*)* denotes the same set as


(A) (1*0)*1*
(B) 0 + (0 + 10)*
(C) (0 + 1)* 10(0 + 1)*
(D) none of these

Solution : Two regular expressions are equivalent if languages generated by them are same.
Option (A) can generate all strings generated by 0*(10*)*. So they are equivalent.
Option (B) string null can not generated by given languages but 0*(10*)* can. So they are not
equivalent.
Option (C) will have 10 as substring but 0*(10*)* may or may not. So they are not equivalent.

Question 4 : The regular expression for the language having input alphabets a and b, in which
two a’s do not come together:
(A) (b + ab)* + (b +ab)*a
(B) a(b + ba)* + (b + ba)*
(C) both options (A) and (B)
(D) none of the above
Solution:
Option (C) stating both both options (A) and (B) is the correct regular expression for the stated
question.
The language in the question can be expressed as
L={&epsilon,a,b,bb,ab,aba,ba,bab,baba,abab,…}.
In option (A) ‘ab’ is considered the building block for finding out the required regular
expression.(b + ab)* covers all cases of strings generated ending with ‘b’.(b + ab)*a covers all
cases of strings generated ending with a.
Applying similar logic for option (B) we can see that the regular expression is derived
considering ‘ba’ as the building block and it covers all cases of strings starting with a and
starting with b.
Compiler Design

Module-I

Syntax Analysis:

The second phase of the compiler is syntax analysis or parsing. The parser uses the first components of the
tokens produced by the lexical analyzer to create a tree-like intermediate representation that depicts the
grammatical structure of the token stream. A typical representation is a syntax tree in which each interior
node represents an operation and the children of the node represent the arguments of the operation. A syntax
tree for the token stream (1.2) is shown as the output of the syntactic analyzer in Fig. 1.7.
This tree shows the order in which the operations in the assignment
position = initial + rate * 60.
are to be performed. The tree has an interior node labeled * with (id, 3) as its left child and the integer 60 as
its right child. The node (id, 3) represents the identifier rate . The node labeled * makes it explicit that we
must first multiply the value of r a t e by 60. The node labeled + indicates that we must add the result of this
multiplication to the value of initial . The root of the tree, labeled =, indicates that we must store the result of
this addition into the location for the identifier p o s i t i o n . This ordering of operations is consistent with
the usual conventions of arithmetic which tell us that multiplication has higher precedence than addition, and
hence that the multiplication is to be performed before the addition.
The subsequent phases of the compiler use the grammatical structure to help analyze the source program and
generate the target program. In Chapter 4 we shall use context-free grammars to specify the grammatical
structure of programming languages and discuss algorithms for constructing efficient syntax analyzers
automatically from certain classes of grammars.
Semantic Analysis
The semantic analyzer uses the syntax tree and the information in the symbol table to check the source
program for semantic consistency with the language definition. It also gathers type information and saves it
in either the syntax tree or the symbol table, for subsequent use during intermediate-code generation.
An important part of semantic analysis is type checking, where the compiler checks that each operator has
matching operands. For example, many program-ming language definitions require an array index to be an
integer; the compiler must report an error if a floating-point number is used to index an array.
Grammar:
The productions of a grammar specify the manner in which the terminals and non-terminals can be
combined to form strings. Each production consists of a non-terminal called the left side of the production,
an arrow, and a sequence of tokens and/or on- terminals, called the right side of the production.

Context free grammar:

Context free grammar is a formal grammar which is used to generate all possible strings in a given formal
language.
Context free grammar G can be defined by four tuples as:

1. G= (V, T, P, S)
2. Where,
3. G describes the grammar
4. T describes a finite set of terminal symbols.
5. V describes a finite set of non-terminal symbols
6. P describes a set of production rules
7. S is the start symbol.
8. In CFG, the start symbol is used to derive the string. You can derive the string by repeatedly
replacing a non-terminal by the right hand side of the production, until all non-terminal have been
replaced by terminal symbols.

Example:

L= {wcwR | w € (a, b)*}

Production rules:

1. S → aSa
2. S → bSb
3. S → c

Now check that abbcbba string can be derived from the given CFG.

1. S ⇒ aSa
2. S ⇒ abSba
3. S ⇒ abbSbba
4. S ⇒ abbcbba
5. By applying the production S → aSa, S → bSb recursively and finally applying the production
S → c, we get the string abbcbba.
Capabilities of CFG:

There are the various capabilities of CFG:

o Context free grammar is useful to describe most of the programming languages.


o If the grammar is properly designed then an efficientparser can be constructed automatically.
o Using the features of associatively & precedence information, suitable grammars for expressions can
be constructed.
o Context free grammar is capable of describing nested structures like: balanced parentheses, matching
begin-end, corresponding if-then-else's & so on.
Derivation

Derivation is a sequence of production rules. It is used to get the input string through these production rules.
During parsing we have to take two decisions. These are as follows:

o We have to decide the non-terminal which is to be replaced.


o We have to decide the production rule by which the non-terminal will be replaced.

We have two options to decide which non-terminal to be replaced with production rule.

Left-most Derivation

In the left most derivation, the input is scanned and replaced with the production rule from left to right. So in
left most derivatives we read the input string from left to right.

Example:

Production rules:

1. S = S + S
2. S = S - S
3. S = a | b |c

Input:

a-b+c

The left-most derivation is:

1. S=S+S
2. S=S-S+S
3. S=a-S+S
4. S=a-b+S
5. S=a-b+c

Right-most Derivation

In the right most derivation, the input is scanned and replaced with the production rule from right to left. So
in right most derivatives we read the input string from right to left.

Example:
1. S = S + S
2. S = S - S
3. S = a | b |c
Input:

a-b+c

The right-most derivation is:

1. S=S-S
2. S=S-S+S
3. S=S-S+c
4. S=S-b+c
5. S=a-b+c

Parsing:

The process of transforming the data from one format to another is called Parsing. This process can be
accomplished by the parser. The parser is a component of the translator that helps to organise linear text
structure following the set of defined rules which is known as grammar.

Ambiguity:

A grammar is said to be ambiguous if there exists more than one leftmost derivation or more than one
rightmost derivative or more than one parse tree for the given input string. If the grammar is not ambiguous
then it is called unambiguous.

Example:
1. S = aSb | SS
2. S = ∈

For the string aabb, the above grammar generates two parse trees:
If the grammar has ambiguity then it is not good for a compiler construction. No method can automatically
detect and remove the ambiguity but you can remove ambiguity by re-writing the whole grammar without
ambiguity.

Types of Parsers:

The parser is that phase of the compiler which takes a token string as input and with the help of existing
grammar, converts it into the corresponding Intermediate Representation(IR). The parser is also known
as Syntax Analyzer.

The parser is mainly classified into two categories, i.e. Top-down Parser, and Bottom-up Parser. These are
explained below:
Top-Down Parser:
The top-down parser is the parser that generates parse for the given input string with the help of
grammar productions by expanding the non-terminals i.e. it starts from the start symbol and ends on the
terminals. It uses left most derivation.
Further Top-down parser is classified into 2 types: A recursive descent parser, and Non-recursive descent
parser.
1. Recursive descent parser is also known as the Brute force parser or the backtracking parser. It
basically generates the parse tree by using brute force and backtracking.
2. Non-recursive descent parser is also known as LL(1) parser or predictive parser or without
backtracking parser or dynamic parser. It uses a parsing table to generate the parse tree instead of
backtracking.
Bottom-up Parser:
Bottom-up Parser is the parser that generates the parse tree for the given input string with the help of
grammar productions by compressing the non-terminals i.e. it starts from non-terminals and ends on the
start symbol. It uses the reverse of the rightmost derivation.
Further Bottom-up parser is classified into two types: LR parser, and Operator precedence parser.
 LR parser is the bottom-up parser that generates the parse tree for the given string by using
unambiguous grammar. It follows the reverse of the rightmost derivation.
LR parser is of four types:
(a)LR(0).
(b)SLR(1).
(c)LALR(1).
(d)CLR(1) .

 Operator precedence parser generates the parse tree from given grammar and string but the only
condition is two consecutive non-terminals and epsilon never appears on the right-hand side of any
production.
 The operator precedence parsing techniques can be applied to Operator grammars.
 Operator grammar: A grammar is said to be operator grammar if there does not exist any production
rule on the right-hand side.
1. as ε(Epsilon)
2. Two non-terminals appear consecutively, that is, without any terminal between them operator
precedence parsing is not a simple technique to apply to most the language constructs, but it
evolves into an easy technique to implement where a suitable grammar may be produced.

SLR, CLR and LALR Parsers.

SLR parser, CLR parser and LALR parser which are the parts of Bottom Up parser.
SLR Parser The SLR parser is similar to LR(0) parser except that the reduced entry. The reduced
productions are written only in the FOLLOW of the variable whose production is reduced.
Construction of SLR parsing table –
1. Construct C = { I0, I1, ……. In}, the collection of sets of LR(0) items for G’.
2. State i is constructed from Ii. The parsing actions for state i are determined as follow :
 If [ A -> ?.a? ] is in Ii and GOTO(Ii , a) = Ij , then set ACTION[i, a] to “shift j”. Here a must be
terminal.
 If [A -> ?.] is in Ii, then set ACTION[i, a] to “reduce A -> ?” for all a in FOLLOW(A); here A may
not be S’.
 Is [S -> S.] is in Ii, then set action[i, $] to “accept”. If any conflicting actions are generated by the
above rules we say that the grammar is not SLR.
3. The goto transitions for state i are constructed for all nonterminals A using the rule: if GOTO( Ii , A ) =
Ij then GOTO [i, A] = j.
4. All entries not defined by rules 2 and 3 are made error.
Eg: If in the parsing table we have multiple entries then it is said to be a conflict.
Consider the grammar E -> T+E | T
T ->id
Augmented grammar - E’ -> E
E -> T+E | T
T -> id
Every SLR grammar is unambiguous but there are many unambiguous grammars that are not SLR.

CLR PARSER In the SLR method we were working with LR(0)) items. In CLR parsing we will be using
LR(1) items. LR(k) item is defined to be an item using lookaheads of length k. So , the LR(1) item is
comprised of two parts : the LR(0) item and the lookahead associated with the item. LR(1) parsers are
more powerful parser. For LR(1) items we modify the Closure and GOTO function.
Closure Operation
Closure(I)
repeat
for (each item [ A -> ?.B?, a ] in I )
for (each production B -> ? in G’)
for (each terminal b in FIRST(?a))
add [ B -> .? , b ] to set I;
until no more items are added to I;
return I;
Lets understand it with an example –

Goto Operation
Goto(I, X)
Initialise J to be the empty set;
for ( each item A -> ?.X?, a ] in I )
Add item A -> ?X.?, a ] to se J; /* move the dot one step */
return Closure(J); /* apply closure to the set */
Eg-
LR(1) items
Void items(G’)
Initialise C to { closure ({[S’ -> .S, $]})};
Repeat
For (each set of items I in C)
For (each grammar symbol X)
if( GOTO(I, X) is not empty and not in C)
Add GOTO(I, X) to C;
Until no new set of items are added to C;
Construction of GOTO graph
 State I0 – closure of augmented LR(1) item.
 Using I0 find all collection of sets of LR(1) items with the help of DFA
 Convert DFA to LR(1) parsing table
Construction of CLR parsing table- Input – augmented grammar G’
1. Construct C = { I0, I1, ……. In} , the collection of sets of LR(0) items for G’.
2. State i is constructed from Ii. The parsing actions for state i are determined as follow : i) If [ A -> ?.a?,
b ] is in Ii and GOTO(Ii , a) = Ij, then set ACTION[i, a] to “shift j”. Here a must be terminal. ii) If [A -
> ?. , a] is in Ii , A ≠ S, then set ACTION[i, a] to “reduce A -> ?”. iii) Is [S -> S. , $ ] is in Ii, then set
action[i, $] to “accept”. If any conflicting actions are generated by the above rules we say that the
grammar is not CLR.
3. The goto transitions for state i are constructed for all nonterminals A using the rule: if GOTO( Ii, A ) =
Ij then GOTO [i, A] = j.
4. All entries not defined by rules 2 and 3 are made error.
Eg:
Consider the following grammar
S -> AaAb | BbBa
A -> ?
B -> ?
Augmented grammar - S’ -> S
S -> AaAb | BbBa
A -> ?
B -> ?
GOTO graph for this grammar will be -

Note – if a state has two reductions and both have same lookahead then it will in multiple entries in
parsing table thus a conflict. If a state has one reduction and their is a shift from that state on a terminal
same as the lookahead of the reduction then it will lead to multiple entries in parsing table thus a conflict.
LALR PARSER LALR parser are same as CLR parser with one difference. In CLR parser if two states
differ only in lookahead then we combine those states in LALR parser. After minimisation if the parsing
table has no conflict that the grammar is LALR also. Eg:
consider the grammar S ->AA
A -> aA | b
Augmented grammar - S’ -> S
S ->AA
A -> aA | b
Important Notes 1. Even though CLR parser does not have RR conflict but LALR may contain RR
conflict. 2. If number of states LR(0) = n1, number of states SLR = n2, number of states LALR =
n3, number of states CLR = n4 then, n1 = n2 = n3 <= n4

Predictive parsing LL(1) parsing:

Construction of LL(1) Parsing Table:


A top-down parser builds the parse tree from the top down, starting with the start non-terminal. There are
two types of Top-Down Parsers:
1. Top-Down Parser with Backtracking
2. Top-Down Parsers without Backtracking
3. Prerequisite - Classification of top-down parsers, FIRST Set, FOLLOW Set
4. Top-Down Parsers without backtracking can further be divided into two parts:

5.
6. In this article, we are going to discuss Non-Recursive Descent which is also known as LL(1)
Parser.
LL(1) Parsing: Here the 1st L represents that the scanning of the Input will be done from Left to Right
manner and the second L shows that in this parsing technique we are going to use Left most Derivation
Tree. And finally, the 1 represents the number of look-ahead, which means how many symbols are you
going to see when you want to make a decision.

Essential conditions to check first are as follows:


1. The grammar is free from left recursion.
2. The grammar should not be ambiguous.
3. The grammar has to be left factored in so that the grammar is deterministic grammar.
These conditions are necessary but not sufficient for proving a LL(1) parser.
Algorithm to construct LL(1) Parsing Table:

Step 1: First check all the essential conditions mentioned above and go to step 2.
Step 2: Calculate First() and Follow() for all non-terminals.
1. First(): If there is a variable, and from that variable, if we try to drive all the strings then the beginning
Terminal Symbol is called the First.
2. Follow(): What is the Terminal Symbol which follows a variable in the process of derivation.
Step 3: For each production A –> α. (A tends to alpha)
1. Find First(α) and for each terminal in First(α), make entry A –> α in the table.
2. If First(α) contains ε (epsilon) as terminal, then find the Follow(A) and for each terminal in Follow(A),
make entry A –> ε in the table.
3. If the First(α) contains ε and Follow(A) contains $ as terminal, then make entry A –> ε in the table for
the $.
To construct the parsing table, we have two functions:
In the table, rows will contain the Non-Terminals and the column will contain the Terminal Symbols. All
the Null Productions of the Grammars will go under the Follow elements and the remaining productions
will lie under the elements of the First set.

Now, let’s understand with an example.


Example-1: Consider the Grammar:
E --> TE'
E' --> +TE' | ε
T --> FT'
T' --> *FT' | ε
F --> id | (E)
*ε denotes epsilon
Step1 – The grammar satisfies all properties in step 1
Step 2 – calculating first() and follow()
Find their First and Follow sets:

First Follow

E –> TE’ { id, ( } { $, ) }

E’ –> +TE’/ε { +, ε } { $, ) }

T –> FT’ { id, ( } { +, $, ) }

T’ –> *FT’/ε { *, ε } { +, $, ) }

F –> id/(E) { id, ( } { *, +, $, ) }

Step 3 – making parser table


Now, the LL(1) Parsing Table is:

id + * ( ) $

E E –> TE’ E –> TE’

E’ E’ –> +TE’ E’ –> ε E’ –> ε

T T –> FT’ T –> FT’

T’ T’ –> ε T’ –> *FT’ T’ –> ε T’ –> ε

F F –> id F –> (E)

As you can see that all the null productions are put under the Follow set of that symbol and all the
remaining productions lie under the First of that symbol.
Note: Every grammar is not feasible for LL(1) Parsing table. It may be possible that one cell may contain
more than one production

.
Example-2:
Consider the Grammar
S --> A | a
A --> a
Step1 – The grammar does not satisfy all properties in step 1, as the grammar is ambiguous. Still, let’s try
to make the parser table and see what happens
Step 2 – calculating first() and follow()
Find their First and Follow sets:

First Follow

S –> A/a {a} {$}

A –>a {a} {$}

Step 3 – making parser table


Parsing Table:

a $

S S –> A, S –> a

A A –> a

Here, we can see that there are two productions in the same cell. Hence, this grammar is not feasible for
LL(1) Parser.
Trick – Above grammar is ambiguous grammar. So the grammar does not satisfy the essential conditions.
So we can say that this grammar is not feasible for LL(1) Parser even without making the parse table.
Example 3: Consider the Grammar
S -> (L) | a
L -> SL'
L' -> )SL' | ε
Step1 – The grammar satisfies all properties in step 1
Step 2 – calculating first() and follow()

First Follow

S {(,a} { $, ) }

L {(,a} {)}

L’ {),ε} {)}
Step 3 – making parser table
Parsing Table:

( ) a $

S S -> (L) S -> a

L L -> SL’ L -> SL’

L’ -> )SL’
L’ -> ε
L’
Here, we can see that there are two productions in the same cell. Hence, this grammar is not feasible for
LL(1) Parser. Although the grammar satisfies all the essential conditions in step 1, it is still not feasible for
LL(1) Parser. We saw in example 2 that we must have these essential conditions and in example 3 we saw
that those conditions are insufficient to be a LL(1) parser.
Transformation on the grammars:

These notes describe methods of transforming grammars. We are motivated by the desire to transform a
grammar into an LL(1) grammar. If a grammar can be so transformed it allows access to the predictive top-
down parsing techniques (i.e. either a table-driven pushdown automata or a recursive descent parser). The
bad news is that given an arbitrary context free grammar it is undecidable whether there is an LL(1) (or
LL(k)) grammar which will recognize the same language: thus, ultimately our objective is impossible!

In fact, this is not quite the problem we wish to solve! We are not simply looking for a completely arbitrary
and unrelated grammar which by some remarkable fluke happens to recognize the same language. We
require a transformation between the set of parse trees ... this because we need to transfer the semantics (or
meaning) intended by the old grammar onto the new grammar. Thus, we really need a step by step
transformation system which allows us to also keep track of how we have transformed the parse tree.

We may see the problem as follows: we are given a grammar G and we wish to transform it to a grammar G'
while providing a translation T: G' -> G such that

INPUT
/ \
parse / \ parse
/ \
/ \
\/ \/

G' -----------> G
T
commutes, in the sense that parsing an input for G' and then translating the parse tree into one for G is the
same as parsing straight into G. In fact, futhermore, it is desirable for the translation T to be given by a linear
time "fold" (an attribute grammar). As we shall see we can do a number of useful transformation steps in the
direction of obtaining an LL(1) grammar. These steps almost amount to a procedure for generating an
equivalent LL(1) grammar (if one exists) -- the one aspect we shall not discuss is how to eliminate the
follow set clashes.

The basic methods for transforming grammars which we shall look at are:

 Elimination of left recursion (dually elimination of right recursion)


 Left factoring (dually right factoring)
 Nonterminal expansion.

These techniques are often sufficient to bring a grammar to heel! However, it should be noted that if the
original grammar is ambiguous then, as these techniques "preserve meaning", the result will still be
ambiguous and so necessarily will fail to be LL(1) or to belong to any other unambiguous class (such as
LL(k) or LR(k)). Furthermore, if the language is simply is not LL(1) then these techniques will also fail.
Thus, you should not expect them to deliver magically an LL(1) grammar!
Set against the fact that the transformations may not help are:

1. By trying to transform the grammar you may expose aspects of the grammar which were not apparent
in its original form;
2. Many of the grammars you will meet in practice can be transformed to LL(1)! If they can be
transformed then one has access to particularly simple parsing techniques with good error reporting.

I lay more stress on grammar transformation techniques than most texts.

Eliminating left recursion:

Eliminating left recursion is the most difficult transformation step. I shall present an approach that has the
following advantages:

 it allows one to carry out the transformation process by hand relatively easily
 it is directly applicable to most grammars as it allows nullable productions
 it tries to keep as many of the original features of the grammar intact as is possible and tries to keep
the the size of the grammar under control.

Transformations highlight the difference between the presentation (that is the grammar) of a language and
the language itself (the set of strings recognized). Recall many grammars can recognize the same language
but only one language is recognized by a grammar: transformations certainly must preserve the language ...
we also require that they allow one to recover the parse tree as well.

The algorithm I will describe works on any grammar which has no cycles and is null unambiguous. A
grammar has a cycle if there is a nonterminal x with a nontrivial derivation x =+=> x. Grammars with
cycles are quite pathological (and ambiguous): they are of little interest in parsing applications so this does
not unduly limit the applicability of this algorithm. Importantly, there is a straightforward test of whether a
grammar has cycles. Null ambiguous grammars are, similarly, quite pathological (and ambiguous) and null
ambiguity can be easily determined.

operator precedence grammars:

Operator grammar and precedence parser :

A grammar that is used to define mathematical operators is called an operator grammar or operator
precedence grammar. Such grammars have the restriction that no production has either an empty
right-hand side (null productions) or two adjacent non-terminals in its right-hand side.
Examples –
This is an example of operator grammar:
E->E+E/E*E/id
However, the grammar given below is not an operator grammar because two non-terminals are adjacent to
each other:
S->SAS/a
A->bSb/b
We can convert it into an operator grammar, though:
S->SbSbS/SbS/a
A->bSb/b
Operator precedence parser –
An operator precedence parser is a bottom-up parser that interprets an operator grammar. This parser is
only used for operator grammars. Ambiguous grammars are not allowed in any parser except operator
precedence parser.
There are two methods for determining what precedence relations should hold between a pair of terminals:
1. Use the conventional associativity and precedence of operator.
2. The second method of selecting operator-precedence relations is first to construct an unambiguous
grammar for the language, a grammar that reflects the correct associativity and precedence in its parse
trees.
This parser relies on the following three precedence relations: ⋖, ≐, ⋗
a ⋖ b This means a “yields precedence to” b.
a ⋗ b This means a “takes precedence over” b.
a ≐ b This means a “has same precedence as” b.

Figure – Operator precedence relation table for grammar E->E+E/E*E/id


There is not given any relation between id and id as id will not be compared and two variables can not
come side by side. There is also a disadvantage of this table – if we have n operators then size of table will
be n*n and complexity will be 0(n2). In order to decrease the size of table, we use operator function table.
Operator precedence parsers usually do not store the precedence table with the relations; rather they are
implemented in a special way. Operator precedence parsers use precedence functions that map terminal
symbols to integers, and the precedence relations between the symbols are implemented by numerical
comparison. The parsing table can be encoded by two precedence functions f and g that map terminal
symbols to integers. We select f and g such that:
1. f(a) < g(b) whenever a yields precedence to b
2. f(a) = g(b) whenever a and b have the same precedence
3. f(a) > g(b) whenever a takes precedence over b
Example – Consider the following grammar:
E -> E + E/E * E/( E )/id
This is the directed graph representing the precedence function:

Since there is no cycle in the graph, we can make this function table:

fid -> g* -> f+ ->g+ -> f$


gid -> f* -> g* ->f+ -> g+ ->f$
Size of the table is 2n.
One disadvantage of function tables is that even though we have blank entries in relation table we have
non-blank entries in function table. Blank entries are also called error. Hence error detection capability of
relation table is greater than function table.
#include<stdlib.h>

#include<stdio.h>

#include<string.h>

// function f to exit from the loop

// if given condition is not true

void f()

printf("Not operator grammar");

exit(0);

void main()

char grm[20][20], c;

// Here using flag variable,

// considering grammar is not operator grammar

int i, n, j = 2, flag = 0;
// taking number of productions from user

scanf("%d", &n);

for (i = 0; i < n; i++)

scanf("%s", grm[i]);

for (i = 0; i < n; i++) {

c = grm[i][2];

while (c != '\0') {

if (grm[i][3] == '+' || grm[i][3] == '-'

|| grm[i][3] == '*' || grm[i][3] == '/')

flag = 1;

else {

flag = 0;

f();

}
if (c == '$') {

flag = 0;

f();

c = grm[i][++j];

if (flag == 1)

printf("Operator grammar");

Input :3
A=A*A
B=AA
A=$

Output : Not operator grammar

Input :2
A=A/A
B=A+A

Output : Operator grammar


$ is a null production here which are also not allowed in operator grammars.
Advantages –

1. It can easily be constructed by hand.


2. It is simple to implement this type of parsing.
Disadvantages –
1. It is hard to handle tokens like the minus sign (-), which has two different precedence (depending on
whether it is unary or binary).
2. It is applicable only to a small class of grammars.

You might also like