0% found this document useful (0 votes)
11 views

Compiler Notes

Uploaded by

pgjibinjose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Compiler Notes

Uploaded by

pgjibinjose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

UNIT I- LEXICAL ANALYSIS

INRODUCTION TO COMPILING:

Translator:
It is a program that translates one language to another.

Types of Translator:
1.Interpreter
2.Compiler
3.Assembler

1.Interpreter:
It is one of the translators that translate high level language to low level language.

During execution, it checks line by line for errors.


Example: Basic, Lower version of Pascal.

2.Assembler:
It translates assembly level language to machine code.

Example: Microprocessor 8085, 8086.

3.Compiler:
It is a program that translates one language(source code) to another language (target
code).

It executes the whole program and then displays the errors.


Example: C, C++, COBOL, higher version of Pascal.

Difference between compiler and interpreter:


Compiler Interpreter
It is a translator that translates high level to It is a translator that translates high level to
low level language. low level language
It displays the errors after the whole It checks line by line for errors.
program is executed.
Examples: Basic, lower version of Pascal. Examples: C, C++, Cobol, higher version of
Pascal.

PARTS OF COMPILATION :
There are 2 parts to compilation:
1. Analysis
2. Synthesis
Analysis part breaks down the source program into constituent pieces and creates an intermediate
representation of the source program.
Synthesis part constructs the desired target program from the intermediate representation.

Software tools used in Analysis part:

1) Structure editor:
✓ Takes as input a sequence of commands to build a source program.
✓ The structure editor not only performs the text-creation and modification functions of
an ordinary text editor, but it also analyzes the program text, putting an appropriate
hierarchical structure on the source program.
✓ For example , it can supply key words automatically -while …. do and begin….. end.
2) Pretty printers :
✓ A pretty printer analyzes a program and prints it in such a way that the structure of
the Synthesis program becomes clearly visible.
✓ For example, comments may appear in a special font.
3) Static checkers :
✓ A static checker reads a program, analyzes it, and attempts to discover potential bugs
without running the program.
✓ For example, a static checker may detect that parts of the source program can never
be executed.
4) Interpreters :
✓ Translates from high level language ( BASIC, FORTRAN, etc..) into machine
language.
✓ An interpreter might build a syntax tree and then carry out the operations at the
nodes as it walks the tree.
✓ Interpreters are frequently used to execute command language since each operator
executed in a command language is usually an invocation of a complex routine such
as an editor or complier.

ANALYSIS OF THE SOURCE PROGRAM


Analysis consists of 3 phases:

1.Linear/Lexical Analysis :
• It is also called scanning. It is the process of reading the characters from left to right
and grouping into tokens having a collective meaning.
• For example, in the assignment statement a=b+c*2, the characters would be grouped
into the following tokens:
i) The identifier1 ‘a’
ii) The assignment symbol (=)
iii) The identifier2 ‘b’
iv) The plus sign (+)
v) The identifier3 ‘c’

2.Syntax Analysis :
• It is called parsing or hierarchical analysis. It involves grouping the tokens of the
source program into grammatical phrases that are used by the compiler to synthesize
output.
• They are represented using a syntax tree as shown below:

• A syntax tree is the tree generated as a result of syntax analysis in which the interior
nodes are the operators and the exterior nodes are the operands.
• This analysis shows an error when the syntax is incorrect.

3.Semantic Analysis :

• It checks the source programs for semantic errors and gathers type information for
the subsequent code generation phase. It uses the syntax tree to identify the operators
and operands of statements.
• An important component of semantic analysis is type checking. Here the compiler
checks that each operator has operands that are permitted by the source language
specification.

PHASES OF COMPILER:
A Compiler operates in phases, each of which transforms the source program from one
representation into another. The following are the phases of the compiler:

Main phases:
1) Lexical analysis
2) Syntax analysis
3) Semantic analysis
4) Intermediate code generation
5) Code optimization
6) Code generation

Sub-Phases:
1) Symbol table management
2) Error handling
LEXICAL ANALYSIS:
• It is the first phase of the compiler. It gets input from the source program and produces
tokens as output.
• It reads the characters one by one, starting from left to right and forms the tokens.
• Token : It represents a logically cohesive sequence of characters such as
keywords,operators, identifiers, special symbols etc.
Example: a + b = 20
Here, a,b,+,=,20 are all separate tokens.
Group of characters forming a token is called the Lexeme.
• The lexical analyser not only generates a token but also enters the lexeme into the symbol
table if it is not already there.

SYNTAX ANALYSIS:
• It is the second phase of the compiler. It is also known as parser.
• It gets the token stream as input from the lexical analyser of the compiler and generates
syntax tree as the output.
• Syntax tree: It is a tree in which interior nodes are operators and exterior nodes are
operands.
Example: For a=b+c*2, syntax tree is

SEMANTIC ANALYSIS:
• It is the third phase of the compiler.
• It gets input from the syntax analysis as parse tree and checks whether the given syntax is
correct or not.
• It performs type conversion of all the data types into real data types.

INTERMEDIATE CODE GENERATION:


• It is the fourth phase of the compiler.
• It gets input from the semantic analysis and converts the input into output as intermediate
code such as three-address code.
• The three-address code consists of a sequence of instructions, each of which has atmost
three operands.
Example: t1=t2+t3

CODE OPTIMIZATION:
• It is the fifth phase of the compiler.
• It gets the intermediate code as input and produces optimized intermediate code as output.
• This phase reduces the redundant code and attempts to improve the intermediate code so that
faster-running machine code will result.
• During the code optimization, the result of the program is not affected.
• To improve the code generation, the optimization involves
✓ deduction and removal of dead code (unreachable code).
✓ calculation of constants in expressions and terms.
✓ collapsing of repeated expression into temporary string.
✓ loop unrolling.
✓ moving code outside the loop.
✓ removal of unwanted temporary variables.

CODE GENERATION:
• It is the final phase of the compiler.
• It gets input from code optimization phase and produces the target code or object
code as result.
• Intermediate instructions are translated into a sequence of machine instructions that
perform the same task.
• The code generation involves
✓ allocation of register and memory
✓ generation of correct references
✓ generation of correct data types
✓ generation of missing code

SYMBOL TABLE MANAGEMENT:


• Symbol table is used to store all the information about identifiers used in the
program.
• It is a data structure containing a record for each identifier, with fields for the
attributes of the identifier.
• It allows to find the record for each identifier quickly and to store or retrieve data
from that record.
• Whenever an identifier is detected in any of the phases, it is stored in the symbol
table.

ERROR HANDLING:
• Each phase can encounter errors. After detecting an error, a phase must handle the
error so that compilation can proceed.
• In lexical analysis, errors occur in separation of tokens.
• In syntax analysis, errors occur during construction of syntax tree.
• In semantic analysis, errors occur when the compiler detects constructs with right
syntactic structure but no meaning and during type conversion.
• In code optimization, errors occur when the result is affected by the optimization.
• In code generation, it shows error when code is missing etc.

To illustrate the translation of source code through each phase, consider the statement a=b+c*2.
The figure shows the representation of this statement after each phase:
Example 2:

position := initial + rate * 60


intermediate code generator
lexical analyzer

temp1 := inttoreal (60)


id1 := id2 + id3 * 60
The Phases of a
temp2 := id3 * temp1
syntax analyzer Compiler
temp3 := id2 + temp2
:= code optimizer

temp1 := id3 * 60.0


id1 +
id1 := id2 + temp1
semantic analyzer code generator

:=
MOVF id3, R2

MULF #60.0, R2
id1 +
MOVF id2, R1

COUSINS OF COMPILER
1. Preprocessor
2. Assembler
3. Loader and Link-editor

PREPROCESSOR
A preprocessor is a program that processes its input data to produce output that is used as
input to another program. The output is said to be a preprocessed form of the input data, which is
often used by some subsequent programs like compilers.
They may perform the following functions :
1. Macro processing
2. File Inclusion
3. Rational Preprocessors
4. Language extension

1. Macro processing:
A macro is a rule or pattern that specifies how a certain input sequence should be mapped to
an output sequence according to a defined procedure. The mapping process that instantiates a macro
into a specific output sequence is known as macro expansion.

2. File Inclusion:
Preprocessor includes header files into the program text. When the preprocessor finds an
#include directive it replaces it by the entire content of the specified file.

3. Rational Preprocessors:
These processors change older languages with more modern flow-of-control and data-
structuring facilities.

4. Language extension :
These processors attempt to add capabilities to the language by what amounts to built-in
macros. For example, the language Equel is a database query language embedded in C.
ASSEMBLER
Assembler creates object code by translating assembly instruction mnemonics into
machine code. There are two types of assemblers:
• One-pass assemblers go through the source code once and assume that all symbols
will be defined before any instruction that references them.
• Two-pass assemblers create a table with all symbols and their values in the first
pass, and then use the table in a second pass to generate code.

LINKER AND LOADER


A linker or link editor is a program that takes one or more objects generated by a
compiler and combines them into a single executable program.
Three tasks of the linker are :
1. Searches the program to find library routines used by program, e.g. printf(), math
outines.
2. Determines the memory locations that code from each module will occupy and
relocates its instructions by adjusting absolute references
3. Resolves references among files.

A loader is the part of an operating system that is responsible for loading programs in
memory, one of the essential stages in the process of starting a program.

GROUPING OF THE PHASES


Compiler can be grouped into front and back ends:
• Front end: analysis (machine independent)
These normally include lexical and syntactic analysis, the creation of the symbol
table, semantic analysis and the generation of intermediate code. It also includes error
handling that goes along with each of these phases.
• Back end: synthesis (machine dependent)
It includes code optimization phase and code generation along with the necessary
error handling and symbol table operations.

Compiler passes

A collection of phases is done only once (single pass) or multiple times (multi pass)

✓ Single pass: usually requires everything to be defined before being used in source
program.
✓ Multi pass: compiler may have to keep entire program representation in memory.

Several phases can be grouped into one single pass and the activities of these phases are
interleaved during the pass. For example, lexical analysis, syntax analysis, semantic analysis and
intermediate code generation might be grouped into one pass.

COMPILER CONSTRUCTION TOOLS


These are specialized tools that have been developed for helping implement various
phases of a compiler. The following are the compiler construction tools:

1) Parser Generators:
✓ These produce syntax analyzers, normally from input that is based on a context-free
grammar.
✓ It consumes a large fraction of the running time of a compiler.
✓ Example-YACC (Yet Another Compiler-Compiler).

2) Scanner Generator:
✓ These generate lexical analyzers, normally from a specification based on regular
expressions.
✓ The basic organization of lexical analyzers is based on finite automation.
3) Syntax-Directed Translation:
✓ These produce routines that walk the parse tree and as a result generate intermediate
code.
✓ Each translation is defined in terms of translations at its neighbor nodes in the tree.

4) Automatic Code Generators:


✓ It takes a collection of rules to translate intermediate language into machine language. The
rules must include sufficient details to handle different possible access methods for data.

5) Data-Flow Engines:
✓ It does code optimization using data-flow analysis, that is, the gathering of information about
how values are transmitted from one part of a program to each other part.

LEXICAL ANALYSIS
Lexical analysis is the process of converting a sequence of characters into a sequence of
tokens. A program or function which performs lexical analysis is called a lexical analyzer or
scanner. A lexer often exists as a single function which is called by a parser or another function.

THE ROLE OF THE LEXICAL ANALYZER


✓ The lexical analyzer is the first phase of a compiler.
✓ Its main task is to read the input characters and produce as output a sequence of tokens
that the parser uses for syntax analysis.

✓ Upon receiving a “get next token” command from the parser, the lexical analyzer reads
input characters until it can identify the next token.

ISSUES OF LEXICAL ANALYZER


There are three issues in lexical analysis:
✓ To make the design simpler.
✓ To improve the efficiency of the compiler.
✓ To enhance the computer portability.

TOKENS:
A token is a string of characters, categorized according to the rules as a symbol (e.g.,
IDENTIFIER, NUMBER, COMMA). The process of forming tokens from an input stream of
characters is called tokenization.

A token can look like anything that is useful for processing an input text stream or text
file. Consider this expression in the C programming language: sum=3+2;
PATTERN:
A pattern is a description of the form that the lexemes of a token may take. A set of strings in
the input for which the same token is produced as output. This set of strings is described by a rule
called a pattern associated with the token.

TOKEN:
Token is a sequence of characters that can be treated as a single logical entity. Typical tokens
are, 1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants.

LEXEME:
Collection or group of characters forming tokens is called Lexeme.
In the case of a keyword as a token, the pattern is just the sequence of characters that
form the keyword. For identifiers and some other tokens, the pattern is a more complex structure
that is matched by many strings.
Attributes for Tokens
Some tokens have attributes that can be passed back to the parser. The lexical
analyzer collects information about tokens into their associated attributes. The
attributes influence the translation of tokens.
i) Constant : value of the constant
ii) Identifiers: pointer to the corresponding symbol table entry.

ERROR RECOVERY STRATEGIES IN LEXICAL ANALYSIS:


The following are the error-recovery actions in lexical analysis:
1) Deleting an extraneous character.
2) Inserting a missing character.
3) Replacing an incorrect character by a correct character.
4) Transforming two adjacent characters.
5) Panic mode recovery: Deletion of successive characters from the token until error is
resolved.

INPUT BUFFERING
We often have to look one or more characters beyond the next lexeme before we can be sure
we have the right lexeme. As characters are read from left to right, each character is stored in the
buffer to form a meaningful token as shown below:

We introduce a two-buffer scheme that handles large look aheads safely. We then consider
an improvement involving "sentinels" that saves time checking for the ends of buffers.
BUFFER PAIRS
·A buffer is divided into two N-character halves, as shown below

✓ Each buffer is of the same size N, and N is usually the number of characters on one disk
block. E.g., 1024 or 4096 bytes.
✓ Using one system read command we can read N characters into a buffer.
✓ If fewer than N characters remain in the input file, then a special character, represented by
eof, marks the end of the source file.
✓ Two pointers to the input are maintained:
1. Pointer lexeme_beginning, marks the beginning of the current lexeme, whose extent we
are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found. Once the next lexeme is
determined, forward is set to the character at its right end.
✓ The string of characters between the two pointers is the current lexeme. After the
lexeme is recorded as an attribute value of a token returned to the parser,
lexeme_beginning is set to the character immediately after the lexeme just found.
Advancing forward pointer:
Advancing forward pointer requires that we first test whether we have reached the end of
one of the buffers, and if so, we must reload the other buffer from the input, and move forward to
the beginning of the newly loaded buffer. If the end of second buffer is reached, we must again
reload the first buffer with input and the pointer wraps to the beginning of the buffer.
Code to advance forward pointer:

SENTINELS
• For each character read, we make two tests: one for the end of the buffer, and one to
determine what character is read. We can combine the buffer-end test with the test
for the current character if we extend each buffer to hold a sentinel character at the
end.
• The sentinel is a special character that cannot be part of the source program, and a
natural choice is the character eof.
• The sentinel arrangement is as shown below:

Note that eof retains its use as a marker for the end of the entire input. Any eof that
appears other than at the end of a buffer means that the input is at an end.

Code to advance forward pointer:

SPECIFICATION OF TOKENS
There are 3 specifications of tokens:
1) Strings
2) Language
3) Regular expression
Strings and Languages
✓ An alphabet or character class is a finite set of symbols.
✓ A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
✓ A language is any countable set of strings over some fixed alphabet.
✓ In language theory, the terms "sentence" and "word" are often used as synonyms for
"string." The length of a string s, usually written |s|, is the number of occurrences of symbols in
s.For example, banana is a string of length six. The empty string, denoted e, is the string of length
zero.

Operations on strings
The following string-related terms are commonly used:
1. A prefix of string s is any string obtained by removing zero or more symbols from
the end of string s.
For example, ban is a prefix of banana. example, ban is a prefix of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from
the beginning of s.
For example, nana is a suffix of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s.
For example, nan is a substring of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those prefixes,
suffixes, and substrings, respectively of s that are not e or not equal to s itself.
5. A subsequence of s is any string formed by deleting zero or more not necessarily
consecutive positions of s.
For example, baan is a subsequence of banana.
Operations on languages:
The following are the operations that can be applied to languages:
1.Union
2.Concatenation
3.Kleene closure
4.Positive closure
The following example shows the operations on strings:
Let L={0,1} and S={a,b,c}
1. Union : L U S={0,1,a,b,c}
2. Concatenation : L.S={0a,1a,0b,1b,0c,1c}
3. Kleene closure : L*={ e,0,1,00….}
4. Positive closure : L+={0,1,00….}
Regular Expressions
• Each regular expression r denotes a language L(r).
• Here are the rules that define the regular expressions over some alphabet S and the
languages that those expressions denote:
1. ɛ is a regular expression, and L(ɛ) is { ɛ }, that is, the language whose sole
member is the empty string.
2. If ‘a’ is a symbol in S, then ‘a’ is a regular expression, and L(a) = {a}, that is,
the language with one string, of length one, with ‘a’ in its one position.
3. Suppose r and s are regular expressions denoting the languages L(r) and L(s).
Then,
a) (r)|(s) is a regular expression denoting the language L(r) U L(s).
b) (r)(s) is a regular expression denoting the language L(r)L(s).
c) (r)* is a regular expression denoting (L(r))*.
d) (r) is a regular expression denoting L(r).
4. The unary operator * has highest precedence and is left associative.
5. Concatenation has second highest precedence and is left associative.
6. | has lowest precedence and is left associative.

Regular set
A language that can be defined by a regular expression is called a regular set.
If two regular expressions r and s denote the same regular set, we say they are equivalent and
write r = s.
There are a number of algebraic laws for regular expressions that can be used to
manipulate into equivalent forms.
For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative.

Regular Definitions
Giving names to regular expressions is referred to as a Regular definition. If S is an
alphabet of basic symbols, then a regular definition is a sequence of definitions of the form

1. Each di is a distinct name.


2. Each ri is a regular expression over the alphabet ∑ U {dl, d2,. . . , di-l}.
Example: Identifiers is the set of strings of letters and digits beginning with a letter.

Regular definition for this set:


letter . A | B | …. | Z | a | b | …. | z |
digit . 0 | 1 | …. | 9
id . letter ( letter | digit ) *

Shorthands
Certain constructs occur so frequently in regular expressions that it is convenient to
introduce notational shorthands for them.

1. One or more instances (+):


-The unary postfix operator + means “ one or more instances of” .
-If r is a regular expression that denotes the language L(r), then ( r )+ is a regular expression
that denotes the language (L (r ))+
-Thus the regular expression a+ denotes the set of all strings of one or more a’s.
-The operator + has the same precedence and associativity as the operator *.

2. . Zero or one instance ( ?):


-The unary postfix operator ? means “zero or one instance of”.
-The notation r? is a shorthand for r | e.
-If ‘r’ is a regular expression, then ( r )? is a regular expression that denotes the language L( r
) U { e }.

3. Character Classes:
-The notation [abc] where a, b and c are alphabet symbols denotes the regular expression a |
b | c.
-Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z.
-We can describe identifiers as being strings generated by the regular expression,
[A–Za–z][A–Za–z0–9]*

Non-regular Set
A language which cannot be described by any regular expression is a non-regular set.
Example: The set of all strings of balanced parentheses and repeating strings cannot be described by
a regular expression. This set can be specified by a context-free grammar.

RECOGNITION OF TOKENS
Consider the following grammar fragment:

where the terminals if , then, else, relop, id and num generate sets of strings given by the
following regular definitions:

For this language fragment the lexical analyzer will recognize the keywords if, then, else, as
well as the lexemes denoted by relop, id, and num. To simplify matters, we assume keywords are
reserved; that is, they cannot be used as identifiers.

Transition diagrams
It is a diagrammatic representation to depict the action that will take place when a lexical
analyzer is called by the parser to get the next token. It is used to keep track of information about the
characters that are seen as the forward pointer scans the input.
A LANGUAGE FOR SPECIFYING LEXICAL ANALYZER
There is a wide range of tools for constructing lexical analyzers.
•Lex
• YACC
LEX
Lex is a computer program that generates lexical analyzers. Lex is commonly used with the
yacc parser generator.

Creating a lexical analyzer


·First, a specification of a lexical analyzer is prepared by creating a program lex.l in the Lex
language. Then, lex.l is run through the Lex compiler to produce a C program lex.yy.c.
· Finally, lex.yy.c is run through the C compiler to produce an object program a.out, which is
the lexical analyzer that transforms an input stream into a sequence of tokens.
Lex Specification
A Lex program consists of three parts:
{ definitions }
%%
{ rules }
%%
{ user subroutines }

.Definitions include declarations of variables, constants, and regular definitions


.Rules are statements of the form
p1 {action1}
p2 {action2}

pn {action n}

where pi is regular expression and actioni describes what action the lexical analyzer
should take when pattern pi matches a lexeme. Actions are written in C code.

• User subroutines are auxiliary procedures needed by the actions. These can be compiled
separately and loaded with the lexical analyzer.

YACC-YET ANOTHER COMPILER-COMPILER


Yacc provides a general tool for describing the input to a computer program. The Yacc user
specifies the structures of his input, together with code to be invoked as each such structure is
recognized. Yacc turns such a specification into a subroutine that handles the input process;
frequently, it is convenient and appropriate to have most of the flow of control in the user's
application handled by this subroutine.

FINITE AUTOMATA
Finite Automata is one of the mathematical models that consist of a number of states and
edges. It is a transition diagram that recognizes a regular expression or grammar.
Types of Finite Automata
There are tow types of Finite Automata :
· Non-deterministic Finite Automata (NFA)
· Deterministic Finite Automata (DFA)
Non-deterministic Finite Automata

Deterministic Finite Automata


Construction of DFA from regular expression

The following steps are involved in the construction of DFA from regular expression:
i) Convert RE to NFA using Thomson’s rules
ii) Convert NFA to DFA
iii) Construct minimized DFA

UNIT-II

SYNTAX ANALYSIS AND RUNTIME ENVIRONMENT

SYNTAX ANALYSIS
Syntax analysis is the second phase of the compiler. It gets the input from the tokens and
generates a syntax tree or parse tree.

Advantages of grammar for syntactic specification :


1. A grammar gives a precise and easy-to-understand syntactic specification of a
programming language.
2. An efficient parser can be constructed automatically from a properly designed grammar.
3. A grammar imparts a structure to a source program that is useful for its translation into
object code and for the detection of errors.
4. New constructs can be added to a language more easily when there is a grammatical
description of the language.

THE ROLE OF PARSER


The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and
verifies that the string can be generated by the grammar for the source language. It reports any
syntax errors in the program. It also recovers from commonly occurring errors so that it can
continue processing its input.
Functions of the parser :
1. It verifies the structure generated by the tokens based on the grammar.
2. It constructs the parse tree.
3. It reports the errors.
4. It performs error recovery.

Issues :
Parser cannot detect errors such as:
1. Variable re-declaration
2. Variable initialization before use.
3. Data type mismatch for an operation.
The above issues are handled by Semantic Analysis phase.

Syntax error handling :


Programs can contain errors at many different levels. For example :
1. Lexical, such as misspelling a keyword.
2. Syntactic, such as an arithmetic expression with unbalanced parentheses.
3. Semantic, such as an operator applied to an incompatible operand.
4. Logical, such as an infinitely recursive call.
Functions of error handler :
1. It should report the presence of errors clearly and accurately.
2. It should recover from each error quickly enough to be able to detect subsequent errors.
3. It should not significantly slow down the processing of correct programs.
Error recovery strategies :
The different strategies that a parser uses to recover from a syntactic error are:
1. Panic mode
2. Phrase level
3. Error productions
4. Global correction
Panic mode recovery:
On discovering an error, the parser discards input symbols one at a time until a
synchronizing token is found. The synchronizing tokens are usually delimiters, such as semicolon or
end. It has the advantage of simplicity and does not go into an infinite loop. When multiple errors in
the same statement are rare, this method is quite useful.

Phrase level recovery:


On discovering an error, the parser performs local correction on the remaining input that
allows it to continue. Example: Insert a missing semicolon or delete an extraneous semicolon etc.

Error productions:
The parser is constructed using augmented grammar with error productions. If an error
production is used by the parser, appropriate error diagnostics can be generated to indicate the
erroneous constructs recognized by the input.
Global correction:
Given an incorrect input string x and grammar G, certain algorithms can be used to find a
parse tree for a string y, such that the number of insertions, deletions and changes of tokens is as
small as possible. However, these methods are in general too costly in terms of time and space.

CONTEXT-FREE GRAMMARS
A Context-Free Grammar is a quadruple that consists of terminals, non-terminals, start
symbol and productions.
Terminals : These are the basic symbols from which strings are formed.
Non-Terminals : These are the syntactic variables that denote a set of strings. These help to
define the language generated by the grammar.
Start Symbol : One non-terminal in the grammar is denoted as the “Start-symbol” and the
set of strings it denotes is the language defined by the grammar.
Productions : It specifies the manner in which terminals and non-terminals can be combined
to form strings. Each production consists of a non-terminal, followed by an arrow, followed by a
string of non-terminals and terminals.
Example of context-free grammar: The following grammar defines simple arithmetic expressions:

In this grammar,
.id + -* / . ( ) are terminals.
.expr , op are non-terminals.
· expr is the start symbol.
· Each line is a production.

Derivations:
Two basic requirements for a grammar are :
1. To generate a valid string.
2. To recognize a valid string.
Derivation is a process that generates a valid string with the help of grammar by replacing
the non-terminals on the left with the string on the right side of the production.
Example : Consider the following grammar for arithmetic expressions :
E . E+E |E*E |( E ) |-E | id
To generate a valid string -( id+id ) from the grammar the steps are -( id+id ) from the
grammar the steps are
1. E→ -E
2. E→.-(E)
3. E→.-(E+E )
4.E→.-(id+E )
5. E→.-(id+id )
In the above derivation,
•E is the start symbol.
• -(id+id) is the required sentence (only terminals).
• Strings such as E, -E, -(E), . . . are called sentinel forms.
Types of derivations:
The two types of derivation are:
1. Left most derivation
2. Right most derivation.
• In leftmost derivations, the leftmost non-terminal in each sentinel is always chosen first for
replacement.
• In rightmost derivations, the rightmost non-terminal in each sentinel is always chosen first
for replacement.
Example:
Given grammar G : E . E+E |E*E |( E ) |-E |id
Sentence to be derived : – (id+id)

LEFTMOST DERIVATION RIGHTMOST DERIVATION


E→-E E→-E
E→-(E) E→-(E)
E→-(E+E ) E→-(E+E )
E→-(id+E ) E→-(E+id )
E→-(id+id ) E→-(id+id )

•String that appear in leftmost derivation are called left sentinel forms.
•String that appear in rightmost derivation are called right sentinel forms.
Sentinels:
Given a grammar G with start symbol S, if S . a , where a may contain non-terminals or
terminals, then a is called the sentinel form of G.
Yield or frontier of tree:
Each interior node of a parse tree is a non-terminal. The children of node can be a
terminal or non-terminal of the sentinel forms that are read from left to right. The sentinel form in
the parse tree is called yield or frontier of the tree.

Ambiguity:
A grammar that produces more than one parse tree for some sentence is said to be
ambiguous grammar.
Example : Given grammar G : E → E+E | E*E |( E ) | -E | id
The sentence id+id*id has the following two distinct leftmost derivations:
E→ E+ E E→ E*E
E→ id + E E→ E + E * E
E→ id + E * E E → id + E * E
E→ id + id * E E→ id + id * E
E→ id + id * id E→ id + id * id

The two corresponding parse trees are :

WRITING A GRAMMAR
There are four categories in writing a grammar :
1. Regular Expression Vs Context Free Grammar
2. Eliminating ambiguous grammar.
3. Eliminating left-recursion
4. Left-factoring.
Each parsing method can handle grammars only of a certain form hence, the initial grammar may
have to be rewritten to make it parsable.

Regular Expressions vs. Context-Free Grammars:

•The lexical rules of a language are simple and RE is used to describe them.
• Regular expressions provide a more concise and easier to understand notation for tokens
than grammars.
• Efficient lexical analyzers can be constructed automatically from RE than from grammars.
• Separating the syntactic structure of a language into lexical and nonlexical parts provides a
convenient way of modularizing the front end into two manageable-sized components.

Eliminating ambiguity:
Ambiguity of the grammar that produces more than one parse tree for leftmost or rightmost
derivation can be eliminated by re-writing the grammar.
Consider this example, G: stmt → if expr then stmt | if expr then stmt else stmt | other

This grammar is ambiguous since the string if E1 then if E2 then S1 else S2 has the
following two parse trees for leftmost derivation :
To eliminate ambiguity, the following grammar may be used:
stmt→ matched_stmt | unmatched_stmt
matched_stmt → if expr then matched_stmt else matched_stmt | other
unmatched_stmt → if expr then stmt | if expr then matched_stmt else unmatched_stmt

Eliminating Left Recursion:


A grammar is said to be left recursive if it has a non-terminal A such that there is a
derivation A=>Aα for some string α. Top-down parsing methods cannot handle left-recursive
grammars. Hence, left recursion can be eliminated as follows:
If there is a production A→ Aα | β it can be replaced with a sequence of two productions
A→ βA’
A’ → αA’ | ε
without changing the set of strings derivable from A.

Example : Consider the following grammar for arithmetic expressions:


E→ E+T |T
T→ T*F |F
F→ (E) |id
First eliminate the left recursion for E as
E→ TE’
E’ → +TE’ | ε
Then eliminate for T as
T → FT’
T’→ *FT’ | ε

Thus the obtained grammar after eliminating left recursion is


E→ TE’
E’ → +TE’ | ε
T → FT’
T’→ *FT’ | ε
F → (E) |id

Algorithm to eliminate left recursion:


1. Arrange the non-terminals in some order A1, A2 . . . An.
2. for i := 1 to n do begin
for j := 1 to i-1 do begin
replace each production of the form Ai → Aj γ
by the productions Ai → 1 γ | 2 γ | . . . | k γ.
where Aj → 1 | 2 | . . . | k are all the current Aj-productions;
end
eliminate the immediate left recursion among the Ai-productions
end

Left factoring:
Left factoring is a grammar transformation that is useful for producing a grammar suitable
for predictive parsing. When it is not clear which of two alternative productions to use to expand a
non-terminal A, we can rewrite the A-productions to defer the decision until we have seen enough
of the input to make the right choice.

If there is any production A→ β 1 | β 2 , it can be rewritten as


A → A’
A’→ β1 | β2
Consider the grammar , G : S → iEtS | iEtSeS | a
E→ b
Left factored, this grammar becomes
S → iEtSS’ | a
S’→ eS | є
E→b
PARSING:
It is the process of analyzing a continuous stream of input in order to determine its
grammatical structure with respect to a given formal grammar.

Parse tree:
Graphical representation of a derivation or deduction is called a parse tree. Each interior
node of the parse tree is a non-terminal; the children of the node can be terminals or non-terminals.

Types of parsing:
1.Top down parsing
2. Bottom up parsing
• Top–down parsing : A parser can start with the start symbol and try to transform it to the
input string. Example : LL Parsers.
• Bottom–up parsing : A parser can start with input and attempt to rewrite it into the start
symbol. Example : LR Parsers.

TOP-DOWN PARSING
It can be viewed as an attempt to find a left-most derivation for an input string or an
attempt to construct a parse tree for the input starting from the root to the leaves.

Types of top-down parsing :


1. Recursive descent parsing
2. Predictive parsing
1. RECURSIVE DESCENT PARSING
· Recursive descent parsing is one of the top-down parsing techniques that uses a set of
recursive procedures to scan its input.
· This parsing method may involve backtracking, that is, making repeated scans of the input.

Example for backtracking :


Consider the grammar G : S → cAd
A → ab |a and the input string w=cad.

The parse tree can be constructed using the following top-down approach :

Step 1:
Initially create a tree with single node labeled S. An input pointer points to ‘c’, the first
symbol of w. Expand the tree with the production of S.
Step 2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to the
second symbol of w ‘a’ and consider the next leaf ‘A’. Expand A using the first alternative.

Step 3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input
pointer to third symbol of w ‘d’. But the third leaf of tree is b which does not match with the input
symbol d.
Hence discard the chosen production and reset the pointer to second position. This is called
backtracking.

Step 4:
Now try the second alternative for A.

Now we can halt and announce the successful completion of parsing.

Example for recursive descent parsing:


A left-recursive grammar can cause a recursive-descent parser to go into an infinite loop.
Hence, elimination of left-recursion must be done before parsing.

Consider the grammar for arithmetic expressions


E → E+T |T
T → T*F |F
F → (E) |id

After eliminating the left-recursion the grammar becomes,


E → TE’
E’→ +TE’ | є
T → FT’
T’ → *FT’ | є
F → (E)| id
Now we can write the procedure for grammar as follows:

Recursive procedure:

Procedure E()
begin
T( );
EPRIME( );
end

Procedure EPRIME( )
begin
If input_symbol=’+’ then
ADVANCE( );
T( );
EPRIME( );
end

Procedure T( )
begin
F( );
TPRIME( );
end

Procedure TPRIME( )
begin
If input_symbol=’*’ then
ADVANCE( );
F( );
TPRIME( );
end

Procedure F( )
begin
If input-symbol=’id’ then
ADVANCE( );
else if input-symbol=’(‘ then
ADVANCE( );
E( );
else if input-symbol=’)’ then
ADVANCE( );
end

else ERROR( );

Stack implementation:

To recognize input id+id*id :


2. PREDICTIVE PARSING
· Predictive parsing is a special case of recursive descent parsing where no backtracking is
required.
· The key problem of predictive parsing is to determine the production to be applied for a
non-terminal in case of alternatives.

Non-recursive predictive parser

The table-driven predictive parser has an input buffer, stack, a parsing table and output an
stream.
Input buffer: It consists of strings to be parsed, followed by $ to indicate the end of the input string.
Stack: It contains a sequence of grammar symbols preceded by $ to indicate the bottom of the stack.
Initially, the stack contains the start symbol on top of $.

Parsing table:
It is a two-dimensional array M[A, a], where ‘A’ is a non-terminal and ‘a’ is a terminal.

Predictive parsing program:


The parser is controlled by a program that considers X, the symbol on top of stack, and a, the
current input symbol. These two symbols determine the parser action. There are three possibilities:
1. If X = a = $, the parser halts and announces successful completion of parsing.
2. If X = a ≠ $, the parser pops X off the stack and advances the input pointer to the next
input symbol.
3. If X is a non-terminal , the program consults entry M[X, a] of the parsing table M. This
entry will either be an X-production of the grammar or an error entry.
If M[X, a] = {X → UVW},the parser replaces X on top of the stack by WVU.
If M[X, a] = error, the parser calls an error recovery routine.

Algorithm for nonrecursive predictive parsing:

Input : A string w and a parsing table M for grammar G.


Output : If w is in L(G), a leftmost derivation of w; otherwise, an error indication.
Method : Initially, the parser has $S on the stack with S, the start symbol of G on top, and w$ in the
input buffer. The program that utilizes the predictive parsing table M to produce a parse for the
input is as follows:

set ip to point to the first symbol of w$;


repeat
let X be the top stack symbol and a the symbol pointed to by ip;
if X is a terminal or $ then
if X = a then
pop X from the stack and advance ip
else error()
else /* X is a non-terminal */
if M[X, a] = X→ Y1Y2 … Yk then begin
pop X from the stack;
push Yk, Yk-1,…,Y1 onto the stack, with Y1 on top;
output the production X → Y1 Y2 . . . Yk
end
else error()
until X = $ /* stack is empty */

Predictive parsing table construction:


The construction of a predictive parser is aided by two functions associated with a grammar
G:
1. FIRST
2. FOLLOW

Rules for first( ):


1. If X is terminal, then FIRST(X) is {X}.
2. If X → є is a production, then add є to FIRST(X).
3. If X is non-terminal and X→ aa is a production then add a to FIRST(X).
4. If X is non-terminal and X → Y1Y2…Yk is a production, then place a in FIRST(X) if for
some i, a is in FIRST(Yi), and є is in all of FIRST(Y1),…,FIRST(Yi-1); that is, Y1,….Yi-1 =>
є. If є is in FIRST(Yj) for all j=1,2,..,k, then add є to FIRST(X).
Rules for follow( ):
1. If S is a start symbol, then FOLLOW(S) contains $.
2. If there is a production A → αBβ, then everything in FIRST(β) except є is placed in
follow(B).
3. If there is a production A → αB, or a production A → αBβ where FIRST(β) contains є,
then everything in FOLLOW(A) is in FOLLOW(B).

Algorithm for construction of predictive parsing table:


Input : Grammar G
Output : Parsing table M
Method :
1. For each production A → α of the grammar, do steps 2 and 3.
2. For each terminal a in FIRST(α), add A → α to M[A, a].
3. If є is in FIRST(α), add A→ α to M[A, b] for each terminal b in FOLLOW(A). If є is in
FIRST(α) and $ is in FOLLOW(A) , add A→ α to M[A, $].
4. Make each undefined entry of M be error.

Example:
Consider the following grammar :
E → E+T |T
T → T*F |F
F → (E) |id

After eliminating left-recursion the grammar is


E → TE’
E’ → +TE’ | є
T → FT’
T’ → *FT’ | є
F → (E) |id

First( ) :
FIRST(E) = { ( , id}
FIRST(E’) ={+ , є }
FIRST(T) = { ( , id}
FIRST(T’) = {*, є }
FIRST(F) = { ( , id }

Follow( ):
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, $, ) }
FOLLOW(T’) = { +, $, ) }
FOLLOW(F) = {+, * , $ , ) }

Predictive parsing table :

Stack implementation:
LL(1) grammar:
The parsing table entries are single entries. So each location has not more than one entry.
This type of grammar is called LL(1) grammar.
Consider this following grammar:
S→ iEtS | iEtSeS | a
E→ b
After eliminating left factoring, we have
S→ iEtSS’ | a
S’→ eS | є
E→ b
To construct a parsing table, we need FIRST() and FOLLOW() for all the non-terminals.
FIRST(S) = { i, a }
FIRST(S’) = {e, є }
FIRST(E) = { b}
FOLLOW(S) = { $ ,e }
FOLLOW(S’) = { $ ,e }
FOLLOW(E) = {t}

Parsing table:

Since there are more than one production, the grammar is not LL(1) grammar.

Actions performed in predictive parsing:


1. Shift
2. Reduce
3. Accept
4. Error

Implementation of predictive parser:


1. Elimination of left recursion, left factoring and ambiguous grammar.
2. Construct FIRST() and FOLLOW() for all non-terminals.
3. Construct predictive parsing table.
4. Parse the given input string using stack and parsing table.

BOTTOM-UP PARSING
Constructing a parse tree for an input string beginning at the leaves and going towards the
root is called bottom-up parsing.
A general type of bottom-up parser is a shift-reduce parser.

SHIFT-REDUCE PARSING
Shift-reduce parsing is a type of bottom-up parsing that attempts to construct a parse tree
for an input string beginning at the leaves (the bottom) and working up towards the root (the
top).

Example:
Consider the grammar:
S → aABe
A→ Abc | b
B→ d
The sentence to be recognized is abbcde.

REDUCTION (LEFTMOST) RIGHTMOST DERIVATION


abbcde (A → b) S→ aABe
aAbcde (A → Abc) → aAde
aAde (B→ d) → aAbcde
aABe (S→ aABe) → abbcde
S
The reductions trace out the right-most derivation in reverse.

Handles:
A handle of a string is a substring that matches the right side of a production, and whose
reduction to the non-terminal on the left side of the production represents one step along the
reverse of a rightmost derivation.

Example:
Consider the grammar:
E → E+E
E → E*E
E→ (E)
E → id
And the input string id1+id2*id3
The rightmost derivation is :
E → E+E
→ E+E*E
→ E+E*id3
→ E+id2*id3
→ id1+id2*id3
In the above derivation the underlined substrings are called handles.

Handle pruning:
A rightmost derivation in reverse can be obtained by “handle pruning”. (i.e.) if w is a
sentence or string of the grammar at hand, then w = γn, where γn is the nth right-sentinel form
of some rightmost derivation.

Stack implementation of shift-reduce parsing:

Actions in shift-reduce parser:


· shift – The next input symbol is shifted onto the top of the stack.
· reduce – The parser replaces the handle within a stack with a non-terminal.
· accept – The parser announces successful completion of parsing.
· error – The parser discovers that a syntax error has occurred and calls an error recovery
routine.

Conflicts in shift-reduce parsing:


There are two conflicts that occur in shift shift-reduce parsing:
1. Shift-reduce conflict: The parser cannot decide whether to shift or to reduce.
2. Reduce-reduce conflict: The parser cannot decide which of several reductions to make.

1. Shift-reduce conflict:
Example: Consider the grammar:
E→E+E | E*E | id and input id+id*id
2. Reduce-reduce conflict:
Consider the grammar:
M → R+R |R+c |R
R → c and input c+c

Viable prefixes:
•α is a viable prefix of the grammar if there is w such that αw is a right sentinel form.
• The set of prefixes of right sentinel forms that can appear on the stack of a shift-reduce
parser are called viable prefixes.
• The set of viable prefixes is a regular language.

OPERATOR-PRECEDENCE PARSING
An efficient way of constructing shift-reduce parser is called operator-precedence parsing.
Operator precedence parser can be constructed from a grammar called Operator-grammar.
These grammars have the property that no production on right side is ɛ or has two adjacent
non-terminals.

Example:
Consider the grammar:
E → EAE |(E) | -E |id
A → +|-|*|/|↑

Since the right side EAE has three consecutive non-terminals, the grammar can be written as
follows:
E → E+E |E-E |E*E | E/E |E↑E | -E |id

Operator precedence relations:


There are three disjoint precedence relations namely

<. - less than


= - equal to
.> - greater than
The relations give the following meaning:

a <. b – a yields precedence to b


a = b – a has the same precedence as b
a .> b – a takes precedence over b

Rules for binary operations:


1. If operator θ1 has higher precedence than operator θ2, then make
θ1 .> θ2 and θ2 <. θ1

2. If operators θ1 and .2, are of equal precedence, then make


θ1 .> θ2 and θ2 .> θ1 if operators are left associative
θ1 <. θ2 and θ2 <. θ1 if right associative

3. Make the following for all operators θ:


θ <. id , id .> θ
θ <. (, ( <. θ.
) .> θ, θ .> )
θ .> $ , $ <. θ

Also make
( =) , ( <. id , id .> ) , $ <. id , id .> $ ,
( <. ( , $ <. ( , ) .> $
) .> ),

Example:
Operator-precedence relations for the grammar
E → E+E | E-E | E*E | E/E | E↑E | (E) | -E | id is given in the following table assuming
1. ↑ is of highest precedence and right-associative
2. * and / are of next higher precedence and left-associative, and
3. + and -are of lowest precedence and left-associative
Note that the blanks in the table denote error entries.

TABLE : Operator-precedence relations


Operator precedence parsing algorithm:

Input : An input string w and a table of precedence relations.

Output : If w is well formed, a skeletal parse tree ,with a placeholder non-terminal E


labeling all interior nodes; otherwise, an error indication.

Method : Initially the stack contains $ and the input buffer the string w $. To parse, we
execute the following program :

(1) Set ip to point to the first symbol of w$;


(2) repeat forever
(3) if $ is on top of the stack and ip points to $ then
(4) return
else begin
(5) let a be the topmost terminal symbol on the stack
and let b be the symbol pointed to by ip;
(6) if a <. b or a = b then begin
(7) push b onto the stack;
(8) advance ip to the next input symbol;
end;
(9) else if a .> b then /*reduce*/
(10) repeat
(11) pop the stack
(12) until the top stack terminal is related by <. to the terminal most recently popped
(13) else error( )
end

Stack implementation of operator precedence parsing:


Operator precedence parsing uses a stack and precedence relation table for its
implementation of above algorithm. It is a shift-reduce parsing containing all four actions
shift, reduce, accept and error.
The initial configuration of an operator precedence parsing is
STACK INPUT
$ w$
where w is the input string to be parsed.

Example:
Consider the grammar E . E+E | E-E | E*E | E/E | E↑E | (E) | id. Input string is id+id*id .The
implementation is as follows:
Advantages of operator precedence parsing:
1.It is easy to implement.
2. Once an operator precedence relation is made between all pairs of terminals of a grammar,
the grammar can be ignored. The grammar is not referred anymore during implementation.

Disadvantages of operator precedence parsing:


1.It is hard to handle tokens like the minus sign (-) which has two different precedence.
2. Only a small class of grammar can be parsed using operator-precedence parser.

LR PARSERS
An efficient bottom-up syntax analysis technique that can be used to parse a large class of
CFG is called LR(k) parsing. The ‘L’ is for left-to-right scanning of the input, the ‘R’ for
constructing a rightmost derivation in reverse, and the ‘k’ for the number of input symbols.
When ‘k’ is omitted, it is assumed to be 1.

Advantages of LR parsing:
• It recognizes virtually all programming language constructs for which CFG can be written.
• It is an efficient non-backtracking shift-reduce parsing method.
• A grammar that can be parsed using LR method is a proper superset of a grammar that can
be parsed with predictive parser.
• It detects a syntactic error as soon as possible.

Drawbacks of LR method:
It is too much of work to construct a LR parser by hand for a programming language
grammar. A specialized tool, called a LR parser generator, is needed. Example: YACC.

Types of LR parsing method:


1. SLR-Simple LR - Easiest to implement, least powerful.
2. CLR-Canonical LR - Most powerful, most expensive.
3. LALR-Look-Ahead LR - Intermediate in size and cost between the other two methods.

The LR parsing algorithm:


The schematic form of an LR parser is as follows:
It consists of : an input, an output, a stack, a driver program, and a parsing table that has two
parts (action and goto).

• The driver program is the same for all LR parser.


• The parsing program reads characters from an input buffer one at a time.
• The program uses a stack to store a string of the form s0X1s1X2s2…Xmsm, where sm is
on top. Each Xi is a grammar symbol and each si is a state.
• The parsing table consists of two parts : action and goto functions.

Action : The parsing program determines sm, the state currently on top of stack, and ai, the
current input symbol. It then consults action[sm,ai] in the action table which can have one of
four values :

1. shift s, where s is a state,


2. reduce by a grammar production A → β,
3. accept, and
4. error.

Goto : The function goto takes a state and grammar symbol as arguments and produces a
state.

LR Parsing algorithm:

Input: An input string w and an LR parsing table with functions action and goto
for grammar G.

Output: If w is in L(G), a bottom-up-parse for w; otherwise, an error indication.

Method: Initially, the parser has s0 on its stack, where s0 is the initial state, and w$ in the
input buffer. The parser then executes the following program :

set ip to point to the first input symbol of w$;


repeat forever begin
let s be the state on top of the stack and
a the symbol pointed to by ip;
if action[s, a] = shift s’ then begin
push a then s’ on top of the stack;
advance ip to the next input symbol
end
else if action[s, a] = reduce A→ β then begin
pop 2* | β | symbols off the stack;
let s’ be the state now on top of the stack;
push A then goto[s’, A] on top of the stack;
output the production A→ β
end
else if action[s, a] = accept then
return
else error()
end

CONSTRUCTING SLR(1) PARSING TABLE:


To perform SLR parsing, take grammar as input and do the following:
1. Find LR(0) items.
2. Completing the closure.
3. Compute goto(I,X), where, I is set of items and X is grammar symbol.

LR(0) items:
An LR(0) item of a grammar G is a production of G with a dot at some position of
the right side. For example, production A → XYZ yields the four items :

A→ . XYZ
A→ X . YZ
A→. XY . Z
A → XYZ .

Closure operation:
If I is a set of items for a grammar G, then closure(I) is the set of items constructed from I
by the two rules:

1. Initially, every item in I is added to closure(I).


2. If A → α . Bβ is in closure(I) and B → γ is a production, then add the item B → . γ to I ,
if it is not already there. We apply this rule until no more new items can be added to
closure(I).

Goto operation:
Goto(I, X) is defined to be the closure of the set of all items [A → αX . β] such that
[A → α . X β] is in I.

Steps to construct SLR parsing table for grammar G are:


1. Augment G and produce G’
2. Construct the canonical collection of set of items C for G’
3. Construct the parsing action function action and goto using the following algorithm that
requires FOLLOW(A) for each non-terminal of grammar.

Algorithm for construction of SLR parsing table:

Input : An augmented grammar G’


Output : The SLR parsing table functions action and goto for G’
Method :
1.Construct C = {I0, I1, …. In}, the collection of sets of LR(0) items for G’.
2. State i is constructed from Ii.. The parsing functions for state i are determined as follows:
(a) If [A→ α . β] is in Ii and goto(Ii,a) = Ij, then set action[i,a] to “shift j”. Here a
must be terminal.
(b) If [A → α ·] is in Ii ,then set action[i,a] to “reduce A → a” for all a in
FOLLOW(A).
(c) If [S’→ S.] is in Ii, then set action[i,$] to “accept”.
If any conflicting actions are generated by the above rules, we say grammar is not SLR(1).

3.The gototransitions for state i are constructed for all non-terminals A using the rule:
If goto(Ii,A) = Ij, then goto[i,A] = j.
4. All entries not defined by rules (2) and (3) are made “error”
5. The initial state of the parser is the one constructed from the set of items containing
[S’→ . S].

Example for SLR parsing:


Construct SLR parsing for the following grammar :
G:E→E+T|T
T→T*F|F
F → (E) | id
The given grammar is :
G : E → E + T ------(1)
E→T ------(2)
T → T * F ------(3)
T→F ------(4)
F → (E) ------(5)
F → id ------(6)
Step 1 : Convert given grammar into augmented grammar.
Augmented grammar :
E’ → E
E→ E+T
E→T
T→T*F
T→ F
F → (E)
F → id

Step 2 : Find LR (0) items.

I0 : E’→ . E
E→.E+T
E→.T
T→ . T * F
T→ . F
F→ . (E)
F→ . id

GOTO ( I0 , E) GOTO ( I4 , id )
I1 : E’ → E . I5 : F → id .
E →E.+T

GOTO ( I0 , T) GOTO ( I6 , T )
I2: E→ T. I9 : E → E + T .
T→ T.*F T→T.*F

GOTO ( I0 , F) GOTO ( I6 , F )
I3: T→F. I3 : T → F .

GOTO ( I0 , ( ) GOTO ( I6 , ( )
I4: F → ( .E) I4: F → ( .E)
E→.E+T E→.E+T
E→ . T E→ . T
T → .T * F T → .T * F
T→.F T→.F
F → . (E) F → . (E)
F → . id F → . id

GOTO ( I0 , id ) GOTO ( I6 , id )
I5 : F → id . I5 : F → id .

GOTO ( I1 , + ) GOTO ( I7 , F)
I6 : E→ E+.T I10: T → T * F.
T→ .T*F
T→.F
F → (E)
F→ .id

GOTO ( I2 , * ) GOTO ( I7 , ( )
I7: T→T*.F I4: F → ( .E)
F→ .(E) E→.E+T
F→ . id E→ . T
T → .T * F
T→.F
F → . (E)
F → . id

GOTO ( I4 , E ) GOTO ( I7 , id )
I8 : F → ( E .) I5 : F → id .
E→E.+T

GOTO ( I4 , T) GOTO ( I8 , ) )
I2 : E →T . I11: F → (E).
T→T.*F

GOTO ( I4 , F) GOTO ( I8 , + )
I3 : T→F. I6 : E → E + . T
T→ .T*F
T→.F
F → (E)
F→ .id
GOTO ( I4 , ( ) GOTO ( I9 , *)
I4 : F → ( .E ) I7: T → T * . F
E → .E + T F→ .(E)
E → .T F→ . id
T → .T * F
T → .F
F → .(E)
F → .id

FOLLOW (E) = { $ , ) , +)
FOLLOW (T) = { $ , + , ) , * }
FOOLOW (F) = { * , + , ) , $ }

SLR parsing table:


Blank entries are error entries.

Stack implementation:
Check whether the input id + id * id is valid or not.
TYPE CHECKING
A compiler must check that the source program follows both syntactic and semantic
conventions of the source language. This checking, called static checking, detects and reports
programming errors.
Some examples of static checks:
1. Type checks – A compiler should report an error if an operator is applied to an
incompatible operand. Example: If an array variable and function variable are added together.
2. Flow-of-control checks –Statements that cause flow of control to leave a construct must
have some place to which to transfer the flow of control. Example: An error occurs when an
enclosing statement, such as break, does not exist in switch statement.

Position of type checker

·A type checker verifies that the type of a construct matches that expected by its context.
For example : arithmetic operator mod in Pascal requires integer operands, so a type
checker verifies that the operands of mod have type integer.
·Type information gathered by a type checker may be needed when code is generated.

TYPE SYSTEMS
The design of a type checker for a language is based on information about the syntactic
constructs in the language, the notion of types, and the rules for assigning types to language
constructs. For example : “ if both operands of the arithmetic operators of +,-and * are of
type integer, then the result is of type integer ”

Type Expressions
· The type of a language construct will be denoted by a “type expression.”
·A type expression is either a basic type or is formed by applying an operator called a type
constructor to other type expressions.
·The sets of basic types and constructors depend on the language to be checked.

The following are the definitions of type expressions:


1.Basic types such as boolean, char, integer, real are type expressions. A special basic type,
type_error , will signal an error during type checking; void denoting “the absence of a value” allows
statements to be checked.
2.Since type expressions may be named, a type name is a type expression.
3. A type constructor applied to type expressions is a type expression.
Constructors include:
Arrays : If T is a type expression then array (I,T) is a type expression denoting the type
of an array with elements of type T and index set I.
Products : If T1 and T2 are type expressions, then their Cartesian product T1 X T2 is a
type expression.
Records : The difference between a record and a product is that the fields of a record hav:
The difference between a record and a product is that the fields of a record have names. The record
type constructor will be applied to a tuple formed from field names and field types.
For example:

type row = record


address: integer;
lexeme: array[1..15] of char
end;
var table: array[1...101] of row;
declares the type name row representing the type expression record((address X
integer) X (lexeme X array(1..15,char))) and the variable table to be an array of records of
this type.

Pointers : If T is a type expression, then pointer(T) is a type expression denoting the type
“pointer to an object of type T”. For example, var p: . row declares variable p to have type
pointer(row).

Functions : A function in programming languages maps a domain type D to a range type R.


The type of such function is denoted by the type expression D . R

4. Type expressions may contain variables whose values are type expressions.
Tree representation for char x char . pointer (integer)

Type systems
• A type system is a collection of rules for assigning type expressions to the various parts of
a program.
• A type checker implements a type system. It is specified in a syntax-directed manner.
• Different type systems may be used by different compilers or processors of the same
language.
Static and Dynamic Checking of Types
• Checking done by a compiler is said to be static, while checking done when the target
program runs is termed dynamic.
• Any check can be done dynamically, if the target code carries the type of an element
along with the value of that element.

Sound type system


A sound type system eliminates the need for dynamic checking for type errors because it
allows us to determine statically that these errors cannot occur when the target program runs.
That is, if a sound type system assigns a type other than type_error to a program part, then
type errors cannot occur when the target code for the program part is run.

Strongly typed language


A language is strongly typed if its compiler can guarantee that the programs it accepts
will execute without type errors.

Error Recovery
•Since type checking has the potential for catching errors in program, it is desirable for
type checker to recover from errors, so it can check the rest of the input.
•Error handling has to be designed into the type system right from the start; the type
checking rules must be prepared to cope with errors.

SPECIFICATION OF A SIMPLE TYPE CHECKER


Here, we specify a type checker for a simple language in which the type of each identifier
must be declared before the identifier is used. The type checker is a translation scheme that
synthesizes the type of each expression from the types of its subexpressions. The type
checker can handle arrays, pointers, statements and functions.

A Simple Language

Consider the following grammar:

P.D;E
D . D ; D | id : T
T . char| integer| array [ num ] of T |. T
E . literal| num| id | E mod E | E [ E ]| E .

Translation scheme:

P.D;E
D. D ; D
D. id : T { addtype
(id.entry
, T.type)}
T. char { T.type
: = char }
T . integer { T.type
: = integer }
T. . T1 { T.type
: = pointer(T1.type) }
T. array [ num ] of T1 { T.type
: = array ( 1… num.val , T1.type) }

In the above language,


. There are two basic types : char and integer ;
. type_error
is used to signal errors;
. the prefix operator . builds a pointer type. Example , . integer leads to the type expression

pointer ( integer ).
Type checking of Expressions

In the following rules, the attribute type for E gives the type expression assigned to the
expression generated by E.

1. E . literal { E.type : = char }


E. num { E.type : = integer }
Here, constants represented by the tokens literal and num have type char and integer.
2. E . id { E.type : = lookup ( id.entry ) }
lookup ( e ) is used to fetch the type saved in the symbol table entry pointed to by e.
3. E . E1 mod E2 { E.type : = if E1. type = integer and
E2. type = integer then integer
else type_error }
The expression formed by applying the mod operator to two subexpressions of type integer
has type integer; otherwise, its type is type_error.

4. E. E1[ E2] { E.type : = if E2.type = integer and


E1.type = array(s,t) then t
else type_error }
In an array reference E1 [ E2 ] , the index expression E2 must have type integer. The result is
the element type t obtained from the type array(s,t) of E1.

5. E. E1 . { E.type : = if E1.type = pointer (t) then t


else type_error }
The postfix operator . yields the object pointed to by its operand. The type of E . is the type t
of the object pointed to by the pointer E.

Type checking of Statements

Statements do not have values; hence the basic type void can be assigned to them. If an error
is detected within a statement, then type_error is assigned.

Translation scheme for checking the type of statements:

1. Assignment statement:
S. id : = E { S.type : = if id.type = E.type then void
else type_error }

2. Conditional statement:
S . if E then S1 { S.type : = if E.type = boolean then S1.type
else type_error }

3. While statement:
S . while E do S1 { S.type : = if E.type = boolean then S1.type
else type_error }
4. Sequence of statements:
S . S1; S2 { S.type : = if S1.type = void and
S1.type = void then void
else type_error }

Type checking of functions

The rule for checking the type of a function application is :


E. E1( E2) { E.type : = if E2.type = s and
E1.type = s . t then t
else type_error }

SOURCE LANGUAGE ISSUES

Procedures:

A procedure definition is a declaration that associates an identifier with a statement. The


identifier is the procedure name, and the statement is the procedure body.
For example, the following is the definition of procedure named readarray :

procedure readarray;
var i : integer;
begin

for i : = 1 to 9 do read(a[i])
end;

When a procedure name appears within an executable statement, the procedure is said to be
called at that point.

Activation trees:

An activation tree is used to depict the way control enters and leaves activations. In an
activation tree,

1.
Each node represents an activation of a procedure.
2.
The root represents the activation of the main program.
3.
The node for a is the parent of the node for b if and only if control flows from activation a to
b.
4.
The node for a is to the left of the node for b if and only if the lifetime of a occurs before the
lifetime of b.
Control stack:
A control stack is used to keep track of live procedure activations. The idea is to push the
node for an activation onto the control stack as the activation begins and to pop the node
when the activation ends.

The contents of the control stack are related to paths to the root of the activation tree.
When node n is at the top of control stack, the stack contains the nodes along the path
from n to the root.

The Scope of a Declaration:


A declaration is a syntactic construct that associates information with a name.
Declarations may be explicit, such as:

var i : integer ;
or they may be implicit. Example, any variable name starting with I is assumed to denote an
integer.

The portion of the program to which a declaration applies is called the scope
of that declaration.

Binding of names:

Even if each name is declared once in a program, the same name may denote different
data objects at run time. “Data object” corresponds to a storage location that holds values.

The term environment


refers to a function that maps a name to a storage location.
The term state
refers to a function that maps a storage location to the value held there.

environment
state

name storage value

When an environment
associates storage location s
with a name x, we say that x
is bound
to s. This association is referred to as a binding
of x.

STORAGE ORGANISATION

·
The executing target program runs in its own logical address space in which each
program value has a location.
·
The management and organization of this logical address space is shared between the
complier, operating system and target machine. The operating system maps the logical

address into physical addresses, which are usually spread throughout memory.

Typical subdivision of run-time memory:

Code
Static Data
Stack
free memory
Heap

·
Run-time storage comes in blocks, where a byte is the smallest unit of addressable
Multibyte objects are stored in consecutivememory. Four bytes form a machine word.
bytes and given the address of first byte.
·
The storage layout for data objects is strongly influenced by the addressing constraints of
the target machine.
·
A character array of length 10 needs only enough bytes to hold 10 characters, a compiler

may allocate 12 bytes to get alignment, leaving 2 bytes unused.


·
This unused space due to alignment considerations is referred to as padding.
·
The size of some program objects may be known at run time and may be placed in an

area called static.


·
The dynamic areas used to maximize the utilization of space at run time are stack and
heap.

Activation records:
· Procedure calls and returns are usually managed by a run time stack called the control
stack
· Each live activation has an activation record on the control stack, with the root of the
activation tree at the bottom, the latter activation has its record at the top of the stack.
· The contents of the activation record vary with the language being implemented. The
diagram below shows the contents of activation record.
· Temporary values such as those arising from the evaluation of expressions.
· Local data belonging to the procedure whose activation record this is.
· A saved machine status, with information about the state of the machine just before the
call to procedures.
· An access link may be needed to locate data needed by the called procedure but found
elsewhere.
· A control link pointing to the activation record of the caller.
· Space for the return value of the called functions, if any. Again, not all called procedures
return a value, and if one does, we may prefer to place that value in a register for
efficiency.
· The actual parameters used by the calling procedure. These are not placed in activation
record but rather in registers, when possible, for greater efficiency.

STORAGE ALLOCATION STRATEGIES


The different storage allocation strategies are :

1. Static allocation – lays out storage for all data objects at compile time
2. Stack allocation – manages the run-time storage as a stack.
3. Heap allocation – allocates and deallocates storage as needed at run time from a data area
known as heap.
STATIC ALLOCATION
· In static allocation, names are bound to storage as the program is compiled, so there is no
need for a run-time support package.
· Since the bindings do not change at run-time, everytime a procedure is activated, its
names are bound to the same storage locations.
· Therefore values of local names are retained
across activations of a procedure. That is,
when control returns to a procedure the values of the locals are the same as they were
when control left the last time.
· From the type of a name, the compiler decides the amount of storage for the name and
decides where the activation records go. At compile time, we can fill in the addresses at
which the target code can find the data it operates on.
STACK ALLOCATION OF SPACE

· All compilers for languages that use procedures, functions or methods as units of user-
defined actions manage at least part of their run-time memory as a stack.
· Each time a procedure is called , space for its local variables is pushed onto a stack, and
when the procedure terminates, that space is popped off the stack.

Calling sequences:
· Procedures called are implemented in what is called as calling sequence, which consists
of code that allocates an activation record on the stack and enters information into its
fields.
· A return sequence is similar to code to restore the state of machine so the calling
procedure can continue its execution after the call.
· The code in calling sequence is often divided between the calling procedure (caller) and
the procedure it calls (callee).
· When designing calling sequences and the layout of activation records, the following
principles are helpful:

• Values communicated between caller and callee are generally placed at the
beginning of the callee’s activation record, so they are as close as possible to the
caller’s activation record.
• Fixed length items are generally placed in the middle. Such items typically include
the control link, the access link, and the machine status fields.
• Items whose size may not be known early enough are placed at the end of the
activation record. The most common example is dynamically sized array, where the
value of one of the callee’s parameters determines the length of the array.
• We must locate the top-of-stack pointer judiciously. A common approach is to have
it point to the end of fixed-length fields in the activation record. Fixed-length data
can then be accessed by fixed offsets, known to the intermediate-code generator,
relative to the top-of-stack pointer.
caller’s
activation
record

caller’s
responsibility

callee’s
activation
record

...
Parameters and returned values
control link
links and saved status
temporaries and local data
Parameters and returned values
control link
links and saved status
temporaries and local data
top_sp

callee’s
responsibility

Division of tasks between caller and callee

·
The calling sequence and its division between caller and callee are as follows.


The caller evaluates the actual parameters.

The caller stores a return address and the old value of top_sp into the callee’s
activation record. The caller then increments the top_sp to the respective
positions.

The callee saves the register values and other status information.

The callee initializes its local data and begins execution.
·
A suitable, corresponding return sequence is:

The callee places the return value next to the parameters.

Using the information in the machine-status field, the callee restores top_sp and
other registers, and then branches to the return address that the caller placed in
the status field.

Although top_sp has been decremented, the caller knows where the return value
is, relative to the current value of top_sp; the caller therefore may use that value.

Variable length data on stack: n stack:

·
The run-time memory management system must deal frequently with the allocation of
space for objects, the sizes of which are not known at the compile time, but which are
local to a procedure and thus may be allocated on the stack.

·
The reason to prefer placing objects on the stack is that we avoid the expense of garbage
collecting their space.
·
The same scheme works for objects of any type if they are local to the procedure called
and have a size that depends on the parameters of the call.

http://csetube.tk/
.
activation
record for p
arrays of p
activation record for top_sp
procedure q called by p
arrays of q top
control link
pointer to A
pointer to B
pointer to C
array A
array B
array C
control link
Access to dynamically allocated arrays

·
Procedure p has three local arrays, whose sizes cannot be determined at compile time.
The storage for these arrays is not part of the activation record for p.

·
Access to the data is through two pointers, top and top-sp. Here the top marks the actual
top of stack; it points the position at which the next activation record will begin.

·
The second top-sp is used to find local, fixed-length fields of the top activation record.

·
The code to reposition top and top-sp can be generated at compile time, in terms of sizes
that will become known at run time.
HEAP ALLOCATION

Stack allocation strategy cannot be used if either of the following is possible :

1. The values of local names must be retained when an activation ends.


2. A called activation outlives the caller.
· Heap allocation parcels out pieces of contiguous storage, as needed for activation records
or other objects.
· Pieces may be deallocated in any order, so over the time the heap will consist of alternate
areas that are free and in use.

Position in the Activation records in the heap Remarks


activation tree
s Retained activation
record for r
r q ( 1 , 9)
.The record for an activation of procedure r is retained when the activation ends.
.Therefore, the record for the new activation q(1 , 9) cannot follow that for s physically.
.If the retained activation record for r is deallocated, there will be free space in the heap
between the activation records for s and q.
s
control link
r
control link
q(1,9)
control link
UNIT IV -CODE GENERATION
The final phase in compiler model is the code generator. It takes as input an intermediate
representation of the source program and produces as output an equivalent target program. The
code generation techniques presented below can be used whether or not an optimizing phase
occurs before code generation.
Position of code generator

ISSUES IN THE DESIGN OF A CODE GENERATOR


The following issues arise during the code generation phase :
1. Input to code generator
2. Target program
3. Memory management
4. Instruction selection
5. Register allocation
6. Evaluation order
1. Input to code generator:
.The input to the code generation consists of the intermediate representation of the source program
produced by front end , together with information in the symbol table to determine run-time addresses of
the data objects denoted by the names in the intermediate representation.
Intermediate representation can be :
a. Linear representation such as postfix notation
b. Three address representation such as quadruples
c. Virtual machine representation such as stack machine code
d. Graphical representations such as syntax trees and dags.
· Prior to code generation, the front end must be scanned, parsed and translated into intermediate
representation along with necessary type checking. Therefore, input to code generation is assumed to be
error-free.
2. Target program:
· The output of the code generator is the target program. The output may be :
a. Absolute machine language
-It can be placed in a fixed memory location and can be executed immediately.
b. Relocatable machine language
-It allows subprograms to be compiled separately.
c. Assembly language
-Code generation is made easier.
3. Memory management:
· Names in the source program are mapped to addresses of data objects in run-time
memory by the front end and code generator.
· It makes use of symbol table, that is, a name in a three-address statement refers to a
symbol-table entry for the name.
· Labels in three-address statements have to be converted to addresses of instructions.
For example,
j: goto i generates jump instruction as follows :
• if i < j, a backward jump instruction with target address equal to location of code for quadruple i
is generated.
• if i > j, the jump is forward. We must store on a list for quadruple i the location of the first machine
instruction generated for quadruple j.
When i is processed, the machine locations for all instructions that forward jumps to i are filled.
4. Instruction selection:
· The instructions of target machine should be complete and uniform.
·Instruction speeds and machine idioms are important factors when efficiency of target program is
considered.
·The quality of the generated code is determined by its speed and size.
·The former statement can be translated into the latter statement as shown below:

5. Register allocation
· Instructions involving register operands are shorter and faster than those involving operands in memory.
· The use of registers is subdivided into two subproblems :
•Register allocation – the set of variables that will reside in registers at a point in the program is
selected.
•Register assignment – the specific register that a variable will reside in is picked.
· Certain machine requires even-odd register pairs for some operands and results.
For example , consider the division instruction of the form :
D x, y
where, x – dividend even register in even/odd register pair ,y – divisor
even register holds the remainder, odd register holds the quotient
6.Evaluation order
· The order in which the computations are performed can affect the efficiency of the target code. Some
computation orders require fewer registers to hold intermediate results than others.

TARGET MACHINE
· Familiarity with the target machine and its instruction set is a prerequisite for designing a good code
generator.
· The target computer is a byte-addressable machine with 4 bytes to a word.
· It has n general-purpose registers, R0, R1, . . . , Rn-1.
· It has two-address instructions of the form:
op source, destination
where, op is an op-code, and source and destination are data fields.
· It has the following op-codes :
MOV (move source to destination)
ADD (add source to destination)
SUB (subtract source from destination)
· The source and destination of an instruction are specified by combining registers and
memory locations with address modes.

Address modes with their assembly-language forms


·For example :
MOV R0, M stores contents of Register R0 into memory location M ;
MOV 4(R0), M stores the value contents(4+contents(R0)) into M.

Instruction costs :
·Instruction cost = 1+cost for source and destination address modes. This cost corresponds to the length
of the instruction.
· Address modes involving registers have cost zero.
· Address modes involving memory location or literal have cost one.
· Instruction length should be minimized if space is important. Doing so also minimizes the time taken to
fetch and perform the instruction.
For example : MOV R0, R1 copies the contents of register R0 into R1. It has cost one,
since it occupies only one word of memory.
· The three-address statement a : = b + c can be implemented by many different instruction sequences :
i) MOV b, R0
ADD c, R0 cost = 6
MOV R0, a

ii) MOV b, a
ADD c, a cost = 6

iii) Assuming R0, R1 and R2 contain the addresses of a, b, and c :


MOV *R1, *R0
ADD *R2, *R0 cost = 2

·In order to generate good code for target machine, we must utilize its addressing capabilities efficiently.

RUN-TIME STORAGE MANAGEMENT


·Information needed during an execution of a procedure is kept in a block of storage called an activation
record, which includes storage for names local to the procedure.
· The two standard storage allocation strategies are:
1.Static allocation
2. Stack allocation
· In static allocation, the position of an activation record in memory is fixed at compile time.
· In stack allocation, a new activation record is pushed onto the stack for each execution of a procedure.
The record is popped when the activation ends.
· The following three-address statements are associated with the run-time allocation and
deallocation of activation records:
1. Call,
2. Return,
3. Halt, and
4. Action, a placeholder for other statements.
· We assume that the run-time memory is divided into areas for:
1. Code
2. Static data
3. Stack
Static allocation
Implementation of call statement:
The codes needed to implement static allocation are as follows:
MOV #here + 20, callee.static_area /*It saves return address*/
GOTO callee.code_area /*It transfers control to the target code for the called procedure */
where, callee.static_area – Address of the activation record
callee.code_area – Address of the first instruction for called procedure
#here + 20 – Literal return address which is the address of the instruction following GOTO.

Implementation of return statement:


A return from procedure callee is implemented by :
GOTO *callee.static_area
This transfers control to the address saved at the beginning of the activation record.

Implementation of action statement:


The instruction ACTION is used to implement action statement.

Implementation of halt statement:


The statement HALT is the final instruction that returns control to the operating system.

Stack allocation
Static allocation can become stack allocation by using relative addresses for storage in activation records.
In stack allocation, the position of activation record is stored in register so words in activation records
can be accessed as offsets from the value in this register.
The codes needed to implement stack allocation are as follows:
Initialization of stack:
MOV #stackstart , SP /* initializes stack */
Code for the first procedure
HALT /* terminate execution */

Implementation of Call statement:


ADD #caller.recordsize, SP /* increment stack pointer */
MOV #here + 16, *SP /*Save return address */
GOTO callee.code_area
where, caller.recordsize – size of the activation record
#here + 16 – address of the instruction following the GOTO

Implementation of Return statement:


GOTO *0 ( SP ) /*return to the caller */
SUB #caller.recordsize, SP /* decrement SP and restore to previous value */

BASIC BLOCKS AND FLOW GRAPHS


Basic Blocks
·A basic block is a sequence of consecutive statements in which flow of control enters at the beginning
and leaves at the end without any halt or possibility of branching except at the end.
·The following sequence of three-address statements forms a basic block:
t1 : = a * a
t2 : = a * b
t3 : = 2 * t2
t4 : = t1 + t3
t5 : = b * b
t6 : = t4 + t5

Basic Block Construction:


Algorithm:
Partition into basic blocks
Input:
A sequence of three-address statements
Output:
A list of basic blocks with each three-address statement in exactly one block
Method:
1. We first determine the set of leaders, the first statements of basic blocks. The rules we use are of the
following:
a. The first statement is a leader.
b. Any statement that is the target of a conditional or unconditional goto is a leader.
c. Any statement that immediately follows a goto or conditional goto statement is a leader.
2. For each leader, its basic block consists of the leader and all statements up to but not including the next
leader or the end of the program.
·Consider the following source code for dot product of two vectors a and b of length 20
begin
prod :=0;
i:=1;
do begin

prod :=prod+ a[i] * b[i];


i :=i+1;
end
.The three-address code for the above source program is given as :
while i <= 20
end
(1) prod := 0
(2) i := 1
(3) t1 := 4* i
(4) t2 := a[t1] /*compute a[i] */
(5) t3 := 4* i
(6) t4 := b[t3] /*compute b[i] */
(7) t5 := t2*t4
(8) t6 := prod+t5
(9) prod := t6
(10) t7 := i+1
(11) i := t7
(12) if i<=20 goto (3)

Basic block 1: Statement (1) to (2)


Basic block 2: Statement (3) to (12)
Transformations on Basic Blocks: :
A number of transformations can be applied to a basic block without changing the set of expressions
computed by the block. Two important classes of transformation are :
· Structure-preserving transformations
·Algebraic transformations
1. Structure preserving transformations:
a) Common subexpression elimination:

a:=b+c a:=b+c
b : = a -d b : = a –d
c:=b+c c:=b+c
d : = a –d d:=b
Since the second and fourth expressions compute the same expression, the basic block can be transformed
as above.
b) Dead-code elimination:
Suppose x is dead, that is, never subsequently used, at the point where the statement x : = y + z appears
in a basic block. Then this statement may be safely removed without changing the value of the basic
block.
c) Renaming temporary variables:
A statement t : = b + c ( t is a temporary ) can be changed to u : = b + c (u is a new temporary) and all
uses of this instance of t can be changed to u without changing the value of the basic block. Such a block
is called a normal-form block.
d) Interchange of statements:
Suppose a block has the following two adjacent statements:
t1 : = b + c
t2 : = x + y
We can interchange the two statements without affecting the value of the block if and only if neither x
nor y is t1 and neither b nor c is t2.

2. Algebraic transformations:
Algebraic transformations can be used to change the set of expressions computed by a basic block into
an algebraically equivalent set.
Examples:
i) x : = x + 0 or x : = x * 1 can be eliminated from a basic block without changing the set of expressions
it computes.
ii) The exponential statement x : = y * * 2 can be replaced by x : = y * y.

Flow Graphs:
• Flow graph is a directed graph containing the flow-of-control information for the set of basic
blocks making up a program.
• The nodes of the flow graph are basic blocks. It has a distinguished initial node.
E.g.: Flow graph for the vector dot product is given as follows:

-B1 is the initial node. B2 immediately follows B1, so there is an edge from B1 to B2. The target of jump
from last statement of B1 is the first statement B2, so there is an edge from B1 (last statement) to B2 (first
statement).
-B1 is the predecessor of B2, and B2 is a successor of B1.
Loops:
A loop is a collection of nodes in a flow graph such that
1.All nodes in the collection are strongly connected.
2. The collection of nodes has a unique entry.
A loop that contains no other loops is called an inner loop.

NEXT-USE INFORMATION
· If the name in a register is no longer needed, then we remove the name from the register
and the register can be used to store some other names.

Input:
Basic block B of three-address statements
Output:
At each statement i: x= y op z, we attach to i the liveliness and next-uses of x, y and z.
Method:
We start at the last statement of B and scan backwards.
1. Attach to statement i the information currently found in the symbol table regarding the next-
use and liveliness of x, y and z.
2. In the symbol table, set x to “not live” and “no next use”.
3.In the symbol table, set y and z to “live”, and next-uses of y and z to i.

Symbol Table:

A SIMPLE CODE GENERATOR


·A code generator generates target code for a sequence of three-address statements and effectively uses
registers to store operands of the statements.
· For example: consider the three-address statement a := b+c
It can have the following sequence of codes:
ADD Rj, Ri Cost = 1 // if Ri contains b and Rj contains c
(or)
ADD c, Ri Cost = 2 // if c is in a memory location
(or)
MOV c, Rj Cost = 3 // move c from memory to Rj and add
ADD Rj, Ri
Register and Address Descriptors:
·A register descriptor is used to keep track of what is currently in each registers. The register descriptors
show that initially all the registers are empty.
· An address descriptor stores the location where the current value of the name can be found at run time.
A code-generation algorithm:
The algorithm takes as input a sequence of three-address statements constituting a basic block.
For each three-address statement of the form x : = y op z, perform the following actions:
1. Invoke a function getreg to determine the location L where the result of the computation y op
z should be stored.
2. Consult the address descriptor for y to determine y’, the current location of y. Prefer the
register for y’ if the value of y is currently both in memory and a register. If the value of y is not already
in L, generate the instruction MOV y’ , L to place a copy of yin L.
3. Generate the instruction OP z’ , L where z’ is a current location of z. Prefer a register to a
memory location if z is in both. Update the address descriptor of x to indicate that x is in location L. If x
is in L, update its descriptor and remove x from all other descriptors.
4. If the current values of y or z have no next uses, are not live on exit from the block, and are in
registers, alter the register descriptor to indicate that, after execution of x : = y op z , those registers will
no longer contain y or z.

Generating Code for Assignment Statements:


·The assignment d : = (a-b) + (a-c) + (a-c) might be translated into the following three-
address code sequence:

t : = a –b
u := a –c
v:= t + u
d := v + u
with d live at the end.

Code sequence for the example is:

Generating Code for Indexed Assignments Indexed Assignments


The table shows the code sequences generated for the indexed assignment statements
a : = b [ i ] and a [ i ] : =b

Statements Code Generated Cost


a : = b[i] MOV b(Ri), R 2
a[i] : = b MOV b, a(Ri) 3

Generating Code for Pointer Assignments


The table shows the code sequences generated for the pointer assignments a : = *p and *p : = a
Statements Code Generated Cost
a : = *p MOV *Rp, a 2
*p : = a MOV a, *Rp 2

Generating Code for Conditional Statements


Statement Code
if x < y goto z CMP x, y
CJ< z /* jump to z if condition code is negative */

x : = y +z MOV y, R0
if x < 0 goto z ADD z, R0
MOV R0,x
CJ< z

THE DAG REPRESENTATION FOR BASIC BLOCKS


·A DAG for a basic block is a directed acyclic graph with the following labels on nodes:
1. Leaves are labeled by unique identifiers, either variable names or constants.
2. Interior nodes are labeled by an operator symbol.
3. Nodes are also optionally given a sequence of identifiers for labels to store the computed values.
· DAGs are useful data structures for implementing transformations on basic blocks.
· It gives a picture of how the value computed by a statement is used in subsequent
statements.
· It provides a good way of determining common sub -expressions.

Algorithm for construction of DAG


Input: A basic block
Output: A DAG for the basic block containing the following information:
1.A label for each node. For leaves, the label is an identifier. For interior nodes, an operator
symbol.
2. For each node a list of attached identifiers to hold the computed values.
Case (i) x: =yOP z
Case (ii) x: =OP y
Case (iii) x: =y

Method:
Step 1: If y is undefined then create node(y).
If z is undefined, create node(z) for case(i).
Step 2: For the case(i), create a node(OP) whose left child is node(y) and right child is node(z). (
Checking for common sub expression). Let n be this node.
For case(ii), determine whether there is node(OP) with one child node(y). If not create such a node.
For case(iii), node n will be node(y).
Step 3: Delete x from the list of identifiers for node(x). Append x to the list of attached identifiers
for the node n found in step 2 and set node(x) to n.

Example: Consider the block of three-address statements:

1. t1:=4*i
2. t2 := a[t1]
3. t3:=4*i
4. t4 := b[t3]
5. t5 := t2*t4
6. t6 := prod+t5
7. prod := t6
8. t7:=i+1
9. i :=t7
10. if i<=20 goto (1)

Stages in DAG Construction


Application of DAGs:
1. We can automatically detect common sub expressions.
2. We can determine which identifiers have their values used in the block.
3. We can determine which statements compute values that could be used outside the block.

GENERATING CODE FROM DAGs


The advantage of generating code for a basic block from its dag representation is that, from a dag we can
easily see how to rearrange the order of the final computation sequence than we can starting from a linear
sequence of three-address statements or quadruples.
Rearranging the order
The order in which computations are done can affect the cost of resulting object code.
For example, consider the following basic block:

t1 : = a + b
t2 : = c + d
t3 : = e – t2
t4 : = t1 – t3
Generated code sequence for basic block:
MOV a , R0
ADD b , R0
MOV c , R1
ADD d , R1
MOV R0 , t1
MOV e , R0
SUB R1 , R0
MOV t1 , R1
SUB R0 , R1
MOV R1 , t4

Rearranged basic block:


Now t1 occurs immediately before t4.
t2 : = c + d
t3 : = e – t2
t1 : = a + b
t4:= t1- t3

Revised code sequence:


MOV c , R0
ADD d , R0
MOV a , R0
SUB R0 , R1
MOV a , R0
ADD b,R0
SUB R1 , R0
MOV R0 , t4
In this order two instructions MOV R0 , t1 and MOV t1 , R1 have been saved.

A Heuristic ordering for Dags


The heuristic ordering algorithm attempts to make the evaluation of a node immediately follow the
evaluation of its leftmost argument.

The algorithm shown below produces the ordering in reverse.


Algorithm:
1) while unlisted interior nodes remain do begin
2) select an unlisted node n, all of whose parents have been listed;
3) list n;
4) while the leftmost child m of n has no unlisted parents and is not a leaf do
begin
5) list m;
6) n:=m
end
end
Consider the DAG shown below,
Initially, the only node with no unlisted parents is 1 so set n=1 at line (2) and list 1 at line (3).
Now, the left argument of 1, which is 2, has its parents listed, so we list 2 and set n=2 at line (6). Now, at
line (4) we find the leftmost child of 2, which is 6, has an unlisted parent 5. Thus we select a new n at
line (2), and node 3 is the only candidate. We list 3 and proceed down its left chain, listing 4, 5 and 6.
This leaves only 8 among the interior nodes so we list that. The resulting list is 1234568 and the order of
evaluation is 8654321.

Code sequence:

t8 : = d + e
t6 : = a + b
t5 : = t6 – c
t4 : = t5 * t8
t3 : = t4 – e
t2 : = t6 + t4
t1 : = t2 * t3

This will yield an optimal code for the DAG on machine whatever be the number of registers.

You might also like