System Software Notes
System Software Notes
System Software Notes
SYSTEM SOFTWARE
STUDY MATERIAL
INDEX
UNIT – I
1.1 Analysis of the source program
1.2 Phases of a compiler
1.3 Cousins of the Compiler
1.4 Grouping of Phases
1.5 Compiler Construction tools
1.6 Lexical Analysis
1.6.1 A role of lexical analyzer
1.6.2 Input Buffering
1.6.3 Specification of tokens
1.6.4 Recognition of tokens.
UNIT – II
2.1 Role of the parser
2.2 Context free grammars
2.3 Top down parsing:
2.3.1Recursive Descent Parsing.
2.3.2 Predictive parsing
2.4 Bottom up parsing:
2.4.1 Handles
2.4.2 Handle pruning
2.4.3 Stack implementation of shift-reducing parsing.
UNIT – III
3.1 Intermediate Code Generation
3.2 Intermediate Languages
3.3 Graphical Representations
3.4 Three-Address Code
3.5 Implementations of three address statements
3.6 Code Generation
3.7 Issues In The Design Of A Code Generator
3.8 Run-Time Storage Management
UNIT – IV
4.1 Elements of assembly language programming
2/81
4.2 Assembly Language Statements:
4.3 Advantage of assembly Language:
4.4 Interpreters
4.4.1 Uses Of Interpreters
4.4.2 Overview Of Interpretation
4.4.3 Pure And Impure Interpreters
4.5 Macros And Macro Processors
4.5.1 Macro definition and call
4.6 Macro expansion
4.6.1 Lexical substitution
4.6.2 Positional Parameters
4.6.3 Keyword Parameters
4.7 Nested Macro Calls
4.8 Advanced Macro Facilities
UNIT – V
5.1 Linkers
5.2 Relocation And Linking Concepts
5.2.1 Program Relocation
5.2.2 Linking
5.2.3 Binary Programs
5.2.4 Object Module
5.3 Self-Relocating Programs
5.4 Linking For Overlays
5.5 Loaders
5.6 Software Tools
5.6.1 Software Tools For Program Development
5.6.1.1 Program Design And Coding
5.6.1.2 Program Entry And Editing
5.6.1.3 Program Testing And Debugging
5.6.1.4 Enhancement Of Program Performance
5.6.1.5 Program Documentation
5.6.1.6 Design Of Software Tools
5.7 Editors
5.7.1 Screen Editors
5.7.2 Word Processors
5.7.3 Structure Editors
5.7.4 Design Of An Editor
5.8 Debug Monitors
5.8.1 Testing Assertions
3/81
UNIT - I
SYLLABUS
Compilers – Analysis of the source program – Phases of a compiler –Cousins of the
Compiler – Grouping of Phases-Compiler Construction tools.
Lexical analysis: A role of lexical analyzer – Input Buffering – Specification of tokens
- Recognition of tokens.
COMPILERS
* A compiler is a program that reads a program written in one language- the source
language- and translates it into an equivalent program in another language – the
target language.
Error messages
i) Analysis
ii) Synthesis
Each node represents an operation and the children of a node represent the
arguments of the operation. For example, an assignment statement in Pascal is given
as, Position := initial + rate * 60. The syntax tree is given below,
:=
position +
initial *
rate 60
i) Structure Editors
It analyzes a program and prints it in such a way that the structure of the
program becomes clearly visible.
Example : Comments may appear in a special font and statements may
appear with an amount of indentation proportional to the depth of their
nesting in the hierarchical organization of the statements.
iv) Interpreters
position +
5/81
initial *
rate 60
At the root, it would discover it had an assignment to perform. So it would call
a routine to evaluate the expression on the right, and store result in identifier position .
At right child of root, the routine would discover it had to compute sum of 2
expressions. It would call itself recursively to compute value of rate*60. it would then
add that value to value of variable initial.
EXAMPLES OF ANALYSIS
i) Text formatters
Assembler
The stream of characters making up the source program is read from left-to-right
and grouped into tokens that are sequences of characters having a collective
meaning. For example, position := initial + rate * 60
The characters or tokens are grouped hierarchically into nested collections with
collective meaning. It involves grouping the tokens of the source program into
grammatical phases that are used by the compiler to output.
The grammatical phases of the source program are represented by a parse tree.
Lexical constructs do not require recursion , while syntactic constructs requires
recursion. Context free grammars are a formalization of recursive rules that can be
used to guide syntactic analysis. For example, parse tree for position := initial +
rate * 60
Assignment statement
7/81
identifiers := expression
rate 60
The characters are grouped and removed from the input processing of the next
token can begin and are recorded in a table.
3. Semantic Analysis
To check the source program for semantic errors. It uses the hierarchical
structure determined by the syntax analysis phase to identify the operators and
operands of expressions and statements.
An important component of semantic analysis is type checking. For example,
when a binary arithmetic operator is applied to an integer and real. The compiler may
need to convert the integer to real.
Consecutive characters not separated by “white space” are grouped into words
consisting of a sequence of horizontally arranged boxes. Boxes in TEX system may be
built from smaller boxes by horizontally and vertically. For example, \h box { < list of
boxes > } \h box { \v box { !1 } \v box { @ 2 } }
! @
1 2
To build a mathematical expression from operators like sub and sup for
subscripts and superscripts. The EQN preprocessor for mathematics,
BOX sub box BOX
box
a sub { i sup 2 } a i 2
8/81
1.2 THE PHASES OF A COMPILER
Each of the phases transforms the source program from one representation to
another.
It is data structure containing a record for each identifier with fields for the
attributes of the identifier. To store or retrieve data from that record.
Example : int position, initial, rate;
When an identifier in the source program is entered into the symbol table. Error
detection and reporting. Each phase can encounters errors. A compilers that stops
when it finds the first error. Error where the token stream violates the structure rules
of the language are determined by the syntax analysis phase.
The lexical analyzer generates the symbol table of lexical value id. The
characters sequence forming a token is called lexeme for the token. Syntax analysis
imposes a hierarchical structure on the token stream by trees.
CODE GENERATION
CODE OPTIMIZATION
To improve the intermediate code and to improve the running time of the target
program without slowing down compilation.
CODE GENERATION
Syntax Analyzer
Lexical Analyzer
Semantic
Analyzer
Symbol-table Error
Management Handler
Code
Optimization
Intermediate
Code Generation
Target Program
9/81
1. Preprocessor functions
* Preprocessors produced input to compiler.
* They perform the following functions.
a) Macro processing
A preprocessor may allow a user to define macros that are shorthands for longer
constructs.
Eg: # define M 6
main()
{ …
total = M*value;
printf(“m=%d”,M);
}
In the above example M is macro.
b) File inclusion
A preprocessor may include header files into the program text. For example,
<global.h> the contents of the file to replace the statement # include<global.h> when
it process a file containing this statement.
c) Rational preprocessors
d) Language extension
These processors attempt to add capabilities to the language by built-in macros. For
example, statements beginning with ## are taken by the preprocessors to be database-
access statements. Translated into procedure calls on routines that perform the
database access.
MACRO PROCESSORS
The Macro definitions are given by the keywords like define or macro.
General syntax is
#define identifier string value.
Eg: #define M 5
The use of a macro consists of naming the macro and supplying actual
parameters, i.e., value for its formal parameters. The macro substitutes the actual
parameters for the formal parameters in the body of the macros, the transformed body
then replaces the macro use itself.
2. Assemblers
Assembly code is mnemonic version of machine code, in which names are used
instead of binary codes for operations, and names are also given to memory addresses.
Example: To compute b=a+2, the code is
mov a, R1
add #2, R1
mov R1, b
The above code moves the contents of the address a into register R1, then adds
the constant 2 to it and stores the result in the location named by b.
Two-pass assembly
* In second pass, again the assembler scans the input. This time, it translates each
operation code into the sequence of bits representing that operation in machine
language, and it translates each identifier representing a location into the address
given for that identifier in the symbol table.
11/81
* The result of this pass is relocatable machine code.
* Example, the following is a machine code into which the above assembly
instructions will be translated.
00 is ordinary address
10 is immediate address
* In the above machine code the first 4 bits stands for instructions like Load (0001),
Store (0011) and Add (0010).
* The next 2 bits 00 or 10 is a tag where 00 refers to ordinary address, where last eight
bits to refer memory address
* The tag 10 refers to immediate mode, where last 8 bits are taken as operand.
* In first and third code the * refers to relocation bit . * means L must be added to the
address of this instruction.
* In this example, address of a =0 and address of b=4
* If L=15 00001111 then after adding L with the original address now a=15 and b=19
and the above instruction will appear as
0001 01 00 00001111
0011 01 10 00000010
0010 01 00 00010011
which is an absolute or relocatable machine code.
3. Loaders
Loader is a program, that performs the two functions of loading and link-
editing. The process of loading consists of taking relocatable machine code, placing
the altered instructions and data in memory at the proper locations
4.Link-Editors
The link-editor allows to make a single program from various files relocatable
machine. The result of various different files compilations, and one or more may be
library files of routines provided by the system.
The phases are collected into a front end and back end. The front end consists of
source language and are largely independent of the target machine. It include lexical
and syntactic analysis, the creation of the symbol table, semantic analysis and the
generation of the intermediate code. The amount of code optimization can be done by
the front end as well and also includes the error handling.
The back end includes the compiler depend on the target machine, or do not
depend on the source language. Aspect of this phase, error handling and symbol-table
operations.
Passes
12/81
Single pass consisting of reading an input file and writing an output file. For
example, lexical analysis, syntax analysis, semantic analysis, and intermediate code
generation might be grouped into one pass. Eg: The token stream after lexical analysis
is translated directly into intermediate code.
Since it takes time to read and write intermediate files, if we group several
phases into one pass – then we may be forced to keep the entire program in
memory,because one phase may need information in a different order than a previous
phase produces it.
Disadvantage:
Some phases when grouped into one pass will present few problems.
Example: The interface between the lexical and syntactic analyzers can often be
limited to a single token. The intermediate and target code generation can be merged
into one pass using a technique called “back patching”.
* The compiler write ,like any programmer, can profitably use software tools such as
debuggers, version manager, profilers and so on.
* In addition to these software development tools, other more specialized tools have
been developed for helping implement various phases of a compiler.
13/81
1.6 Lexical analysis
token
Source Lexical Parser
program Analyzer Get token next
Symbol table
The lexical analyzer returns to the parser a representation for the token it has
found. The representation is an integer code if the token is a simple construct such as
left parentheses, comma or colon.
To responsible for simple tasks & to eliminate blanks from the input.
Lexical Analysis
The alternative of modifying the grammar to incorporate white space into the
syntax is not nearly as easy to implement.
i) Simpler Design
The separation of lexical analysis from syntax analysis allows to simplify one or
more of phases. For example, already to remove the comments and white space.
Tokens
Lexemes
Patterns
It is a rule describing the set of strings that can represent a particular token in
source program.
15/81
To describe the patterns for more complex tokens. To use the regular expression
notation developed. For example:
if the statement
position := initial +rate * 60
is given as input to lexical analyzer ,
the output is
id1= id2 + id3 * 60
When more than one pattern matches a lexeme, the lexical analyzer must
provide additional information about the particular lexeme that matched to the
subsequent phases of the compiler. The lexical analyzer collects information about
tokens into their associated attributes.
The tokens influence parsing decisions; the attributes influence the translation
of tokens. A token has only a single attribute- a pointer to the symbol-table entry in
which the information about the token is kept. The pointer becomes the attributes for
the token. In certain pairs there is no need for an attribute value, the first component
is sufficient to identify the lexeme. For example, the tokens and associated attribute
values E = M * C * 2 are written as a sequence of a pairs :
num has been given an integer-valued attribute. The compiler may store the
character string that forms a number in a symbol table and the attribute of token num
be a pointer to the table entry.
Lexical errors
The lexical analyzer must return the token for an identifier and phase of
compiler handle any error.
Error-recovery actions:
16/81
i. Deleting an extraneous character.
ii. Inserting a missing character.
iii. Replacing an incorrect character by a correct character
iv. Transposing two adjacent characters.
3. Write the lexical analyzer in assembly language and explicitly manage the reading
of input.
The lexical analyzer is the only phase of the compiler that reads the source program
character-by-character, it is possible to speed some amount of time in the lexical
analysis phase. All these processes may be carried out with an extra buffer into which
the source file is read and then copied, with modification, into the buffer.
Preprocessing the character stream being subjected to lexical analysis saves the
trouble of moving the lookahead pointer back and forth over comments or strings of
blanks. The lexical analyzer needs to look ahead characters beyond the lexeme for a
pattern.
The lexical analyzers used a function to push lookahead characters back into the
input stream. Because a large amount of time can be consumed moving characters,
specialized buffering techniques have been developed to reduce the amount of
overhead required to process an input character.
: : E : = : : M : * C : * : * : 2 : eof : : : : :
forward
lexeme-beginning
Fig. An input buffer in two halves
A buffer divided into two N character halves. N is the number of characters on one
disk block( e.g.,1024 or 4096 ). To read N input characters into each half of the buffer
with one system read command. To invoke a read command for each input character.
eof marks the end of the source file and is different from any input character.
Sentinels
17/81
Each buffer half to hold a sentinel character at the end. The sentinel is a special
character that cannot be part of the source program. The same buffer arrangement with
the sentinels added eof.
forward
lexeme-beginning
The term Alphabet or character class denotes any finite set of symbols. Example
of symbols are letters and characters. The set { 0, 1 } is the binary alphabet. Example
of computer alphabets : ASCII and EBCDIC.
Language denotes any set of strings over some fixed alphabet. Alphabet
languages like , the empty set or { }, the set containing only the empty string.
Concatenation string formed x and y xy. The empty string is the identity
element concatenation s = s = s.
Term Definition
Operations on Language
Example
Regular expressions
a. is a regular expression that denotes { }, that is the set containing the empty
string.
b. If a is a symbol in ∑, then a is a regular expression that denotes { a}, i.e., the
set containing the string a.
c. Suppose r and s are regular expressions denoting the languages L(r) and L(s).
Then
i. (r)|(s) is a regular expressions denoting L(r)UL(s).
ii. (r)(s) is a regular expressions denoting L(r)L(s).
iii. (r)* and (s)* is a regular expressions denoting (L(r))* and (L(s))*.
iv. (r) and (s) is a regular expression denoting L(r) and L(s).
The number of algebraic laws obeyed by regular expressions and can be used to
manipulate regular expressions into equivalent regular forms. Algebraic laws that hold
for regular expressions r, s and t
Axiom Descriptive
Regular Definition
digit 0|1|…|9
digits digitdigit*
opt_fra .digit|
opt_exp (E(+|-| )digits)|
num digits opt_fra opt_exp
20/81
Notational Shorthands
If r is a regular expression that denotes the language L(r), then (r)+ is a regular
expression that denote the language (L(r))+. The unary postfix operator + means “one
or more instances of ”. The regular expression a+ denotes the set of all strings of one
or more a’s. The operator + has the same precedence and associativity as the operator
*. The two algebraic identities
r* = r+| and r+ = rr* relate the Kleene and positive closure operators.
The unary postfix operator ? means “zero or more instances of ”. The notation r?
is a shorthand for r| . If r is regular expression, then (r)? is the regular expression
that denotes the language L(r)U{ }. For above example, can be rewritten as
digit 0|1|…|9
digits digit+
opt_fra (.digit)?
opt_exp (E(+|-)?digits)?
num digits opt_fra opt_exp
Character classes
The notation [abc] where a, b, and c are alphabet symbol denotes the regular
expression a | b | c.
Character class [a-z] denotes the regular expression a | b | …|z . Using character
class, identifiers as being strings generated by the regular expression [A-Za-z][A-Za-
z0-9]*.
Lexical analyzer will isolate the lexeme for the next token in the input buffer and
produce as output a pair consisting of the appropriate token and attribute value.
Transition Diagram
Start > =
o 11 2
22/81
UNIT – II
SYLLABUS
Syntax Analysis: Role of the parser – Context free grammars –Top down parsing:
Recursive Descent Parsing. Predictive parsing – Bottom up parsing: Handles, Handle
pruning, Stack implementation of shift-reducing parsing.
SYNTAX ANALYSIS:
Cocke-Younger-Kasami(CYK) algorithm and Earley’s algorithm can parse any grammar. These
methods are too inefficient to use in production compilers.
Top-down Parser
Symbol
table
Bottom-up Parser
In top-down and bottom-up cases, the input to the parser is scanned from left to right, one symbol at
a time and to work only on sub classes of grammars, such as LL and LR grammars. Automated tools
construct parsers for the larger class of LR grammars.
The output of the parser is representation of the parse tree for the stream of tokens produced by the
lexical analyzer.
If a compiler had to process only correct programs, its design and implementation would be
simplified.
When programmers write incorrect programs, a good compiler should assist the programmers in
identifying and locating errors.
Compiler is required for syntactic accuracy in computer languages.
Planning the errors handling right from the start can both simplify the structure of a compiler and
improve its response to errors.
Many errors are syntactic in nature or are exposed when the stream of tokens coming from the
lexical analyzer disobeys the grammatical rules defining the programming languages.
Parsing methods can detect the syntactic errors in programs very efficiently.
Accurately detecting the semantic and logical errors at compile time is very difficult task.
Error-recovery strategies
Error recovery strategies are used to recover from a syntactic error of the different general strategies
of a parser.
1. Panic-mode recovery
The parser discards input symbols one at a time until one of a designated set of synchronizing
tokens found. The compiler designer must select the synchronizing tokens appropriate for the source
language. While panic-mode correction skips an amount of input without checking it for additional
errors. The synchronized tokens are delimiters, such as semicolon or end, whose role in the source
program is clear. It can be used by most parsing methods. If the multiple errors occur in the same
statement, this method may quite.
24/81
Advantage : Simplest to implement and guaranteed not to go into an infinite loop.
2. Phrase-level recovery
A parser may perform local correction on the remaining input, that is, it may replace a prefix of
the remaining input by some string that allows the parser to continue. A typical local correction would
be to replace a comma by a semicolon, delete an extraneous semicolon, or insert a missing semicolon.
The choice of the local correction is left to the complete designer. To choose replacement that do not
lead to infinite loop. This type of replacement can correct any input string and has been used in several
error-repairing compilers. The method was firs used with top-down parsing.
Drawback : It has in coping with situations in which the actual error has occurred before the
point of detection, is difficulty.
3. Error-productions
The common errors may be encountered, the grammar for the language at hand with productions
that generate the erroneous constructs. The use of the grammar augmented by these error productions to
construct a parser. If the parser uses an error production, can generate appropriate error diagnostics to
indicate the erroneous construct that has been recognized in the input.
4. Global correction
A compiler to make as few changes as possible in processing an incorrect input string. There are
algorithms for choosing a minimal sequence of changes to obtain a globally least cost correction. These
methods are too costly to implement in terms of time and space, so these techniques are currently only
of theoretical. Given an incorrect input string x and grammar G, these algorithms will find a parse tree
for a related string y, such that the number of instructions, deletions, and changes of tokens required to
transform x into y is as small as possible. The notion of least-cost correction does provide for evaluating
error-recovery techniques and it has been used for finding optimal replacement strings for phrase-level
recovery.
Construction of language has a recursive structure that can be defined by context free grammars. For
example, conditional statement defined by a rule,
if s1 and s2 are statements and E is an expression, then “if E then S1 else S2” is statement.
The regular expressions can specify the lexical structure of tokens. Using the syntactic variable stmt
to denote the class of statements and expr the class of expressions, then grammar production is
stmt if expr then stmt else stmt
CFG consists of
1. terminals
2. non-terminals
3. a start symbol and
4. productions.
Terminals are basic symbols from which strings are formed. The word “token” is a synonym for
“terminal” in programs, for programming languages. For example, stmt if expr then stmt else
stmt. Each of the keywords if, then and else is a terminal.
25/81
Non-terminals are syntactic variables that denote set of strings. The non-terminals define sets of
strings that help define the language generated by the grammar. A hierarchical structure on the
language that is useful for both syntax analysis and translation. In the above example, stmt and expr
are non-terminals.
In a grammar, one non-terminal is distinguished as the start symbol, and the set of strings it denotes
is the language defined by the grammar.
The productions of a grammar, is in which the terminals and non-terminals can be combined to form
strings. Each productions consists of a non-terminal, followed by an arrow, followed by a string of
non-terminals and terminals.
Example 1:
expr expr op expr
expr ( expr )
expr - expr
expr id
op +
op -
op *
op /
op
In this grammar, the terminal symbols are id + - * / ( ) the non-terminal symbols are expr and op
and expr is the starting symbol.
CFG: Example 2
G = h{S}, {a, b}, P, Si, where S −! ab and S −! aSb
are the only productions in P.
Derivations look like this:
S = ab
S= aSb = aabb
S = aSb = aaSbb = aaabbb
L(G), the language generated by G is {anbn|n > 0}.
Notational Conventions
Upper-case letters in the alphabet, such as X,Y,Z represent grammar symbols, that is either non-
terminals or terminals.
Lower-case letters in the alphabet, u, v,…,z represent strings of terminals.
Lower-case letters , , represent strings of grammar symbols. A production can be written as
26/81
A indicating a single non-terminal A on the left side of the production and a string of grammar
symbols to the right side of the production.
If A 1, A 2, …., A k are all productions with A on the left, A 1| 2 | … | k, where,
1, 2 , … , k the alternatives for A.
Unless otherwise stated, the left side of the first production is the start symbol.
For example, E E A E | ( E ) | - E | id , A + | - | * | / | , here, E and A are non-terminals, with
E is the start symbol. The remaining symbols are terminals.
Derivations
A production rule is in which the non-terminal on the left is replaced by the on the right side of the
production. For example, E E + E | E * E | ( E ) | - E | id
The production E -E, an expression preceded by a minus sign is also an expression. This
production can be used to generate more complex expressions from simpler expressions by allowing
to replace any instance of an E by –E. This can be rewritten as E -E, which is “E derives –E”.
In abstract string, A if A is a production and and are arbitrary strings of
grammar symbols. If 1 2 … n, 1derives n. The symbol means derives in one
step, the * means derives in zero or more steps, and + means derives in one or more steps.
1. * for any strings ,
2. If * and , then * .
A language can be generated by a grammar is said to be a context-free language. If two grammars
generate the same language, the grammars are said to be equivalent.
Strings in L(G) may contain only terminal symbols of G. A string of terminals w is in L(G) if and
only if S + w. The string w is called a sentence of G.
If string S* , where may contain non-terminals, then is a sentential form of G. A sentence is
a sequential form with no non-terminals.
For example, the string –(id + id) is a sentence of grammar E E + E | E * E | ( E ) | -E | id
because there is the derivation E -E - ( E ) -( E + E ) -( id + E ) -( id + id).
The strings E, -E, -( E ), …,-( id + id ) appearing in this derivations are all sentential forms of this
grammar. E * -( id + id ) to indicate that –(id + id) can be derived from E.
Leftmost derivations in which only the leftmost non-terminal in any sentential form is replaced at
each step. It is called leftmost. For example, if by a step in which the leftmost non-terminal
in is replaced, then written as lm .
The leftmost derivation is E lm - E lm – ( E ) lm - ( E + E ) lm - ( id +E ) lm - ( id + id ).
Using notational conventions, every leftmost step can be written wA lm w where w consists of
terminals only, A is the production applied, and is a string of grammar symbols by a
leftmost derivation, *lm . If S *lm , then is a left sentential form of the grammar.
Right most derivations in which the right most non-terminal is replaced at each step. Rightmost
derivations are also called as canonical derivations.
- E
( E )
E + E
id id
id id
id id
Parse Tree
Parse Tree
Note : * operator as having higher precedence than +.
Ambiguity
A grammar that produces more than one parse tree for some sentence is said to be ambiguous.
An ambiguous grammar is one that produces more than one left most or right most derivation for the
same sentence.
Top-down parsing can be viewed as an attempt to find a leftmost derivation for an input string.
It can be viewed as an attempt to construct a parse tree for the input starting from the root and
creating the nodes of the parse tree in preorder.
Recursive-descent parsing is called predictive parsing , where no backtracking is required.
Top-down parsing is called recursive-descent, that may involve backtracking, that is making
repeated scans of the input.
Backtracking is not very efficient, rarely needed to parse programming language.
For example, consider the grammar S c A d, A a b | a and the input string w = cad.
S S S
c A d c A d c A d
a b a
Fig. Steps in top-down parse
To construct a parse tree for this string top-down, to create a tree consisting of a single node labeled
S.
The leftmost leaf, labeled c, matches the first symbol of w.
The advance input pointer to a, the second symbol of w, the next leaf labeled A, match for the
second input symbol.
The advance input pointer to d, the third input symbol and compared against the next leaf, labeled b.
b does not match d, then go back to A, to reset the input pointer to position 2, which means that the
procedure for A must store the input pointer in a local variable.
The second alternative for A to obtain the tree. The leaf a matches the second symbol of w and the
leaf d matches the third symbol.
29/81
It produced a parse tree for w, and successful completion of parsing.
A left-recursive grammar can cause a recursive-descent parser, even more with backtracking, to go
into an infinite loop.
To eliminate left recursion from a grammar, and left factoring the resulting grammar.
To obtain a grammar that can be parsed by a recursive-descent parser that needs no backtracking i.e.,
a predictive parser.
To construct a predictive parser, given the current input symbol a and the non-terminal A to be
expanded, which one of the alternatives of production A 1 | 2 | … | n is the unique alternative
that derives a string beginning with a.
For example, the productions
stmt if expr then stmt else stmt | while expr do stmt | begin stmt-list end
The keywords if, while, and begin, which alternative is the only one that can possible to success, if
to find a statement.
To create a transition diagram for the predictive parser, it is very useful plan or flowchart for lexical
analyzer.
The labels of edges are tokens and non--terminals.
A transition on a token means that transition if that token is the next input symbol.
To construct the transition diagram of a predictive parser from a grammar, first eliminate left
recursion from the grammar, and then left factor the grammar.
a) Create an initial and final state.
b) For each production A X1 X2 …Xn , create a path from the initial to the final state, with
edges labeled X1 , X2 , … , Xn .
More than one transition from a state on the same input occurs ambiguity.
To build a recursive-descent parser using backtracking to systematically.
For example, E E + T | T , T T * F | F , F ( E ) | id
contains a collection of transition diagrams for grammar
E TE , E +TE | , T FT , T *FT | , F ( E ) | id.
30/81
Substituting diagrams to the transformations on grammar can simplify transition diagrams.
E: T E T: F T
0 1 2 7 8 9
E : + T E 6 T : 1
* 1
F 1 T 1
3 4 5 2
0 1 3
F: ( E )
1 1 1 1
id 6
4 5 7
Fig. Transition diagram for grammar
T
E : + T +
E :
3 4 5 3 4
6 6
T +
E: T + E: T
0 3 4 0 3
6 6
Fig. Simplified transition diagrams
E: + T: *
T F
6 7 8 1
0 3
3
F: ( E )
1 1 1 1
id 6
4 5 7
Fig. Simplified transition diagrams for arithmetic expressions
31/81
Non-recursive predictive parsing
To build a non-recursive parser by maintaining a stack. The predictive parsing is determining the
production to be applied for a non-terminal.
To non-recursive parser looks up the production to be applied in a parsing table.
The table can be constructed directly from grammars.
A table-driven predictive parser has an input buffer, a stack, a parsing table, and an output stream.
The input buffer contains the string to be parsed, followed by $, a symbol used as a right end marker
to indicate the end of the input string.
The stack contains the start symbol of the grammar on top of $.
The parsing table is a two-dimensional array M [ A , a ], where A is a non-terminal, and a is a
terminal or the symbol $.
X the symbol on the top of the stack, and a the current input symbol.
a + b $
Stack X Outpu
Predictive Parsing Program
Y
Z
$
Parsing Table M
Fig. Non-recursive Predictive Parser
For example,
E E + T | T , T T * F | F , F ( E ) | id
E TE , E +TE | , T FT , T *FT | , F ( E ) | id.
The input id + id * id $ the predictive parser create the sequence of moves.
The input pointer points to the leftmost symbol of the string in the INPUT column.
The leftmost derivation for the input, the productions output are those of a leftmost derivation
The input symbol scanned already, followed by the grammar symbol on the stack ( from top to
bottom), the left-sentential forms in the derivations.
The construction of a predictive parser is aided by the functions associated with a grammar G.
i) First
ii) Follow
Sets of tokens yielded by the FOLLOW functions can also be used as synchronizing tokens during
panic-mode error recovery.
FIRST() be the set of terminals, where, is any string of grammar symbols, strings derived from
. If * , then is in FIRST().
FOLLOW(A), be the non-terminal A, the set of terminals a that can appear immediately to the right
of A. The set of terminals a such that there exists a derivation of the form
S * A a for and .
Note : During the derivation, have been symbols between A and a, they derived and disappeared. If
A can be the rightmost symbol in sentential form, then $ is in FOLLOW(A).
33/81
Rules :
FIRST(X) for all grammar symbols X, apply the rules until no more terminals or can be added to any
FIRST set.
FOLLOW(A) for all non-terminals A, apply the rules until nothing can be added to any FOLLOW set.
i) $ in FOLLOW(S), where S is the start symbol and $ is the input right end marker.
ii) If there is a production A B, then everything in FIRST() except for is placed in
FOLLOW(B).
iii) If there is a production A B, or a production A B where FIRST() contains
(i.e., * ), then everything in FOLLOW(A) is in FOLLOW(B).
An error is detected during predictive parsing when the terminal on top of the stack does not match
the next input symbol or when non-terminal A is on top of the stack, a is the next input symbol, and
the parsing table entry M [A , a ] is empty.
Panic-mode error recovery is skipping symbols on the input until a token in a selected set of
synchronizing tokens appears. Its effectiveness depends on the choice of synchronizing set.
The parser recovers from errors to occur
As a starting point, all symbols in FOLLOW(A) into the synchronizing set for non-terminal A. If
skip tokens until an element of FOLLOW(A) is seen and pop A from the stack, it is likely that
parsing can continue.
It is not enough to use FOLLOW(A) as the synchronizing set for A.
If symbols in FIRST(A) to the synchronizing set for non-terminals A, then it may be possible to
resume parsing according to A if a symbol in FIRST(A) appears in the input.
If a non-terminal can generate the empty string then the production deriving can be used as a
default. To reduces the number of non-terminals that have to be considered during error recovery.
If a terminal on top of the stack cannot be matched, the terminal was inserted and continue parsing.
Phrase-level recovery
34/81
To implement by filling in the blank entries in the predictive parsing table with pointers to error
routines.
Example :
Using FIRST and FOLLOW symbols as synchronizing tokens works, when expressions are parsed
to grammar E E + T | T , T T * F | F , F ( E ) | id .
Construct the parsing table for this grammar, with “synch” indicating synchronizing tokens obtained
from the FOLLOW set of the non-terminal.
Example :
The sequence of four reductions, trace out the right most derivation in reverse :
S rm a A B e rm a A d e rm a A b c e rm a b b c e
2.4.1 HANDLES
A “Handle” of a string is a substring that matches the right side of a production, and whose reduction
to the non-terminal on the left side of the production represents one step along the reverse of a
rightmost derivations.
A handle of a right-sentential form is a production A and a position of where the string
may be found and replaced by A to produce the previous right-sentential form in a rightmost
derivation of .
If a grammar is unambiguous, then every right-sentential form of the grammar has exactly one
handle.
The handle represents the left most complete sub tree consisting of a node and all its children.
Consider the grammar, E E + E | E * E | ( E ) | id. The right most derivation
E rm E + E ( Or)
E rm E + E * E E rm E * E
E rm E + E * id3 E rm E * id3
E rm E + id2 * id3 E rm E + E * id3
E rm id1 + id2 * id3 E rm E + id2 * id3
E rm id1 + id2 * id3
Note : The string appearing to right of a handle contains only terminal symbols.
35/81
2.4.2 HANDLE PRUNING
RIGHT-SENTENTIAL
HANDLE REDUCING PRODUCTION
FORM
id1 + id2 * id3 id1 E id
E + id2 * id3 id2 E id
E + E * id3 id3 E id
E+E*E E*E EE*E
E+E E+E EE+E
E
Fig. Reductions marked by Shift-reduce parser
S$ $
After entering this configuration, the parser halts and get successful completion of parsing.
Example:
A shift reduce parser is parsing the input string id1 + id2 * id3 according to the grammar
E E + E | E * E | ( E ) | id.
36/81
STACK INPUT ACTION
$ id1 + id2 * id3$ Shift
$ id1 + id2 * id3$ Reduce by E id
$E + id2 * id3$ Shift
$E+ id2 * id3$ Shift
$ E + id2 * id3$ Reduce by E id
$E+E * id3$ Shift
$E+E* id3$ Shift
$ E + E * id3 $ Reduce by E id
$E+E*E $ Reduce by E E * E
$E+E $ Reduce by E E + E
$E $ Accept
The handle will always eventually appear on top of the stack, never inside.
S *rm A z *rm B y z *rm y z , here, A *rm B y , B *rm
S *rm B x A z *rm B x y z *rm x y z , here, A *rm y , B *rm
In reverse of shift-reduce parser
Viable prefixes
The set of prefixes of right sentential forms that can appear on the stack of a shift-reduce
parser are called viable prefix. (or)
It is a prefix of a right-sentential form that does not continue past the right end of the right
most handle of that sentential form.
The front end translates a source program into an intermediate representation from
which the back end generates target code.
A source program can be translated into target language directly, using a machine-
independent intermediate.
Retargeting is facilitated; a compiled for a different machine can be created by
attaching a back end for the new machine to an existing front end.
A machine-independent code optimizer can be applied to the intermediate
representation.
The syntax directed method can be used to translate into an intermediate form
programming language constructs such as declarations, assignments and flow of
controls statements.
The source program has been parsed and statically checked.
Intermediate code generation can be folded into parsing, if desired.
programming language constructs are similar to those for constructing syntax trees or
assign assign
a + a +
* * *
| |
c c
Each node is represented as a record with a field for its operator and additional fields
a) Nodes are allocated from an array of records and the index or position of the
node serves as the pointer to the node.
0 id b
1 id c
assign 2 - 1
3 * 0 2
4 id b
* *
5 id c
6 - 5
id a 7 * 4 6
8 + 3 7
id b id b 9 id a
10 := 9 8
11 …
+ …
- -
id id
(a)
(b)
a. Statements can have symbolic labels and there are statements for flow of control.
b. A symbolic label represents the index of a three-address statements in the array
holding intermediate code.
c. An indices can be substituted for the labels either by marketing a separate pass, or
by using “backpatching”.
1. Assignment statements
2. Copy Statements
3. Unconditional jump
4. Conditional jump
and y and executes the statements with the L next if x stands in relation relop to y.
5. Procedure Calls
param x
42/81
call p, n for procedure call
6. Indexed assignment
x := &y, x := *y and *x := y.
When three-address code is generated, temporary names are made up for the
interior nodes of a syntax tree.
The synthesized attribute represents the three-address code for the assignment.
The non terminal has two attributes :
a) The name that will hold the value, and
b) The sequence of three address statements evaluated.
Three-address statements may be sent to an output file, rather than built up into the
code attributes.
Flow of control statements can be added to the assignments in productions and
semantic rules.
Productions concatenate only the operator after the code for the operands.
The intermediate form produced by the syntax-directed translations can be changed
by making modifications to the semantic rules.
a. Quadruples
fields, which call op, arg1, arg2 and result. The op field contains an internal code for
the operator. The contents of fields arg1, arg2 and result are pointers to the symbol
43/81
table entries for the names represented by the fields. Temporary names must be entered
b. Triples
To avoid entering temporary names into the symbol table refer to a temporary value
by the position of the statement that computes it. Three address statements can be
represented by records with only three fields : op, arg1, and arg2. The fields arg1 and
arg2, for the arguments of op, are either pointers to the symbol table or pointers into
the triple structure. Three fields are used, this intermediate code format is known as
triples, its refers to “two address code”. Parenthesized numbers represent pointers into
c. Indirect triples
pointers to triples, rather than listing the triples. This implementation is called indirect
triples.
Quadruple Notation
Three address statement defining or using a temporary can immediately access the
location for that temporary via the symbol table. The symbol table interposes an extra
degree of indirection between the computation of a value and its use. Benefit : an
optimizing compiler.
Triples
44/81
Allocation of storage to those temporaries needing it must be deferred to the
code generation phase. Moving a statement that defines a temporary value require to
compiler.
Indirect Triples
To save space compared with quadruples if the same temporary value is used
more than once. Two or more entries in the statement array can point to the same line
To transform the intermediate code into a form from which more efficient target
code can be produced.
Symbol table
The input to the code generator consists of the intermediate representation of the
source program produced by the front end, together with information in the symbol
table that is used to determine the run-time addresses of the data objects denoted by
the names in the intermediate representation.
Linear representations such as postfix notations.
To code generation the front end has scanned, parsed and translated the source
program into intermediate representation.
Its input is free of errors.
Target Programs
The output of the code generator is the target program. The output take variety of
forms.
Absolute machine language.
Memory Management
Instruction Selection
The instruction set of the target machine determines the difficulty of instruction
selection.
46/81
The uniformity and completeness of the insertion set are important factors.
If the target machine does not support each data type in a uniform, then each
exception to the general rule require special handling.
Mov b, R0
Add c, R0
Mov R0, a
Add e, R0
Mov R0, d
For example, the sequence of statements,
a := b + c
d := a + e will be translates
Register Allocation
Instructions involving register operands are shorter and faster than those involving
operands in memory.
The use of registers is subdivided into two sub problems
During register allocation, to select the set of variables that will reside in
registers at a pointer in the program.
During a subsequent register assignment phase, to pick the specific register that a
variable will reside in.
NP complete problem, to avoid this problem by generating code for the three-
address statements in the order in which they have been produced by the
intermediate code generator.
Absolute M M 1
Register R R 0
Literal #C C 1
Example:
location M.
Mov *4(R0), M to store the value contents ( contents (4 + contents (R0))) into
memory location M.
Instruction Set
The cost of an instruction to be one plus the costs associated with the source and
destination address modes.
This cost corresponds to the length of the instruction.
Address modes involving registers have cost zero, while those with a memory
location or literal in them have cost one, because such operands have to be stored
with the instruction.
By minimizing the instruction length to minimize the time taken to perform the
instruction
48/81
Example :
( contents ( contents ( (12 + contents (R1))) - ( contents (4 + contents (R0))) into the
destination *12(R1)
STORAGE-ALLOCATION STRATEGIES
Static Allocation
Static allocation lays out storage for all data objects at compile time.
The position of an activation record in a memory is fixed at compile time.
The addresses at which information is to be saved when a procedure call occurs are
known at compile time.
Limitations with using static allocation
The size of a data object an constraints on its position in memory must be known at
compile time.
Recursive procedures are restricted, because all activations of procedure use the
same bindings for local names.
Data structures cannot be created dynamically, since there is no mechanism for
storage allocation at run-time.
Stack Allocation
49/81
goto callee.code-area
which transfers control to the address saved at the beginning of the activation
record.
2) add #caller.recordsize, sp
goto callee.code-area
Advantage
50/81
It makes the compiler more portable. The front end need not be changed even if the
compiler is moved to a different machine where a different run-time organization is
needed.
Generating the specific sequence of access steps while generating intermediate code
can be significant advantage in an optimizing compiler.
Stack allocation is based on the control stack, storage is organized stack, and
activation records are pushed and popped as activations begins and end,
respectively.
Local are bound to fresh storage in each activation, because a new activation record
is pushed onto the stack when a call is made.
The values of locals are deleted when the activation ends, that is, the values are lost
because the storage for locals disappears when the activation record is popped.
At run-time, an activation record can be allocated and de-allocated by incrementing
and decrementing the top of the stack, respectively, by the size of the record.
Calling Sequences
Advantage :
To placing the fields for parameters and a potential returned value next to the
activation record of the caller.
The caller can access these fields using offsets from the end of its own activation
record, without knowing the complete layout of the record for the callee.
The register top-sp points to the end of the machine-status fielding an activation
record.
This position is known to the caller, so it can be made responsible for setting
top-sp before control flows to the called procedure.
The code for the callee can access its temporaries and local data using offsets
from
top-sp.
51/81
Parameters and returned value
Responsibility
The caller evaluates actuals. The caller stores a return address and the old
vale of top-sp into the callee’s activation record.
The caller increments top-sp to the position, that is , top-sp is moved past the
caller’s local data and temporaries and the callee’s parameter and status
fields.
The callee saves register values and other status information. The callee
initializes its local data and begins execution.
The return sequence,
The callee places a return value next to the activation record of the caller.
Using the information in the status field, the callee restores top-sp and other
registers and branches to a return address in the caller’s code.
Although top-sp has been decremented, the caller can copy the returned value
into its own activation record and use it to evaluate an expression.
The calling sequences allow the number of arguments of the called procedure to
depend on the call.
At compile time, the target code of the caller knows the number of arguments it
is supplying to the callee. Hence the callee knows the size of the parameter field.
The target code of the callee must be prepared to handle other calls, it waits until
it is called, and then examines the parameter field.
52/81
Variable-Length Data
The relative addresses of these pointers are known at compile time, the target
code can access array elements though the pointers.
The activation record for q begins after the arrays of p, and the variable length
arrays of q begin.
Pointer to A
Pointer to B
Pointer to C
Array A
Array B Array of p
Array C
Top_sp
Top_sp Array of q
Dangling References
A dangling references occurs when there is a reference to storage that has been
allocated.
Heap Allocation
Heap allocation allocates and de-allocates storage as needed at run-time from a data
area known as a heap.
The values of local names must be retained when an activation ends.
A called activation outlives the caller. This possibility cannot occur for those
language where activation trees correctly depict the flow of control between
procedures.
To handle small activation records or records of a predictable size as a special case :
For each size of interest, keep a linked list of free blocks of that size.
If possible, fill a request for size s with a block of size s’ , where s’ is the
smallest size greater than or equal to s.
When the block is eventually de-allocated, it is returned to the linked list it came
from.
For large blocks of storage use the heap manager.
Control link
Control link
Control link
54/81
4 Assemblers
4.1 Elements of assembly language programming
* An assembly language is a machine dependent low level programming language which is
specific to a certain computer system.
* It provides 3 basic features which simplify programming language:
Statement Format:
* An assembly language statement has the following format:
[label] <opcode> <opcode spec> [,<opcode spec>..]
* The MOVE instruction moves the value between a memory word and a register.
* In the MOVER instruction the second operand is the source operand and the first operand is
the target operand.
* Converse is true for the MOVEM instruction.
* All arithmetic is performed in a register and sets a condition code.
* A comparison instruction sets a condition code analogous to the subtract instruction without
affecting the values of the operand.
* The condition operand can be tested by a branch on condition instruction.
* The assembly statement corresponding to it has the format
BC <condition code spec>, <memory address>
* In a machine language program we show all addresses and constants in decimal rather than
octal or hexa decimal.
* The following figure shows the machine instructions format.
* The opcode,register operand and memory operand occupy 2,1 and 3 digits , respectively.
* The sign is not a part of the instruction
* The condition code specified in a BC statement is encoded into the first operand position using
the codes 1-6 for the specification LT,LE,EQ,GT,GE and ANY, respectively
57/81
Fig: Instructions format.
1. Imperative statements:
* An imperative statement indicates an action to be performed during the execution of the
assembled program.
* Each imperative statement typically translates into one machine instruction.
2. Declaration statements:
* The syntax of declaration statements is a s follows:
[label1] DS <constant>
[label> DC ‘<value>’
* The DS(short for declare storage) statement reserves areas of memory and associates name
switch them. Consider the following DS statements:
A DS 1
G DS 200
* The first statement reserves a memory area of one word and associates the name A with it.
* The second statement reserves block of 200 memory words.
* The name G is associated with the first word of the block.
58/81
* Other words in the block can be accessed through offsets from G,e.g. G+5 is the sixth word of
the memory block etc.,
* The DC(short for declare constant) statement constructs memory words containing constants.
Consider the following DC statement:
ONE DC ‘1’
* The statement associates the name ONE with a memory word containing the value ‘1’.
* The programmer can declare constants in different forms –
decimal,binary,hexadecimal,etc.
* The assembler converts them to the appropriate internal form.
Use of Constants:
* Contrary to the name declare constant the dc statement does not really implement constants, it
merely initializes memory words to given values.
* These values are not protected by assembler ; they may be changed by moving a new value
into the memory word.
* An assembly program can use constants in the sense implemented in a HLL in two ways-as
immediate operands and as literals.
* Immediate operands can be used in an assembly statement only if the architecture of the
target machine includes the necessary features.
* In such machine the assembler statement is translated into an instruction with two operands-
AREG and the value ‘5’ as an immediate operand.
ADD AREA, 5
* A Simple assembly language does not support this feature, but assembly language of intel
8086 supports it.
* A literal is an operand with the syntax =’<value>’.
* It differs from a constant because its location cannot be specified in the assembly program.
* This helps to ensure that its value is not changed during execution of a program.
* It differs from an immediate operand because no architectural provision is needed to support is
use.
* The value of the literal is protected by the fact that the name and address of this word is not
known to the assembly language programmer.
3. Assembler Directives:
* Assembler directives instruct the assembler to perform certain actions during the assembly of a
program.
* Some assembler directives are described in the following.
START <constant>
* This directive indicates that the first word of the target program generated by the assembler
should be placed in the memory word with address
END [<operand spec>]
* This directive indicates that the end of the target program.
* The optional <operand spec> indicates the address of the instruction where the execution of
the program should begin.
START 101
READ N 101) +09 0 114
MOVER BREG, ONE 102) +04 2 116
MOVEM BREG, TERM 103) +05 2 117
AGAIN MULT BREG,TERM 104) +03 2 117
MOVER CREG, TERM 105) +04 3 117
ADD CREG, ONE 106) + 01 3 116
MOVEM,CREG, TERM 107) + 05 3 117
COMP CREG , N 108) + 06 3 114
BC LE, AGAIN 109) + 07 2 104
DIV BREG, TWO 110) + 08 2 118
MOVEM BREG,RESULT 111) + 05 2 115
PRINT RESULT 112) + 10 0 115
STOP 113) + 00 0 000
N DS 1 114)
RESULT DS 1 115)
ONE DC ‘1’ 116) + 00 0 001
TERM DS 1 117)
TWO DC ‘2’ 118) + 00 0 001
END
* Assembly language programming holds an edge over HLL programming in situations where it
is necessary or desirable to use specific architectural features of a computer – for example,
special instructions supported by the CPU.
60/81
4.6 INTERPRETERS
* An interpreter is a program that executes instructions written in a high-level language.
* The most common is to compile the program; the other method is to pass the program through
an interpreter.
* Compiler VS Interpreter
The interpreter, on the other hand, can immediately execute high-level programs.
For this reason, interpreters are sometimes used during the development of a
program, when a programmer wants to add small sections at a time and test them
quickly.
In addition, interpreters are often used in education because they allow students to
program interactively.
Both interpreters and compilers are available for most high-level languages
The structure of the interpreter is similar to that of a compiler, but the amount of
time it takes to produce the executable representation will vary as will the amount
of optimization.
Compiler characteristics:
Interpreter characteristics:
Figure (b)
Data
IR
Source program Preprocessor Interpreter Results
* An impure interpreter performs some preliminary processing of the source program to reduce
the analysis overheads during interpretation. Figure (b) contains a schematic of impure
interpretation. The preprocessor converts the program to an intermediate representation (IR)
which is used during interpretation. This speeds up interpretation as the code component of the
IR, i.e., the IC, can be analyzed more efficiently than the source form of the program.
Definition:
* A Macro is a unit of specification for program generation through expansion.
* A macro consists of a name, a set of formal parameters and body of code.
* Macro expansion is a macro name with a set of formal parameters is replaced by some code.
* Lexical expansion is replacement of a character string by another character string during
program generation.
* Semantic expansion is generation of instructions tailored to the requirements of specific
usage.
64/81
4.5.1 Macro definition and call
* A macro definition is enclosed between macro header statement and a macro end
statement.
* Macro definitions are typically located at the start of a program.
* A macro definition consists of
1.A macro prototype
2.One or more model statements
3.Macro preprocessor statements
* The macro prototype statement declares the name of a macro and the names and kinds of its
parameters.
* A model statement is a statement from which an assembly language statement may be
generated during macro expansion.
* A preprocessor statement is used to perform auxiliary functions during macro expansion.
* The macro prototype statement has the following syntax:
<macro name> [<formal parameter spec>[,..]]
* A macro is called by writing the macro name in the mnemonic field of an assembly statement.
* The macro call has the syntax:
<macro name> [<actual parameter spec>[,..]]
where the actual parameter typically resembles an operand specification in an assembly language
statement.
* The following example shows the definition of macro INCR,MACRO and MEND are the
macro header and macro end statements, respectively.
* The prototype statement indicates that three parameters called MEM_VAL,INCr_VAL and
REG exist for the macro.
* Since parameter kind is not specified for any of the parameters, they are all of the default kind
‘positional parameter’.
* Statements with the operation codes MOVER,ADD and MOVEM are model statements.
* No preprocessor statements are used in this macro.
MACRO
INCR &MEM_VAL, &INCR_VAL, ®
MOVER ®, &MEM_VAL
ADD ®, &INCR_VAL
MOVEM ®, &MEM_VAL
MEND
1.Find the ordinal position of XYZ in the list of formal parameters in the macro prototype
statement.
2.Find the actual parameter specification occupying the same ordinal position in the list of actual
parameter in the actual parameters in the macro call statement.
MACRO
COMPUTE &FIRST, &SECOND
MOVEM BREG, TMP
INCR_D &FIRST, &SECOND, REG = BREG
MOVER BREG, TMP
MEND
Fig 2: Expanded code for a nested macro call COMPUTE X,Y
COMPUTE X, Y + BREG, TMP 1
+ MOVER BREG,X 2
+ INCR_D X,Y + ADD BREG,Y 3
+ MOVEM BREG,X 4
+ MOVER BREG, TMP 5
67/81
4.8 Advanced Macro Facilities:
* Advanced Macro Facilities are aimed at supporting semantic expansion.
* These facilities are grouped into
1.Facilities for alteration of flow of control during expansion.
2.Expansion time variables
3.Attributes of parameter
* This section describes some advanced facilities and illustrates their use in performing
conditional expansion of model statements and in writing expansion time loops.
Expansion variables:
* Expansion time variables are variables that can be called only during the macro expansion of
calls.
* A local EV is created for use only during a particular macro call.
* A global EV exists across all macro calls situated in a program and can be used in any macro
that has declaration for it.
* local EV and global EV’s are created through declaration statements with the following
syntax:
Conditional Expansion
68/81
* While writing a general purpose macro it is important to ensure execution efficiency of its
generated code.
* Condition expansion helps in generating assembly codespecially suited to the parameters in a
macro call.
* This is achieved by ensuring that the model is visited only under specific condition during
expansion of macro
Semantic Expansion
* Semantic expansion is the generation of instructions tailored to the requirements of a specific
usage.
* It can be achieved by a combination of advanced macro facilities like AIF,AGO statements and
expansion time variables.
* The CLEAR macro is an instance of semantic expansion.
* Here the number of MOVEM AREG statements generated by call on CLEAR is determined by
the value of second parameter CLEAR.
* Example:
MACRO
CREATE_CONST &X,&Y
AIF (T’ &X EQ B) .BYTE
&Y DW 25
AGO .OVER
.BYTE ANOP
&Y DB 25
.OVER MEND
This macro creates a constant ‘25’ with the name given by the 2 nd parameter.The type of the
constant matches the type of the first parameter.
69/81
UNIT – V
SYLLABUS:
Linkers: Relocation and linking concept (Program relocation-linking object module)-self
relocating programs-linking for overlays-loaders.
Software tools-software tools for program development-editors-debug monitors.
5.1 LINKERS
Execution of a program written in a language L involves the following steps:
1. Translation of the program.
2. Linking of the program with other programs needed for its execution.
3. Relocation of the program with to execute from the specific memory area allocated to it.
4. Loading of the program in the memory for the purpose of execution.
These steps are performed by different language processors.Step1 is performed by the
translator for language L. Steps 2 and 3 are performed by a linker while Step4 is performed by a
Loader.
Data
Translator Linker Loader Binary
Source program program
Object modules Binary programs
* The following terminology is used to refer to the address of a program entity at different times:
70/81
1. Translation time address: Address assigned by the translator.
2. Linked address: Address assigned by the linker.
3. Load time address: Address assigned by the loader.
* The same prefixes translation time, linked and load time are used with the origin and execution
start address of a program. Thus,
1. Translated origin: Address of the origin assumed by the translator. This is the address
specified by the programmer in an ORIGIN statement.
2. Linked origin: Address of the origin assigned by the linker while producing a binary
program.
3. Load origin: Address of the origin assigned by the loader while loading the program for
execution.
* The linked and load origins may differ from the translated origin of a program due to one of
the reasons mentioned earlier.
Definition 5.2.1 (Program relocation): Program relocation is the process of modifying the
addresses used in the address sensitive instructions of a program such that the program can
execute correctly from the designed area of memory.
* If linked origin ≠ translated origin, relocation must be performed by the linker. If load origin ≠
linked origin, relocation must be performed by the loader.
* In general, a linker performs relocation, whereas some loaders do not.
* However, it would have been more precise to use the term ‘linked origin’.
* Example
Statement Address Code
START 500
ENTRY TOTAL
EXTRN MAX,ALPHA
READ A 500) + 09 0 540
Loop 501)
.
.
MOVER AREG,ALPHA 518) + 04 1 000
BC ANY,MAX 519) + 06 6 000
.
.
BC LT,LOOP 538) + 06 1 501
71/81
STOP 539) + 00 0 000
A DS 1 540)
TOTAL DS 1 541)
END
Performing relocation
Let the translated and linked origins of program P be t-origin and l-originp, respectively.
Consider a symbol symb in P. let its translation time address be t symb and link time address be lsymb
. The relocation factor of P is defined as
relocation_factorp = l_origin p - t_originp (5.1)
Note that relocation_factor p can be positive , negative or zero.
Consider statement which uses symb as an operand. The translator puts the address t symb in the
instruction generated for it. Now,
tsymb = t_originp + dsymb
Where dsymb is the offset of symb in P. hence
tsymb = l_originp + dsymb
using (5.1),
lsymb = t_originp +relocation_factorp + dsymb
= t_originp+ dsymb + relocation_factorp
= tsymb + relocation_factorp (5.2)
Let IRRp designate the set of instructions requiring relocation in program P. following (5.2),
relocation of program P can be performed by computing the relocation factor for P and adding it
to the translation time address(es) in every instruction i IRRp .
5.2.2 Linking
Consider an application program AP consisting of a set of program units SP = {P i}. a
program unit Pi interacts with another program unit Pj by using addresses of Pj’s instructions
and data in its own instructions. To realize such interactions, Pj and Pi must contain public
definitions and external references as defined in the following:
Example :In the assembly program , the ENTRY statement indicates that a public definition of
TOTAL exists in the program. Note that LOOP and An are not pubic definitions even though
they are defined in the program. The EXTRN statement indicates that the program contains
external reference to MAX and ALPHA. The assembler does not know the address of an external
symbol. Hence it puts zeros in the address fields of the instructions corresponding to the
statement MOVER AREG<ALPHA and BC ANY, MAX. If the the EXTRN statement did not
exist, the assembler would have flagged reference to MAX and ALPHA as errors.
1. Pi has been relocated to the memory area starting at its link origin,and
2. Linking has been performed for each external reference in Pi.
To form a binary program form a set of object modules, the programmer invokes the linker using
the command
Linker <link origin>,<object module names>
[,<execution start address>]
* Where <link origin> specifies the memory address to be given to the first word of the binary
program.<execution start address> is usually a pair.The linker converts this into the linked
start address. This is stored along with the binary program for use when the program is to be
executed.If specification of <execution start address> is omitted the execution start address is
assumed to be the same as the linked origin.
* Note that a linker converts the object modules in the set of program units Sp into a binary
program. Since we have assumed link address = load address, the loader simply loads the
binary program into the appropriate area of memory for the purpose of execution.
4.Linking table (LINKTAB) : This table contains information concerning the public
definitions and external references in P.
Example 6: Consider the assembly program of figure. The object module of the program
contains the following information.
1.Nonexecutable programs.
2.Relocatable programs,
3.Self-relocating programs.
* A non relocatable program is a program which cannot be executed in any memory area other
than the area starting on its translated origin.
* Non relocatability is the result of address sensitivity of a program and lack of information
concerning the address sensitive instructions in a program.
* The difference between a relocatable program and a non relocatable program is the availability
of information concerning the address sensitive instructions in it.
* A relocatable program can be processed to relocate it to a desired area of memory.
74/81
* Representative examples of non relocatable and relocatable programs are a hand coded
machine language program and an object module, respectively.
* A self relocating program is a program which can perform the relocation of its own address
sensitive instructions. It contains the following two provisions for this purpose:
1.A table of information concerning the address sensitive exists as a part of the program.
2. Code to perform the relocation of address sensitive instructions also exists as a part of the
program. This is called the relocating logic.
* The start address of the relocating logic is specified as the execution start address of the
program.
* Thus the relocating logic gains control when the program is loaded in memory for execution.
* It uses the load load address and the information concerning address sensitive instructions to
perform its own relocation.
* Execution control is now transferred to the relocated program.
* A self-relocating program can execute in any area of the memory.
* This is very important in time sharing operating systems where the load address of a program
is likely to be different for different executions.
Program generator generates a program which performs a set of functions described in its
specification. Use of a program generator saves substantial design effort since a programmer
merely specifies what functions a program should perform rather than how the functions should
be implemented. Coding effort is saved since the program is generated rather than coded by
hand. A programming environment supports coding by incorporating awareness of the
programming language syntax and semantics in the language editor.
5.6.1.2 Program Entry and Editing
These tools are text editors or more sophisticated programs with text editors as front
ends. The editor functions in two modes. In the command mode, it accepts user commands
specifying the editing function to be performed. In the data mode, the user keys in the text to be
added to the file. Failure to recognize the current mode of the editor can lead to mix up of
commands and data. This can be avoided in two ways. In one approach, a quick exit is provided
from the data mode, e.g. by pressing the escape key, such that editor enters the command mode.
Another approach is to use the screen mode, wherein the editors is in the data mode most of the
time. The user is provided special keys to move the cursor on the screen. A stroke of any other
key is taken to imply input of the corresponding character at the current cursor position. Certain
keys pressed along with the control key signify commands like erase character, delete line, etc.
Thus end of data need not be explicitly indicated by the user. Most Turbo editors on PC’s use this
approach.
5.6.1.3 Program Testing and Debugging
Important steps in program testing and debugging are selection of test data for the
program,analysis of test results to detect errors, and debugging, i.e. localization and removal of
errord. Software tools to assist the programmer in these steps come in the following forms:
1.Test data generators help the user in selecting test data for this program. Their use
helps in ensuring that a program is thoroughlu tested.
2. Automated test drivers help in regression testing, wherein a program correctness is
verified by subjecting it to a standard set of tests after every modification. Regreession testing is
performed as follows: Many sets of test data are prepared for the program. These are given as
inputs to the test driver. The driver selects one set of test data at a time and organizes execution
of the program on the data.
3.Debug monitors in obtaining information for localization of errors.
4.Source code control systems help to keep track of modifications in the source code
Test data selection uses the notion of an execution path which is a sequence of program
statements visited during an execution. For testing, a program can be viewed as a set of
77/81
execution paths. A test data generator determines the conditions which must be satisfied by the
programs inputs for controlto flow along a specific execution path. A test data is a set of input
values which satisfy these conditions.
Producing debug information
Classically, localization and removal of errors has been aided by special purpose debug
information. Such information can be produced statically by analyzing the source program or
dynamically during program execution. Statically produced debug information takes the form of
cross reference listings, lists of undefined variables and unreachable statements, etc. All these are
useful in determining the cause of program malfunction. Techniques of data flow analysis are
employed to collect such information.
Example : The data flow concept of reaching definitions can be used to determine whether
variable x may have a value when execution reaches statement 10 in the following program
NO STATEMENT
10 sum: = x + 10;
If no definitions of x reach statement 10, then x is surely undefined in statement 10. If some
definitions reach statement 10, then x may have a value whwn control reaches the statement.
Whether x is defined in a specific execution of the program would depend on how control flows
during the execution.
Dynamically produced debug information takes the form of value dumps and execution traces
produced during the execution of a program. This information helps to determine the execution
paths followed during an execution and the sequence of values assumed by a variable. Most
programming languages provide facilities to produce dynamic debug information.
5.6.1.4 Enhancement of program performance
Program efficiency depends on two factors—the efficiency of the algorithm and the
efficiency of its coding. An optimizing compiler can improve efficiency of the code but it cannot
improve efficiency of the algorithm. Only a program designer can improve efficiency of an
algorithm by rewriting it. However, this is time consuming process hence some help should be
provided to improve its cost-effectiveness. For example, it is better to focus on only those
sections of a program which consumes a considerable amount of execution time. A performance
tuning tool helps in identifying parts. It is empirically observed that less than three percent of
program code generally accounts for more than 50 percent of program execution time. This
observation promises major economies of effort in improving a program.
Example : A program consists of three modules A, B and c. Sizes of the modules and execution
times consumed by them in a typical execution as given in table.
Name # of statements % of total execution time
A 150 4.00
B 80 6.00
C 35 90.00
It is seen that module C, which is roughly 13 % of the total program size, consumes 90% of the
program execution time. Hence optimization of module C would result in optimization for 90%
of program execution time at only 13% of the cost.
A profile monitor is a software tool that collects information regarding the execution behavior of
a program.e.g. the amount of execution time consumed by its modules, and presents it in the
form of an execution profile. usinng this information, the programmer can focus attention on the
program sections consuming a significant amount of execution time.
5.6.1.5 Program Documentation
Most programming projects suffer from lack of up-to-date documentation. Automatic
documentation tools are motivated by the desire to overcome this deficiency. These tools work
on the source program to produce different forms of documentation, e.g. flow charts, IO
specifications showing files and their records, etc.
5.6.1.6 Design of Software tools
Program preprocessing techniques are used to support static analysis of programs. Tools
generating cross reference listings and lists of unreferenced symbols; test data generators. And
78/81
documentation aids use this technique. Program instrumentation implies insertion of statements
in a program. The instrumented program is translated using a standard translator. During
execution, the inserted statements perform a set of desired functions.Profile and debug moniters
typically use this technique. In a profile monitor, an inserted ststement updates a counter
indicating the number of times a statement is executed, whereas in debug monitors an inserted
statement indicates that execution has reached a specific point in the source program.
Example : A debug moniter instruments a program to insert statements of the form
Call debug_mon(const_i);
Before every statement in the program, wher const_i is an integer constant indicating the serial
numberof the statement in the program. During execution of the instrumented program, the
debug monitor receives control after every statement..
At 10, display total
At 20, display term
Every time debug monitor receives control, it checks to see if statement 10 or 20 is about to be
executed. It then performs the debug action indicatd by the user.
Program interpretation and program generation
Use of Interpreters in software tools is motivated by the same reasons that motivate the
use of interpreters in program development.Since most requirements met by software tools are
ad hoc, it is useful to eliminate the translation phase. However, interpreter based tools suffer
from poor efficiency and poor portability, since an interpreter based tool is only as portable as
the interpreter it uses. A generated program is more efficient and can be made portable.
5.7 EDITORS
Text editors come in the following forms:
1. Line editors.
2. Stream editors
3. Screen editors
4. Word processors
5. Structure editors
* The scope of edit operations in a line editors is limited to a line of text. The line is designated
positionally, e.g. by specifying its serial number in the text, or contextually, e.g. by specifying
a context which uniquely identifies it.
* The primary advantage of line edors is their simplicity.
* A stream editor views entire text as a stream of characters.
* This permits edit operations to cross line boundaries.
* Stream editors typically support character, line and context oriented based on the current
editing context indicated by the position of a text pointer.
* The pointer can be manipulated using positioning or search commands.
* Travelling implies movement of the editing context to a new position within the text
* This may be done explicitly by the user or may be implied in a user command. Viewing
implies formatting the text in a manner desired by the user.
* The display component maps this view into the physical characteristics of the display
device being used.
* This determines where a particular view may appear on the users screen.
* The separation of viewing and display functions gives rise to the interesting
possibilities like multiple windows on the same screen, concurrent edit operations
using the same display terminal, etc.
* A simple text editor may choose to combine the viewing and display functions.
* For a given position of the editing context, the editing filters operate on the internal
form of text to prepare the forms suitable for editing and viewing.
* The viewing and display manager makes provision for appropriate display of text.
* When the cursor position changes, the editing is performed, the editing filter reflects
the changes into the internal form and updates the contents of the viewing buffer.
* Apart from the fundamental editing functions, most editors support an undo function to
nullify one or more of the previous edit operations performed by the user.
Editing Viewing
buffer buffer
Editing filter
Viewing filter
Text
1. The user compiles the program under the debug option. The compiler produce
two files – the compiled code file and the debug information file.
2. The user activates the debug monitor and indicates the name of the program to be
debugged. The debug monitor opens the compiled code and debug information
files for the program.
81/81
3. The user specifies his debug requirements—a list of breakpoints and actions to be
performed at breakpoints. The debug monitor instruments the program, and builds
a debug table containing the pairs.
4. The instrumented program gets control and executes up to a breakpoint.
5. A software interrupt is generated when the <SI_instrn> is executed. Control is
given to the debug monitor which consults the debug table and performs the
debug actions specified for the breakpoint. A debug conversation is now opened
during which the user may issue some debug commands or modify breakpoints
and debug actions associated with breakpoints.