System Software and Compilers PDF
System Software and Compilers PDF
System Software
Semester : VI Course Code : 15CS63
Outcomes
The students should be able to:
1. Student able to Define System Sotware such as Assembler and Macroprocessor.
2. Student able to Define System Sotware such as Loaders and Linkers
3. Student able to lexical analysis and syntax analysisFamiliaize with source file ,object and
executable file structures and libraries
4. Describe the front and back end phases of compiler and their importance to students
MODULE- 1
Introduction to System Software,
Machine Architecture of SIC and SIC/XE.
Assemblers: Basic assembler functions, machine dependent assembler features,
machine independent assembler features, assembler design options.
Macroprocessors: Basic macro processor functions, ->10 Hours
MACHINE ARCHITECTURE
System Software:
System software consists of a variety of programs that support the operation of a computer.
Application software focuses on an application or problem to be solved.
System softwares are the machine dependent softwares that allows the user to focus on the
application or problem to be solved, without bothering about the details of how the
machine works internally.
Examples: Operating system, compiler, assembler, macroprocessor, loader or linker, debugger, text
editor, database management systems, etc.
We discuss here the SIC machine architecture with respect to its Memory and Registers,
Data Formats, Instruction Formats, Addressing Modes, Instruction Set, Input and Output.
Memory:
There are 215 bytes in the computer memory, that is 32,768 bytes. It uses Little Endian format to
store the numbers, 3 consecutive bytes form a word , each location in memory contains 8-bit bytes.
Registers:
There are five registers, each 24 bits in length. Their mnemonic, number and use are given in the
following table.
PC 8 Program counter
Data Formats:
Integers are stored as 24-bit binary numbers. 2’s complement representation is used for negative
values, characters are stored using their 8-bit ASCII codes.No floating-point hardware on the
standard version of SIC.
Instruction Formats:
All machine instructions on the standard version of SIC have the 24-bit format as shown above.
Addressing Modes:
Direct x= 0 TA = address
Instruction Set
Memory
Registers
Additional B, S, T, and F registers are provided by SIC/XE, in addition to the registers of SIC.
B 3 Base register
Instruction Formats :
The new set of instruction formats fro SIC/XE machine architecture are as follows.
Format 2 (2 bytes): first eight bits for operation code, next four for register 1 and following four for
register 2. The numbers for the registers go according to the numbers indicated at the registers
section (ie, register T is replaced by hex 5, F is replaced by hex 6).
Format 3 (3 bytes): First 6 bits contain operation code, next 6 bits contain flags, last 12 bits contain
displacement for the address of the operand. Operation code uses only 6 bits, thus the second hex
digit will be affected by the values of the first two flags (n and i). The flags, in order, are: n, i, x, b, p,
and e. Its functionality is explained in the next section. The last flag e indicates the instruction format
(0 for 3 and 1 for 4).
Format 4 (4 bytes): same as format 3 with an extra 2 hex digits (8 bits) for addresses that require
more than 12 bits to be represented.
Addressing Modes:
1. Direct (x, b, and p all set to 0): operand address goes as it is. n and i are both set to the same
value, either 0 or 1. While in general that value is 1, if set to 0 for format 3 we can assume that the
rest of the flags (x, b, p, and e) are used as a part of the address of the operand, to make the format
compatible to the SIC format.
2. Relative (either b or p equal to 1 and the other one to 0): the address of the operand should be
added to the current value stored at the B register (if b = 1) or to the value stored at the PC register
(if p = 1)
3. Immediate(i = 1, n = 0): The operand value is already enclosed on the instruction (ie. lies on the
last 12/20 bits of the instruction)
4. Indirect(i = 0, n = 1): The operand value points to an address that holds the address for the
operand value.
5. Indexed (x = 1): value to be added to the value stored at the register x to obtain real address of
the operand. This can be combined with any of the previous modes except immediate.
The various flag bits used in the above formats have the following meanings
Bits x,b,p : Used to calculate the target address using relative, direct, and indexed addressing Modes.
Bits i and n: Says, how to use the target address b and p - both set to 0, disp field from format 3
instruction is taken to be the target address.
For a format 4 bits b and p are normally set to 0, 20 bit address is the target address
i=1, n=0 Immediate addressing, TA: TA is used as the operand value, no memory reference
i=0, n=1 Indirect addressing, ((TA)): The word at the TA is fetched. Value of TA is taken as the address
of the operand value
i=0, n=0 or i=1, n=1 Simple addressing, (TA):TA is taken as the address of the operand value
Two new relative addressing modes are available for use with instructions assembled using format 3.
Instruction Set:
SIC/XE provides all of the instructions that are available on the standard version. In addition we
have, Instructions to load and store the new registers LDB, STB, etc, Floating-point arithmetic
operations, ADDF, SUBF, MULF, DIVF, Register move instruction : RMO, Register-to-register
arithmetic operations, ADDR, SUBR, MULR, DIVR and, Supervisor call instruction : SVC.
There are I/O channels that can be used to perform input and output while the CPU is executing
other instructions. Allows overlap of computing and I/O, resulting in more efficient system
operation. The instructions SIO, TIO, and HIO are used to start, test and halt the operation of I/O
channels.
……..
……..
ONE WORD 1
ALPHA RESW 1
BEETA RESW 1
INCR RESW 1
Example 5: To transfer two hundred bytes of data from input device to memory
LDX ZERO
CLOOP TD INDEV
JEQ CLOOP
RD INDEV
STCH RECORD, X
TIX B200
JLT CLOOP
.
.
INDEV BYTE X ‘F5’
RECORD RESB 200
ZERO WORD 0
B200 WORD 200
Assemblers - 1
A Simple Two-Pass Assembler
Main Functions
• It is a copy function that reads some records from a specified input device and then copies
them to a specified output device
– Reads a record from the input device (code F1)
– Copies the record to the output device (code 05)
– Repeats the above steps until encountering EOF.
– Then writes EOF to the output device
– Then call RSUB to return to the caller
–
Data transfer
– A record is a stream of bytes with a null character (0016) at the end.
– If a record is longer than 4096 bytes, only the first 4096 bytes are copied.
– EOF is indicated by a zero-length record. (I.e., a byte stream with only a null
character.
– Because the speed of the input and output devices may be different, a buffer is used
to temporarily store the record
Subroutine call and return
– On line 10, “STL RETADDR” is called to save the return address that is already stored
in register L.
– Otherwise, after calling RD or WR, this COPY cannot return back to its caller.
Assembler Directives
An Assembler’s Job
Header
Col. 1 H
Text
Col.1 T
End
Col.1 E
DC2079 2C1036
E 001000
NOTE: There is no object code corresponding to addresses 1033-2038. This storage is simply
reserved by the loader for use by the program during execution.
Pass 1
– Assign addresses to all statements in the program
– Save the values (addresses) assigned to all labels (including label and variable
names) for use in Pass 2 (deal with forward references)
– Perform some processing of assembler directives (e.g., BYTE, RESW, these can affect
address assignment)
Pass 2
– Assemble instructions (generate opcode and look up addresses)
– Generate data values defined by BYTE, WORD
– Perform processing of assembler directives not done in Pass 1
– Write the object program and the assembly listing
Content
– The mapping between mnemonic and machine code. Also include the instruction
format, available addressing modes, and length information.
Characteristic
– Static table. The content will never change.
Implementation
– Array or hash table. Because the content will never change, we can optimize its
search speed.
In pass 1, OPTAB is used to look up and validate mnemonics in the source program.
In pass 2, OPTAB is used to translate mnemonics to machine instructions.
end( if START)
else
initialize LOCCTR to 0
begin
begin
begin
if found then
else
(if symbol)
if found then
add 3 to LOCCTR
begin
end
else
End {pass 1}
begin
end
begin
begin
if found then
begin
begin
if found then
begin
else
begin
end
begin
end
end
End {Pass 2}
– Program relocation
– Literals
– Symbol-defining statements
– Expressions
– Program blocks
A SIC/XE Program
• Indirect addressing: op @m
• Immediate addressing: op #c
• Register-to-register instructions
– Can save one byte from using format 3 rather than format 4.
– Fetch a value stored in a register is much faster than fetch it from the
memory.
• Let the assembled program starts at address 0 so that later it can be easily moved to
any place in the physical memory.
PC or Base-Relative Modes
– Base-relative: 0~4095
– PC-relative: -2048~2047
• If the displacement cannot fit into 12 bits, format 4 then needs to be used. (E.g., line
15 and 125)
• The difference between PC and base relative addressing modes is that the assembler
knows the value of PC when it tries to use PC-relative mode to assembles an
– Therefore, the programmer must tell the assembler the value of register B.
– This is done through the use of the BASE directive. (line 13)
– Also, the programmer must load the appropriate value into register B by
himself.
– Another BASE directive can appear later, this will tell the assembler to change
its notion of the current value of B.
– NOBASE can also be used to tell the assembler that no more base-relative
addressing mode should be used.
Relocatable Is Desired
• The program in Fig. 2.1 specifies that it must be loaded at address 1000 for correct
execution. This restriction is too inflexible for the loader.
• If the program is loaded at a different address, say 2000, its memory references will
access wrong data! For example:
• Thus, we want to make programs relocatable so that they can be loaded and execute
correctly at any place in the memory.
If we can use a hardware relocation register (MMU), software relocation can be avoided
here. However, when linking multiple object Programs together, software relocation is still
needed.
• Only those instructions that use absolute (direct) addresses to reference symbols.
• When the assembler generate an address for a symbol, the address to be inserted
into the instruction is relative to the start of the program.
• The assembler also produces a modification record, in which the address and length
of the need-to-be-modified address field are stored.
• The loader, when seeing the record, will then add the beginning address of the
loaded program to the address field stored in the record.
MODULE-2
These are the features which do not depend on the architecture of the machine. These are:
Literals
Expressions
Program blocks
Control sections
Literals
A literal is defined with a prefix = followed by a specification of the literal value.
Example:
The example above shows a 3-byte operand whose value is a character string EOF. The object code
for the instruction is also mentioned. It shows the relative displacement value of the location where
this value is stored. In the example the value is at location (002D) and hence the displacement value
is (010).
As another example the given statement below shows a 1-byte literal with the hexadecimal value
‘05’.
All the literal operands used in a program are gathered together into one or more literal pools. This
is usually placed at the end of the program. The assembly listing of a program containing literals
usually includes a listing of this literal pool, which shows the assigned addresses and the generated
data values. In some cases it is placed at some other location in the object program. An assembler
directive LTORG is used. Whenever the LTORG is encountered, it creates a literal pool that contains
all the literal operands used since the beginning of the program. The literal pool definition is done
after LTORG is encountered. It is better to place the literals close to the instructions.
A literal table is created for the literals which are used in the program. The literal table contains the
literal name, operand value and length. The literal table is usually created as a hash table on the
literal name.
Implementation of Literals:
During Pass-1:
The literal encountered is searched in the literal table. If the literal already exists, no action is taken;
if it is not present, the literal is added to the LITTAB and for the address value, it waits till it
encounters LTORG for literal definition. When Pass 1 encounters a LTORG statement or the end of
the program, the assembler makes a scan of the literal table. At this time each literal currently in the
table is assigned an address. As addresses are assigned, the location counter is updated to reflect
the number of bytes occupied by each literal.
During Pass-2:
The assembler searches the LITTAB for each literal encountered in the instruction and replaces it
with its equivalent value as if these values are generated by BYTE or WORD. If a literal represents an
address in the program, the assembler must generate a modification relocation for, if it all it gets
affected due to relocation. The following figure shows the difference between the SYMTAB and
LITTAB.
Symbol-Defining Statements:
EQU Statement:
Most assemblers provide an assembler directive that allows the programmer to define symbols and
specify their values. The directive used for this EQU (Equate). The general form of the statement is
This statement defines the given symbol (i.e., entering in the SYMTAB) and assigning to it the
value specified. The value can be a constant or an expression involving constants and any
othersymbol which is already defined. One common usage is to define symbolic names that can be
used to improve readability in place of numeric values.
For example
+LDT #4096
This loads the register T with immediate value 4096, this does not clearly show what exactly this
value indicates. If a statement is included as:
+LDT #MAXLEN
Then it clearly indicates that the value of MAXLEN is some maximum length value. When the
assembler encounters EQU statement, it enters the symbol MAXLEN along with its value in the
symbol table. During LDT the assembler searches the SYMTAB for its entry and its equivalent value
as the operand in the instruction. The object code generated is the same for both the options
discussed, but is easier to understand. If the maximum length is changed from 4096 to 1024, it is
difficult to change if it is mentioned as an immediate value wherever required in the instructions.
We have to scan the whole program and make changes wherever 4096 is used. If we mention this
value in the instruction through the symbol defined by EQU, we may not have to search the whole
program but change only the value of MAXLENGTH in the EQU statement (only once).
ORG Statement:
This directive can be used to indirectly assign values to the symbols. The directive is usually called
ORG (for origin). Its general format is:
ORG value
where value is a constant or an expression involving constants and previously defined symbols.
When this statement is encountered during assembly of a program, the assembler resets its location
counter (LOCCTR) to the specified value. Since the values of symbols used as labels are taken from
LOCCTR, the ORG statement will affect the values of all labels defined until the next ORG is
encountered. ORG is used to control assignment storage in the object program.Sometimes altering
the values may result in incorrect assembly.
ORG can be useful in label definition. Suppose we need to define a symbol table with the following
structure:
SYMBOL 6 Bytes
VALUE 3 Bytes
FLAG 2 Bytes
The symbol field contains a 6-byte user-defined symbol; VALUE is a one-word representation of the
value assigned to the symbol; FLAG is a 2-byte field specifies symbol type and other information. The
space for the table can be reserved by the statement:
If we want to refer to the entries of the table using indexed addressing, place the offset value of the
desired entry from the beginning of the table in the index register. To refer to the fields SYMBOL,
VALUE, and FLAGS individually, we need to assign the values first as shown below:
To retrieve the VALUE field from the table indicated by register X, we can write a statement:
LDA VALUE, X
The same thing can also be done using ORG statement in the following way:
ORG STAB
SYMBOL RESB 6
VALUE RESW 1
FLAG RESB 2
ORG STAB+1100
The first statement allocates 1100 bytes of memory assigned to label STAB. In the second statement
the ORG statement initializes the location counter to the value of STAB. Now the LOCCTR points to
STAB. The next three lines assign appropriate memory storage to each of SYMBOL, VALUE and FLAG
symbols. The last ORG statement reinitializes the LOCCTR to a new value after skipping the required
number of memory for the table STAB (i.e., STAB+1100).
While using ORG, the symbol occurring in the statement should be predefined as is required in EQU
statement. For example for the sequence of statements below:
ORG ALPHA
BYTE1 RESB 1
BYTE2 RESB 1
BYTE3 RESB 1
ORG
ALPHA RESB 1
The sequence could not be processed as the symbol used to assign the new location counter
value is not defined. In first pass, as the assembler would not know what value to assign to ALPHA,
the other symbol in the next lines also could not be defined in the symbol table. This is a kind of
problem of the forward reference.
EXPRESSIONS:
Assemblers also allow use of expressions in place of operands in the instruction. Each such
expression must be evaluated to generate a single operand value or address. Assemblers generally
arithmetic expressions formed according to the normal rules using arithmetic operators +, - *, /.
Division is usually defined to produce an integer result. Individual terms may be constants, user-
defined symbols, or special terms. The only special term used is * ( the current value of location
counter) which indicates the value of the next unassigned memory location. Thus the statement
BUFFEND EQU *
Assigns a value to BUFFEND, which is the address of the next byte following the buffer area. Some
values in the object program are relative to the beginning of the program and some are absolute
(independent of the program location, like constants). Hence, expressions are classified as either
absolute expression or relative expressions depending on the type of value they produce.
Absolute Expressions:
The expression that uses only absolute terms is absolute expression. Absolute expression may
contain relative term provided the relative terms occur in pairs with opposite signs for each pair.
Example:
In the above instruction the difference in the expression gives a value that does not depend on the
location of the program and hence gives an absolute immaterial o the relocation of the program. The
expression can have only absolute terms. Example:
Relative Expressions: All the relative terms except one can be paired as described in “absolute”. The
remaining unpaired relative term must have a positive sign. Example:
Handling the type of expressions: to find the type of expression, we must keep track the type of
symbols used. This can be achieved by defining the type in the symbol table against each of the
symbol as shown in the table below:
Program Blocks:
Program blocks allow the generated machine instructions and data to appear in the object
program in a different order by Separating blocks for storing code, data, stack, and larger data block.
USE [blockname]
At the beginning, statements are assumed to be part of the unnamed (default) block. If no USE
statements are included, the entire program belongs to this single block. Each program block may
actually contain several separate segments of the source program. Assemblers rearrange these
segments to gather together the pieces of each block and assign address. Separate the program into
blocks in a particular order. Large buffer area is moved to the end of the object program. Program
readability is better if data areas are placed in the source program close to the statements that
reference them.
Pass 1
Store the block name or number in the SYMTAB along with the assigned relative address of
the label
Indicate the block length as the latest value of LOCCTR for each block at the end of Pass1
Assign to each block a starting address in the object program by concatenating the program
blocks in a particular order
Pass 2
Calculate the address for each symbol relative to the start of the object program by adding
The location of the symbol relative to the start of its block
Control Sections:
A control section is a part of the program that maintains its identity after assembly; each
control section can be loaded and relocated independently of the others. Different control sections
are most often used for subroutines or other logical subdivisions. The programmer can assemble,
load, and manipulate each of these control sections separately.
Because of this, there should be some means for linking control sections together. For
example, instructions in one control section may refer to the data or instructions of other control
sections. Since control sections are independently loaded and relocated, the assembler is unable to
process these references in the usual way. Such references between different control sections are
called external references.
The assembler generates the information about each of the external references that will
allow the loader to perform the required linking. When a program is written using multiple control
sections, the beginning of each of the control section is indicated by an assembler directive
assembler directive: CSECT
The syntax :
secname CSECT
Control sections differ from program blocks in that they are handled separately by the
assembler. Symbols that are defined in one control section may not be used directly another control
section; they must be identified as external reference for the loader to handle. The external
references are indicated by two assembler directives:
It is the statement in a control section, names symbols that are defined in this section but may be
used by other control sections. Control section names do not need to be named in the EXTREF as
they are automatically considered as external symbols.
It names symbols that are used in this section but are defined in some other control section.
The order in which these symbols are listed is not significant. The assembler must include proper
information about the external references in the object program that will cause the loader to insert
the proper value where they are required.
The assembler must also include information in the object program that will cause the loader to
insert the proper value where they are required. The assembler maintains two new record in the
object code and a changed version of modification record.
Col. 1 D
Col. 1 R
Modification record
Col. 1 M
Col.11-16 External symbol whose value is to be added to or subtracted from the indicated field
A define record gives information about the external symbols that are defined in this
control section, i.e., symbols named by EXTDEF.
A refer record lists the symbols that are used as external references by the control section,
i.e., symbols named by EXTREF.
The new items in the modification record specify the modification to be performed: adding
or subtracting the value of some external symbol. The symbol used for modification my be defined
either in this control section or in another section.
The object program is shown below. There is a separate object program for each of the control
sections. In the Define Record and refer record the symbols named in EXTDEF and EXTREF are
included.
In the case of Define, the record also indicates the relative address of each external symbol within
the control section.
For EXTREF symbols, no address information is available. These symbols are simply named in the
Refer record.
– One-pass assembler
– Multi-pass assembler
One-Pass Assembler
• No loader is needed
• Good for computing center where most students reassemble their programs
each time.
Internal Implementation
• The assembler generate object code instructions as it scans the source program.
• If an instruction operand is a symbol that has not yet been defined, the operand address is
omitted when the instruction is assembled.
• The address of the operand field of the instruction that refers to the undefined symbol is
added to a list of forward references associated with the symbol table entry.
• When the definition of the symbol is encountered, the forward reference list for that symbol
is scanned, and the proper address is inserted into any instruction previously generated.
– On line 45, when the symbol ENDFIL is defined, the assembler places its value in the
SYMTAB entry.
– The assembler then inserts this value into the instruction operand field (at address
201C).
– From this point on, any references to ENDFIL would not be forward references and
would not be entered into a list.
• At the end of the processing of the program, any SYMTAB entries that are still marked with *
indicate undefined symbols.
Multi-Pass Assembler
DELTA RESW 1
• This is because ALPHA and BETA cannot be defined in pass 1. Actually, if we allow multi-pass
processing, DELTA is defined in pass 1, BETA is defined in pass 2, and ALPHA is defined in
pass 3, and the above definitions can be allowed.
• It is unnecessary for a multi-pass assembler to make more than two passes over the entire
program.
• Instead, only the parts of the program involving forward references need to be processed in
multiple passes.
• The method presented here can be used to process any kind of forward references.
Steps:
• Use a symbol table to store symbols that are not totally defined yet.
– We store the names and the number of undefined symbols which contribute to the
calculation of its value.
– We also keep a list of symbols whose values depend on the defined value of this
symbol.
• When a symbol becomes defined, we use its value to reevaluate the values of all of the
symbols that are kept in this list.
MODULE-3
Lexical Analysis
Role of lexical analyzer
Specification of tokens
Recognition of tokens
Finite automata
1. Simplicity of design
A pattern is a description of the form that the lexemes of a token may take
A lexeme is a sequence of characters in the source program that matches the pattern for a
token
Example
Lexical errors
Some errors are out of power of lexical analyzer to recognize:
o fi (a == f(x)) …
However it may be able to recognize errors like:
o d = 2r
Such errors are recognized when no pattern for tokens matches a
character sequence
Error recovery
1. Panic mode: successive characters are ignored until we reach to a well formed token
2. Delete one character from the remaining input
3. Insert a missing character into the remaining input
Input buffering
Sentinels
Specification of tokens
1. In theory of compilation regular expressions are used to formalize the specification
of tokens
3. Example:
i. Letter_(letter_ | digit)*
Regular expressions
1. Ɛ is a regular expression, L(Ɛ) = {Ɛ}
Regular definitions
1. d1 -> r1
2. d2 -> r2
3. …
4. dn -> rn
5. Example:
6. letter_ -> A | B | … | Z | a | b | … | Z | _
7. digit -> 0 | 1 | … | 9
8. id -> letter_ (letter_ | digit)*
Extensions
One or more instances: (r)+
Example:
id -> letter_(letter|digit)*
Recognition of tokens
Starting point is the language grammar to understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
Recognition of tokens (cont.)
The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
We also need to handle whitespaces:
Transition diagrams
TOKEN getRelop()
switch(state) {
case 0: c= nextchar();
if (c == ‘<‘) state = 1;
break;
case 1: …
case 8: retract();
retToken.attribute = GT;
return(retToken);
Finite Automata
o A set of states S
o A start state n
Transition
s1 a s2
Is read
If end of input
Example
Alphabet still { 0, 1 }
MODULE-4
Uses of grammars
E -> E + T | T
T -> T * F | F
F -> (E) | id
E -> TE’
E’ -> +TE’ | Ɛ
T -> FT’
T’ -> *FT’ | Ɛ
F -> (E) | id
Error handling
Lexical errors
Syntactic errors
Semantic errors
Logical errors
Terminals
Nonterminals
Start symbol
Productions
Derivations
E -> E + E | E * E | -E | (E) | id
Parse trees
-(id+id)
Elimination of ambiguity
A -> A α|β
A -> β A’
A’ -> α A’ | ɛ
Left factoring
On seeing input if it is not clear for the parser which production to use
A -> αA’
A’ -> β1 | β2
Example: id+id*id
E -> TE’
E’ -> +TE’ | Ɛ
T -> FT’
T’ -> *FT’ | Ɛ
F -> (E) | id
void A() {
for (i=1 to k) {
if (Xi is a nonterminal
Example
S->cAd
A->ab | a
Input: cad
In predictive parsing when we have A-> α|β, if First(α) and First(β) are
disjoint sets then we can select appropriate A-production by looking
at the next input
Follow(A), for any nonterminal A, is set of terminals a that can appear immediately
after A in some sentential form
Computing First
To compute First(X) for all grammar symbols X, apply following rules until no more
terminals or ɛ can be added to any First set:
Example!
Computing follow
To compute First(A) for all nonterminals A, apply following rules until nothing can be
added to any follow set:
Example!
LL(1) Grammars
Grammars for which we can create predictive parsers are called LL(1)
A grammar G is LL(1) if and only if whenever A-> α|βare two distinct productions of G,
the following conditions hold:
If α=> ɛ then βdoes not derive any string beginning with a terminal in Follow(A).
If after performing the above, there is no production in M[A,a] then set M[A,a] to error .
Example
BOTTOMUP PARSING
Shift-reduce parser
The general idea is to shift some symbols of input to the stack until a reduction can be
applied
The key decisions during bottom-up parsing are about when to reduce and about what
production to applyA reduction is a reverse of a step in a derivation
Handle pruning
Basic operations:
LR Parsing
Why LR parsers?
Table driven
Class of grammars for which we can construct LR parsers are superset of those
which we can construct LL parsers
States of an LR parser
An LR(0) item of G is a production of G with the dot at some position of the body:
A->.XYZ
A->X.YZ
A->XY.Z
A->XYZ.
In a state having A->.XYZ we hope to see a string derivable from XYZ next on the
input.
Augmented grammar:
If A->α.Bβ is in closure(I) and B->γ is a production then add the item B->.γ
to clsoure(I).
Example: E’->E
E -> E + T | T
Closure algorithm
SetOfItems CLOSURE(I) {
J=I;
repeat
if (B->.γ is not in J)
add B->.γ to J;
return J;
GOTO Algorithm
SetOfItems GOTO(I,X) {
J=empty;
if (A-> α.X β is in I)
return J;
Void items(G’) {
C= CLOSURE({[S’->.S]});
repeat
add GOTO(I,X) to C;
LR parsing algorithm
if (ACTION[s,a] = shift t) {
Method
If any conflicts appears then we say that the grammar is not SLR(1).
The initial state of the parser is the one constructed from the set of items
containing [S’->.S]
MODULE-5
Introduction
If dependency graph has an edge from M to N then M must be evaluated before the
attribute of N
Thus the only allowable orders of evaluation are those sequence of nodes N1,N2,…,Nk
such that if there is an edge from Ni to Nj then i<j
Example!
S-Attributed definitions
postorder(N) {
S-Attributed definitions can be implemented during bottom-up parsing without the need
to explicitly create parse trees
L-Attributed definitions
A SDD is L-Attributed if the edges in dependency graph goes from Left to Right but
not from Right to Left.
Synthesized
Example:
Production
E -> E1 + T
E -> E1 - T
E -> T
T -> (E)
T -> id
T -> num
Semantic RULE
E.node = T.node
T.node = E.node
An SDT is a Context Free grammar with program fragments embedded within production
bodies
Any SDT can be implemented by first building a parse tree and then performing the
actions in a left-to-right depth first order
Typically SDT’s are implemented during parsing without building a parse tree .
Simplest SDDs are those that we can parse the grammar bottom-up and the SDD is s-
attributed
For such cases we can construct SDT where each action is placed at the end of the
production and is executed along with the reduction of the body to the head of that
production
SDT’s with all actions at the right ends of the production bodies are called postfix SDT’s
In a shift-reduce parser we can easily implement semantic action using the parser stack
For each nonterminal (or state) on the stack we can associate a record holding its
attributes
Then in a reduction step we can execute the semantic action at the end of a production
to evaluate the attribute(s) of the non-terminal at the leftside of the production
And put the value on the stack in replace of the rightside of production
EXAMPLE
L -> E n {print(stack[top-1].val);
top=top-1;}
E -> E1 + T {stack[top-2].val=stack[top-2].val+stack.val;
top=top-2;}
E -> T
T -> T1 * F {stack[top-2].val=stack[top-2].val+stack.val;
top=top-2;}
T -> F
top=top-2;}
F -> digit
Intermediate code is the interface between front end and back end in a compiler
Ideally the details of source language are confined to the front end and the details of
target machines to the back end (a m*n model)
This way we can easily show the common sub-expressions and then use that
knowledge during code generation
Example: a+a*(b-c)+(b-c)*d
Algorithm
Search the array for a node M with label op, left child l and right child r
If not create in the array a new node N with label op, left child l, and right
child r and return its value
In a three address code there is at most one operator at the right side of an
instruction
Example:
Quadruples
Triples
Temporaries are not used and instead references to instructions are made
Indirect triples
Type Expressions
Example: int[2][3]
array(2,array(3,integer))
A type expression can be formed by applying the array type constructor to a number and
a type expression.
A type expression can be formed by using the type constructor g for function types
If s and t are type expressions, then their Cartesian product s*t is a type expression
Type expressions may contain variables whose values are type expressions.
Short-Circuit Code
Flow-of-Control Statements