CSC 2209 Notes
CSC 2209 Notes
TECHNOLOGY
Course Objectives.
The course will introduce the students the designs of the different Systems Software and will
also consider the implementations of such software on a variety of real machines.
Grading Policy:
Grades will be based on your performance on two in-class tests, a
comprehensive final examination and Course work.
40% Course work which includes, tests and Course work assignments.
60% Final Examination.
N.B. Late homework decreases your overall score by 20% per day.
COURSE OUTLINE:
1. Review Micro Computer Architecture
1.1 CPU
1.2 Memory
1.3 The Intel 8085/8088 CPU
1.4 Machine Language Instructions
1.5 Instruction Formats and Addressing Modes
3. Assemblers
3.1 Assembler Tables and Logic
3.2 Instruction Formats and addressing Modes.
3.3 Program Relocation
3.4 Literals
3.5 Program Blocks and Control sections
5. Compilers
5.1 Basic Compiler Functions
5.1.1 Grammers
5.1.2 Lexical Analysis
5.1.3 Syntactic Analysis
5.1.4 Code Generation
5.2 Machine Dependent Compiler Features
Code Optimisation
5.3 Machine Independent Code Optimization
5.3.1 Storage Allocation
5.3.2 Structured Variables
5.4 Compiler Design Options
6. Macroprocessors
7.1 Macro Definitions
7.2 Macro Processors Tables and Logic
7.3 Macro expansions
3
SYSTEMS PROGRAMMING
A computer system is sometimes subdivided into two functional entities:
Hardware and Software.
The hardware of a computer consists of all the electronic components
and the electro mechanical devices that comprise the physical entity of
the device.
Software consists of the instructions and data that the computer
manipulates to perform various data processing tasks.
Three types of software exist:
1. Systems Software
2. Development Software
3. Application Software
The system software of a computer consists of a collection of programs
whose purpose is to make more effective use of the computer. They
control the operation of the machine and carry out the most basic
functions the computer performs. They control the way in which the
computer receives input, produces output, manages and stores data,
carries out or executes instructions of other programs etc.
Examples of systems programs include language processors (compilers
and assemblers that accept people like languages and translate them
into machine language), loaders (they prepare machine language
programs for execution), macro processors (allow programmers to use
abbreviations), operating systems and file systems (allow flexible storing
and retrieval of information).
Application programs are written by the user for the purpose of solving
particular problems using a computer as a tool e.g. application
packages.
Development Software is used to create, update and maintain other
programs e.g. programming languages.
4
Systems software supports the operation and use of the computer itself
rather than any particular application. They are therefore related to the
structure of the machine on which they are to run.
There are however some aspects of systems software that do not
directly depend upon the type of the computing system being supported
e.g. the general design and logic of an assembler is basically the same
on most computers.
Microprocessor
(CPU) Bus Control
Logic
Interface Memory Module
System Bus
I/O
Interface Mass Storage
Device Sub System
The microprocessor
At the centre of all operations is the MPU (Microprocessor Unit). In a
microcomputer the CPU is the microprocessor. Its purpose is to
Decode the instructions and use them to control the
activities within the system
It also performs the arithmetic( + , -, /, *) and logical
(>,>=,<,<=, =, =!) computations.
Timing Circuitry (Clock)
5
Used to synchronise the activities within the microprocessor and the bus
control logic.
Memory
Stores both data and instructions that are currently being
used. Memory is broken down into modules where each module
contains several thousand locations.
Each location is associated with an identifier called a
memory address.
System Bus.
A set of conductors that connect the CPU to its memory and I/O devices.
The bus conductors are normally separated into 3 groups:
1. The Data Lines: for transmitting information
2. Address Lines: Indicate where information is to come from or
where it is to be placed.
3. Control Lines: To regulate the activities on the bus.
Interface
6
Instruction Register
Stack pointer
Arithmetic
registers
Bus
Control
Unit Arithmetic/Logic Unit
A typical CPU consists of the control unit which contains the following
registers:
7
1. The Program Counter (PC) :It holds the address of the main
memory location from which the next instruction is to be fetched.
2. Instruction Register (IR) Receives the instruction when it is brought
from memory and holds it while it gets decoded and executed.
3. Processor Status Word (PSW) contains condition flags which
indicate the current status of the CPU and the important
characteristics of the result of the previous instruction.
4. Stack Pointer (SP): Accesses a special part of memory called a
stack. It is used to temporarily store important information while
sub routines are being executed. It hold the address at the top of
the stack.
Working Registers
They are Arithmetic registers or accumulators and address registers.
(i) Arithmetic Registers: They temporarily hold the operands and the
result of the arithmetic operations.
Accessing a register is faster than accessing memory. If several
operations are to be performed it is better to have the operands in
registers than in memory and return only the result to memory. The
more arithmetic registers a CPU has the faster it can execute
computations.
Arithmetic/Logic Unit
8
EXAMPLES OF CPU’S
The Intel 8085
S Z AC P C
Accumulator (8 bits)
B (8 bits) C (8 bits)
D (8 bits) E (8 bits)
H (8 bits) L (8 bits)
General Purpose Registers
ALU
INSTRUCTION FORMATS
The arrangement of an instruction with respect to assigning meaning to
its various groups of bits is called its format.
The portion of the instruction that specifies what the instruction does is
called the operation code (opcode).
10
All instruction formats reserve the first bits of the instruction for at least
part of the opcode but beyond this the formats vary considerably from
one computer to the next. The remaining bits designate the operands or
their locations.
Instructions vary in length from 1 byte to 3 or 6 bytes.
e.g.
Register to register transfer
0 1 0 0 0 1 1 1
ADDRESSING MODES
They are the methods used to locate and fetch an operand from an
internal CPU register or from a memory location.
Each processor has its own addressing modes.
1. Immediate Addressing: Information is part of the instruction. No
addressing is needed to get the information.
It is mostly used for quantities that are constants.
They are 2 byte instructions where the operand is the second byte.
2. Direct addressing: The address is part of the instruction.
3. Register addressing: The operand is in the register and the register’s
address is part of the instruction.
4. Indirect Addressing: The address is in the location whose address is
specified as part of the instruction. This location may be a register
(register indirect addressing) or it may be a memory location.
e.g., add contents of register R1 to the memory location whose
address is in register R2.
5. Base addressing: The address is formed by adding the contents of a
memory location or register to a number called a displacement
which is part of the instruction. It is used primarily to reference
arrays or in relocating a program in memory.
6. Indexing: It is a process of incrementing or decrementing an address
as the computer sequences through a set of consecutive or evenly
spaced addresses. This is done by successively changing an
12
ASSEMBLER LANGUAGE
It is a type of language that is closer to machine language instructions.
There is an assembler language instruction for each machine language
instruction.
An assembler converts Assembler Language into machine instructions.
There are 2 types of statements in assembler:
(i) Instructions: These are translated into machine code by the
assembler.
(ii) Directive: Gives directions to the assembler during the assembly
process but they are not translated into machine code.
Acronyms called mnemonics indicate the type of instruction.
Character strings called symbols or identifiers represent addresses and
perhaps numbers.
A typical assembler instruction would be
MOV A , M
A typical assembler directive would be
COST : DS 1
This directive causes the assembler to reserve a byte and associate a
symbol COST to it.
Example:
For a problem ANS: = X + Y; it can be solved as follows in the 8085
microprocessor.
LDA X
MOV B, A
LDA Y
ADD B
STA ANS
Mnemonic address
Branch Condition
Assembler Directives.
The directives direct the assembler during the assembly process.
The ASM 85 has 3 directives.
They have the format
Label: Mnemonic Operand, Operand
16
DS (Define Storage)
It is used to reserve memory and perhaps to assign a label to the first
byte of the reserved area. e.g. ARRAY: DS 20 reserves 20
bytes and assigns the label ARRAY to the byte with the lowest address.
DB (Define Byte)
Used to put values into or pre-assign values to memory locations as well
as reserve space and assign labels.
It serves as the DATA statement in Fortran. It can include up to 8
operands where each operand is a string constant with no more than
128 characters or constant expressions that evaluate to a 2’s
complement number from -128 to 127.
e.g. NUM: DB 14H, ‘ABC’,01101000B
reserves 5 bytes associated with a label NUM with the first byte.
14 41 42 43 68
NUM
DW (Define Word)
Similar to DB except that it reserves words instead of bytes. Each of its
possible 8 operands should evaluate to a 16 bit number or a single string
of one or two characters.
The lower order byte of the word is stored in the lower byte address and
the high order byte in the higher byte address e.g.
TASK1 and TASK2 are labels. Assuming that TASK1 and TASK2 have
been assigned memory locations 2010 and 108C respectively
10 20 8C 10 2A 09
Table
18
Registers
There are 5 registers where each has a special use. Each register is 24
bits.
Mnemonic Number Use
A 0 Accumulator; used for arithmetic Operations
X 1 Index register; used for addressing
L 2 Linkage register; the Jump to subroutine (JSUB)
instruction stores the return address in this register.
PC 8 Program Counter; Contains the address of the next
instruction to be fetched for execution.
SW 9 Status word; contains the condition codes
Data Formats
Integers are stored as 24 bit binary numbers; 2’s complement
representation is used for negative numbers. Characters are stored
using their 8 bit ASCII codes. There is no floating point hardware on the
simple standard version.
Instruction formats
All machine instructions have the following 24 bit format.
8 1 15
Opcode x address
Addressing Modes
There are 2 addressing modes designated by the bit x. When x = 1 the
addressing mode is indexed, when it is 0 it is direct.
Mode Indication Target address calculation
Direct x=0 TA = address
Indexed x=1 TA = address + (X)
(X) means contents of register X.
Instruction Set
Instructions available include LDA, LDX, STA, STX, ADD, SUB, MUL,
DIV etc.
All Arithmetic operations involve register A and a word in memory; the
result is left in the register.
An instruction COMP compares values in register A with a word in
memory, it sets the condition code CC to indicate the result (<, = , or >).
Conditional jump instructions are JLT, JEQ, JGT. For subroutine linkage,
there is jump to subroutine JSUB and the return address is placed in
register L and return from subroutine RSUB where the program returns
by jumping to address contained in register L.
= device is busy.
RD Read data: Reads the data from a device when the device is
ready otherwise the operation is delayed.
WR Writes data to a device.
The sequence is repeated for each byte of data to be read or written.
Registers
The following additional registers are provided
Mnemonic Number Use
B 3 Base register, used for addressing
S 4 General Working register, no special use.
T 5 General Working register, no special use.
F 6 Floating Point Accumulator.
Data Formats
In addition to the standard formats for the standard version there is a 48
bit floating point data type with the following format.
1 11 36
S exponent fraction
Instruction formats
Format 1 (1 byte)
8
Opcode
Format 2 (2 bytes)
8 4 4
opcode r1 r2
Format 3 (3 bytes)
6 1 1 1 1 1 1 12
opcode n i x b p e disp
Format 4 (4 bytes)
6 1 1 1 1 1 1 20
opcode n i x b p e address
Since we now have more memory, an address can’t now fit into a 15 bit
field. Two possible options are available in the extended version by
using some sort of relative addressing (format 3) or extend the address
field to 20 bits (format 4). There are also instructions that do not
reference memory at all (formats 1 and 2)
Bit e in formats 3 and 4 is used to distinguish between formats 3 and 4.
(e = 0 means format 3, e = 1 means format 4)
Addressing Modes
Two new relative addressing modes are provided by the extended
version of format 3: Base relative addressing and program counter
relative addressing
Mode Indication Target address calculation
Base relative b= 1, p = 0 TA =(B) + disp (0 <= disp <= 4095)
Program Counter b = 0, p = 1 TA = (PC) + disp (-2048 <= disp <= 2047)
If bits b and p are set to 0, the disp field in format 3 is taken to be the
target address. For format 4 bits b and p must be 0 and the target
address is taken from the address field of the instruction. This is called
direct addressing.
Any of these addressing modes can also be combined with indexed
addressing if bit x is set to 1. In such a case the contents of X, (X) is
added in the target address calculation.
Bits i and n are used to specify how the target address is used. If i = 1
and n = 0 the target address itself is used as the operand value. No
memory reference is made. This is immediate addressing.
If i = 0 and n = 1 the word at the location given by the target address is
fetched. The value in this word is taken as the address of the operand
value. This is indirect addressing. If bits i and n are both 0 or both = 1
the target address is taken as the location of the operand. This is
referred to as simple addressing. Indexing cannot be used with
immediate or indirect addressing.
If bits n and i are both 0 then bits b, p and e are considered to be part of
the address field of the instruction rather than flags indicating addressing
modes. This makes Instruction Format 3 identical to the format used on
the standard version of SIC.
b p Addressing Mode Target Address
0 0 Direct TA = Disp + (X)
0 1 PC Relative TA = (PC) + Disp + (X)
1 0 Base Relative TA = (B) + Disp + (X)
1 1 -----
Example
23
The figure below gives the different addressing modes available. The
contents of registers B, PC, and X and some selected memory locations
are shown.
The machine code for a series of LDA instructions is given. The target
address generated by each instruction and the value that is loaded into
register A are also shown.
(B) = 006000 (PC) = 003000(X) = 000090
Addressing Modes
The following addressing modes apply to Format 3 and 4 instructions. Combinations of
addressing bits not included in this table are treated as errors.
4 Format 4 Instruction
D Direct addressing
A Assembler selects either program counter relative or base relative mode
Immediate 0 1 0 0 0 0 disp TA D
0 1 0 0 0 1 addr TA 4 D
0 1 0 0 1 0 (PC) + disp TA A
0 1 0 1 0 0 (B) + disp TA A
Instruction Set
All instructions in the standard version are still available. In addition there
are instructions to load and store the new registers (LDB, STB, etc) and
to perform floating point arithmetic. (ADDF, SUBF, MULF, DIVF).
Other instructions work on registers e.g. RMO, ADDR, SUBR, MULR,
DIVR.
In the instruction set Table below, uppercase letters refer to specific registers. The notation
m .indicates a memory address, n indicates an integer between 1 and 16 and r1 and r2
represent register identifiers.
Parentheses are used to indicate the contents of a register or memory location. Thus
A (m..m+2) specifies that the contents of the memory location m through m+2 are
loaded into register A; m..m+2 (A) specifies that the contents of register A are stored in
the word that begins at address m.
P Priviledged Instruction
25
DIRECTIVES
RESW Reserve the indicated number of words for a data area.
RESB Reserve the indicated number of bytes for a data area
WORD Generate a one word integer constant
BYTE Generate character or hexadecimal constant, occupying as
many bytes as needed to represent the constant
START Specifies the name and starting address for the program.
END Indicates the end of the source program and optionally
specify the first executable instruction in the program.
26
EXAMPLES
ALPHA: RESW 1
FIVE : WORD 5
CHARZ: BYTE C’Z’
C1: RESB 1
ALPHA: RESW 1
C1: RESB 1
28
LDA ALPHA
ADD INCR
SUB ONE
STA BETA
LDA GAMMA
ADD INCR
SUB ONE
STA DELTA
ONE: WORD 1
ALPHA: RESW 1
BETA: RESW 1
GAMMA: RESW 1
DELTA: RESW 1
INCR RESW 1
ALPHA: RESW 1
BETA : RESW 1
GAMMA: RESW 1
DELTA: RESW 1
INCR : RESW 1
29
LDX ZERO
MOVECH: LDCH STR1,X
STCH STR2,X
TIX ELEVEN
JLT MOVECH
LDA ZERO
STA INDEX
ADDLP: LDX INDEX
LDA ALPHA, X
ADD BETA, X
STA GAMMA, X
LDA INDEX
ADD THREE
STA INDEX
COMP K300
JLT ADDLP
INDEX: RESW 1
ALPHA: RESW 100
BETA : RESW 100
GAMMA: RESW 100
ZERO : WORD 0
K300: WORD 300
INLOOP: TD INDEV
JEQ INLOOP
RD INDEV
STCH DATA
OUTLP: TD OUTDEV
JEQ OUTLP
LDCH DATA
WD OUTDEV
JSUB READ
READ: LDX #0
LDT #100
RLOOP: TD INDEV
JEQ RLOOP
RD INDEV
STCH RECORD,X
TIXR T
JLT RLOOP
RSUB
ASSEMBLERS
Basic Assembler Functions
An assembler is a program that accepts as input an assembler language
program and it produces its machine language equivalent along with
information for the loader.
Assembler Language
Program Assembler Machine To linker
Language
Listing
Header Record
Col. 1 H
Col 2-7 Program Name
Col 8-13 Starting address of the object program
Col 14-19 Length of object program in bytes.
Text Record
Col. 1 T
Col 2-7 Starting address for object code in this record
Col 8-9 Length of object code in this record in bytes.
Col 10-69 Object code in hexadecimal.
End Record
Col. 1 E
Col 2-7 Address of first executable instruction in object
program.
37
H^ COPY ^001000^00107A
T^001000^1E^141033^482039^001036^281030^301015^482061^3C1003^00102A^0C1039^00102D
T^00101E^15^0C1036^482061^081033^4C0000^454F46^000003^000000
T^002039^1E^041030^001030^E0205D^30203F^D8205D^281030^302057^549039^2C205E^38203F
T^002057^IC^101036^4C0000^F1^001000^041030^E02079^302064^509039^DC2079^2C1036
T^002073^07^382064^4C0000^05
E^001000
The operand is too large to fit into 12 bits, so extended format is used.
0003 LDB #LENGTH 69202D
is also immediate addressing. The value of the symbol is the address
assigned to it so it loads the address of length into register B. Here
program counter addressing is combined with immediate addressing.
The instruction 002A J @RETADR 3E2003
2. Program Relocation
It is always impossible to plan where the program will be executed in
memory because there are processes that are always going on. In such
a case the actual starting address of a program is not known until load
time.
The SIC program on page 41 is an example of an absolute program. It
must be loaded at address 1000 (the address that was specified at
assembly time) in order to execute properly.
42
1076
5000
6036 B410
6 RDREC
6076
7420
8496
1. When the assembler generates the object code for the JSUB
instruction it will insert the address of RDREC relative to the start of
the program.
2. The assembler will also produce a command for the loader
instructing it to add the beginning address of the program to the
address field in the JSUB instruction at load time.
The command for the loader must also be a part of the object program.
This is accomplished by having a modification record with the format:
Col. 1 M
Col 2-7 starting location of the address field to be modified
relative to the beginning of the program
Col 8-9 length of the address field to be modified in half
Bytes
1. Literals
45
During pass 1 the assembler gathers all the literals and puts them in the
LITTAB. During pass 2 the data values specified by the literals in each
literal pool are inserted at the appropriate places in the object program.
3. Expressions
Most assemblers allow the use of expressions wherever a single
operand is permitted. Each expression must be evaluated by the
assembler to produce a single operand address or value. Individual
terms in the expression may be constants, user-defined symbols or
special terms; e.g. the most common special term is the current value of
the location counter (designated by *). It represents the value of the next
unassigned memory location.
The statement BUFEND EQU * in the previous program gives
BUFEND the value that is the address of the next byte after the buffer
area.
Expressions are classified as either absolute expressions or relative
expressions depending upon the value they produce. An expression that
contains only absolute terms (independent of the program location like
constants) is an absolute expression. It may also contain relative terms
so long as the relative terms occur in pairs and the terms in each pair
have opposite signs.
A relative expression is one in which all the relative terms except one
can be paired. The remaining unpaired relative term must have a
positive sign. (Non of the relative terms may enter into a multiplication or
division operation). In the expression
MAXLEN EQU BUFEND – BUFFER
Both BUFEND and BUFFER are relative terms but the expression
represents an absolute value; the difference between the two addresses.
Expressions such as BUFEND + BUFFER, 100 – BUFFER OR 3*
BUFFER represent neither absolute values nor locations within the
program.
4. Program Blocks
48
They are segments of code that are re-arranged within a single object
unit.
In the example below three program blocks have been used. The first
unnamed block contains the executable instructions. The second block
CDATA contains data areas that are a few words in length, the third block
the value of the operand has relative address 0003 within the CDATA
block. The starting address for CDATA is 0066. The desired target
address for this instruction is therefore 0003 + 0066= 0069. The
instruction is to be assembled using program counter relative
addressing.
The address of the next instruction is 0009 within the default block. The
required displacement therefore is 0069 – 0009 = 60. Similar
calculations are performed during pass 2.
50
Because the large buffer area is moved to the end of the object program
there is no need to use extended format instructions. Base register is
also no longer necessary.
defined in the same control section, the value of the expression can
therefore be calculated immediately by the assembler.
Since the assembler leaves room for inserting values for external
symbols it must also include information in the object program that will
cause the loader to insert the proper values where they are required.
Two new record types are defined, they are DEFINE and REFER.
A DEFINE record gives information about EXTDEF and a REFER record
lists the EXTREFs.
The Define record:
Col 1 D
Col 2-7 Name of external symbol defined in this control section
Col 8-13 Relative address of symbol within this section
Col 14-73 Repeat information in col 2-13 for other external symbols.
Col 2-7 Starting address of the field to be modified, relative to the beginning of the
control section
Col 8-9 Length of the field to be modified in half bytes.
Col 10 Modification flag (+ or -)
Col 11-16 External symbol whose value is to be added to or subtracted from the indicated field.
The figure below shows the object program corresponding to the source
code in the previous program. Note that there is a separate set of object
program records from Header through End for each section.
The modification record M^000004^05^+RDREC implies that the address of
RDREC is to be added onto this field in order to produce the correct
machine instruction for execution.
For the instruction at address 0028 both BUFEND and BUFFER are in a
different control section. The assembler generates an initial value of zero
for this word. The last two modification records in RDREC direct that the
address of BUFEND be added to this field and the address of BUFFER
be subtracted from it.
If an expression is to be used, all terms in an expression must be relative
within the same section because if the terms are in different sections
their difference has a value that is unpredictable.
H^COPY ^000000^001033
D^BUFFER^000033^BUFEND^001033^LENGTH^0002D
R^RDREC ^WRREC
T^000000^1D^172027^4B100000^032023^290000^332007^4B100000^3F2FEC^032016^0F2016
T^00001D^0D^010003^0F200A^4B100000^3E2000
T^000030^03^454F46
M^000004^05^+RDREC
M^000011^05^+WRREC
M^000024^05^+WRREC
E^000000
H^RDREC ^000000^00002B
R^BUFFER^LENGTH^BUFEND
T^000000^1D^B410^B400^B440^77201F^E3201B^332FFA^DB2015^A004^332009^57900000^B850
T^00001D^0E^3B2FE9^13100000^4F0000^F1^000000
M^000018^05^+BUFFER
M^000021^05^+LENGTH
M^000028^06^+BUFEND
M^000028^06^-BUFFER
E
H^WRREC ^000000^00001C
R^LENGTH^BUFFER
T^000000^1C^B410^77100000^E32012^332FFA^53900000^DF2008^B850^3B2FEE^4F0000^05
54
M^000003^05^+LENGTH
M^00000D^05^+BUFFER
E
Object Code in Memory and symbol table entries for the program below after the
instruction at address 2021
Memory
Address Contents Symbol Value
Object Code in Memory and symbol table entries for the program below after the
instruction at address 2052
Memory
Address Contents Symbol Value
2050 ø
H^COPY ^00100^00107A
T^001000^09^454F46^000003^000000
T^00200F^15^141009^480000^00100C^281006^300000^480000^3C2012
T^00201C^02^2024
T^002024^19^001000^0C100F^001003^0C100C^480000^081009^4C0000^F1^001000
T^002013^02^203D
T^00203D^1E^041006^001006^E02039^302043^D82039^281006^300000^54900F^2C203A^382043
T^002050^02^205B
T^00205B^07^10100C^4C0000^05
T^00201F^02^2062
T^002031^02^2062
T^002062^18^041006^E02061^302065^50900F^DC2061^2C100C^382065^4C0000
E^00200F
58
1. Relocation
Loaders that allow program relocation are called relocating or relative
loaders. There are two methods for specifying relocation as part of the
object program.
The first method uses modification records which describe each part of
the object code that must be changed when the program is relocated.
Using the program on page 44 the only portions that must be relocated
are those that contain actual addresses at addresses 0006, 0013 and
0026.
H^COPY ^000000^001077
T^000000^1D^17202D^69202D^4B101036^032026^290000^332007^4B10105D^3F2FEC^032010
T^00001D^13^0F2016^010003^0F200D^4B10105D^3E2003^454F46
T^001036^1D^B410^B400^B440^75101000^E32019^332FFA^DB2013^A004^332008^57C003^B850
T^001053^1D^3B2FEA^134000^4F0000^F1^B410^774000^E32011^332FFA^53C003^DF2008^B850
T^001070^07^3B2FEF^4F0000^05
M^000007^05+COPY
M^000014^05+COPY
M^000027^05+COPY
E^000000
There is one modification record for each value that must be changed
during relocation. Each modification record specifies the starting address
and length of the field whose value is to be altered. It then specifies the
modification to be performed. Here all modifications add the value of the
symbol COPY which represents the starting address of the program.
This method is not suitable for a program which uses absolute
addressing and may require all records to be modified.
60
In the object code for the program above this mask is represented as
three hexadecimal digits. They are underlined for easier identification. If
the relocation bit corresponding to a word of object code is set to 1 the
program’s starting address is to be added to this word when the program
is to be relocated. A bit value of 0 indicates that no modification is
necessary. If a text record contains fewer than 12 words of object code,
the bits corresponding to the unused words are set to 0.
HCOPY ^000000^00107A
T^000000^1E^FFC^140033^481039^000036^280030^300015^481061^3C0003^00002A^0C0039^00002D
T^00001E^15^E00^0C0036^481061^080033^4C0000^454F46^000003^000000
T^001039^1E^FFC^040030^000003^ E0105D^30103F^D8105D^280030^301057^548039^2C105E^38103F
T^001057^0A^800^100036^4C0000^F1^001000
T^001061^19^FE0^040030^E01079^301064^508039^DC1079^2C0036^381064^4C0000^05
E^000000
Program Linking
Concepts of program linking were discussed under control sections.
The example below consists of three differently assembled programs
each having a list of items LISTA, LISTB and LISTC. Their ends are
marked by ENDA, ENDB and ENDC. The labels on the beginnings and
ends of the lists are external symbols. Each program has the same set
of references to these external symbols.
In PROGA, REF1 is a reference to a label within the program which is
assembled by program counter relative. No modification is necessary.
In PROGB, REF1 Refers to an external symbol. The assembler uses an
extended format instruction with the address field set to 000000. There is
a modification record in the object program instructing the loader to add
the value of LISTA to this address field when the program is linked.
REF2 and REF3 are explained similarly.
In PROGA the assembler can evaluate all of the expression in REF4
except for the value of LISTC. This results in an initial value of 000014
and one modification record.
62
H^PROGA ^000000^000063
D^LISTA ^000040^ENDA ^000054
R^LISTB ^ENDB ^LISTC ^ENDC
T^000020^0A^03201D^77100004^050014
T^000054^0F^000014^FFFFF6^00003F^000014^FFFFC0
M^000024^05^+LISTB
M^000054^06^+LISTC
M^000057^06^+ENDC
M^000057^06^-LISTC
M^00005A^06^+ENDC
M^00005A^06^-LISTC
M^00005A^06^+PROGA
M^00005D^06^-ENDB
M^00005D^06^+LISTB
M^000060^06^+LISTB
M^000060^06^-PROGA
E^000020
H^PROGB ^000000^00007F
D^LISTB ^000060^ENDB ^000070
R^LISTA ^ENDA ^LISTC ^ENDC
T^000036^0B^03100000^772027^05100000
T^000070^0F^000000^FFFFF6^FFFFFF^FFFFF0^000060
M^000037^05^+LISTA
M^00003E^05^+ENDA
M^00003E^05^-LISTA
M^000070^06^+ENDA
M^000070^06^-LISTA
M^000070^06^+LISTC
M^000073^06^+ENDC
M^000073^06^-LISTC
M^000076^06^+ENDC
M^000076^06^-LISTC
M^000076^06^+LISTA
M^000079^06^+ENDA
M^000079^06^-LISTA
M^00007C^06^+PROGB
M^00007C^06^-LISTA
E
H^PROGC ^000000^000051
D^LISTC ^000030^ENDC ^000042
R^LISTA ^ENDA ^LISTB ^ENDB
T^000018^0C^03100000^77100004^05100000
T^000042^0F^000030^000008^000011^000000^000000
M^000019^05^+LISTA
M^00001D^05^+LISTB
M^000021^05^+ENDA
M^000021^05^-LISTA
M^000042^06^+ENDA
64
M^000042^06^-LISTA
M^000042^06^+PROGC
M^000048^06^+LISTA
M^00004B^06^+ENDA
M^00004B^06^-LISTA
M^00004B^06^-ENDB
M^00004B^06^+LISTB
M^00004E^06^+LISTB
M^00004E^06^-LISTA
E
The same expression in PROGB contains no terms that can be
evaluated by the assembler. The object code therefore contains an initial
value of 000000 and three modification records.
In PROGC the assembler can supply the value of LISTC relative to the
beginning of the program which is not known until the program is loaded.
The initial value for this data word contains the relative address of LISTC
= 000030. The modification records instruct the loader to add the
beginning address of the program (PROGC), to add the value of ENDA
and to subtract the value of LISTA.
Assume that PROGA has been loaded starting at address 4000 followed
immediately by PROGB and PROGC. REF4 through REF8 result into
the same value after relocation and linking for all the three programs.
E.g. REF4 in PROGA is located at address 4054. The initial value was
000014. To this is added the address assigned to LISTC which is 4112
(40E2 + 30). This results in the value 004126. This value will be the
same at address 70 (40D3) in PROGB and at address 0042 in PROGC.
definition does not appear until later in the input, the required linking
cannot be performed until an address is assigned to the external symbol
involved.
A linking loader therefore makes two passes just like the assembler.
Pass 1 assigns addresses to all external symbols and pass 2 performs
the actual loading, relocation and linking.
The main data structure used is the External symbol table ESTAB which
is similar to the SYMTAB. It is used to store names and addresses of
each external symbol in the set of control sections being loaded. It also
indicates in which control section the symbol is defined.
The beginning address in memory where the linked program is loaded is
called PROGADDR. Its value is supplied to the loader by the operating
system.
CSADDR is the starting address assigned to the control section currently
being scanned by the loader. It is added to all relative addresses within
the control section to convert them to actual addresses.
During pass 1 the loader is concerned with only the Header and Define
record types. The beginning load address for the linked program
PROGADDR becomes the starting address CSADDR for the first control
section.
The control section name for the Header record is entered into ESTAB
with a value given by CSADDR. All the external symbols appearing in
the Define record are also entered into ESTAB. Their addresses are
obtained by adding the value specified in the Define record to CSADDR.
When the End record is read, the control section length CSLTH which
was saved by the header record is added to CSADDR. This calculation
gives the starting address of the next control section.
At the end of pass 1 ESTAB contains all external symbols defined in the
control sections together with the address assigned to each.
66
2. Loader Options
Many loaders allow the user to specify options that modify the standard
processing of the program. Below are a few of them:
(i) An option that allows the selection of alternative sources
of input e.g. INCLUDE program_name (library_name) directs the loader
to read the designated object program from a library and treat it as
if it were part of the primary loader input.
(ii) An option to allow the user to delete external symbols or
entire control sections e.g. DELETE csect_name instructs the
loader to delete the named control section from the set of
programs being loaded.
(iii) An option to change external symbols e.g. CHANGE
name1, name2 causes the external symbol name1 to be changed to
3. Overlay Programs
They are programs that are designed to execute in such a way that if
both or all of them are not needed in memory at the same time, one can
execute first and the other will execute in the same memory space after
the first one has been executed.
1 A
2 B 5 C 6 D/E
3 F/G 4 H 7 I 8 J 9 K
Control Length Control Length
Section (bytes) Section (bytes)
A 1000 G 400
B 1800 H 800
C 4000 I 1000
D 2800 J 800
E 800 K 2000
F 1000
69
In the example above the letters represent control sections and the lines
show control between the control sections. Control section A (the root)
can call B, C, or D/E etc. D/E means that control sections D and E are
closely related and they are always used together. The nodes in the tree
are called segments. The root segment (A) is loaded when execution of
the program begins and it remains in memory until the program ends.
The other segments are loaded as they are called.
If H is being executed both B and A should be in memory since H was
called by B, and B was called by A. Thus the three sections A, B and H
must be active. The other segments cannot be active since there is no
path from them to H. If for example the segment containing K was called
previously it must have returned to D/E and then to A before B could be
called by A.
Because segments at the same level e.g. B, C and D/E can be called
only from the level above, they cannot be required at the same time;
thus they can be assigned to the same locations in memory. If a
segment is loaded it overlays any segments at the same level and their
subordinate segments that may be in memory. The entire program
therefore can be executed in a smaller total amount of memory. This is
the main reason for the use of overlay structures.
The structure of an overlay program is defined to the loader using the
following commands:
SEGMENT seg_name(control-section….) and
PARENT seg_name
Once the overlay structure has been defined the starting addresses for
the segments can be found because each segment begins immediately
after the end of its parent.
The figure below shows the length and the relative starting address of
each segment in our example. It assumes that the beginning load
address for the program is 8000.
Segment Starting Address
Relative Actual Length
1 0000 8000 1000
2 1000 9000 1800
3 2800 A800 1400
4 2800 A800 800
5 1000 9000 4000
6 1000 9000 3000
7 4000 C000 1000
8 4000 C000 800
9 4000 C000 2000
During the execution of the program many different segments may be in
memory together. Below are some possibilities.
71
The loader can assign an actual starting address to every segment in the
overlay program once the initial load address is supplied. Thus the
addresses of all external symbols are known and all relocation and
linking operations can be performed.
8000
A A A
9000
A000 B D
B000
H
C000
E
D000
E000
The root segment can be loaded directly into memory; the other
segments with their linking information are loaded into a special working
file called SEGFILE that is created by the loader.
The actual loading of the segments during program execution is handled
by an overlay manager, OVLMGR. This is a special control section which is
automatically included in the root segment of the overlay program by the
loader. OVLMGR uses a segment table SEGTAB which has all the
information about the overlay program. SEGTAB also includes a special
transfer area for each segment except the root. If a segment is currently
2. Dynamic Linking
The linking function is performed at execution time. A subroutine is
loaded and linked to the rest of the program when it is first called. It
provides for the ability to load the routines only when (and if) they are
needed.
73
COMPILERS
A compiler bridges the semantic gap between a Programming Language
domain and an execution domain. Two aspects of the compiler are:
1. To generate code to implement meaning of a source program in
the execution domain and
2. To provide diagnostics for violations of the programming language
semantics in the source program.
For purposes of compiler construction a high level language is usually
described in terms of a grammar. The grammar specifies the form or
syntax of legal statements in the language.
For example an assignment statement might be defined by the grammar
as a variable name, followed by an assignment operator (:=) followed by
an expression. The problem of compilation becomes the matching of the
statements written by the programmer to structures defined by the
grammar, and generating the appropriate object code for each
statement.
The source program statements are regarded as tokens. Tokens are the
fundamental building blocks of the language. It might be a keyword, a
variable name, an integer, an arithmetic operator etc. The task of
scanning the source statement, recognizing and classifying the various
tokens is known as lexical analysis. The part of the compiler that
performs this analytical function is called the scanner.
After the token scan, each statement in the program must be recognized
as some language construct, such as a declaration, or an assignment
statement, described by the grammar. This process which is called
syntactic analysis or parsing is performed by part of the compiler that
is called the parser. The last step is the basic translation process in the
generation of object code.
74
GRAMMARS
A grammar for a programming language is a formal description of the
syntax or form of programs and individual statements written in the
language. The grammar does not describe the semantics or meaning of
the various statements.
A number of different notations are used to write grammars. The
simplest and widely used notation is the BNF (Backus–Naur Form).
A BNF grammar consists of a set of rules each of which defines the
syntax of some construct in the programming language. Below is a BNF
grammar of a restricted Pascal Language.
1. <prog> ::= PROGRAM <prog-name> VAR <dec-list> BEGIN <stmt-list> END
2. <prog-name>::= id
3. <dec-list> ::= <dec> | <dec-list> ; <dec>
4. <dec> ::= <id-list> : <type>
5. <type> ::= INTEGER
6. <id-list> ::= id | <id-list> , id
7. <stmt-list>::= <stmt> | <stmt-list> ; <stmt>
8. <stmt> ::= <assign> | <read> | <write> | for
9. <assign> ::= id := <exp>
10. <exp> ::= <term> | <exp> + <term> | <exp> - <term>
11. <term> ::= <factor> | <term * <factor> | <term> DIV <factor>
12. <factor> ::= id | int | ( <exp>)
13. <read> ::= READ ( <id-list> )
14. <write> ::= WRITE ( <id-list> )
15. <for> ::= FOR <index-exp> DO <body>
16. <index-exp> ::= id := <exp> TO <exp>
17. <body> ::= <stmt> | BEGIN <stmt-list> END
1. PROGRAM STATS
2. VAR
3. SUM, SUMSQ, I, VALUE,MEAN, VARIANCE : INTEGER
4. BEGIN
5. SUM := 0;
6. SUMSQ := 0;
7. FOR I := 1 TO 100 DO
8. BEGIN
9. READ (VALUE)
10. SUM := SUM + VALUE;
11. SUMQ := SUMQ + VALUE * VALUE
12. END;
13. MEAN : = SUM DIV 100;
14. VARIANCE := SUMQ DIV 100 - MEAN * MEAN;
15. WRITE (MEAN, VARIANCE)
16. END.
In this rule the non terminal symbols are <read> and <id-list>, and the
terminal symbols are the tokens READ, (, and ). Thus the rule specifies
that a <read> consists of the token READ, followed by the token “(“ ,
followed by a language construct <id-list>, followed by the token “)”.
To recognize a <read> of course we need the definition of <id-list> which
is provided for in rule 6.
It is often convenient to display the analysis of a source statement in
terms of a grammar as a tree called the parse tree or syntax tree.
Below are parse trees for statement number 9, READ (VALUE) and
statement 14 VARIANCE := SUMQ DIV 100 –MEAN * MEAN.
<read>
<id-list>
READ ( id )
{value}
76
<assign>
<exp>
<exp>
<term> <term>
<term> <term>
id := id DIV int _ id * id
{variance} {sumq} {100} {mean} {mean}
Lexical Analysis
This involves scanning the program to be complied and recognizing the
tokens that make up the source statements. Scanners are usually
designed to recognize keywords, operators, identifiers, integers, floating
point numbers, character strings and other similar items that are written
as part of the source program.
Items such as identifiers and integers are usually recognized as either
single tokens or they could be defined as part of the grammar e.g.
<ident> ::= <letter> | <ident> <letter> | <ident> <digit>
<letter> ::= A | B | C | D |………|Z
<digit> ::= 0 | 1 | 2 | 3 | ……..|9
PROGRAM 1 WRITE 9 - 17
VAR 2 TO 10 * 18
BEGIN 3 DO 11 DIV 19
END 4 ; 12 ( 20
END. 5 : 13 ) 21
INTEGER 6 , 14 id 22
FOR 7 := 15 int 23
READ 8 + 16
Syntactic Analysis
During syntactic analysis the source statements written by the
programmer are recognized as language constructs described by the
grammar being used. This may be regarded as building the parse tree
for the statements being translated. Parsing techniques are of two types;
bottom up and top down according to the way in which the parse tree is
being constructed.
Top down methods begin with the rule of the grammar that specifies the
goal of the analysis (i.e. the root of the tree), and attempt to construct
the tree so that the terminal nodes match the statements being
analyzed.
Bottom up methods begin with terminal nodes of the tree (the statements
being analyzed), and attempt to combine these into successively higher-
level nodes until the root is reached.
79
READ ( <N1> )
id {Value}
81
In part (i) the parser identifies the portion of the statement delimited by
the precedence relationship < and > which consists of a single token id.
This portion can be identified as a factor (rule 12), prog_name (rule 2) or
an id_list (rule 6). It is simply interpreted as some non terminal symbol
<N1>.
Precedence relations hold only between terminal symbols, so <N1> is
not involved in this process.
2. Variance := Sumq DIV 100 – Mean * Mean ;
(i) id1 := id2 DIV int – id3 * id4 ;
< = < >
(ii) id1 := [N1] DIV int – id3 * id4 ; <N1>
< = < < >
id2
{SumQ}
<N1> <N2>
<N3> <N6>
<N7>
<N3> <N6>
Note that each portion of the parse tree is constructed from the terminal
nodes up towards the root, hence the term bottom up parsing.
There are a few differences between these parse trees and the first
ones. This is because the operator precedence parse is not concerned
with the names of the non terminals and it is not necessary to perform
this additional step in the recognition process.
Code Generation
83
After the syntax has been analysed the last task of the compilation is the
generation of object code. A simple code generation technique is the
one that creates the object code for each part of the program as soon as
its syntax has been recognized.
The technique involves a set of routines one for each rule or alternative
rule in the grammar. When the parser recognizes a portion of the source
program according to some rule of the grammar, the corresponding
routine is executed. Such routines are called semantic routines because
the processing performed is related to the meaning associated with the
corresponding construct in the language. These semantic routines
generate object code directly so they can also be called code generation
routines.
The code to be generated depends upon the computer for which the
program is being compiled. We will use the generation of the object code
for the SIC/XE machine.
The code generation routines create segments of object code for the
compiled program which will be represented here using SIC assembler
language. The actual code generated is machine language not
assembler. As each piece of object code is generated a location counter
is updated to reflect the next available address in the compiled program.
Regardless of the method used to generate the parse tree, the parser
will always recognize at each step the left most substring of the input
that can be interpreted according to the rule of the grammar. In the
operator precedence method this recognition occurs when a substring of
the input is reduced to some non terminal <Ni>. The assembler code
below shows the symbolic representation of the object code to be
generated for the READ statement.
+JSUB XREAD
WORD 1
84
WORD VALUE
LDA SUMQ
85
DIV #100
STA T1
LDA MEAN
MUL MEAN
STA T2
LDA T1
SUB T2
STA VARIANCE
Code-Optimization
Machine instructions that use registers as operands are usually faster
than the corresponding instructions that refer to locations in memory. It is
therefore better to keep in registers all variables and intermediate results
that will be used later in the program.
88
This corresponds to
LDA SUMQ
DIV #100
STA T1
LDA MEAN
MUL MEAN
STA T2
LDA T1
SUB T2
STA VARIANCE
corresponding to
LDA MEAN
MUL MEAN
STA T1
LDA SUMQ
89
DIV #100
SUB T1
STA VARIANCE
The resulting machine code requires fewer instructions and uses only
one temporary variable instead of two.
RETADR
RETADR RETADR
(a) (2) (2)
SUB SUB
CALL SUB
(3)
RETADR
RETADR
(b)
(c)
If procedures may be called recursively like in PASCAL static allocation
cannot be used. In the figure the program MAIN has been called by the
operating system (call 1). MAIN stores its return address at a fixed
location RETADR within MAIN.
90
MAIN calls SUB (call 2). The return address of this call is stored at a fixed
location within SUB. If SUB calls itself recursively as in fig (c) a problem
occurs because SUB stores the return address for call 3 into RETADR
from register L. This destroys the return address for call 2 and as a result
there is no possibility of ever making a correct return to MAIN.
A similar difficulty occurs with respect to any variables used by SUB.
When recursive calls are made variables within SUB may be set to new
values; however the previous values may be needed by call 2 of SUB
after the return from the recursive call
It is therefore necessary to preserve the previous values of any variables
used in SUB including parameters, temporaries, return addresses,
register save areas etc.
This is usually accomplished by the dynamic storage allocation
technique where each procedure call creates an activation record that
contains storage for all the variables used by the procedure. If the
procedure is called recursively another activation record is created. Each
activation record is associated with a particular invocation of the
procedure. An activation record is not deleted until a return has been
made from the corresponding invocation. The starting address for the
current activation record is usually contained in a base register which is
used by the procedure to address its variables. In this way the values of
variables used by the different invocations of a procedure are kept
separate from one another.
Activation records are typically allocated on a stack, with the current
record on top of the stack.
In the diagram below, (a) MAIN has been called, its activation record
appears on the stack. The base register is set to indicate the starting
address of this of the current activation record. The first word in an
91
RETADR B PREV
(2)
NEXT Variables
B 0 for MAIN
(a) SUB
RETADR
NEXT
0
(b)
Variables
For SUB
RETADR
NEXT
B
PREV
Variables
System for SUB
(1) Variables
System for SUB
(1)
Call SUB RETADR RETADR
Variables Variables
SUB for Main (2) for MAIN
SUB
(3) RETADR RETADR
Call SUB
NEXT NEXT
0 0
(c) (d)
92
The second word of the activation record contains a pointer NEXT to the
first unused word of the stack, which will be the starting address for the
next activation record created. The third word contains the return
address for this invocation of the procedure, and the remaining words
contain the values of all the variables used by the procedure.
In diagram (b) MAIN has called SUB. On the top of the stack a new
activation record has been created with register B set to indicate the new
current record. The pointers PREV and next are set as shown.
In (c) SUB has called itself recursively and another activation record has
been created.
When a procedure returns to its caller the current activation record is
deleted. The pointer PREV in the deleted record is used to reestablish
the previous activation record as the current one and execution
continues.
Fig (d) shows how the stack would appear after SUB returns from the
recursive call. Register B has been reset to point to the activation record
for the previous invocation of SUB. The return address and all the
variable values in this activation record are exactly the same as they
were before the recursive call.
This technique is called automatic allocation of storage. In this technique
the compiler generates code for references to variables using some sort
of relative addressing. The compiler assigns each variable an address
which is relative to the beginning of the activation record instead of an
actual location within the program. The address of the current activation
record is contained in register B. the displacement in this instruction is
the relative address of the variable within the activation record.
The compiler also generates additional code to manage the activation
records themselves. At the beginning of each procedure there must be
code to create a new activation record, linking it to the previous one and
93
setting the appropriate pointers. This code is often called prologue for the
procedure. At the end of the procedure there must be a code to delete
the current activation record and resetting pointers as needed. This code
is called an epilogue.
Other types of dynamic storage allocation allow the programmer to
specify when storage is to be assigned. In PL/I the statement
ALLOCATE (A) allocates storage for the variable A while FREE (A)
releases the storage assigned to A by the previous ALLOCATE. This
feature is called controlled storage in PL/I.
In Pascal the statement NEW(P) allocates storage for a variable and
sets the pointer P to indicate the variable just created. The statement
DISPOSE(P) releases the storage that was previously assigned to the
variable pointed to by P.
Structured Variables
These include arrays, records, strings, sets etc.
Consider an array A: ARRAY[1..10], if each integer variable occupies
one word of memory, then ten words have to be allocated to store this
array.
In general an array ARRAY[l..u] of integer needs an allocation of u-l+1
words of storage for the array.
For a two-dimensional array like B: ARRAY[0..3,1..6] of integer, the first
subscript on 4 different values (0-3) and the second subscript can take
on 6 values. We need to allocate a total of 4 * 6 = 24 words to store the
array. In general an array ARRAY[l1..u1, l2..u2] of integer needs to be
allocated a storage of (u1-l1+1)*(u2-l2+1) words.
To generate code for array references it is important to know which array
element corresponds to each word of allocated storage. For a one
94
Where all elements that that have the same value of the second
subscript are stored together is called the column major order.
0,1 1,1 2,1 3,1 0,2 1,2 2,2 3,2 0,3 1,3 2,3 3,3 0,4 1,4 2,4 3,4 0,5 1,5 2,5 3,5 0,6 1,6 2,6 3,6
Compilers for most high level languages store arrays using row-major
order; FORTAN compilers however store arrays in column order.
To refer to an array element we calculate the address of the referenced
element relative to the base address of the array. The compiler will
generate code to place this relative address in an index register.
Assume a one dimensional array A: ARRAY[1..10] of integer and
suppose that a statement refers to an array element A[6]. There are five
array elements preceding A[6]; on a SIC machine each element will
occupy 3 bytes, thus the address of A[6] relative to the starting address
of the array is given by 5 x 3 = 15.
In general for an array element A[s] of a one dimensional array, A:
ARRAY [l..u] where each array element occupies w bytes of storage, its
location will be
w * (s - l)
95
MACROPROCESSORS
A macro instruction (often abbreviated as a macro) represents a
commonly used group of statements in the source program language.
The macro processor replaces each macro instruction with the
corresponding group of source language statements. This is called
expanding the macros. Macro instructions therefore allow the
programmer to write a short hand version of a program. The functions of
a macro processor essentially involve the substitution of one group of
characters or lines for another.
The figure above shows the output that would be generated. In the
expanded form:
The macro instruction definitions have been deleted.
99
NAMTAB DEFTAB
ARGTAB
1 F1
2 BUFFER
3 LENGTH
The positional notation for the parameters &INDEV has been converted
to ?1 etc. The first argument in the figure above is F1.
expanded. For the first macro expansion in a program XX will have the
value AA. For succeeding macro expansion, XX will be set to AB, AC
etc.