SS Mod 2 Full

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

WWW.VIDYARTHIPLUS.

COM

MODULE 2- Assemblers
ELEMENTS OF ASSEMBLY LANGUAGE PROGRAMMING

An assembly language is a machine dependent, low level programming language which is specific to a
certain computer system (or a family of computer systems). Compared to the machine language of a computer
system. it provides three basic features which simplify programming:

1. Mnemonic operation codes: Use of mnemonic operation codes (also called mnemonic opcodes) for
machine instructions eliminates the need to memorize numeric operation codes. It also enables the assembler
to provide helpful diagnostics, for example indication of misspelt operation codes.

2. Symbolic operands: Symbolic names can be associated with data or instructions. These symbolic names can
be used as operands in assembly statements. The assembler performs memory bindings to these names; the
programmer need not know any details of the memory bindings performed by the assembler. This leads to a
very important practical advantage during program modification as discussed in Section 4.1.2.

3. Data declarations: Data can be declared in a variety of notations, including the decimal notation. This
avoids manual conversion of constants into their internal machine representation, for example, conversion of
—5 into (1111 1010)2 or 10.5 into (41A80000)16.

Statement format

An assembly language statement has the following format:

[Label] <Opcode> <operand spec>[<operand spec> ..]

where the notation [..] indicates that the enclosed specification is optional. If a label is specified in a
statement, it is associated as a symbolic name with the memory word(s) generated for the statement. <operand
spec> has the following syntax:

<symbolic name> [+ <displacement> ] [(<index register> )]

Thus, some possible operand forms are: AREA. AREA+5, AREA(4), and AREA+5(4). The first specification
refers to the memory word with which the name AREA is associated. The second specification refers to the
memory word 5 words away from the word with the name AREA. Here ‘5’ is the displacement or offset from
AREA. The third specification implies indexing with index register 4—that is, the operand address is obtained
by adding the contents of index register 4 to the address of AREA. The last specification is a combination of
the previous two specifications.

A simple assembly language

A simple assembly language is used ,to illustrate features of assembly languages and techniques used in
assemblers. In this language, each statement has two operands. the first operand is always a register which can
be any one of AREG, BREG, CREG and DREG. The second operand refers to a memory word using a
symbolic name and an optional displacement. (Note that indexing is not permitted.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

Figure 4.1 lists the mnemonic opcodes for machine instructions. The MOVE instructions move a value
between a memory word and a register. In the MOVER instruction the second operand is the source operand
and the first operand is the target operand. Converse is true for the MOVEM instruction. All arithmetic is
performed in a register (i.e. the result replaces the contents of a register) and sets a condition code. A
comparison instruction sets a condition code analogous to a subtract instruction without affecting the values of
its operands. The condition code can be tested by a Branch on Condition (BC) instruction. The assembly
statement corresponding to it has the format

BC <condition code spec>, <memory address>

It transfers control to the memory word with the address <memory address> if the current value of condition
code matches <condition code spec>. For simplicity, we assume <condition code spec> to be a character
string with obvious meaning. e.g. CT, EQ. etc. A BC statement with the condition code spec ANY implies
unconditional transfer of control. In a machine language program, we show all addresses and constants in
decimal rather than in octal or hexadecimal.

Figure 4.2 shows the machine instructions format. The opcode.,register operand and memory operand occupy
2, 1 and 3 digits, respectively. The sign is not a pan of the instruction. The condition code specified in a BC
statement is encoded into the first operand position using the codes 1-6 for the specifications LT, LE, EQ. GT,
GE and ANY, respectively. Figure 4.3 shows an assembly language program and an equivalent machine
language program.

4.1.1 Assembly Language Statements

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
An assembly program contains three kinds of statements:

1. Imperative statements
2. Declaration statements
3. Assembler directives.

Imperative statements

An imperative statement indicates an action to be performed during the execution of the assembled program.
Each imperative statement typically translates into one machine instruction.

Declaration statements

The syntax of declaration statements is as follows:

(Label] DS <constant>

(Label] DC ‘<vaLue>’

The DS (short for declare storage) statement reserves areas of memory and associates names with them.
Consider the following DS statements:

A DS 1
G DS 200

The first statement reserves a memory area of 1 word and associates the name A with it. The second statement
reserves a block of 200 memory words. The name G is associated with the first word of the block. Other
words in the block can be accessed through offsets from G, e.g. G+5 is the sixth word of the memory block,
etc.

The DC (short for declare constant) statement constructs memory words containing constants. The statement

ONE DC ‘1’

associates the name ONE with a memory word containing the value ‘1’. The programmer can declare
constants in different forms—decimal, binary, hexadecimal, etc. The assembler converts them to the
appropriate internal form.

Use of constants

Contrary to the name declare constant’, the DC statement does not really implement constants, it merely
initializes memory words to given values. These values are not protected by the assembler; they may be
changed by moving a new value into the memory word. For example, in Fig. 4.3 the value of ONE can be
changed by executing an instruction MOVEM BREG, ONE.

An assembly program can use constants in the sense implemented in an HLL in two ways—as immediate
operands, and as literals. Immediate operands can be used in an assembly statement only if the architecture of
the target machine includes the necessary features. In such a machine, the assembly statement.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
ADD AREG,5

is translated into an instruction with two operands—AREG and the value ‘5’ as an immediate operand. Note
that our simple assembly language does not support this feature, whereas the assembly language of Intel 8086
supports it (see Section 4.5).

A literal is an operand with the syntax =‘<value>’. It differs from a constant because its location cannot be
specified in the assembly program. This helps to ensure that its value is not changed during execution of a
program. It differs from an immediate operand because no architectural provision is needed to support its use.
An assembler handles a literal by mapping its use into other features of the assembly language. Figure 4.4(a)
shows use of a literal =‘5’. Figure 4.4(b) shows an equivalent arrangement using a DC statement FIVE DC
‘5’. When the assembler encounters the use of a literal in the operand field of a statement, it handles the literal
using an arrangement similar to that shown in Fig. 4.4(b)—it allocates a memory word to contain the value of
the literal, and replaces the use of the literal in a statement by an operand expression referring to this word.
The value of the literal is protected by the fact that the name and address of this word is not known to the
assembly language programmer.

Assembler directives

Assembler directives instruct the assembler to perform certain actions during the assembly of a program.
Some assembler directives are described in the following.

START <constant>

This directive indicates that the first word of the target program generated by the assembler should be placed
in the memory word with address <constant>.

END (<operand spec>]

This directive indicates the end of the source program. The optional <operand spec> indicates the address of
the instruction where the execution of the program should begin. (By default, execution begins with the first
instruction of the assembled program.)

LTORG

Fig. 4.4 has shown how literals can be handled in two steps. First, the literal is treated as if it is a <value> in a
DC statement. i.e. a memory word containing the value of the literal is formed. Second, this memory word is
used as the operand in place of the literal. Where should the assembler place the word corresponding w the
literal? Obviously, it should be placed such that control never reaches it during the execution of a program.
The LTORG statement permits a programmer to specify where literals should be placed. By default,
assembler places the literals after the END statement.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
At every LTORG statement, as also at the END statement, the assembler allocates memory to the literals of a
literal pool. The pool contains all literals used in the program since the start of the program or since the last
LTORG statement.

Example 4.3 In Fig. 4.8. the literals =5’ and =‘ 1’ are added to the literal pool in statements 2 and 6,
respectively, the first LTORG statement (statement number 13) allocates the addresses 211 and 212 to the
values ‘5’ and ‘1’. A new literal pool is now started. The value ‘1’ is put into this pool in statement 15. This
value is allocated the address 219 while processing the END statement. The literal =‘1’ used in statement 15
therefore refers to location 219 of the second pool of literals rather than location 212 of the first pool. Thus,
all references to literals are forward references by definition.

LITTAB

Assembler handles literal operands using the basic datastructure literal table ‘LITTAB’. For each literal used
littab contains literal name, operand value and length and th eaddress asigned to the operand when it is
placed in the literal pool.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
In addition to the mnemonic machine instructions, we have used the following assembler directives:

START Specify name and starting address for the program.

END Indicate the end of the source program and (optionally) specify the first executable instruction
in the program.

BYTE Generate character or hexadecimal constant, occupying as many bytes as needed to represent
the constant.

Eg: BYTE X’F1’

WORD Generate one-word integer constant.

Eg: THREE WORD 3

RESB Reserve the indicated number of bytes for a data area.

Eg: RESB 4096 , reserves 4096 bytes of data area

RESW Reserve the indicated number of words for a data area.

Eg: RESW 5 , reserves 5 words for data area

Program Blocks

In all of the examples we have seen so far the program being assembled was treated as a unit. The source
programs logically contained subroutines, data areas, etc. However, they were handled by the assembler as
one entity, resulting in a single block of object code. Within this object program the generated machine
instructions and data appeared in the same order as they were written in the source program.

Many assemblers provide features that allow more flexible handling of the source and object programs. Some
features allow the generated machine instructions and data to appear in the object program in a different order
from the corresponding source statements. Other features result in the creation of several independent parts of
the object program. These parts maintain their identity and are handled separately by the loader. We use the
term program blocks to refer to segments of code that are rearranged within a single object program unit, and
control sections to refer to segments that are translated into independent object program units. (This
terminology is, unfortunately, far from uniform. As a matter of fact, in some systems the same assembler
language feature is used to accomplish both of these logically different functions.) In this section we consider
the use of program blocks and how they are handled by the assembler. Section 2.3.5 discusses control sections
and their use.

Figure 2.11 shows our example program as it might be written using program blocks. In this case three blocks
are used. The first (unnamed) program block contains the executable instructions of the program. The second
(named CDATA) contains all data areas that are a few words or less in length. The third (named CBLKS)
contains all data areas that consist of larger blocks of memory. Some possible reasons for making such a
division are discussed later in this section.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
The assembler directive USE indicates which portions of the source program belong to the various blocks. At
the beginning of the program, statements are assumed to be part of the unnamed (default) block; if no USE
statements are included, the entire program belongs to this single block. The USE statement on line 92 signals
the beginning of the block named CDATA.

Source statements are associated with this block until the USE statement on line 103, which begins the block
named CBLKS. The USE statement may also indicate a continuation of a previously begun block. Thus the
statement on line 123 resumes the default block, and the statement on line 183 resumes the block named
CDATA.

As we can see, each program block may actually contain several separate segments of the source program.
The assembler will (logically) rearrange these segments to gather together the pieces of each block. These
blocks will then be assigned addresses in the object program, with the blocks appearing in the same order in
which they were first begun in the source program. The result is the same as if the programmer had physically
rearranged the source statements to group together all the source lines belonging to each block.

The assembler accomplishes this logical rearrangement of code by maintaining, during Pass 1, a separate
location counter for each program block. The location counter for a block is initialized to 0 when the block is
first begun. The current value of this location counter is saved when switching to another block, and the saved
value is restored when resuming a previous block. Thus during Pass 1 each label in the program is assigned an
address that is relative to the start of the block that contains it.

When labels are entered into the symbol table, the block name or number is stored along with the assigned
relative address. At the end of Pass I the latest value of the location counter for each block indicates the length
of that block. The assembler can then assign to each block a starting address in the object program (beginning
with relative location 0). The algorithm for Pass I is shown in Fig. 2.12(b).

For code generation during Pass 2, the assembler needs the address for each symbol relative to the start of the
object program (not the start of an individual program block). This is easily found from the information in
SYMTAB. The assembler simply adds the location of the symbol, relative to the start of its block, to the
assigned block starting address (see Fig. 2.12c).

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Figure 2.12(a) demonstrates this process applied to our sample program. The column headed Loc/Block
shows the relative address (within a program block) assigned to each source line and a block number
indicating which program block is involved (0 = default block, 1 = CDATA, 2 = CBLKS). This is essentially
the same information that is stored in SYMTAB for each symbol. Notice that the value of the symbol
MAXLEN (line 107) is shown without a block number. This indicates that MAXLEN is an absolute symbol,
whose value is not relative to the start of any program block.

At the end of Pass 1 the assembler constructs a table that contains the starting addresses and lengths for all
blocks. For our sample program, this table looks like

SYMTAB shows the value of the operand (the symbol LENGTH) as relative location 0003 within program
block 1 (CDATA). The starting address for CDATA is 0066. Thus the desired target address for this
instruction is 0003 + 0066 = 0069. The instruction is to be assembled using program-counter relative
addressing. When the instruction is executed, the program counter contains the address of the following
instruction (line 25). The address of this instruction is relative location 0009 within the default block. Since
the default block starts at location 0000, this address is simply 0009. Thus the required displacement is 0069
— 0009 = 60. The calculation of the other addresses during Pass 2 follows a similar pattern.

We can immediately see that the separation of the program into blocks has considerably reduced our
addressing problems. Because the large buffer area is moved to the end of the object program, we no longer
need to use extended format instructions on lines 15, 35, and 65. Furthermore, the base register is no longer
necessary; we have deleted the LDB and BASE statements previously on lines 13 and 14. The problem of
placement of literals (and literal references) in the program is also much more easily solved. We simply
include a LTORG statement in the CDATA block to be sure that the literals are placed ahead of any large data
areas.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

Of course the use of program blocks has not accomplished anything we could not have done by rearranging
the statements of the source program. For example, program readability is often improved if the definitions of
data areas are placed in the source program close to the statements that reference them. This could be
accomplished in a long subroutine (without using program blocks) by simply inserting data areas in any
convenient position. However, the programmer would need to provide Jump instructions to branch around the
storage thus reserved.

In the situation just discussed, machine considerations suggested that the parts of the object program appear in
memory in a particular order. On the other hand, human factors suggested that the source program should be
in a different order. The use of program blocks is one way of satisfying both of these requirements, with the
assembler providing the required reorganization.

It is not necessary to physically rearrange the generated code in the object program to place the pieces of each
program block together. The assembler can simply write the object code as it is generated during Pass 2 and
insert the proper load address in each Text record. These load addresses will, of course, reflect the starting
address of the block as well as the relative location of the code within the block. This process is illustrated in
Fig. 2.13. The first two Text records are generated from the source program lines 5 through 70. When the
USE statement on line 92 is recognized, the assembler writes out the current Text record (even though there is
still room left in it).

The assembler then prepares to begin a new Text record for the new program block. As it happens, the
statements on lines 95 through 105 result in no generated code, so no new Text records are created. The next
two Text records come from lines 125 through 180. This time the statements that belong to the next program
block do result in the generation of object code. The fifth Text record contains the single byte of data from
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
line 185. The sixth Text record resumes the default program block and the rest of the object program
continues in similar fashion.

It does not matter that the Text records of the object program are not in sequence by address; the loader will
simply load the object code from each record at the indicated address. When this loading is completed, the
generated code from the default block will occupy relative locations 0000 through 0065; the generated code
and reserved storage for CDATA will occupy locations 0066 through 0070; and the storage reserved for
CBLKS will occupy locations 0071 through 1070. Figure 2.14 traces the blocks of the example program
through this process of assembly and loading. Notice that the program segments marked CDATA(1) and
CBLKS(1) are not actually present in the object program. Because of the way the addresses are assigned,
storage will automatically be reserved for these areas when the program is loaded.

Source program Object program Program loaded in memory

To understand how the pieces of each program block are gathered together, you may also want to simulate (by
hand) the loading of the object program of Fig. 213. The algorithm is shown in Fig. 2.12(b).

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

THE SIMPLIFIED INSTRUCTIONAL COMPUTER (SIC)

The architecture of Simplified Instructional Computer (SIC) described here. This machine has been designed
to illustrate the most commonly encountered hardware features and concepts, while avoiding most of the
idiosyncrasies that are often found in real machines.

Like many other products, SIC comes in two versions: the standard model and an XE version (XE stands for
“extra equipment,” or perhaps “extra expensive”). The two versions have been designed to be upward
compatible—that is, an object program for the standard SIC machine will also execute properly on a SIC/XE
system. (Such upward compatibility is often found on real computers that are closely related to one another.)

SIC Machine Architecture

Memory

Memory consists of 8-bit bytes; any 3 consecutive bytes form a word (24 bits). All addresses on SIC are byte
addresses; words are addressed by the location of their lowest numbered byte. There are a total of 32,768 (2’s)
bytes in the computer memory.

Registers

There are five registers, all of which have special uses. Each register is 24 bits in length. The following table
indicates the numbers, mnemonics, and uses of these registers. (The numbering scheme has been chosen for
compatibility with the XE version of SIC.)

Data Formats

Integers are stored as 24-bit binary numbers; 2’s complement representation is used for negative values.
Characters are stored using their 8-bit ASCII codes (see Appendix B). There is no floating-point hardware on
the standard version of SIC.

Instruction Formats
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
All machine instructions on the standard version of SIC have the following 24-bit format:

The flag bit x is used to indicate indexed-addressing mode.

Addressing Modes

There are two addressing modes available, indicated by the setting of the x bit in the instruction. The
following table describes how the target address is calculated from the address given in the instruction.
Parentheses are used to indicate the contents of a register or a memory location. For example, (X) represents
the contents of register X.

Direct addressing mode

Example. LDA TEN

Effection address (EA) = 1000

Content of the address 1000 is loaded to Accumulator.

Indexed addressing mode

Example. STCH BUFFER, X

Effective address (EA) = 1000 + [X]

= 1000 + content of the index register X register


WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
The Accumulator content, the character is loaded to the Effective address.

Instruction Set

SIC provides a basic set of instructions that are sufficient for most simple tasks. These include instructions
that load and store registers (LDA, LDX, STA, STX, etc.), as well as integer arithmetic operations (ADD,
SUB, MUL, DIV). All arithmetic operations involve register A and a word in memory, with the result being
left in the register. There is an instruction (COMP) that compares the value in register A with a word in
memory; this instruction sets a condition code CC to indicate the result (<, =, or >). Conditional jump
instructions (JLT, JEQ, JGT) can test the setting of CC, and jump accordingly. Two instructions are provided
for subroutine linkage. JSUB jumps to the subroutine, placing the return address in register L; RSUB returns
by jumping to the address contained in register L.

1.3.2 SIC/XE Machine Architecture


Memory

The memory structure for SIC/XE is the same as that previously described for SIC. However, the maximum
memory available on a SIC/XE system is 1 megabyte (220 bytes). This increase leads to a change in
instruction formats and addressing modes.

Registers

The following additional registers are provided by SIC/XE:

Data Formats

SIC/XE provides the same data formats as the standard version. In addition, there is a 48-bit floating-point
data type with the following format:

The fraction is interpreted as a value between 0 and 1; that is, the assumed

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
binary point is immediately before the high-order bit. For normalized floating- point numbers, the high-order
bit of the fraction must be 1. The exponent is interpreted as an unsigned binary number between 0 and 2047.
If the exponent has value e and the fraction has value f, the absolute value of the number represented is

The sign of the floating-point number is indicated by the value of s (0 = positive, 1 = negative). A value of
zero is represented by setting all bits (including sign, exponent, and fraction) to 0.

Instruction Formats

The larger memory available on SIC/XE means that an address will (in general) no longer fit into a 15-bit
field; thus the instruction format used on the standard version of SIC is no longer suitable. There are two
possible options—either use some form of relative addressing, or extend the address field to 20 bits. Both of
these options are included in SIC/XE (Formats 3 and 4 in the following description). In addition, SIC/XE
provides some instructions that do not reference memory at all. Formats 1 and 2 in the following description
are used for such instructions.

The new set of instruction formats is as follows. The settings of the flag bits in Formats 3 and 4 are discussed
under Addressing Modes. Bit e is used to distinguish between Formats 3 and 4 (e = 0 means Format 3, e = 1
means Format 4). Appendix A indicates the format to be used with each machine instruction.

Example. RSUB (Return to subroutine)

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

Addressing Modes

Two new relative addressing modes are available for use with instructions assembled using Format 3. These
are described in the following table:

For base relative addressing, the displacement field disp in a Format 3 instruction is interpreted as a 12-bit
unsigned integer.

The content of the address 0033 is loaded to the index register X.

For program-counter relative addressing, this field is interpreted as a 12-bit signed integer, with negative
values represented in 2’s complement notation.

Linkage register contains the content of RETADR 0030.

If bits b and p are both set to 0, the disp field from the Format 3 instruction is taken to be the target address.
For a Format 4 instruction, bits b and p are normally set to 0, and the target address is taken from the address
field of the instruction. We will call this direct addressing, to distinguish it from the relative addressing modes
described above.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

Accumulator contains the content of LENGTH 0033. Any of these addressing modes can also be combined
with indexed addressing—if bit x is set to 1, the term (X) is added in the target address calculation. Notice
that the standard version of the SIC machine uses only direct addressing (with or without indexing).

Accumulator A contains the content of BUFFER 0036.

Bits I and n in Formats 3 and 4 are used to specify how the target address is used. If bit i =1 and n = 0, the
target address itself is used as the operand value; no memory reference is performed. This is called immediate
addressing.

Accumulator contains 9.

If bit i =0 and n = 1, the word at the location given by the target address is fetched; the value contained in this
word is then taken as the address of the operand value. This is called indirect addressing.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Jump to content of the address 0030 RETADR.

If bits i and n are both 0 or both 1, the target address is taken as the location of the operand; we will refer to
this as simple addressing. Indexing cannot be used with immediate or indirect addressing modes.

SIC/XE instructions that specify neither immediate nor indirect addressing are assembled with bits n and i
both set to 1. Assemblers for the standard version of SIC will, however, set the bits in both of these positions
to 0. (This is because the 8-bit binary codes for all of the SIC instructions end in 00.) All SIC/XE machines
have a special hardware feature designed to provide the upward compatibility mentioned earlier. If bits n and i
are both 0, then bits b, p, and e are considered to be part of the address field of the instruction (rather than
flags indicating addressing modes). This makes Instruction Format 3 identical to the format used on the
standard version of SIC, providing the desired compatibility.

Instruction Set

SIC/XE provides all of the instructions that are available on the standard version. In addition, there are
instructions to load and store the new registers (LDB, STB, etc.) and to perform floating-point arithmetic
operations (ADDF, SUBF, MULF, DIVF). There are also instructions that take their operands from registers.
Besides the RMO (register move) instruction, these include register-to-register arithmetic operations (ADDR,
SUBR, MULR, DIVR). A special supervisor call instruction (SVC) is provided. Executing this instruction
generates an interrupt that can be used for communication with the operating system.There are also several
other new instructions.

Control Sections and Program Linking


In this section, we discuss the handling of programs that consist of multiple control sections. A control section
is a part of the program that maintains its identity after assembly; each such control section can be loaded and
relocated independently of the others. Different control sections are most often used for subroutines or other
logical subdivisions of a program. The programmer can assemble, load, and manipulate each of these control
sections separately. The resulting flexibility is a major benefit of using control sections. When control sections
form logically related parts of a program, it is necessary to provide some means for linking them together. For
example, instructions in one control section might need to refer to instructions or data located in another
section. Because control sections are independently loaded and relocated, the assembler is unable to process
these references in the usual way. The assembler has no idea where any other control section will be located at
execution time. Such references between control sections are called external references. The assembler
generates information for each external reference that will allow the loader to perform the required linking. In
this section we describe how external references are handled by our assembler. Chapter 3 discusses in detail
how the actual linking is performed.

Figure 2.15 shows our example program as it might be written using multi- pie control sections. In this case
there are three control sections: one for the main program and one for each subroutine. The START statement
identifies the beginning of the assembly and gives a name (COPY) to the first control section. The first section
continues until the CSECT statement on line 109. This assembler directive signals the start of a new control
section named RDREC. Similarly, the CSECT statement on line 193 begins the control section named
WRREC. The assembler establishes a separate location counter (beginning at 0) for each control section, just
as it does for program blocks.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Control sections differ from program blocks in that they are handled separately by the assembler. (It is not
even necessary for all control sections in a program to be assembled at the same time.) Symbols that are
defined in one control section may not be used directly by another control section; they must be identified as
EXTERNAL REFERENCES for the loader to handle. Figure 2.15 shows the use of two assembler
directives to identify such references:

EXTDEF (external definition) and EXTREF (external reference). The EXTDEF statement in a
control section names symbols, called external symbols, that are defined in this control section and may be
used by other sections. Control section names (in this case COPY, RDREC, and WRREC) do not need to be
named in an EXTDEF statement because they are automatically considered to be external symbols. The
EXTREF statement names symbols that are used in this control section and are defined elsewhere. For
example, the symbols BUFFER, BUFEND, and LENGTH are defined in the control section named COPY
and made available to the other sections by the EXTDEF statement on line 6. The third control section
(WRREC) uses two of these symbols, as specified in its EXTREF statement. (Line 207).

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

The order in which symbols are listed in the EXTDEF and EXTREF statements is not significant.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Now we are ready to look at how external references are handled by the assembler. Figure 2.16 shows the
generated object code for each statement in the program. Consider first the instruction

The operand (RDREC) is named in the EXTREF statement for the control section, so this is an external
reference. The assembler has no idea where the control section containing RDREC will be loaded, so it cannot
assemble the address for this instruction. Instead the assembler inserts an address of zero and passes
information to the loader, which will cause the proper address to be inserted at load time. The address of
RDREC will have no predictable relationship to anything in this control section; therefore relative addressing
is not possible. Thus an extended format instruction must be used to provide room for the actual address to be
inserted. This is true of any instruction whose operand involves an external reference.

Similarly, the instruction

is only slightly different. Here the value of the data word to be generated is specified by an expression
involving two external references: BUFEND and BUFFER. As before, the assembler stores this value as zero.
When the program is loaded, the loader will add to this data area the address of BUFEND and subtract from it
the address of BUFFER, which results in the desired value.

Note the difference between the handling of the expression on line 190 and the similar expression on line 107.
The symbols BUFEND and BUFFER are defined in the same control section with the EQU statement on line
107. Thus the value of the expression can be calculated immediately by the assembler. This could not be done
for line 190; BUFEND and BUFFER are defined in another control section, so their values are unknown at
assembly time.

As we can see from the above discussion, the assembler must remember (via entries in SYMTAB) in which
control section a symbol is defined. Any attempt to refer to a symbol in another control section must be
flagged as an error unless the symbol is identified (using EXTREF) as an external reference. The assembler
must also allow the same symbol to be used in different control.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
sections. For example, the conflicting definitions of MAXLEN on lines 107 and 190 should cause no problem.
A reference to MAXLEN in the control section COPY would use the definition on line 107, whereas a
reference to MAXLEN in RDREC would use the definition on line 190.
So far we have seen how the assembler leaves room in the object code for the values of external symbols. The
assembler must also include information in the object program that will cause the loader to insert the proper
values where they are required. We need two new record types in the object program and a change in a
previously defined record type. As before, the exact format of these records is arbitrary; however, the same
information must be passed to the loader in some form.
The two new record types are Define and Refer. A Define record gives information about external symbols
that are defined in this control section— that is, symbols named by EXTDEF. A Refer record lists symbols
that are used as external references by the control section—that is, symbols named by EXTREF. The formats
of these records are as follows.

Define record:
Col.1 D
Col. 2-7 Name of external symbol defined in this control section.
Col. 8-13 Relative address of symbol within this control section
(hexadecimal)
Col. 14-73 Repeat information in Col. 2—13 for other external symbols

Refer record:
Col.1 R
Col. 2-7 Name of external symbol referred to in this control section
Col. 8-73 Names of other external reference symbols

The other information needed for program linking is added to the Modification record type. The new format is
as follows.
Modification record (revised):

Col.1 M
Col. 2-7 Starting address of the field to be modified, relative to the beginning of the control section
(hexadecimal)
Col. 8-9 Length of the field to be modified, in half-bytes (hexadecimal)

Program Relocation
As we mentioned before, it is often desirable to have more than one program at a time sharing the memory
and other resources of the machine, if we knew in advance exactly which programs were to be executed
concurrently in this way, we could assign addresses when the programs were assembled so that they would fit
together without overlap or wasted space. Most of the time, however, it is not practical to plan program
execution this closely. (We usually do not know exactly when jobs will be submitted, exactly how long they
will run, etc.) Because of this, it is desirable to be able to load a program into memory wherever there is room
for it. In such a situation the actual starting address of the program is not known until load time.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
The program we considered in Section 2.1 is an example of an absolute program (or absolute assembly). This
program must be loaded at address 1000 (the address that was specified at assembly time) in order to execute
properly. To see this, consider the instruction

from Fig. 2.2. In the object program (Fig. 2.3), this statement is translated as 00102D, specifying that register
A is to be loaded from memory address 102D. Suppose we attempt to load and execute the program at address
2000 instead of address 1000. If we do this, address 102D will not contain the value that we expect—in fact, it
will probably be part of some other user’s program.

Obviously we need to make some change in the address portion of this instruction so we can load and execute
our program at address 2000. On the other hand, there are parts of the program (such as the constant 3
generated from line 85) that should remain the same regardless of where the program is loaded. Looking at the
object code alone, it is in general not possible to tell which values represent addresses and which represent
constant data items.

Since the assembler does not know the actual location where the program will be loaded, it cannot make the
necessary changes in the addresses used by the program. However, the assembler can identify for the loader
those parts of the object program that need modification. An object program that contains the information
necessary to perform this kind of modification is called a relocatable program.

To look at this in more detail, consider the program from Figs. 2.5 and 2.6. In the preceding section, we
assembled this program using a starting address of 0000. Figure 2.7(a) shows this program loaded beginning
at address 0000. The JSUB instruction from line 15 is loaded at address 0006. The address field of this
instruction contains 01036, which is the address of the instruction labeled RDREC. (These addresses are, of
course, the same as those assigned by the assembler.)

Now suppose that we want to load this program beginning at address 5000, as shown in Fig. 2.7(b). The
address of the instruction labeled RDREC is then 6036. Thus the JSUB instruction must be modified as shown
to contain this new address.

Likewise, if we loaded the program beginning at address 7420 (Fig. 2.7c), the JSUB instruction would need to
be changed to 48108456 to correspond to the new address of RDREC. Note that no matter where the program
is loaded, RDREC is always 1036 bytes past the starting address of the program. This means that we can solve
the relocation problem in the following way:

1. When the assembler generates the object code for the JSUB instruction we are considering, it will insert the
address of RDREC relative to the start of the program. (This is the reason we initialized the location counter
to 0 for the assembly.)

2. The assembler will also produce a command for the loader, instructing it to add the beginning address of
the program to the address field in the JSUB instruction at load time.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

The command for the loader, of course, must also be a part of the object program. We can accomplish this
with a Modification record having the following format:
Modification record:
Col M
Col. 2-7 Starting location of the address field to be modified, relativ to the beginning of the program
(hexadecimal)
Col. 8-9 Length of the address field to be modified, in halfbytes (hexadecimal)
2.3.2 Symbol-Defining Statements

Up to this point the only user-defined symbols we have seen in assembler language programs have
appeared as labels on instructions or data areas. The value of such a label is the address assigned to the
statement on which it appears. Most assemblers provide an assembler directive that allows the programmer to
define symbols and specify their values. The assembler directive generally used is EQU (for “equate”). The
general form of such a statement is

This statement defines the given symbol (i.e., enters it into SYMTAB) and assigns to it the value specified.
The value may be given as a constant or as any expression involving constants and previously defined
symbols. We discuss the formation and use of expressions in the next section.
One common use of EQU is to establish symbolic names that can be used for improved readability in place of
numeric values. For example, on line 133 of the program in Fig. 2.5 we used the statement

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

to load the value 4096 into register T. This value represents the maximum- length record we could read with
subroutine RDREC. The meaning is not, however, as clear as it might be. If we include the statement
MAXLEN EQU 4096
in the program, we can write line 133 as
+LDT #MAXLEN
When the assembler encounters the EQU statement, it enters MAXLEN into SYMTAB (with value 4096).
During assembly of the LDT instruction, the assembler searches SYMTAB for the symbol MAXLEN, using
its value as the operand in the instruction. The resulting object code is exactly the same as in the original
version of the instruction; however, the source statement is easier to understand. it is also much easier to find
and change the value of MAXLEN if this becomes necessary—we would not have to search through the
source code looking for places where #4096 is used.
There is another common assembler directive that can be used to indirectly assign values to symbols. This
directive is usually called ORG (for “origin”). Its form is

ORG value

where value is a constant or an expression involving constants and previously defined symbols. When this
statement is encountered during assembly of a program, the assembler resets its location counter (LOCCTR)
to the specified value. Since the values of symbols used as labels are taken from LOCCTR, the ORG
statement will affect the values of all labels defined until the next ORG.

Sometimes, ORG can be useful in label definition. Suppose that we were defining a symbol table with the
following structure:

In this table, the SYMBOL field contains a 6-byte user-defined symbol; VALUE is a one-word representation
of the value assigned to the symbol; FLAGS is a 2-byte field that specifies symbol type and other information.

We could reserve space for this table with the statement

We might want to refer to entries in the table using indexed addressing (placing in the index register the offset
of the desired entry from the beginning of the table). Of course, we want to be able to refer to the fields

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
SYMBOL, VALUE, and FLAGS individually, so we must also define these labels. One way of doing this
would be with EQU statements:

2.3.3 Expressions

Our previous examples of assembler language statements have used single terms (labels, literals, etc.) as
instruction operands. Most assemblers allow the use of expressions wherever such a single operand is
permitted. Each such expression must, of course, be evaluated by the assembler to produce a single operand
address or value.

Assemblers generally allow arithmetic expressions formed according to the normal rules using the operators +
—, , and I. Division is usually defined to produce an integer result. Individual terms in the expression may be
constants, user-defined symbols, or special terms. The most common such special term is the current value of
the location counter (often designated by *). This term represents the value of the next unassigned memory
location. Thus in Fig. 2.9 the statement.

gives BUFEND a value that is the address of the next byte after the buffer area.

In Section 2.2 we discussed the problem of program relocation. We saw that some values in the object
program are relative to the beginning of the program, while others are absolute (independent of program
location). Similarly, the values of terms and expressions are either relative or absolute. A constant is, of
course, an absolute term. Labels on instructions and data areas, and references to the location counter value,
are relative terms. A symbol whose value is given by EQU (or some similar assembler directive) may be
either an absolute term or a relative term depending upon the expression used to define its value.

Expressions are classified as either absolute expressions or relative expressions depending upon the type of
value they produce. An expression that contains only absolute terms is, of course, an absolute expression.
However, absolute expressions may also contain relative terms provided the relative terms occur in pairs and
the terms in each such pair have opposite signs. It is not necessary that the paired terms be adjacent to each
other in the expression; however, all relative terms must be capable of being paired in this way. None of the
relative terms may enter into a multiplication or division operation.

A relative expression is one in which all of the relative terms except one can be paired as described above; the
remaining unpaired relative term must have a positive sign. As before, no relative term may enter into a
multiplication or division operation. Expressions that do not meet the conditions given for either absolute or
relative expressions should be flagged by the assembler as errors.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

Although the rules given above may seem arbitrary, they are actually quite reasonable. The expressions that
are legal under these definitions include exactly those expressions whose value remains meaningful when the
program is relocated. A relative term or expression represents some value that may be written as (S+ r), where
S is the starting address of the program and r is the value of the term or expression relative to the starting
address. Thus a relative term usually represents some location within the program. When relative terms are
paired with opposite signs, the dependency on the program starting address is canceled out; the result is an
absolute value. Consider, for example, the program of Fig. 2.9. In the statement

107 MAXL BUFEND-BUFFER

both BUFEND and BUFFER are relative terms, each representing an address within the program. However,
the expression represents an absolute value: the difference between the two addresses, which is the length of
the buffer area in bytes. Notice that the assembler listing in Fig. 2.10 shows the value calculated for this
expression (hexadecimal 1000) in the Loc column. This value does not represent an address, as do most of the
other entries in that column. However, it does show the value that is associated with the symbol that appears
in the source statement (MAXLEN).

Expressions such as BUFEND + BUFFER, 100 - BUFFER, or 3 * BUFFER represent neither absolute values
nor locations within the program. The values of these expressions depend upon the program starting address
in a way that [S unrelated to anything within the program itself. Because such expressions are very unlikely to
be of any use, they are considered errors.

To determine the type of an expression, we must keep track of the types of 311 symbols defined in the
program. For this purpose we need a flag in the symbol table to indicate type of value (absolute or relative) in
addition to the value itself. Thus for the program of Fig. 2.10, some of the symbol table entries might be

With this information the assembler can easily determine the type of each expression used as an operand and
generate Modification records in the object program for relative values.

Basic Functions Of Assembler


 Translate mnemonic opcodes to the machine languguage
 Assign machine addresses to symbolic labels used by the programmer

Translation of source program to object code consist of the following steps:-

 Convert mnemonic opcodes to their machine language equivalent.


WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
 Convert symbolic operands to their equivalent machine addresses
o handle forward references
 Build machine instructions in proper format.
 Convert data constants specified in source program into internal machine representation.
 Write object program and assembly listing

Features of Assembler
 Machine independent Assembler features
o Instruction formats and addressing modes
o Program relocation
 Machine independent Assembler Features
o Symbols defining statements
o Literals
o Expressions
o Program blocks
o Control Section and Program Linking

Assembler Output Format

Assembler gets source program as input and produces the object code. This object program will later be
loaded into memory for execution. Simple object program format uses 3 types of records- Header, Text &
End.

Header record:

CoI.1 H

Col. 2-7 Program name

Col. 8-13 Starting address of object program (hexadecimal)

Col. 14-19 Length of object program in bytes (hexadecimal)

Text record:

Col.1 T

Col. 2-7 Starting address for object code in this record (hexadecimal)

Col. 8-9 Length of object code in this record in bytes (hexadecimal)

Col. 10-69 Object code, represented in hexadecimal (2 columns per byte of object code)

End record:

Col.1 E

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Col. 2-7 Address of first executable instruction in object program (hexadecimal)

Assembler Design

A four step approach is used to develop a design specification for an assembler:-

1. Identify the information necessary to perform a task.

2. Design a suitable data structure to record the information.

3. Determine the processing necessary to obtain the information.

4. Determine the processing necessary to perform the task.

Assembling = Analysis of source program + Synthesis of target program

During assembling there are two phases.

 Analysis phase
 Synthesis phase

Data Structures of Assemblers

Assembler uses two major internal data structures:-

1. Symbol table(SYMTAB)

2. Mnemonics table.(OPTAB)

Each entry of the symbol table has two primary fields—name and address. The table is built by the analysis
phase. An entry in the mnemonics table has two primary fields—mnemonic and opcode. The synthesis phase
uses these tables to obtain the machine address with which a name is associated, and the machine opcode
corresponding to a mnemonic, respectively. Hence the tables have to be searched with the symbol name and
the mnemonic as keys.

The primary function performed by the analysis phase is the building of the symbol table. For this purpose it
must determine the addresses with which the symbolic names used in a program are associated. It is possible
to determine some addresses directly, e.g. the address of the first instruction in the program, however others
must be inferred.

To implement memory allocation a data structure called location counter (LC) is introduced. The location
counter is always made to contain the address of the next memory word in the target program. It is initialized
to the constant specified in the START statement. Whenever the analysis phase sees a label in an assembly
statement. it enters the label and the contents of LC in a new entry of the symbol table. It then finds the
number of memory words required by the assembly statement and updates the LC contents. (Hence the word
‘counter’ in ‘location counter’.) This ensures that LC points to the next memory word in the target program
even when machine instructions have different lengths and DS!DC statements reserve different amounts of
memory. To update the contents of LC, analysis phase needs to know lengths of different instructions. This
information simply depends on the assembly language, hence the mnemonics table can be extended to include

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
this information in a new field called length. We refer to the processing involved in maintaining the location
counter as LC processing.

Figure 4.6 illustrates the use of the data structures by the analysis and synthesis phases. Note that the
Mnemonics table is a fixed table which is merely accessed by the analysis and synthesis phases, while the
Symbol table is constructed during analysis and used during synthesis. The tasks performed by the analysis
and synthesis phases are as follows:

Analysis phase

1. Isolate the label, mnemonic opcode and operand fields of a statement.

2. If a label is present, enter the pair (symbol, <LC contents>) in a new entry of symbol table.

3. Check validity of the mnemonic opcode through a look-up in the Mnemonics table.

4. Perform LC processing, i.e. update the value contained in LC by considering the opcode and
operands of the statement.

Synthesis phase

1. Obtain the machine opcode corresponding to the mnemonic from the Mnemonics table.

2. Obtain address of a memory operand from the Symbol table.

3. Synthesize a machine instruction or the machine form of a constant, as the case may be.

Two Pass Assemblers

Two pass translation of an assembly language program can handle lbrward reterences easily. LC processing is
performed in the first pass and symbols defined in the program are entered into the symbol table. The second
pass synthesizes the target form using the address information found in the symbol table. In effect, the first
pass performs analysis of the source program while the second pass performs synthesis of the target program.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
The first pass constructs an intermediate representation (IR) of the source program for use by the second pass
.This representation consists of two main components—data structures, e.g. the symbol table, and a processed
form of the source program. The latter component is called intermediate code(IC).

DESIGN OF A TWO PASS ASSEMBLER

Tasks performed by the passes of a two pass assembler are as follows:

Pass 1

1. Separate the symbol, mnemonic opcode and operand fields.


2. Build the symbol table.
3. Validate opcode
4. Perform LC processing.
5. Construct intermediate representation.

Pass II

1. Convert mnemonic to machine code


2. Synthesize the target program.

Pass I performs analysis of the source program and synthesis of the intermediate representation while Pass II
processes the intermediate representation to synthesize the Target program.

Pass I Algorithm

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

LOCCTR is initialized using the starting address of the program. Each line after pass I is written into the
intermediate file. If there is no starting address specified, then the LOCCTR is set to zero. If a label is found
in the symbol, then the assembler checks in the SYMTAB .If the label is found already, a duplicate error flag
is set .If it is not found, then label and current value of LOCCTR is entered in the SYMTAB. Validation of
opcode is also done in pass I . If an opcode is found, the LOCCTR is incremented as per the size of the
instruction. At the END of the program, (LOCCTR- starting address) gives the length of the program.

Pass II Algorithm

Here intermediate file is the input. On sensing the START symbol, the Header record is written to the object
code. If an opcode is found, an assembler checks for the symbol associated with that opcode. Symbol is
replaced by using its value which we get from SYMTAB. The object code is written into Text Record. If it
doesn’t fit into text record, a new record is initialized. After writing the last text record, End record is
generated which specifies the first executable instructions of the program.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

A simple SIC Assembler program

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Figure 2.1 shows an assembler language program for the basic version of SIC. We use variations of this
program throughout this chapter to show different assembler features. The line numbers are for reference only
and are not part

the program. These numbers also help to relate corresponding parts of different versions of the program. The
mnemonic instructions used are those introduced in Section 1.3.1 and Appendix A. Indexed addressing is
indicated adding the modifier “,X” following the operand (see line 160). Lines beginning with “.“ contain
comments only.

The program contains a main routine that reads records from an input device (identified with device code Fl)
and copies them to an output device (code 05). This main routine calls subroutine RDREC to read a record
into a buffer and subroutine WRREC to write the record from the buffer to the output device. Each subroutine
must transfer the record one character at a time because the only I/O instructions available are RD and WD.
The buffer Is necessary because the I/O rates for the two devices, such as a disk and a slow printing terminal,
may be very different. The end of each record is marked with a null character (hexadecimal 00). If a record is
longer than the length of the buffer (4096 bytes), only the first 4096 bytes are copied. (For simplicity, the
program does not deal with error recovery when a record containing 4096 bytes or more is read.) The end of
the file to be copied is indicated by a zero-length record. When the end of file is detected, the program writes
EOF on the output device and terminates by executing an RSUB instruction. We assume that this program
was called by the operating system using a JSUB instruction; thus, the RSUB will return control to the
operating system.

Forward Reference (necessity of two pass assembler)


The translation of source program to object code requires us to accomplish the following functions

1. Convert mnemonic operation codes to their machine language equivalents—e.g., translate STL to 14 (line
10).

2. Convert symbolic operands to their equivalent machine addresses— e.g., translate RETADR to 1033 (line
10).

3. Build the machine instructions in the proper format.

4. Convert the data constants specified in the source program into their internal machine representations—
e.g., translate EOF to 454F46 (line 80).

5. Write the object program and the assembly listing.

All of these functions except number 2 can easily be accomplished by sequential processing of the source
program, one line at a time. The translation of addresses, however, presents a problem. Consider the statement

10 1000 FIRST STL RETADR 141033

This instruction contains a forward reference—that is, a reference to a label (RETADR) that is defined later in
the program. If we attempt to translate the program line by line, we will be unable to process this statement
because we do not know the address that will be assigned to RETADR. Because of this, most assemblers
make two passes over the source program. The first pass does little more than scan the source program for

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
label definitions and assign addresses (such as those in the Loc column in Fig. 2.2). The second pass performs
most of the actual translation previously described.

ONE- PASS ASSEMBLERS

In this section we examine the structure and design of one-pass assemblers. As we discussed in Section 2.1,
the main problem in trying to assemble a program in one pass involves forward references. Instruction
operands often are symbols that have not yet been defined in the source program. Thus the assembler does not
know what address to insert in the translated instruction.

It is easy to eliminate forward references to data items; we can simply require that all such areas be defined in
the source program before they are referenced. This restriction is not too severe. The programmer merely
places all storage reservation statements at the start of the program rather than at the end. Unfortunately,
forward references to labels on instructions cannot be eliminated as easily. The logic of the program often
requires a forward jump— for example, in escaping from a loop after testing some condition. Requiring that
the programmer eliminate all such forward jumps would be much too restrictive and inconvenient. Therefore,
the assembler must make some special provision for handling forward references. To reduce the size of the
problem, many one-pass assemblers do, however, prohibit (or at least discourage) forward references to data
items.

There are two main types of one-pass assembler. One type produces object code directly in memory for
immediate execution; the other type produces the usual kind of object program for later execution. We use the
program in Fig. 2.18 to illustrate our discussion of both types. This example is the same as in Fig. 2.2, with all
data item definitions placed ahead of the code that references them. The generated object code shown in Fig.
2.18 is for reference only; we will discuss how each type of one-pass assembler would actually generate the
object program required.

We first discuss one-pass assemblers that generate their object code in memory for immediate execution. No
object program is written out, and no loader is needed. This kind of load-and-go assembler is useful in a
system that is oriented toward program development and testing. A university computing system for student
use is a typical example of such an environment. In such a system, a large fraction of the total workload
consists of program translation. Because programs are re-assembled nearly every time they are run, efficiency
of the assembly process is an important consideration. A load-and-go assembler avoids the overhead of
writing the object program out and reading it back in. This can be accomplished with either a one- or a two-
pass assembler. However, a one-pass assembler also avoids the overhead of an additional pass over the source
program.

Because the object program is produced in memory rather than being written out on secondary storage, the
handling of forward references becomes less difficult. The assembler simply generates object code
instructions as it scans the source program. If an instruction operand is a symbol that has not yet been defined,
the operand address is omitted when the instruction is assembled. The symbol used as an operand is entered
into the symbol table (unless such an entry is already present). This entry is flagged to indicate that the
symbol is undefined. The address of the operand field of the instruction that refers to the undefined symbol is
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
added to a list of forward references associated with the symbol table entry. When the definition for a symbol
is encountered, the forward reference list for that symbol is scanned (if one exists), and the proper address is
inserted into any instructions previously generated.

An example should help to make this process clear. Figure 2.19(a) shows the object code and symbol table
entries as they would be after scanning line 40

of the program in Fig. 2.18. The first forward reference occurred on line 15.

Since the operand (RDREC) was not yet defined, the instruction was assembled with no value assigned as the
operand address (denoted in the figure by - - - - - -). RDREC was then entered into SYMTAB as an undefined
symbol (indicated by *); the address of the operand field of the instruction (2013) was inserted in a list
associated with RDREC. A similar process was followed with the instructions on lines 30 and 35.

Now consider Fig. 2.19(b), which corresponds to the situation after scanning line 160. Some of the forward
references have been resolved by this time, while others have been added. When the symbol ENDFIL was
defined (line 45), the assembler placed its value in the SYMTAB entry; it then inserted this value into the
instruction operand field (at address 201C) as directed by the forward reference list. From this point on, any
references to ENDHL would not be forward references, and would not be entered into a list. Similarly, the
definition of RDREC (line 125) resulted in the filling in of the operand address at location 2013. Meanwhile,
two new forward references have been added: to WRREC (line 65) and E)UT (line 155). You should continue
tracing through this process to the end of the program to show yourself that all of the forward references will
be filled in properly. At the end of the program, any SYMTAB entries that are still marked with * indicate
undefined symbols. These should be flagged by the assembler as errors. the end of the program is
encountered, the assembly is complete. If no errors have occurred, the assembler searches SYMTAB for the
value of the symbol named in the END statement (in this case, HRST) and jumps to this location to begin
execution of the assembled program. The algorithm for one pass assembler is shown in Fig. 2.19(c).

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

When

We used an absolute program as our example because, for a load-and-go assembler, the actual address must
be known at assembly time. Of course it is not necessary for this address to be specified by the programmer; it
might be assigned by the system. In either case, however, the assembly process would be the same—the
location counter would be initialized to the actual program starting address.

One-pass assemblers that produce object programs as output are often used on systems where external
working-storage devices (for the intermediate file between the two passes) are not available. Such assemblers
may also be useful when the external storage is slow or is inconvenient to use for some other reason. One-pass
assemblers that produce object programs follow a slightly different procedure from that previously described.
Forward references are entered into lists as before. Now, however, when the definition of a symbol is
encountered, instructions that made forward references to that symbol may no longer be available in memory
for modification. In general, they will already have been written out as part of a Text record in the object
program. In this case the assembler must generate another Text record with the correct operand address. When
the program is loaded, this address will be inserted into the instruction by the action of the loader.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

Figure 2.20 illustrates this process. The second Text record contains the object code generated from lines 10
through 40 in Fig. 2.18. The operand addresses for the instructions on lines 15, 30, and 35 have been
generated as 0000. When the definition of ENDFIL on line 45 is encountered, the assembler generates the
third Text record. This record specifies that the value 2024 (the address of ENDFIL) is to be loaded at
location 201C (the operand address field of the JEQ instruction on line 30). When the program is loaded,
therefore, the value 2024 will replace the 0000 previously loaded. The other forward references in the
program are handled in exactly the same way. In effect, the services of the loader are being used to complete
forward references that could not be handled by the assembler. Of course, the object program records must be
kept in their original order when they are presented to the loader.

In this section we considered only simple one-pass assemblers that handled absolute programs. Instruction
operands were assumed to be single symbols, and the assembled instructions contained the actual (not
relative) addresses of the operands. More advanced assembler features such as literals were not allowed.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

2.4.2 Multi-Pass Assemblers

In our discussion of the EQU assembler directive, we required that any symbol used on the right-hand side
(i.e., in the expression giving the value of the new symbol) be defined previously in the source program. A
similar requirement was imposed for ORG. As a matter of fact, such a restriction is normally applied to all
assembler directives that (directly or indirectly) define symbols.

The reason for this is the symbol definition process in a two-pass assembler. Consider, for example, the
sequence

ALPHA EQU BETA

BETA EQU DELTA

DELTA RESW 1

The symbol BETA cannot be assigned a value when it is encountered during the first pass because DELTA
has not yet been defined. As a result, ALPHA cannot be evaluated during the second pass. This means that
any assembler that makes only two sequential passes over the source program cannot resolve such a sequence
of definitions.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Restrictions such as prohibiting forward references in symbol definition are not normally a serious
inconvenience for the programmer. As a matter of fact, such forward references tend to create difficulty for a
person reading the program as well as for the assembler. Nevertheless, some assemblers are designed to
eliminate the need for such restrictions. The general solution is a multi-pass assembler that can make as many
passes as are needed to process the definitions of symbols. It is not necessary for such an assembler to make
more than two passes over the entire program. Instead, the portions of the program that involve forward
references in symbol definition are saved during Pass 1. Additional passes through these stored definitions are
made as the assembly progresses. This process is followed by a normal Pass 2.

There are several ways of accomplishing the task outlined above. The method we describe involves storing
those symbol definitions that involve forward references in the symbol table. This table also indicates which
symbols are dependent on the values of others, to facilitate symbol evaluation.

Figure 2.21(a) shows a sequence of symbol-defining statements that involve forward references; the other
parts of the source program are not important for our discussion, and have been omitted. The following parts
of Fig. 2.21 show information in the symbol table as it might appear after processing each of the source
statements shown.

Figure 2.21(b) displays symbol table entries resulting from Pass 1 processing of the statement

HALFSZ EQU MAXLEN/2

MAXLEN has not yet been defined, so no value for HALFSZ can be computed. The defining expression for
HALFSZ is stored in the symbol table in place of its value. The entry &1 indicates that one symbol in the
defining expression is undefined. In an actual implementation, of course, this definition might be stored at
some other location. SYMTAB would then simply contain a pointer to the defining expression. The symbol
MAXLEN is also entered in the symbol table, with the flag * identifying it as undefined. Associated with this
entry is a list of the symbols whose values depend on MAXLEN (in this case, HALFSZ). (Note the similarity
to the way we handled forward references in a one-pass assembler.)

The same procedure is followed with the definition of MAXLEN [see Fig. 2.21(c)1. In this case there are two
undefined symbols involved in the definition: BUFEND and BUFFER. Both of these are entered into
SYMTAB with lists indicating the dependence of MAXLEN upon them. Similarly, the definition of PREVBT
causes this symbol to be added to the list of dependencies on BUFFER [as shown in Fig. 2.21(d)1.

So far we have simply been saving symbol definitions for later processing. The definition of BUFFER on line
4 lets us begin evaluation of some of these symbols. Let us assume that when line 4 is read, the location
counter contains the hexadecimal value 1034. This address is stored as the value of BUFFER. The assembler
then examines the list of symbols that are dependent on BUFFER. The symbol table entry for the first symbol
in this list (MAXLEN) shows that it depends on two currently undefined symbols; therefore, MAXLEN
cannot be evaluated immediately. Instead, the &2 is changed to &1 to show that only one symbol in the
definition (BUFEND) remains undefined. The other symbol in the list (PREVBT) can be evaluated because it
depends only on BUFFER. The value of the defining expression for PREVBT is calculated and stored in
3YMTAB. The result is shown in Fig. 2.21(e).

The remainder of the processing follows the same pattern. When BUFEND is defined by line 5, its value is
entered into the symbol table. The list associated with BUFEND then directs the assembler to evaluate
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
MAXLEN, and entering a value for MAXLEN causes the evaluation of the symbol in its list (HALFSZ). As
shown in Fig. 2.21(f), this completes the symbol definition process. If any symbols remained undefined at the
end of the program, the assembler would flag them as errors.

The procedure we have just described applies to symbols defined by assembler directives like EQU.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

Macro Assembler
Use of a macro preprocessor followed by a conventional assembler is an expensive way of handling macros
since the number of passes over the source program is large and many functions get duplicated. For example,
analysis of a source statement to detect macro calls requires us to process the mnemonic field. A similar

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
function is required in the first pass of the assembler. Similar functions of the preprocessor and the assembler
can be merged if macros are handled by a macro assembler which performs macro expansion and program
assembly simultaneously. This may also reduce the number of passes.

It is always possible to perform macro expansion in a single pass. This is not true, as certain kinds of forward
references in macros cannot be handled in a single pass.

This problem leads to the classical two pass organization for macro expansion. The first pass collects
information about the symbols defined in a program and the second pass performs macro expansion.

Pass structure of a macro-assembler

To design the pass structure of a macro-assembler we identify the functions of a macro preprocessor and the
conventional assembler which can be merged to advantage. After merging, the functions can be structured
into passes of the macro- assembler. This process leads to the following pass structure:

Pass I

I. Macro definition processing

2. SYMTAB construction.

Pass II

1. Macro expansion

2. Memory allocation and LC processing

3. Processing of literals

4. Intermediate code generation.

WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM

Pass III

1. Target code generation.

Pass II is large in size since it performs many functions. Further, since it performs macro expansion as well as
Pass I of a conventional assembler, all the data structures of the macro preprocessor and the conventional
assembler need to exist during this pass.

The pass structure can be simplified if attributes of actual parameters are not to be supported. The macro
preprocessor would then be a single pass program. Integrating Pass I of the assembler with the preprocessor
would give us the following wo pass structure:

Pass I

I. Macro definition processing

2. Macro expansion

3. Memory allocation, LC processing and SYMTAB construction

4. Processing of literals

5. Intermediate code generation.

Pass II

1. Target code generation.

WWW.VIDYARTHIPLUS.COM V+ TEAM

You might also like