0% found this document useful (0 votes)
29 views

System Software and Compilers PDF

The document describes the System Software course 15CS63. [1] It covers topics like assemblers, macroprocessors, loaders, linkers and operating systems. [2] The prerequisites for the course are basic microprocessor concepts. [3] Key outcomes include understanding assemblers, loaders, compilers and familiarizing with file structures and libraries.

Uploaded by

Prathiksha B.A
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

System Software and Compilers PDF

The document describes the System Software course 15CS63. [1] It covers topics like assemblers, macroprocessors, loaders, linkers and operating systems. [2] The prerequisites for the course are basic microprocessor concepts. [3] Key outcomes include understanding assemblers, loaders, compilers and familiarizing with file structures and libraries.

Uploaded by

Prathiksha B.A
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 199

System Software 15CS63

System Software
Semester : VI Course Code : 15CS63

Course Title : System Software AND Compiler Design

Faculty : Niranjan Murthy C

Dept : Computer Science & engineering

Prerequisites: Basic concepts of microprocessors (10CS45)

This course gives an introduction to the design and implementation of


Description various types of system software. A central theme of the course is the
relationship between machine architecture and system software. The
design of an assembler or an operating system is greatly influenced by
the architecture of the machine on which it runs. These influences are
emphasized and demonstrated through the discussion of actual pieces
of system softare fo a variety of real machines.

Outcomes
The students should be able to:
1. Student able to Define System Sotware such as Assembler and Macroprocessor.
2. Student able to Define System Sotware such as Loaders and Linkers
3. Student able to lexical analysis and syntax analysisFamiliaize with source file ,object and
executable file structures and libraries
4. Describe the front and back end phases of compiler and their importance to students

1 GMIT, Davangere Deepak D J


System Software 15CS63

MODULE- 1
 Introduction to System Software,
 Machine Architecture of SIC and SIC/XE.
 Assemblers: Basic assembler functions, machine dependent assembler features,
 machine independent assembler features, assembler design options.
 Macroprocessors: Basic macro processor functions, ->10 Hours
MACHINE ARCHITECTURE
System Software:

 System software consists of a variety of programs that support the operation of a computer.
 Application software focuses on an application or problem to be solved.
 System softwares are the machine dependent softwares that allows the user to focus on the
application or problem to be solved, without bothering about the details of how the
machine works internally.

Examples: Operating system, compiler, assembler, macroprocessor, loader or linker, debugger, text
editor, database management systems, etc.

Difference between System Software and application software

System Software Application Software


System software is machine dependent Application software is not dependent on the
underlying hardware.
System software focus is on the computing Application software provides solution to a
system. problem
Examples: Operating system, compiler, Examples: Antivirus, Microsoft office
assembler

SIC – Simplified Instructional Computer


Simplified Instructional Computer (SIC) is a hypothetical computer that includes the hardware
features most often found on real machines. There are two versions of SIC, they are,
standard model (SIC), and, extension version (SIC/XE) (extra equipment or extra expensive).

SIC Machine Architecture:

We discuss here the SIC machine architecture with respect to its Memory and Registers,
Data Formats, Instruction Formats, Addressing Modes, Instruction Set, Input and Output.

Memory:

There are 215 bytes in the computer memory, that is 32,768 bytes. It uses Little Endian format to
store the numbers, 3 consecutive bytes form a word , each location in memory contains 8-bit bytes.

Registers:

There are five registers, each 24 bits in length. Their mnemonic, number and use are given in the
following table.

2 GMIT, Davangere Deepak D J


System Software 15CS63

Mnemonic Number Use

A 0 Accumulator; used for arithmetic operations

X 1 Index register; used for addressing

L 2 Linkage register; JSUB

PC 8 Program counter

SW 9 Status word, including CC

Data Formats:

Integers are stored as 24-bit binary numbers. 2’s complement representation is used for negative
values, characters are stored using their 8-bit ASCII codes.No floating-point hardware on the
standard version of SIC.

Instruction Formats:

Opcode(8) x Address (15)


X is used to indicate indexed-addressing mode.

All machine instructions on the standard version of SIC have the 24-bit format as shown above.

Addressing Modes:

Only two modes are supported: Direct and Indexed

Mode Indication Target address calculation

Direct x= 0 TA = address

Indexed x= 1 TA = address + (x)

() are used to indicate the content of a register.

Instruction Set

 Load and store registers (LDA, LDX, STA, STX)


 Integer arithmetic (ADD, SUB, MUL, DIV), all involve register A and a word in memory.
 Comparison (COMP), involve register A and a word in memory.
 Conditional jump (JLE, JEQ, JGT, etc.)
 Subroutine linkage (JSUB, RSUB)

Input and Output

 One byte at a time to or from the rightmost 8 bits of register A.


 Each device has a unique 8-bit ID code.
 Test device (TD): test if a device is ready to send or receive a byte of data.
 Read data (RD): read a byte from the device to register A
 Write data (WD): write a byte from register A to the device.

SIC/XE Machine Architecture:

Memory

3 GMIT, Davangere Deepak D J


System Software 15CS63

 Maximum memory available on a SIC/XE system is 1 Megabyte (2 20 bytes).

Registers

 Additional B, S, T, and F registers are provided by SIC/XE, in addition to the registers of SIC.

Mnemonic Number Special use

B 3 Base register

S 4 General working register

T 5 General working register

F 6 Floating-point accumulator (48 bits)

Floating-point data type:

 There is a 48-bit floating-point data type, F*2(e-1024)

Instruction Formats :

The new set of instruction formats fro SIC/XE machine architecture are as follows.

Format 1 (1 byte): contains only operation code (straight from table).

Format 2 (2 bytes): first eight bits for operation code, next four for register 1 and following four for
register 2. The numbers for the registers go according to the numbers indicated at the registers
section (ie, register T is replaced by hex 5, F is replaced by hex 6).

Format 3 (3 bytes): First 6 bits contain operation code, next 6 bits contain flags, last 12 bits contain
displacement for the address of the operand. Operation code uses only 6 bits, thus the second hex
digit will be affected by the values of the first two flags (n and i). The flags, in order, are: n, i, x, b, p,
and e. Its functionality is explained in the next section. The last flag e indicates the instruction format
(0 for 3 and 1 for 4).

Format 4 (4 bytes): same as format 3 with an extra 2 hex digits (8 bits) for addresses that require
more than 12 bits to be represented.

Addressing Modes:

Five possible addressing modes plus the combinations are as follows.

1. Direct (x, b, and p all set to 0): operand address goes as it is. n and i are both set to the same
value, either 0 or 1. While in general that value is 1, if set to 0 for format 3 we can assume that the
rest of the flags (x, b, p, and e) are used as a part of the address of the operand, to make the format
compatible to the SIC format.

2. Relative (either b or p equal to 1 and the other one to 0): the address of the operand should be
added to the current value stored at the B register (if b = 1) or to the value stored at the PC register
(if p = 1)

3. Immediate(i = 1, n = 0): The operand value is already enclosed on the instruction (ie. lies on the
last 12/20 bits of the instruction)

4. Indirect(i = 0, n = 1): The operand value points to an address that holds the address for the
operand value.

4 GMIT, Davangere Deepak D J


System Software 15CS63

5. Indexed (x = 1): value to be added to the value stored at the register x to obtain real address of
the operand. This can be combined with any of the previous modes except immediate.

The various flag bits used in the above formats have the following meanings

e - > e = 0 means format 3, e = 1 means format 4

Bits x,b,p : Used to calculate the target address using relative, direct, and indexed addressing Modes.

Bits i and n: Says, how to use the target address b and p - both set to 0, disp field from format 3
instruction is taken to be the target address.

For a format 4 bits b and p are normally set to 0, 20 bit address is the target address

x -x is set to 1, X register value is added for target address calculation

i=1, n=0 Immediate addressing, TA: TA is used as the operand value, no memory reference

i=0, n=1 Indirect addressing, ((TA)): The word at the TA is fetched. Value of TA is taken as the address
of the operand value

i=0, n=0 or i=1, n=1 Simple addressing, (TA):TA is taken as the address of the operand value

Two new relative addressing modes are available for use with instructions assembled using format 3.

Instruction Set:

SIC/XE provides all of the instructions that are available on the standard version. In addition we
have, Instructions to load and store the new registers LDB, STB, etc, Floating-point arithmetic
operations, ADDF, SUBF, MULF, DIVF, Register move instruction : RMO, Register-to-register
arithmetic operations, ADDR, SUBR, MULR, DIVR and, Supervisor call instruction : SVC.

Input and Output:

There are I/O channels that can be used to perform input and output while the CPU is executing
other instructions. Allows overlap of computing and I/O, resulting in more efficient system
operation. The instructions SIO, TIO, and HIO are used to start, test and halt the operation of I/O
channels.

Example programs SIC:


Example 1: Simple data and character movement operation
LDA FIVE
STA ALPHA
LDCH CHARZ
STCH C1
ALPHA RESW 1
FIVE WORD 5
CHARZ BYTE C’Z’
C1 RESB 1

Example 2: Arithmetic operations


LDA ALPHA
ADD INCR
SUB ONE
STA BETA
……..

5 GMIT, Davangere Deepak D J


System Software 15CS63

……..
……..
ONE WORD 1
ALPHA RESW 1
BEETA RESW 1
INCR RESW 1

Example 3: Looping and Indexing operation


LDX ZERO ; X=0
MOVECH LDCH STR1, X
STCH STR2, X
TIX ELEVEN
JLT MOVECH
......
......
......
STR1 BYTE C ‘HELLO WORLD’
STR2 RESB 11
ZERO WORD 0
ELEVEN WORD 11

Example 4: Input and Output operation


INLOOP TD INDEV ; TEST INPUT DEVICE
JEQ INLOOP ; LOOP UNTIL DEVICE IS READY
RD INDEV ; READ ONE BYTE INTO A
STCH DATA ; STORE A TO DATA
.
.
OUTLP TD OUTDEV ; TEST OUTPUT DEVICE
JEQ OUTLP ; LOOP UNTIL DEVICE IS READY
LDCH DATA ; LOAD DATA INTO A
WD OUTDEV ; WRITE A TO OUTPUT DEVICE
.
.
INDEV BYTE X ‘F5’ ; INPUT DEVICE NUMBER
OUTDEV BYTE X ‘08’ ; OUTPUT DEVICE NUMBER
DATA RESB 1 ; ONE-BYTE VARIABLE

Example 5: To transfer two hundred bytes of data from input device to memory
LDX ZERO
CLOOP TD INDEV
JEQ CLOOP
RD INDEV
STCH RECORD, X
TIX B200
JLT CLOOP
.
.
INDEV BYTE X ‘F5’
RECORD RESB 200
ZERO WORD 0
B200 WORD 200

6 GMIT, Davangere Deepak D J


System Software 15CS63

Example Programs (SIC/XE)


Example 1: Simple data and character movement operation
LDA #5
STA ALPHA
LDA #90
.
.
.
ALPHA RESW 1
C1 RESB 1

Example 2: Arithmetic operations


LDS INCR
LDA ALPHA
ADD S,A
SUB #1
STA BETA
…………
…………
ALPHA RESW 1
BETA RESW 1
INCR RESW 1

Example 3: Looping and Indexing operation


LDT #11
LDX #0 ;X = 0
MOVECH LDCH STR1, X ; LOAD A FROM STR1
STCH STR2, X ; STORE A TO STR2
TIXR T
JLT MOVECH
.
.
STR1 BYTE C ‘HELLO WORLD’
STR2 RESB 11

Assemblers - 1
A Simple Two-Pass Assembler

Main Functions

 Translate mnemonic operation codes to their machine language equivalents


 Assign machine addresses to symbolic labels used by the programmers
 Depend heavily on the source language it translates and the machine language it produces.
 E.g., the instruction format and addressing modes

Basic Functions of an Assembler

7 GMIT, Davangere Deepak D J


System Software 15CS63

8 GMIT, Davangere Deepak D J


System Software 15CS63

• It is a copy function that reads some records from a specified input device and then copies
them to a specified output device
– Reads a record from the input device (code F1)
– Copies the record to the output device (code 05)
– Repeats the above steps until encountering EOF.
– Then writes EOF to the output device
– Then call RSUB to return to the caller

RDREC and WRREC

 Data transfer
– A record is a stream of bytes with a null character (0016) at the end.
– If a record is longer than 4096 bytes, only the first 4096 bytes are copied.
– EOF is indicated by a zero-length record. (I.e., a byte stream with only a null
character.
– Because the speed of the input and output devices may be different, a buffer is used
to temporarily store the record
 Subroutine call and return
– On line 10, “STL RETADDR” is called to save the return address that is already stored
in register L.
– Otherwise, after calling RD or WR, this COPY cannot return back to its caller.

Assembler Directives

 Assembler directives are pseudo instructions


– They will not be translated into machine instructions.
– They only provide instruction/direction/information to the assembler.
 Basic assembler directives :
o START : Specify name and starting address for the program
o END : Indicate the end of the source program, and (optionally) the first executable
instruction in the program. Assembler Directives (cont’d)
o BYTE : Generate character or hexadecimal constant, occupying as many bytes as
needed to represent the constant.

9 GMIT, Davangere Deepak D J


System Software 15CS63

o WORD : Generate one-word integer constant


o RESB : Reserve the indicated number of bytes for a data area
o RESW : Reserve the indicated number of words for a data area

An Assembler’s Job

 Convert mnemonic operation codes to their machine language codes


 Convert symbolic (e.g., jump labels, variable names) operands to their machine addresses
 Use proper addressing modes and formats to build efficient machine instructions
 Translate data constants into internal machine representations
 Output the object program and provide other information (e.g., for linker and loader)

Object Program Format

 Header

Col. 1 H

Col. 2~7 Program name

Col. 8~13 Starting address of object program (hex)

Col. 14-19 Length of object program in bytes (hex)

 Text

Col.1 T

Col.2~7 Starting address for object code in this record (hex)

Col. 8~9 Length of object code in this record in bytes (hex)

Col. 10~69 Object code, represented in hexa (2 col. per byte)

 End

Col.1 E

Col.2~7 Address of first executable instruction in object program (hex)

The Object Code for COPY


H COPY 001000 00107A

T 001000 1E 141033 482039 001036 281030 301015 482061 3C1003

00102A 0C1039 00102D

T 00101E 15 0C1036 482061 081044 4C0000 454F46 000003 000000

T 002039 1E 041030 001030 E0205D 30203F D8205D 281030 302057

549039 2C205E 38203F

T 002057 1C 101036 4C0000 F1 001000 041030 E02079 302064 509039

DC2079 2C1036

10 GMIT, Davangere Deepak D J


System Software 15CS63

T 002073 07 382064 4C0000 05

E 001000

NOTE: There is no object code corresponding to addresses 1033-2038. This storage is simply
reserved by the loader for use by the program during execution.

Two Pass Assembler

 Pass 1
– Assign addresses to all statements in the program
– Save the values (addresses) assigned to all labels (including label and variable
names) for use in Pass 2 (deal with forward references)
– Perform some processing of assembler directives (e.g., BYTE, RESW, these can affect
address assignment)
 Pass 2
– Assemble instructions (generate opcode and look up addresses)
– Generate data values defined by BYTE, WORD
– Perform processing of assembler directives not done in Pass 1
– Write the object program and the assembly listing

A Simple Two Pass Assembler Implementation

Algorithms and Data Structures

Three Main Data Structures


• Operation Code Table (OPTAB)
• Location Counter (LOCCTR)
• Symbol Table (SYMTAB)

OPTAB (operation code table)

 Content
– The mapping between mnemonic and machine code. Also include the instruction
format, available addressing modes, and length information.
 Characteristic
– Static table. The content will never change.
 Implementation

11 GMIT, Davangere Deepak D J


System Software 15CS63

– Array or hash table. Because the content will never change, we can optimize its
search speed.
 In pass 1, OPTAB is used to look up and validate mnemonics in the source program.
 In pass 2, OPTAB is used to translate mnemonics to machine instructions.

Location Counter (LOCCTR)


• This variable can help in the assignment of addresses.
• It is initialized to the beginning address specified in the START statement.
• After each source statement is processed, the length of the assembled instruction and data
area
 to be generated is added to LOCCTR.
• Thus, when we reach a label in the source program, the current value of LOCCTR gives the
address to be associated with that label.

Symbol Table (SYMTAB)


• Content
– Include the label name and value (address) for each label in the source program.
– Include type and length information (e.g., int64)
– With flag to indicate errors (e.g., a symbol defined in two places)
• Characteristic
– Dynamic table (I.e., symbols may be inserted, deleted, or searched in the table)
• Implementation
– Hash table can be used to speed up search – Because variable names may be very similar
(e.g., LOOP1, LOOP2), the selected hash function must perform well with such non-random
keys.

The Pseudo Code for Pass 1


Begin

read first input line

if OPCODE = ‘START’ then begin

save #[Operand] as starting addr

initialize LOCCTR to starting address

write line to intermediate file

read next line

end( if START)

else

initialize LOCCTR to 0

While OPCODE != ‘END’ do

begin

if this is not a comment line then

12 GMIT, Davangere Deepak D J


System Software 15CS63

begin

if there is a symbol in the LABEL field then

begin

search SYMTAB for LABEL

if found then

set error flag (duplicate symbol)

else

(if symbol)

search OPTAB for OPCODE

if found then

add 3 (instr length) to LOCCTR

else if OPCODE = ‘WORD’ then

add 3 to LOCCTR

else if OPCODE = ‘RESW’ then

add 3 * #[OPERAND] to LOCCTR

else if OPCODE = ‘RESB’ then

add #[OPERAND] to LOCCTR

else if OPCODE = ‘BYTE’ then

begin

find length of constant in bytes

add length to LOCCTR

end

else

set error flag (invalid operation code)

end (if not a comment)

write line to intermediate file

read next input line

end { while not END}

write last line to intermediate file

Save (LOCCTR – starting address) as program length

13 GMIT, Davangere Deepak D J


System Software 15CS63

End {pass 1}

The Pseudo Code for Pass 2


Begin

read 1st input line

if OPCODE = ‘START’ then

begin

write listing line

read next input line

end

write Header record to object program

initialize 1st Text record

while OPCODE != ‘END’ do

begin

if this is not comment line then

begin

search OPTAB for OPCODE

if found then

begin

if there is a symbol in OPERAND field then

begin

search SYMTAB for OPERAND field then

if found then

begin

store symbol value as operand address

else

begin

store 0 as operand address

set error flag (undefined symbol)

end

14 GMIT, Davangere Deepak D J


System Software 15CS63

end (if symbol)

else store 0 as operand address

assemble the object code instruction

else if OPCODE = ‘BYTE’ or ‘WORD” then

convert constant to object code

if object code doesn’t fit into current Text record then

begin

Write text record to object code

initialize new Text record

end

add object code to Text record

end {if not comment}

write listing line

read next input line

end

write listing line

read next input line

write last listing line

End {Pass 2}

Machine dependent Assembler Features


Assembler Features

• Machine Dependent Assembler Features

– Instruction formats and addressing modes (SIC/XE)

– Program relocation

• Machine Independent Assembler Features

– Literals

– Symbol-defining statements

– Expressions

15 GMIT, Davangere Deepak D J


System Software 15CS63

– Program blocks

– Control sections and program linking

A SIC/XE Program

16 GMIT, Davangere Deepak D J


System Software 15CS63

SIC/XE Instruction Formats and Addressing Modes

• PC-relative or Base-relative (BASE directive needs to be used) addressing: op m

• Indirect addressing: op @m

• Immediate addressing: op #c

• Extended format (4 bytes): +op m

• Index addressing: op m,X

• Register-to-register instructions

Relative Addressing Modes

• PC-relative or base-relative addressing mode is preferred over direct addressing


mode.

– Can save one byte from using format 3 rather than format 4.

• Reduce program storage space

• Reduce program instruction fetch time

– Relocation will be easier.

The Differences Between the SIC and SIC/XE Programs

• Register-to-register instructions are used whenever possible to improve execution


speed.

– Fetch a value stored in a register is much faster than fetch it from the
memory.

• Immediate addressing mode is used whenever possible.

– Operand is already included in the fetched instruction. There is no need to


fetch the operand from the memory.

• Indirect addressing mode is used whenever possible.

17 GMIT, Davangere Deepak D J


System Software 15CS63

– Just one instruction rather than two is enough.

The Object Code

18 GMIT, Davangere Deepak D J


System Software 15CS63

Generate Relocatable Programs

• Let the assembled program starts at address 0 so that later it can be easily moved to
any place in the physical memory.

• Actually, as we have learned from virtual memory, now every process


(executed program) has a separate address space starting from 0.

• Assembling register-to-register instructions presents no problems. (e.g., line 125 and


150)

• Register mnemonic names need to be converted to their corresponding


register numbers.

• This can be easily done by looking up a name table.

PC or Base-Relative Modes

• Format 3: 12-bit displacement field (in total 3 bytes)

– Base-relative: 0~4095

– PC-relative: -2048~2047

• Format 4: 20-bit address field (in total 4 bytes)

• The displacement needs to be calculated so that when the displacement is added to


PC (which points to the following instruction after the current instruction is fetched)
or the base register (B), the resulting value is the target address.

• If the displacement cannot fit into 12 bits, format 4 then needs to be used. (E.g., line
15 and 125)

– Bit e needs to be set to indicate format 4.

– A programmer must specify the use of format 4 by putting a + before the


instruction. Otherwise, it will be treated as an error.

19 GMIT, Davangere Deepak D J


System Software 15CS63

Base-Relative v.s. PC-Relative

• The difference between PC and base relative addressing modes is that the assembler
knows the value of PC when it tries to use PC-relative mode to assembles an

20 GMIT, Davangere Deepak D J


System Software 15CS63

instruction. However, when trying to use base-relative mode to assemble an


instruction, the assembler does not know the value of the base register.

– Therefore, the programmer must tell the assembler the value of register B.

– This is done through the use of the BASE directive. (line 13)

– Also, the programmer must load the appropriate value into register B by
himself.

– Another BASE directive can appear later, this will tell the assembler to change
its notion of the current value of B.

– NOBASE can also be used to tell the assembler that no more base-relative
addressing mode should be used.

21 GMIT, Davangere Deepak D J


System Software 15CS63

22 GMIT, Davangere Deepak D J


System Software 15CS63

Relocatable Is Desired

• The program in Fig. 2.1 specifies that it must be loaded at address 1000 for correct
execution. This restriction is too inflexible for the loader.

• If the program is loaded at a different address, say 2000, its memory references will
access wrong data! For example:

– 55 101B LDA THREE 00102D

• Thus, we want to make programs relocatable so that they can be loaded and execute
correctly at any place in the memory.

Address Modification Is Required

If we can use a hardware relocation register (MMU), software relocation can be avoided
here. However, when linking multiple object Programs together, software relocation is still
needed.

23 GMIT, Davangere Deepak D J


System Software 15CS63

What Instructions Needs to be Modified?

• Only those instructions that use absolute (direct) addresses to reference symbols.

• The following need not be modified:

– Immediate addressing (no memory references)

– PC or Base-relative addressing (Relocatable is one advantage of relative


addressing, among others.)

– Register-to-register instructions (no memory references)

The Modification Record

• When the assembler generate an address for a symbol, the address to be inserted
into the instruction is relative to the start of the program.

• The assembler also produces a modification record, in which the address and length
of the need-to-be-modified address field are stored.

• The loader, when seeing the record, will then add the beginning address of the
loaded program to the address field stored in the record.

24 GMIT, Davangere Deepak D J


System Software 15CS63

The Relocatable Object Code

25 GMIT, Davangere Deepak D J


System Software 15CS63

MODULE-2

 Loaders and Linkers: Basic Loader Functions,


 Machine Dependent Loader
 Features, Machine Independent Loader Features,
 Loader Design Options,
 Implementation Examples.

Machine Independent Assembler Features

These are the features which do not depend on the architecture of the machine. These are:
 Literals
 Expressions
 Program blocks
 Control sections

Literals
A literal is defined with a prefix = followed by a specification of the literal value.

Example:

45 001A ENDFIL LDA =C‟EOF‟ 032010

93 002D * LTORG =C‟EOF‟ 454F46

The example above shows a 3-byte operand whose value is a character string EOF. The object code
for the instruction is also mentioned. It shows the relative displacement value of the location where
this value is stored. In the example the value is at location (002D) and hence the displacement value
is (010).

As another example the given statement below shows a 1-byte literal with the hexadecimal value
‘05’.

215 1062 WLOOP TD =X‟05‟ E32011

It is important to understand the difference between a constant defined as a literal and a


constant defined as an immediate operand. In case of literals the assembler generates the specified
value as a constant at some other memory location. In immediate mode the operand value is
assembled as part of the instruction itself. Example

55 0020 LDA #03 010003

All the literal operands used in a program are gathered together into one or more literal pools. This
is usually placed at the end of the program. The assembly listing of a program containing literals
usually includes a listing of this literal pool, which shows the assigned addresses and the generated
data values. In some cases it is placed at some other location in the object program. An assembler
directive LTORG is used. Whenever the LTORG is encountered, it creates a literal pool that contains

26 GMIT, Davangere Deepak D J


System Software 15CS63

all the literal operands used since the beginning of the program. The literal pool definition is done
after LTORG is encountered. It is better to place the literals close to the instructions.

A literal table is created for the literals which are used in the program. The literal table contains the
literal name, operand value and length. The literal table is usually created as a hash table on the
literal name.

Implementation of Literals:

During Pass-1:

The literal encountered is searched in the literal table. If the literal already exists, no action is taken;
if it is not present, the literal is added to the LITTAB and for the address value, it waits till it
encounters LTORG for literal definition. When Pass 1 encounters a LTORG statement or the end of
the program, the assembler makes a scan of the literal table. At this time each literal currently in the
table is assigned an address. As addresses are assigned, the location counter is updated to reflect
the number of bytes occupied by each literal.

During Pass-2:

The assembler searches the LITTAB for each literal encountered in the instruction and replaces it
with its equivalent value as if these values are generated by BYTE or WORD. If a literal represents an
address in the program, the assembler must generate a modification relocation for, if it all it gets
affected due to relocation. The following figure shows the difference between the SYMTAB and
LITTAB.

Symbol-Defining Statements:
EQU Statement:

Most assemblers provide an assembler directive that allows the programmer to define symbols and
specify their values. The directive used for this EQU (Equate). The general form of the statement is

Symbol EQU value

This statement defines the given symbol (i.e., entering in the SYMTAB) and assigning to it the
value specified. The value can be a constant or an expression involving constants and any

27 GMIT, Davangere Deepak D J


System Software 15CS63

othersymbol which is already defined. One common usage is to define symbolic names that can be
used to improve readability in place of numeric values.

For example

+LDT #4096

This loads the register T with immediate value 4096, this does not clearly show what exactly this
value indicates. If a statement is included as:

MAXLEN EQU 4096 and then

+LDT #MAXLEN

Then it clearly indicates that the value of MAXLEN is some maximum length value. When the
assembler encounters EQU statement, it enters the symbol MAXLEN along with its value in the
symbol table. During LDT the assembler searches the SYMTAB for its entry and its equivalent value
as the operand in the instruction. The object code generated is the same for both the options
discussed, but is easier to understand. If the maximum length is changed from 4096 to 1024, it is
difficult to change if it is mentioned as an immediate value wherever required in the instructions.
We have to scan the whole program and make changes wherever 4096 is used. If we mention this
value in the instruction through the symbol defined by EQU, we may not have to search the whole
program but change only the value of MAXLENGTH in the EQU statement (only once).

ORG Statement:

This directive can be used to indirectly assign values to the symbols. The directive is usually called
ORG (for origin). Its general format is:

ORG value

where value is a constant or an expression involving constants and previously defined symbols.

When this statement is encountered during assembly of a program, the assembler resets its location
counter (LOCCTR) to the specified value. Since the values of symbols used as labels are taken from
LOCCTR, the ORG statement will affect the values of all labels defined until the next ORG is
encountered. ORG is used to control assignment storage in the object program.Sometimes altering
the values may result in incorrect assembly.

ORG can be useful in label definition. Suppose we need to define a symbol table with the following
structure:

SYMBOL 6 Bytes

VALUE 3 Bytes

FLAG 2 Bytes

The table looks like the one given below.

28 GMIT, Davangere Deepak D J


System Software 15CS63

The symbol field contains a 6-byte user-defined symbol; VALUE is a one-word representation of the
value assigned to the symbol; FLAG is a 2-byte field specifies symbol type and other information. The
space for the table can be reserved by the statement:

STAB RESB 1100

If we want to refer to the entries of the table using indexed addressing, place the offset value of the
desired entry from the beginning of the table in the index register. To refer to the fields SYMBOL,
VALUE, and FLAGS individually, we need to assign the values first as shown below:

SYMBOL EQU STAB

VALUE EQU STAB+6

FLAGS EQU STAB+9

To retrieve the VALUE field from the table indicated by register X, we can write a statement:

LDA VALUE, X

The same thing can also be done using ORG statement in the following way:

STAB RESB 1100

ORG STAB

SYMBOL RESB 6

VALUE RESW 1

FLAG RESB 2

ORG STAB+1100

The first statement allocates 1100 bytes of memory assigned to label STAB. In the second statement
the ORG statement initializes the location counter to the value of STAB. Now the LOCCTR points to
STAB. The next three lines assign appropriate memory storage to each of SYMBOL, VALUE and FLAG
symbols. The last ORG statement reinitializes the LOCCTR to a new value after skipping the required
number of memory for the table STAB (i.e., STAB+1100).

While using ORG, the symbol occurring in the statement should be predefined as is required in EQU
statement. For example for the sequence of statements below:

ORG ALPHA

29 GMIT, Davangere Deepak D J


System Software 15CS63

BYTE1 RESB 1

BYTE2 RESB 1

BYTE3 RESB 1

ORG

ALPHA RESB 1

The sequence could not be processed as the symbol used to assign the new location counter
value is not defined. In first pass, as the assembler would not know what value to assign to ALPHA,
the other symbol in the next lines also could not be defined in the symbol table. This is a kind of
problem of the forward reference.

EXPRESSIONS:

Assemblers also allow use of expressions in place of operands in the instruction. Each such
expression must be evaluated to generate a single operand value or address. Assemblers generally
arithmetic expressions formed according to the normal rules using arithmetic operators +, - *, /.
Division is usually defined to produce an integer result. Individual terms may be constants, user-
defined symbols, or special terms. The only special term used is * ( the current value of location
counter) which indicates the value of the next unassigned memory location. Thus the statement

BUFFEND EQU *

Assigns a value to BUFFEND, which is the address of the next byte following the buffer area. Some
values in the object program are relative to the beginning of the program and some are absolute
(independent of the program location, like constants). Hence, expressions are classified as either
absolute expression or relative expressions depending on the type of value they produce.

Absolute Expressions:

The expression that uses only absolute terms is absolute expression. Absolute expression may
contain relative term provided the relative terms occur in pairs with opposite signs for each pair.
Example:

MAXLEN EQU BUFEND-BUFFER

In the above instruction the difference in the expression gives a value that does not depend on the
location of the program and hence gives an absolute immaterial o the relocation of the program. The
expression can have only absolute terms. Example:

MAXLEN EQU 1000

Relative Expressions: All the relative terms except one can be paired as described in “absolute”. The
remaining unpaired relative term must have a positive sign. Example:

STAB EQU OPTAB + (BUFEND – BUFFER)

Handling the type of expressions: to find the type of expression, we must keep track the type of
symbols used. This can be achieved by defining the type in the symbol table against each of the
symbol as shown in the table below:

30 GMIT, Davangere Deepak D J


System Software 15CS63

Program Blocks:
Program blocks allow the generated machine instructions and data to appear in the object
program in a different order by Separating blocks for storing code, data, stack, and larger data block.

Assembler Directive USE:

USE [blockname]

At the beginning, statements are assumed to be part of the unnamed (default) block. If no USE
statements are included, the entire program belongs to this single block. Each program block may
actually contain several separate segments of the source program. Assemblers rearrange these
segments to gather together the pieces of each block and assign address. Separate the program into
blocks in a particular order. Large buffer area is moved to the end of the object program. Program
readability is better if data areas are placed in the source program close to the statements that
reference them.

In the example below three blocks are used :

 Default: executable instructions


 CDATA: all data areas that are less in length
 CBLKS: all data areas that consists of larger blocks of memory

31 GMIT, Davangere Deepak D J


System Software 15CS63

32 GMIT, Davangere Deepak D J


System Software 15CS63

Arranging code into program blocks:

Pass 1

A separate location counter for each program block is maintained.

Save and restore LOCCTR when switching between blocks.

At the beginning of a block, LOCCTR is set to 0.

Assign each label an address relative to the start of the block.

Store the block name or number in the SYMTAB along with the assigned relative address of
the label

Indicate the block length as the latest value of LOCCTR for each block at the end of Pass1

Assign to each block a starting address in the object program by concatenating the program
blocks in a particular order

Pass 2

Calculate the address for each symbol relative to the start of the object program by adding
The location of the symbol relative to the start of its block

The starting address of this block

Control Sections:
A control section is a part of the program that maintains its identity after assembly; each
control section can be loaded and relocated independently of the others. Different control sections
are most often used for subroutines or other logical subdivisions. The programmer can assemble,
load, and manipulate each of these control sections separately.

Because of this, there should be some means for linking control sections together. For
example, instructions in one control section may refer to the data or instructions of other control
sections. Since control sections are independently loaded and relocated, the assembler is unable to
process these references in the usual way. Such references between different control sections are
called external references.

The assembler generates the information about each of the external references that will
allow the loader to perform the required linking. When a program is written using multiple control
sections, the beginning of each of the control section is indicated by an assembler directive
assembler directive: CSECT

The syntax :

secname CSECT

separate location counter for each control section

Control sections differ from program blocks in that they are handled separately by the
assembler. Symbols that are defined in one control section may not be used directly another control
section; they must be identified as external reference for the loader to handle. The external
references are indicated by two assembler directives:

EXTDEF (external Definition):

33 GMIT, Davangere Deepak D J


System Software 15CS63

It is the statement in a control section, names symbols that are defined in this section but may be
used by other control sections. Control section names do not need to be named in the EXTREF as
they are automatically considered as external symbols.

EXTREF (external Reference):

It names symbols that are used in this section but are defined in some other control section.

The order in which these symbols are listed is not significant. The assembler must include proper
information about the external references in the object program that will cause the loader to insert
the proper value where they are required.

34 GMIT, Davangere Deepak D J


System Software 15CS63

35 GMIT, Davangere Deepak D J


System Software 15CS63

The assembler must also include information in the object program that will cause the loader to
insert the proper value where they are required. The assembler maintains two new record in the
object code and a changed version of modification record.

Define record (EXTDEF)

Col. 1 D

Col. 2-7 Name of external symbol defined in this control section

Col. 8-13 Relative address within this control section (hexadecimal)

Col.14-73 Repeat information in Col. 2-13 for other external symbols

Refer record (EXTREF)

Col. 1 R

Col. 2-7 Name of external symbol referred to in this control section

Col. 8-73 Name of other external reference symbols

36 GMIT, Davangere Deepak D J


System Software 15CS63

Modification record

Col. 1 M

Col. 2-7 Starting address of the field to be modified (hexadecimal)

Col. 8-9 Length of the field to be modified, in half-bytes (hexadecimal)

Col.11-16 External symbol whose value is to be added to or subtracted from the indicated field

A define record gives information about the external symbols that are defined in this
control section, i.e., symbols named by EXTDEF.

A refer record lists the symbols that are used as external references by the control section,
i.e., symbols named by EXTREF.

The new items in the modification record specify the modification to be performed: adding
or subtracting the value of some external symbol. The symbol used for modification my be defined
either in this control section or in another section.

The object program is shown below. There is a separate object program for each of the control
sections. In the Define Record and refer record the symbols named in EXTDEF and EXTREF are
included.

In the case of Define, the record also indicates the relative address of each external symbol within
the control section.

For EXTREF symbols, no address information is available. These symbols are simply named in the
Refer record.

37 GMIT, Davangere Deepak D J


System Software 15CS63

Assembler Design Options


One and Multi-Pass Assembler

• So far, we have presented the design and implementation of a two-pass assembler.

• Here, we will present the design and implementation of

– One-pass assembler

• If avoiding a second pass over the source program is necessary or desirable.

– Multi-pass assembler

• Allow forward references during symbol definition.

One-Pass Assembler

• The main problem is about forward reference.

• Eliminating forward reference to data items can be easily done.

– Simply ask the programmer to define variables before using them.

• However, eliminating forward reference to instruction cannot be easily done.

– Sometimes your program needs a forward jump.

– Asking your program to use only backward jumps is too restrictive.

38 GMIT, Davangere Deepak D J


System Software 15CS63

• There are two types of one-pass assembler:

– Produce object code directly in memory for immediate execution

• No loader is needed

• Load-and-go for program development and testing

• Good for computing center where most students reassemble their programs
each time.

• Can save time for scanning the source code again

– Produce the usual kind of object program for later execution

Internal Implementation

39 GMIT, Davangere Deepak D J


System Software 15CS63

• The assembler generate object code instructions as it scans the source program.

• If an instruction operand is a symbol that has not yet been defined, the operand address is
omitted when the instruction is assembled.

• The symbol used as an operand is entered into the symbol table.

• This entry is flagged to indicate that the symbol is undefined yet.

• The address of the operand field of the instruction that refers to the undefined symbol is
added to a list of forward references associated with the symbol table entry.

• When the definition of the symbol is encountered, the forward reference list for that symbol
is scanned, and the proper address is inserted into any instruction previously generated.

40 GMIT, Davangere Deepak D J


System Software 15CS63

• Between scanning line 40 and 160:

– On line 45, when the symbol ENDFIL is defined, the assembler places its value in the
SYMTAB entry.

– The assembler then inserts this value into the instruction operand field (at address
201C).

– From this point on, any references to ENDFIL would not be forward references and
would not be entered into a list.

• At the end of the processing of the program, any SYMTAB entries that are still marked with *
indicate undefined symbols.

– These should be flagged by the assembler as errors.

Multi-Pass Assembler

• If we use a two-pass assembler, the following symbol definition cannot be allowed.

ALPHA EQU BETA

BETA EQU DELTA

DELTA RESW 1

• This is because ALPHA and BETA cannot be defined in pass 1. Actually, if we allow multi-pass
processing, DELTA is defined in pass 1, BETA is defined in pass 2, and ALPHA is defined in
pass 3, and the above definitions can be allowed.

• This is the motivation for using a multi-pass assembler.

41 GMIT, Davangere Deepak D J


System Software 15CS63

• It is unnecessary for a multi-pass assembler to make more than two passes over the entire
program.

• Instead, only the parts of the program involving forward references need to be processed in
multiple passes.

• The method presented here can be used to process any kind of forward references.

Multi-Pass Assembler Implementation

Steps:

• Use a symbol table to store symbols that are not totally defined yet.

• For a undefined symbol, in its entry,

– We store the names and the number of undefined symbols which contribute to the
calculation of its value.

– We also keep a list of symbols whose values depend on the defined value of this
symbol.

• When a symbol becomes defined, we use its value to reevaluate the values of all of the
symbols that are kept in this list.

• The above step is performed recursively.

42 GMIT, Davangere Deepak D J


System Software 15CS63

43 GMIT, Davangere Deepak D J


System Software 15CS63

44 GMIT, Davangere Deepak D J


System Software 15CS63

45 GMIT, Davangere Deepak D J


System Software 15CS63

MODULE-3
Lexical Analysis
 Role of lexical analyzer

 Specification of tokens

 Recognition of tokens

 Lexical analyzer generator

 Finite automata

 Design of lexical analyzer generator

The role of lexical analyzer

Why to separate Lexical analysis and parsing

1. Simplicity of design

2. Improving compiler efficiency

3. Enhancing compiler portability

Tokens, Patterns and Lexemes

 A token is a pair a token name and an optional token value

 A pattern is a description of the form that the lexemes of a token may take

 A lexeme is a sequence of characters in the source program that matches the pattern for a
token

Example

46 GMIT, Davangere Deepak D J


System Software 15CS63

 Attributes for tokens


E = M * C ** 2
<id, pointer to symbol table entry for E>
<assign-op>
<id, pointer to symbol table entry for M>
<mult-op>
<id, pointer to symbol table entry for C>
<exp-op>
<number, integer value 2>

 Lexical errors
Some errors are out of power of lexical analyzer to recognize:
o fi (a == f(x)) …
However it may be able to recognize errors like:
o d = 2r
Such errors are recognized when no pattern for tokens matches a
character sequence

 Error recovery
1. Panic mode: successive characters are ignored until we reach to a well formed token
2. Delete one character from the remaining input
3. Insert a missing character into the remaining input

47 GMIT, Davangere Deepak D J


System Software 15CS63

4. Replace a character by another character


5. Transpose two adjacent characters

 Input buffering

Sentinels

48 GMIT, Davangere Deepak D J


System Software 15CS63

 Specification of tokens
1. In theory of compilation regular expressions are used to formalize the specification
of tokens

2. Regular expressions are means for specifying regular languages

3. Example:

i. Letter_(letter_ | digit)*

4. Each regular expression is a pattern specifying the form of strings

 Regular expressions
1. Ɛ is a regular expression, L(Ɛ) = {Ɛ}

2. If a is a symbol in ∑then a is a regular expression, L(a) = {a}

3. (r) | (s) is a regular expression denoting the language L(r) L(s)

4. (r)(s) is a regular expression denoting the language L(r)L(s)

5. (r)* is a regular expression denoting (L(r))*

49 GMIT, Davangere Deepak D J


System Software 15CS63

6. (r) is a regular expression denoting L(r)

 Regular definitions
1. d1 -> r1
2. d2 -> r2
3. …
4. dn -> rn

5. Example:

6. letter_ -> A | B | … | Z | a | b | … | Z | _
7. digit -> 0 | 1 | … | 9
8. id -> letter_ (letter_ | digit)*
 Extensions
One or more instances: (r)+

Zero of one instances: r?

Character classes: [abc]

Example:

letter_ -> [A-Za-z_]

digit -> [0-9]

id -> letter_(letter|digit)*

 Recognition of tokens
Starting point is the language grammar to understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt

expr -> term relop term
| term
term -> id
| number
 Recognition of tokens (cont.)
The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
We also need to handle whitespaces:

50 GMIT, Davangere Deepak D J


System Software 15CS63

ws -> (blank | tab | newline)+

 Transition diagrams

 Transition diagrams (cont.)

 Transition diagram for whitespace

 Transition diagram for unsigned numbers

51 GMIT, Davangere Deepak D J


System Software 15CS63

Architecture of a transition-diagram-based lexical analyzer

TOKEN getRelop()

TOKEN retToken = new (RELOP)

while (1) { /* repeat character processing until a

return or failure occurs */

switch(state) {

case 0: c= nextchar();

if (c == ‘<‘) state = 1;

else if (c == ‘=‘) state = 5;

else if (c == ‘>’) state = 6;

else fail(); /* lexeme is not a relop */

break;

case 1: …

case 8: retract();

retToken.attribute = GT;

return(retToken);

 Finite Automata

 Regular expressions = specification


52 GMIT, Davangere Deepak D J
System Software 15CS63

 Finite automata = implementation

 A finite automaton consists of


o An input alphabet

o A set of states S

o A start state n

o A set of accepting states F S

o A set of transitions state input state

 Transition

s1 a s2

 Is read

In state s1 on input “a” go to state s2

 If end of input

 If in accepting state => accept, othewise => reject

 If no transition possible => reject

Example

 Alphabet still { 0, 1 }

The operation of the automaton is not completely defined by the input

On input “11” the automaton could be in either state

MODULE-4

53 GMIT, Davangere Deepak D J


System Software 15CS63

 Syntax Analysis: Introduction,


 Role Of Parsers, Context Free Grammars,
 Writing a grammar,
 Top Down Parsers,
 Bottom-Up Parsers,
 Operator-Precedence Parsing

The role of parser

Uses of grammars
E -> E + T | T
T -> T * F | F
F -> (E) | id
E -> TE’
E’ -> +TE’ | Ɛ
T -> FT’
T’ -> *FT’ | Ɛ
F -> (E) | id

Error handling

54 GMIT, Davangere Deepak D J


System Software 15CS63

 Common programming errors

 Lexical errors

 Syntactic errors

 Semantic errors

 Logical errors

 Error handler goals

 Report the presence of errors clearly and accurately

 Recover from each error quickly enough to detect subsequent errors

 Add minimal overhead to the processing of correct progrms

Context free grammars

 Terminals

 Nonterminals

 Start symbol

 Productions

Derivations

 Productions are treated as rewriting rules to generate a string

 Rightmost and leftmost derivations

 E -> E + E | E * E | -E | (E) | id

 Derivations for –(id+id)

 E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)

Parse trees

 -(id+id)

 E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)

Elimination of ambiguity

55 GMIT, Davangere Deepak D J


System Software 15CS63

Elimination of left recursion

 A grammar is left recursive if it has a non-terminal A such that there is a derivation


A=> Aα

 Top down parsing methods cant handle left-recursive grammars

 A simple rule for direct left recursion elimination:

 For a rule like:

56 GMIT, Davangere Deepak D J


System Software 15CS63

 A -> A α|β

 We may replace it with

 A -> β A’

 A’ -> α A’ | ɛ

Left factoring

 Left factoring is a grammar transformation that is useful for producing a grammar


suitable for predictive or top-down parsing.

 Consider following grammar:

 Stmt -> if expr then stmt else stmt

 | if expr then stmt

 On seeing input if it is not clear for the parser which production to use

 We can easily perform left factoring:

 If we have A->αβ1 | αβ2 then we replace it with

 A -> αA’

 A’ -> β1 | β2

 TOP DOWN PARSING


A Top-down parser tries to create a parse tree from the root towards the leafs scanning
input from left to right

It can be also viewed as finding a leftmost derivation for an input string

Example: id+id*id

E -> TE’

E’ -> +TE’ | Ɛ

T -> FT’

T’ -> *FT’ | Ɛ

F -> (E) | id

57 GMIT, Davangere Deepak D J


System Software 15CS63

Recursive descent parsing

Consists of a set of procedures, one for each nonterminal

Execution begins with the procedure for start symbol

A typical procedure for a non-terminal

void A() {

choose an A-production, A->X1X2..Xk

for (i=1 to k) {

if (Xi is a nonterminal

call procedure Xi();

else if (Xi equals the current input symbol a)

advance the input to the next symbol;

else /* an error has occurred */

Example

58 GMIT, Davangere Deepak D J


System Software 15CS63

S->cAd

A->ab | a

Input: cad

First and Follow


 First() is set of terminals that begins strings derived from

 If α=>ɛ then is also in First(ɛ)

 In predictive parsing when we have A-> α|β, if First(α) and First(β) are
disjoint sets then we can select appropriate A-production by looking
at the next input

 Follow(A), for any nonterminal A, is set of terminals a that can appear immediately
after A in some sentential form

 If we have S => αAaβ for some αand βthen a is in Follow(A)

If A can be the rightmost symbol in some sentential form, then $ is in Follow(A)

Computing First

 To compute First(X) for all grammar symbols X, apply following rules until no more
terminals or ɛ can be added to any First set:

1. If X is a terminal then First(X) = {X}.

2. If X is a nonterminal and X->Y1Y2…Yk is a production for some k>=1, then


place a in First(X) if for some i a is in First(Yi) and ɛ is in all of
First(Y1),…,First(Yi-1) that is Y1…Yi-1 => ɛ. if ɛ is in First(Yj) for j=1,…,k then
add ɛ to First(X).

3. If X-> ɛ is a production then add ɛ to First(X)

 Example!

Computing follow

 To compute First(A) for all nonterminals A, apply following rules until nothing can be
added to any follow set:

1. Place $ in Follow(S) where S is the start symbol

59 GMIT, Davangere Deepak D J


System Software 15CS63

2. If there is a production A-> αBβ then everything in First(β) except ɛ is in


Follow(B).

3. If there is a production A->B or a production A->αBβ where First(β)


contains ɛ, then everything in Follow(A) is in Follow(B)

 Example!

LL(1) Grammars

Predictive parsers are those recursive descent parsers needing no backtracking

Grammars for which we can create predictive parsers are called LL(1)

The first L means scanning input from left to right

The second L means leftmost derivation

And 1 stands for using one input symbol for lookahead

A grammar G is LL(1) if and only if whenever A-> α|βare two distinct productions of G,
the following conditions hold:

For no terminal a do αandβ both derive strings beginning with a

At most one of α or βcan derive empty string

If α=> ɛ then βdoes not derive any string beginning with a terminal in Follow(A).

Construction of predictive parsing table

For each production A->α in grammar do the following:

For each terminal a in First(α) add A-> in M[A,a]

If ɛ is in First(α), then for each terminal b in Follow(A) add A-> ɛ to M[A,b]. If ɛ is


in First(α) and $ is in Follow(A), add A-> ɛ to M[A,$] as well

If after performing the above, there is no production in M[A,a] then set M[A,a] to error .

Example

60 GMIT, Davangere Deepak D J


System Software 15CS63

Non-recursive predicting parsing

61 GMIT, Davangere Deepak D J


System Software 15CS63

Predictive parsing algorithm

Set ip point to the first symbol of w;

Set X to the top stack symbol;

While (X<>$) { /* stack is not empty */

if (X is a) pop the stack and advance ip;

else if (X is a terminal) error();

else if (M[X,a] is an error entry) error();

else if (M[X,a] = X->Y1Y2..Yk) {

output the production X->Y1Y2..Yk;

pop the stack;

push Yk,…,Y2,Y1 on to the stack with Y1 on top;

set X to the top stack symbol;

62 GMIT, Davangere Deepak D J


System Software 15CS63

BOTTOMUP PARSING

Shift-reduce parser

The general idea is to shift some symbols of input to the stack until a reduction can be
applied

At each reduction step, a specific substring matching the body of a production is


replaced by the nonterminal at the head of the production

The key decisions during bottom-up parsing are about when to reduce and about what
production to applyA reduction is a reverse of a step in a derivation

The goal of a bottom-up parser is to construct a derivation in reverse:


E=>T=>T*F=>T*id=>F*id=>id*id

Handle pruning

 A Handle is a substring that matches the body of a production and whose


reduction represents one step along the reverse of a rightmost derivation

Shift reduce parsing (cont.)

63 GMIT, Davangere Deepak D J


System Software 15CS63

Basic operations:

Shift,Reduce,Accept, Error Example: id*id

LR Parsing

The most prevalent type of bottom-up parsers

LR(k), mostly interested on parsers with k<=1

Why LR parsers?

Table driven

Can be constructed to recognize all programming language constructs

Most general non-backtracking shift-reduce parsing method

64 GMIT, Davangere Deepak D J


System Software 15CS63

Can detect a syntactic error as soon as it is possible to do so

Class of grammars for which we can construct LR parsers are superset of those
which we can construct LL parsers

States of an LR parser

States represent set of items

An LR(0) item of G is a production of G with the dot at some position of the body:

For A->XYZ we have following items

A->.XYZ

A->X.YZ

A->XY.Z

A->XYZ.

In a state having A->.XYZ we hope to see a string derivable from XYZ next on the
input.

What about A->X.YZ?

Constructing canonical LR(0) item sets

Augmented grammar:

G with addition of a production: S’->S

Closure of item sets:

If I is a set of items, closure(I) is a set of items constructed from I by the following


rules:

Add every item in I to closure(I)

If A->α.Bβ is in closure(I) and B->γ is a production then add the item B->.γ
to clsoure(I).

Example: E’->E

E -> E + T | T

T -> T * F | F, F -> (E) | id

65 GMIT, Davangere Deepak D J


System Software 15CS63

Closure algorithm

SetOfItems CLOSURE(I) {

J=I;

repeat

for (each item A-> α.Bβ in J)

for (each prodcution B->γ of G)

if (B->.γ is not in J)

add B->.γ to J;

until no more items are added to J on one round;

return J;

GOTO Algorithm

SetOfItems GOTO(I,X) {

J=empty;

if (A-> α.X β is in I)

add CLOSURE(A-> αX. β ) to J;

66 GMIT, Davangere Deepak D J


System Software 15CS63

return J;

Canonical LR(0) items

Void items(G’) {

C= CLOSURE({[S’->.S]});

repeat

for (each set of items I in C)

for (each grammar symbol X)

if (GOTO(I,X) is not empty and not in C)

add GOTO(I,X) to C;

until no new set of items are added to C on a round;

67 GMIT, Davangere Deepak D J


System Software 15CS63

LR parsing algorithm

let a be the first symbol of w$;

while(1) { /*repeat forever */

68 GMIT, Davangere Deepak D J


System Software 15CS63

let s be the state on top of the stack;

if (ACTION[s,a] = shift t) {

push t onto the stack;

let a be the next input symbol;

} else if (ACTION[s,a] = reduce A->β) {

pop |β| symbols of the stack;

let state t now be on top of the stack;

push GOTO[t,A] onto the stack;

output the production A->β;

} else if (ACTION[s,a]=accept) break; /* parsing is done */

else call error-recovery routine;

Constructing SLR parsing table

Method

69 GMIT, Davangere Deepak D J


System Software 15CS63

Construct C={I0,I1, … , In}, the collection of LR(0) items for G’

State i is constructed from state Ii:

If [A->α.aβ] is in Ii and Goto(Ii,a)=Ij, then set ACTION[i,a] to “shift j”

If [A->α.] is in Ii, then set ACTION[i,a] to “reduce A->α” for all a in


follow(A)

If {S’->.S] is in Ii, then set ACTION[I,$] to “Accept”

If any conflicts appears then we say that the grammar is not SLR(1).

If GOTO(Ii,A) = Ij then GOTO[i,A]=j

All entries not defined by above rules are made “error”

The initial state of the parser is the one constructed from the set of items
containing [S’->.S]

70 GMIT, Davangere Deepak D J


System Software 15CS63

MODULE-5

 Syntax Directed Translation


 Intermediate code generation
 Code generation

Introduction

 We can associate information with a language construct by attaching attributes to


the grammar symbols.

 A syntax directed definition specifies the values of attributes by associating semantic


rules with the grammar productions.

Ordering the evaluation of attributes

If dependency graph has an edge from M to N then M must be evaluated before the
attribute of N

Thus the only allowable orders of evaluation are those sequence of nodes N1,N2,…,Nk
such that if there is an edge from Ni to Nj then i<j

Such an ordering is called a topological sortof a graph

Example!

S-Attributed definitions

An SDD is S-attributed if every attribute is synthesized

We can have a post-order traversal of parse-tree to evaluate attributes in S-attributed


definitions

postorder(N) {

for (each child C of N, from the left) postorder(C);

evaluate the attributes associated with node N;

71 GMIT, Davangere Deepak D J


System Software 15CS63

S-Attributed definitions can be implemented during bottom-up parsing without the need
to explicitly create parse trees

L-Attributed definitions

 A SDD is L-Attributed if the edges in dependency graph goes from Left to Right but
not from Right to Left.

 More precisely, each attribute must be either

 Synthesized

 Inherited, but if there us a production A->X1X2…Xn and there is an inherited


attribute Xi.a computed by a rule associated with this production, then the
rule may only use:

 Inherited attributes associated with the head A

 Either inherited or synthesized attributes associated with the


occurrences of symbols X1,X2,…,Xi-1 located to the left of Xi

 Inherited or synthesized attributes associated with this occurrence of


Xi itself, but in such a way that there is no cycle in the graph

Application of Syntax Directed Translation

 Construction of syntax trees

 Leaf nodes: Leaf(op,val)

 Interior node: Node(op,c1,c2,…,ck)

Example:

Production

E -> E1 + T

E -> E1 - T

E -> T

T -> (E)

T -> id

T -> num

Semantic RULE

E.node=new node(‘+’, E1.node,T.node)

E.node=new node(‘-’, E1.node,T.node)

E.node = T.node

72 GMIT, Davangere Deepak D J


System Software 15CS63

T.node = E.node

T.node = new Leaf(id,id.entry)

T.node = new Leaf(num,num.val)

Syntax tree for L-attributed definition

Syntax directed translation schemes

An SDT is a Context Free grammar with program fragments embedded within production
bodies

Those program fragments are called semantic actions

They can appear at any position within production body

Any SDT can be implemented by first building a parse tree and then performing the
actions in a left-to-right depth first order

Typically SDT’s are implemented during parsing without building a parse tree .

Postfix translation schemes

Simplest SDDs are those that we can parse the grammar bottom-up and the SDD is s-
attributed

For such cases we can construct SDT where each action is placed at the end of the
production and is executed along with the reduction of the body to the head of that
production

SDT’s with all actions at the right ends of the production bodies are called postfix SDT’s

73 GMIT, Davangere Deepak D J


System Software 15CS63

Parse-Stack implementation of postfix SDT’s

In a shift-reduce parser we can easily implement semantic action using the parser stack

For each nonterminal (or state) on the stack we can associate a record holding its
attributes

Then in a reduction step we can execute the semantic action at the end of a production
to evaluate the attribute(s) of the non-terminal at the leftside of the production

And put the value on the stack in replace of the rightside of production

EXAMPLE

L -> E n {print(stack[top-1].val);

top=top-1;}

E -> E1 + T {stack[top-2].val=stack[top-2].val+stack.val;

top=top-2;}

E -> T

T -> T1 * F {stack[top-2].val=stack[top-2].val+stack.val;

top=top-2;}

T -> F

F -> (E) {stack[top-2].val=stack[top-1].val

top=top-2;}

F -> digit

74 GMIT, Davangere Deepak D J


System Software 15CS63

Intermediate Code Generation

 Intermediate code is the interface between front end and back end in a compiler

 Ideally the details of source language are confined to the front end and the details of
target machines to the back end (a m*n model)

 In this chapter we study intermediate representations, static type checking and


intermediate code generation.

Variants of syntax trees

 It is sometimes beneficial to crate a DAG instead of tree for Expressions.

 This way we can easily show the common sub-expressions and then use that
knowledge during code generation

 Example: a+a*(b-c)+(b-c)*d

SDD for creating DAG’sSDD for creating DAG’s

75 GMIT, Davangere Deepak D J


System Software 15CS63

Value-number method for constructing DAG’s

 Algorithm

 Search the array for a node M with label op, left child l and right child r

 If there is such a node, return the value number M

 If not create in the array a new node N with label op, left child l, and right
child r and return its value

 We may use a hash table

Three address code

 In a three address code there is at most one operator at the right side of an
instruction

76 GMIT, Davangere Deepak D J


System Software 15CS63

Example:

Data structures for three address codes

 Quadruples

 Has four fields: op, arg1, arg2 and result

 Triples

 Temporaries are not used and instead references to instructions are made

 Indirect triples

 In addition to triples we use a list of pointers to triples.

Type Expressions

Example: int[2][3]

array(2,array(3,integer))

A basic type is a type expression


77 GMIT, Davangere Deepak D J
System Software 15CS63

A type name is a type expression

A type expression can be formed by applying the array type constructor to a number and
a type expression.

A record is a data structure with named field

A type expression can be formed by using the type constructor g for function types

If s and t are type expressions, then their Cartesian product s*t is a type expression

Type expressions may contain variables whose values are type expressions.

Short-Circuit Code

Flow-of-Control Statements

78 GMIT, Davangere Deepak D J


System Software 15CS63

79 GMIT, Davangere Deepak D J

You might also like