Scanning and Parsing
Scanning and Parsing
SYSTEM PROGRAMMING
CONTENTS
Course Coordinator
Tapashi Kashyap Das, Assistant Professor, Computer Science, KKHSOU
Arabinda Saikia, Assistant Professor, Computer Science, KKHSOU
January 2011
© Krishna Kanta Handiqui State Open University.
No part of this publication which is material protected by this copyright notice may be produced
or transmitted or utilized or stored in any form or by any means now known or hereinafter
invented, electronic, digital or mechanical, including photocopying, scanning, recording or by
any information storage or retrieval system, without prior written permission from the KKHSOU.
The university acknowledges with thanks the financial support provided
by the Distance Education Council, New Delhi, for the preparation
of this study material.
Printed and published by Registrar on behalf of the Krishna Kanta Handiqui State Open University.
Marks
Unit 1 : Introduction to Language Processor 20
Language Processor: Activities, Phase and Pass; Types of Language Processor:
Compiler, Interpreter, Assembler; Phases of Compiler; Programming Language
Grammer; Terminal Symbols, Alphabets and Strings.
Unit 2 : Assembler 20
Features of Assembly Language Programming, Statement Format, Assembly
Language statements, Advantage of Assembly Language, Design specification of
Assembler.
Unit 3 : Linker and Loader 20
Linker, Loader, Two Pass Linking, Loader Scheme: General Scheme Loader, Compile
and Go loader, Bootstrap Loader; Object Files; Object Code Library; Concept of
Relocation; Symbol and Symbol Resolution; Overlay; Dynamic Linking.
Unit 1 This unit introduces the concept of the language processor. Different types of language pro-
cessors like compiler, interpreter, assembler are briefly described in this unit.
Unit 2 This unit is on assembler. With this unit, learners will be acquainted with assembly language
programming along with the design specification of the assembler.
Unit 3 This unit deals with the concept of the linker and the loader. Learners will be acquainted with
different loader scheme, object code library and concept of relocation.
Unit 4 This unit focuses on the compiler and the interpreter. Concept of memory allocation is also
discussed in this unit.
Unit 5 This unit the last unit of this course. Various important concepts like finite state autometa,
regular expressions etc. are discussed in this unit. The concept of parse tree and different
parsing techniques are also presented in this unit.
Each unit of this course intends to include some boxes along-side the main sections, to help you
know some of the difficult, unfamiliar terms. Some “EXERCISE” have also been included to help you
apply your own thoughts. You may find some boxes marked with: “LET US KNOW”. These boxes will
provide you with some additional interesting and relevant information. Again, you will get “CHECK
YOUR PROGRESS” questions. These have been designed to self-check your progress of study. It
will be helpful for you if you solve the problems put in these boxes immediately after you go through
the sections of the units and then match your answers with “ANSWERS TO CHECK YOUR
PROGRESS” given at the end of each unit.
UNIT 1 : INTRODUCTION TO LANGUAGE
PROCESSOR
UNIT STRUCTURE
1.2 INTRODUCTION
To make you understand about language processor we can take the help
of one real life example. Suppose you are talking with a stranger who knows
System Programming 5
Unit 1 Introduction to Language Processor
only English. But you are not good at English. Then whenever the stranger
speaks, you will first try to understand the type of language he is speaking
i.e. your brain will process (try to understand) it and convert (make under-
standable) the language into your own language. So we have seen that
the main purpose of language processor is to understand the language
that is speaking to us. Similar activity is going inside a computer. As we
know a computer does not understand our language. Subsequently, com-
puter scientists have developed so many languages (like high level lan-
guage C/C++) that can be understood by the computer so that we can
interact with the Computer to meet our requirements.
6 System Programming
Introduction to Language Processor Unit 1
Whereas the synthesis phase, construct the object code from the
intermediate representation and symbol table. Here the object code
means the code produced after the compilation; in other words the
low level machine code is termed as object code. Some other objec-
tives of this phase includes: obtain machine code from mnemonics
table, check the address of an operand from symbol table and syn-
thesize a machine instruction. We will discuss later more about this
phase while discussing the complier.
System Programming 7
Unit 1 Introduction to Language Processor
1.4.1 Compiler
Fig. 1.1
If there are errors in the source code, the compiler identifies the
errors at the end of compilation. Such errors must be removed to
enable the compiler to compile the source code successfully. The
object program generated after compilation can be executed a num-
ber of times without translating it again.
8 System Programming
Introduction to Language Processor Unit 1
1.4.2 Interpreter
Interpreter does not produce the target executable code for the com-
puter. The operations performed by an interpreter are:[* Internal er-
ror: Invalid file format. | In-line.WMF *]
a) It directly executes the input code (i.e. source code) line by line
by using the given inputs and producing the desired outputs.
1.4.3 Assembler
Fig. 1.2
10 System Programming
Introduction to Language Processor Unit 1
Symbol Table :
Compiler uses
symbol table to
keep track of scope
and binding infor-
mation about names
found in the source
code. The table is
changed every time
a name is encoun-
tered. Changes to
this table occur i) if
a new name is
discovered ii) if new
information about
an existing name is
discovered.
Fig. 1.3
System Programming 11
Unit 1 Introduction to Language Processor
The lexical and syntactic features of any programming language are de-
termined from its grammar. Here we are going to discuss about the basics
of grammar required for any programming language. As we already know
that a language L is nothing but the collection of some meaningful sen-
tences. Each sentence in the language consist of sequence of valid words
and each word consist of some letters and /or symbols with respect to the
particular language L. Such type of language can be termed as formal
language.
So, we can define the Grammar for such formal language L as the set of
rules which precisely specify the sentences. You may think that our natu-
ral language (English,MIL) are also called formal language but it is not
true. We can not call it since natural language vocabulary is so rich to
remember all. But we can call programming languages as formal lan-
guages.
The lower case letters like a, b, c, d ….etc are used to denote the sym-
bols in . A symbol in the alphabet is termed as terminal symbol (de-
noted by T) of a language L.
string which has no symbols. To build larger string from the existing string
concatenation operations is used.
Production rules are often written in the form: head body; e.g., the rule
z0 z1 specifies that z0 can be replaced by z1.
For example, suppose the alphabet are a and b, the start symbol is S, and
the production rules:
1. S aSb
2. S b
here a,b are the terminals and S is non-terminal as well as the start sym-
bol. Suppose using the above production rules we want to generate a
string aabbb. So, starting with S we get the string as:
Like by using specific ordering of the production rules (if more than one
production rule exist) we can generate any string for a particular language.
System Programming 13
Unit 1 Introduction to Language Processor
14 System Programming
Introduction to Language Processor Unit 1
System Programming 15
UNIT 2 : ASSEMBLER
UNIT STRUCTURE
2.2 INTRODUCTION
We have already come across the concept of language processor and its
types. We are also acquainted with compiler, interpreter and assembler
with their functionalities. We have also learnt the basic idea of system
programming, their components and phases, how they processed, what
are different rules like semantic, syntax and lexical, their grammar, sym-
bol, alphabets etc.
Where <Op. Spec> is the operand specification and has the following
syntax:
We can use any combination of that forms also like NUM+4(4) etc. Fol-
lowing are the mnemonic codes used in assembly language for machine
instructions :
Mnemonics ADD, SUB, DIV, and MULT are for arithmetic operation and
are performed in register. The comparison instruction sets a condition code
analogous to a subtract instruction without affecting the values of its oper-
ands. Condition code can be tested by BC instruction. The assembly in-
struction format of BC is
BC <condition code specif>, <memory address>
The condition code specifications are the character code used for simplic-
ity. They are :
18 System Programming
Assembler Unit 2
LT 1 less than
LE 2 less than equal
EQ 3 equal
GT 4 greater than
GE 5 greater than equal
ANY 6 unconditional transfer
Imperative Statements:
Declarative Statements:
A DS 1 A
N DC ‘1’ N 1
Here DS occupies the memory area to store one word memory and DC
occupies to store value 1. Using DS we can reserve a blocks of memory
also. To reserve a block of 100 memory words, the statement to be,
A DS 100
If we want to retrieve the fifth word of the memory that already specified,
then have to write A+4, which is the 5th position and A is the memory
location.
For DC, the programmer can declare in different forms like decimal, bi-
nary, hexadecimal etc. The assembler converts them to the appropriate
internal form. Actually DC is mainly used to initialize the memory words to
given values. These values may change inside the program if we give
new values to it. Assembler does not have any protection for that.
When it will be translated into machine language, it will have two oper-
ands: AREAG and the value ‘5’ as immediate operands. This feature
does not support by simple assembly language program of other machines
except Intel 8086.
Literals are operands and are different from constant because its value
does not change during the execution of a program and its location can-
not be specified in the assembly program. It is written as :
ADD AREAG, = ’5’
Assembler Directives:
System Programming 21
Unit 2 Assembler
1) Machine language programs are very difficult to read, write and un-
derstand whereas assembly language programs are easy to read,
write and understand.
22 System Programming
Assembler Unit 2
There are two intermediate phases to get the target program from source
program. So, the general model for the translation process can be repre-
sented as follows :
System Programming 23
Unit 2 Assembler
Analysis phase:
Our first work is to separate the codes and here we found that AGAIN
appears in the label field, LOAD in the opcode mnemonic field and RE-
SULT + 4 in the operand field. Then will find the rules of meaning for the
codes. We have thus completed analysis of the source statement. Then
the work of synthesis phase will start.
Synthesis phase :
For the assembly statement considering in the analysis phase, now the
work of the synthesis phase is that, first to select the appropriate machine
instruction’s opcode for the mnemonic LOAD and place it in the machine
instruction’s opcode field. Then will evaluate the address corresponding
to the operand expression ‘RESULT + 4’ and place it in the address field of
the machine instruction. This would complete the translation of this as-
sembly language statement.
System Programming 25
Unit 2 Assembler
Mnemonic opcode length
ADD 01 1
SUB 02 1
Mnemonic table
Symbol Address
AGAIN 104
N 113
Symbol table
System Programming 27
Unit 2 Assembler
28 System Programming
UNIT 3 : LINKER AND LOADER
UNIT STRUCTURE
System Programming 29
Unit 3 Linker and Loader
3.2 INTRODUCTION
In this unit, we will discuss the concept of linking and loading. Linking is a
process of combining various pieces of code, module or data together to
form a single executable unit that can be loaded in memory. Linking can
be done at compile time, at load time (by loaders) and also at run time (by
application programs). A loader is a program that takes object program
and prepares it for execution. Once such executable file has been gener-
ated, the actual object module generated by the linker is deleted. Thus,
although the linker itself produces an intermediate relocatable module,
the more usual result of a linker call is an executable file that is subse-
quently produced by the Loader.
3.3 LINKER
A linker reads all of the symbol tables in the input module and extracts the
useful information, which is sometimes all of the incoming information but
frequently just what’s needed to link. Then it builds the link-time symbol
tables and uses those to guide the linking process. Depending on the
output file format, the linker may place some or all of the symbol informa-
tion in the output file. Linkers handle a variety of symbols. All linkers handle
symbolic references from one module to another. Each input module in-
cludes a symbol table. The symbols include the following:
Global symbols defined and perhaps referenced in the module.
Global symbols referenced but not defined in the module (generally
called externals).
Segment names, which are usually also considered to be global
symbols defined to be at the beginning of the segment.
Non-global symbols, usually for debuggers and crash dump analy-
sis (optional). These are not really symbols needed for the linking
30 System Programming
Linker and Loader Unit 3
3.4 LOADER
Loader is the program that accomplished the loading task. Loading is the
process of bringing a program into main memory so that it can run. On
most modern systems, each program is loaded into a fresh address space,
which means that all programs are loaded at a known fixed address, and
can be linked for that address. Loading is pretty simple from this point and
requires the following steps:
Read enough header information from the object file to find out how
much address space is needed.
Zero out any BSS (Block Started by Symbol, a portion of data seg-
ment containing statically allocated variable) space at the end of the
program if the virtual memory system does not do so automatically.
If the program is not mapped through the virtual memory system, reading
in the object file just means reading in the file with normal read system
calls. On systems that support shared read-only code segments, the sys-
tem needs to check whether there’s already a copy of the code segment
loaded in and uses that rather than making another copy.
System Programming 31
Unit 3 Linker and Loader
Let us consider two program files, a.c and b.c. As we invoke the GCC on
a.c and b.c at the shell prompt, the following actions take place :
gcc a.c
cpp other-command-line options a.c /tmp/a.i
cc1 other-command-line options /tmp/a.i -o /tmp/a.s
as other-command-line options /tmp/a.s -o /tmp/a.o
cpp, cc1 and as are the GNU’s preprocessor, compiler proper and as-
sembler respectively. They are a part of the standard GCC distribution.
If we repeat the above steps for file b.c, we have another object file, b.o.
The linker’s job is to take these input object files (a.o and b.o) and gener-
ate the final executable :
ld other-command-line-options /tmp/a.o /tmp/b.o -o a.out
The final executable (a.out) is then ready to be loaded. To run the execut-
able, we type its name at the shell prompt :
./a.out
The shell invokes the loader function, which copies the code and data in
the executable file a.out into memory, and then transfers control to the
beginning of the program. The loader is a program called execve, which
32 System Programming
Linker and Loader Unit 3
loads the code and data of the executable object file into memory and
then runs the program by jumping to the first instruction.
a.out was first coined as the Assembler OUTput in a.out object files. Since
then, object formats have changed variedly, but the name continues to be
used.
Linker and Loader perform several related but conceptually separate ac-
tions.
Each input file contains a set of segments, contiguous block of code or data
to be placed in the output file. Each input file also contains at least one
symbol table. Some symbols are exported-defined within the file for use in
other files, generally the names of routines within the file that can be called
from elsewhere. Other symbols are imported-used in the file but not de-
fined, generally, the names of routines called from but not present in the file.
When a linker runs, it first has to scan the input files to find the sizes of the
segments and to collect the definitions and references of all the symbols.
It creates a symbols table listing all the segments defined in the input file
and a symbol table with all the symbols imported or exported. Using the
data from the first pass the linker assigns numeric locations to symbols,
determines the sizes and location of the segments in the output address
space and figures out where everything goes in the output file.
The second pass uses the information collected in the first pass to control
the actual linking process. It reads and relocates the object code, substi-
tuting numeric address for symbol references and adjusting memory ad-
dresses in code and data to reflect relocated segment addresses, and writes
the relocated codes to the outputs file. It then writes the output file gener-
ally with header information, the relocated segments and symbol table in-
formation. If the program uses dynamic linking the symbol table contains
the information, the run time linker will need to resolve dynamic symbols.
Advantages :
The program need not be retranslated each time while running it.
This is because initially when source program gets executed an
object program gets generated. Of program is not modified, then
loader can make use of this object program to convert it to ex-
ecutable form.
In this type of loader, the instruction is read line by line, its machine
code is obtained and it is directly put in the main memory at some
known address. That means the assembler runs in one part of
memory and the assembled machine instructions and data is di-
rectly put into their assigned memory locations. After completion of
assembly process, assign starting address of the program to the
location counter. The typical example is WATFOR-77, it’s a FOR-
TRAN compiler which uses such “load and go” scheme. This load-
ing scheme is also called as “assemble and go”.
Advantages :
Disadvantages:
36 System Programming
Linker and Loader Unit 3
Advantages :
Disadvantages :
The list of all the symbols that are defined in the current segment
but can be referred by the other segments.
The list of symbols which are not defined in the current segment
but can be used in the current segment are stored in a table. The
USE table holds the information such as name of the symbol,
address, address relativity.
System Programming 37
Unit 3 Linker and Loader
Executable object file, which contains binary code and data in a form
that can be directly loaded into memory and executed.
38 System Programming
Linker and Loader Unit 3
Not all object formats contain all of these kinds of information, and
it’s possible to have quite useful formats with little or no information
beyond the object code.
All linkers support object code libraries in one form or another, with most
also providing support for various kinds of shared libraries. The basic prin-
ciple of an object code library is simple enough (Fig.4.1). A library is little
System Programming 39
Unit 3 Linker and Loader
more than a set of object code files. (Indeed, on some systems you can
literally concatenate a group of object files together and use the result as
a link library.) If any imported names remain undefined after the linker
processes all of the regular input files, it runs through the library or librar-
ies and links in any of the files in the library that export one or more unde-
fined names.
Shared libraries complicate this task a little by moving some of the work
from link time to load time. The linker identifies the shared libraries that
resolve the undefined names in a linker run, but rather than linking any-
thing into the program, the linker notes in the output file the names of the
libraries in which the symbols were found, so that the shared library can
be bound in when the program is loaded.
LINKER
Library 1
C Executable
File
D
A Library 2
X
B E
Y
C F
: D P
E :
Once a linker has scanned all of the input files to determine segment
sizes and symbol definitions and symbol references, figured out which
library modules to include, and decided where in the output address space
all of the segments will go, the next stage is the heart of the linking pro-
cess: relocation. We use the term relocation to refer both to the process of
adjusting program addresses to account for nonzero segment origins and
to the process of resolving references to external symbols, because the
two are frequently handled together.
40 System Programming
Linker and Loader Unit 3
The linker’s first pass lays out the positions of the various segments and
collects the segment-relative values of all global symbols in the program.
Once the linker determines the position of each segment, it potentially
needs to fix up all storage addresses to reflect the new locations of the
segments. On most architecture, addresses in data are absolute, while
those embedded in instructions may be absolute or relative. The linker
needs to fix them up accordingly. The linker also resolves stored refer-
ences to global symbols to the symbol addresses.
Every relocatable object file has a symbol table and associated symbols.
In the context of a linker, the following kinds of symbols are present :
System Programming 41
Unit 3 Linker and Loader
3.12 OVERLAY
Overlay is a technique that dates back to before 1960, and are still in use
in some memory-constrained environments. Several MS-DOS linkers in
the 1980 supported them in a form nearly identical to that used 25 years
earlier on mainframe computers. Although overlays are now little used on
conventional architectures, the techniques that linkers use to create and
manage overlays remain interesting. Also, the inter-segment call tricks
developed for overlays point the way to dynamic linking. Overlaid pro-
grams divide the code into a tree of segments, such as the one in Fig. 3.1.
The programmer manually assigns object files or individual object code
segments to overlay segments. Sibling segments in the overlay tree share
the same memory. In the example, segments A and D share the same
memory, B and C share the same memory, and E and F share the same
memory. The sequence of segments that lead to a specific segment is
called a path, so the path for E includes the root, D, and E.
When the program starts, the system loads the root segment which con-
tains the entry point of the program. Each time a routine makes a “down-
ward” inter-segment call, the overlay manager ensures that the path to
the call target is loaded. For example, if the root calls a routine in segment
42 System Programming
Linker and Loader Unit 3
A, the overlay manager loads section A if it’s not already loaded. If a rou-
tine in A calls a routine in B the manager has to ensure that B is loaded,
and if a routine in the root calls a routine in B, the manager ensures that
both A and B are loaded. Upwards calls don’t require any linker help, since
the entire path from the root is already loaded.
For example, linking the following two programs produces link-time er-
rors:
/* first.c */ /* second.c */
int foo () { int foo () {
return 0; return 1;
} }
int main () {
foo ();
}
The linker will generate an error message because foo (strong symbol as
its global function) is dfined twice.
gcc first.c second.c
/tmp/ccM1DKre.o: In function ‘foo’:
/tmp/ccM1DKre.o(.text+0x0): multiple definition of ‘foo’
/tmp/ccIhvEMn.o(.text+0x0): first defined here
collect2: ld returned 1 exit status
Collect2 is a wrapper over linker ld that is called by GCC.
Root
A D
B F
C E
Dynamic linking defers much of the linking process until a program starts
running or sometimes even later. It provides a variety of benefits that are
hard to get otherwise :
In this unit we have learnt the working strategy of linker and loader. Link-
ing is the process of combining various pieces of code and data together
to form a single executable that can be loaded in memory. Linking can be
done at compile time, at load time (by loaders) and also at run time (by
application programs). Loading is the process of bringing a program into
main memory so that it can run. After that we have acquainted with con-
cept of relocation. We have also learnt various important concepts like
symbol resolution overlay and dynamic linking.
46 System Programming
UNIT 4 : COMPILERS AND INTERPRETERS
UNIT STRUCTURE
4.2 INTRODUCTION
From the introduction we came to know that compilers are like bridge
between programming language (PL) domain and an execution domain.
The main aspects are :
Data types
These are the specification of legal values for variables and legal opera-
tions on legal values of a variable. For example, in C language to declare
a variable x,
int x ; where ‘int’ is the data type and x is the legal variable. Legal opera-
tions are the assignment operation and data manipulation opera-
tion. A compiler can ensure by the following tasks that the data
types are assigned through legal operations :
48 System Programming
Compilers and Interpreters Unit 4
When a compiler compiles the data structure used by the PL, it develops Data structures
memory mapping to access the memory word allocated to the element. A used by a PL are
record is a heterogeneous data structure that require complex memory arrays, stacks,
mapping. queues, list, records
etc.
The control structure of a language is the collection of language features
for altering the flow of control during the execution of a program. That are
execution control, procedure calls, conditional transfer etc. the compiler
must ensure that the source program does not violate the semantics of a
control structure.
Scope Rules
System Programming 49
Unit 4 Compilers and Interpreters
Lexical analysis
Syntax analysis
Semantic analysis
Lexical Analysis :
Fig. 4.1
50 System Programming
Compilers and Interpreters Unit 4
Symbol
Table
Fig. 4.2
Fig. 4.3
Identifier :
position, initial, rate
Operators :
:=,+,*
Constant :
60
Fig. 4.4
From the parse tree we can say that an identifier, constant (here 60)
individually is an expression. In addition to that we can also con-
clude that initial + rate * 60 is an expression. Finally the entire ex-
pression is an assignment statement which is proved to be valid as
the parse tree shows. This is what the parse tree wants to show.
52 System Programming
Compilers and Interpreters Unit 4
Semantic analysis :
For example, any old noun phrase followed by some verb phrase
makes a syntactically correct English sentence, a semantically cor-
rect one has subject-verb agreement, proper use of gender, and the
components go together to express an idea that makes sense.
void main()
{
…..….............
int kkhsou_var,result;
kkhsou_var=10;
result = kkhsou_var +100; // assignment statements
…............
}
c) Some of the three address code has fewer than three operands.
Code optimization :
During this phase, the code optimizer optimizes the code produced
by the intermediate code generator in terms of time and space. The
main objective of this phase is to improve the intermediate code so
that faster running machine code will result. Code optimizations is
the process of transforming a piece of code to make more efficient
(either in terms of time or space) without changing its output.
System Programming 55
Unit 4 Compilers and Interpreters
Propagating x yield:
int y = 7 - 14 / 2;
return y * (28 / 14 + 2);
Code generation :
code generator takes the parse tree as input from the previously
executed phases and generates the machine level code.
After the code generator executes the above code we get the fol-
lowing :
load y
add constant 13
store x
Memory binding is gram. Thus, static allocation implies that storage is allocated to all
an association program variables before the start of program execution. The bind-
between the ing of data item also takes place at compile time. Since no memory
‘memory address’ is allocated or deallocated during execution, variables are perma-
attribute of a data nent in a program. On the other hand, in dynamic storage alloca-
item and address of tion, binding is affected during program execution. Static allocation
a memory area.
is widely used in scientific languages like FORTRAN and dynamic
allocation is used by block-structured languages like PL, ALGOL,
PASCAL etc.
58 System Programming
Compilers and Interpreters Unit 4
System Programming 59
Unit 4 Compilers and Interpreters
A SA Code (A)
B SB Code (B)
Data (B)
•
Free area pointer
Fig 4.6 : A schematic for managing dynamic storage allocation
Postfix notation
In the postfix notation operator immediately appears after its last operand.
Operators can be evaluated in the order in which they appear in the string.
It is a popular intermediate code in non optimizing compilers due to ease
of generation and use. This code generation is performed by using stack
of operand descriptors. Operand descriptors are pushed on the stack as
operand appears in the string and operator descriptors are popped from
the stack. A descriptor for the partial result generated by the operator is
now pushed on the stack. Consider the following expression that describes
the stack operation for code generation in postfix notation.
2 1 5 4 3
Here the operand stack will contain descriptors for a, b and c when opera-
tor ‘*’ is encountered. Then contain the descriptors for the operand ‘a’ and
the partial result of b*c when ‘+’ is encountered. The stack occupation for
this code generation is as:
c e operand
stack
b acc d d
acc new
a a temp
symbol
* + e f
Fig 4.5 : Some steps of code generation for the above postfix string
1 * b c
2 + 1 a
3 e f
4 * d 3
5 + 2 4
The symbol 1 in the operand filed of triple 2 indicates that the operand is
the value of b * c represented by triple number 1.
Result name designated the result of the evaluation. It can be used as the
operand of another quadruple. This is more convenient than use a num-
ber like triple.
1 * b c t1
2 + t1 a t2
3 e f t3
4 * d t3 t4
5 + t2 t4 t5
Here t1, t2, ………., t5 are result names, not temporary location for holding
partial results. Now, even if a quadruple is moved, it would still be refer-
enced by its result name which remains unchanged. Some of the result
names become temporary locations if common subexpressions are de-
tected.
Expression Trees
LOAD A LOAD C
ADD B SUB D
STORE TEMP1 STORE TEMP1
LOAD C LOAD A
SUB D ADD B
STORE TEMP2 DIV TEMP1
LOAD TEMP1
DIV TEMP2
(a) (b)
Both instruction sequences (a) and (b) can evaluate the same expression
(A + B) / (C - D) on a machine having single register. Second instruction
sequence is shorter and faster. LOAD-STORE optimization is the process
of minimizing the number of load and store instructions by utilizing CPU
registers and instruction peculiarities. Instead of generating always in a
right-to-left manner, it can be sometimes more efficient to generate code
from left-to-right. The guiding rule to identify such possibilities is that “if
the register requirements of both subtrees of an operation are identical,
first evaluate the right subtree”. This rule reduces the LOAD-STORE re-
quirements because the result of the left subtree can be immediately used
in evaluation of the parent operator.
64 System Programming
Compilers and Interpreters Unit 4
4.7 INTERPRETERS
Interpreter
HLL
Program
+
Data
I/O Devices
Use of Interpreter :
66 System Programming
Compilers and Interpreters Unit 4
(a)
Data
Intermediate
Source Pre- Code
Interpreter Results
Program processor
(b)
System Programming 67
Unit 4 Compilers and Interpreters
68 System Programming
Compilers and Interpreters Unit 4
1. a) T b) F c) T d) F e) T
2. a) bridge b) data types c) memory mapping
d) heterogeneous e) complex f) control structure
3. a) memory binding b) before
c) program controlled dynamic allocation d) block
4. a) F b) T c) F d) T
e) T f) F
5. a) F b) T c) T d) F e) T
6. a) stack b) Triple c) expression trees
d) compilers e) Intermediate code
7. a) interpreter b) efficiency, simplicity c) data item.
8. a) F b) T
70 System Programming
UNIT 5 : SCANNING AND PARSING
UNIT STRUCTURE
System Programming 71
Unit 5 Scanning and Parsing
So lexical analysis phase tokenize the input statement int a = 2 ; like the
tabulated token type and store it in the symbol table.
72 System Programming
Scanning and Parsing Unit 5
74 System Programming
Scanning and Parsing Unit 5
start 0 0 1 0 2
a 3 b 4
2
5 a 12
1 7 a 8
6 11
9 b 10
System Programming 75
Unit 5 Scanning and Parsing
implies a state,
1
transition on input 1,
accepting state.
q0 1 q1 1 q3
q2
1
Example : Draw a DFA for regular expression 01|10.
Solution :
1 q4
q3
0
0
q0 1 q2
Here the value is the language consisting of all string starting with a 0 or a
1 followed by any number of 0s. Let us analyse expression (2). The sym-
bol 0 and 1 are shorthand for the set {0} and {1}. So (0 U 1) means ({0} U
{1}). The value of ({0} U {1}) is {0,1}. The part of 0* means {0}*, and its
value is the language consisting of all strings containing any numbers of
0s. A language denoted by a regular expression is said to be a regular set.
If two regular expressions r and s denote the same language, we say that
System Programming 77
Unit 5 Scanning and Parsing
5.5 PARSING
The parser looks into the sequence of tokens returned by the lexical
analyser and extract the constructs of the language appearing in
the sequence. Thus, the role of parser is divided into two catego-
ries–
Symbol
Table
2. Top-down parsing, build parse tree starting from the root node
and work up to the leaves.
3. Bottom-up parsing, build parse tree starting from the leaves and
work up to the root node.
System Programming 79
Unit 5 Scanning and Parsing
For these two derivations, we can draw the following parse tree–
E E
E + E E * E
id E * E E + E id
id id id id
LET US KNOW
For the input string ab, the recursive descent parser starts by con-
structing a parse tree representing S abA as shown in figure 5.3(a).
In figure 5.3(b) the tree expand with the production A cd. Since it
does not match the string ab the parser backtracks and then tries
the alternatives A c as shown in figure 5.3(c). Here also the parse
tree does not match the string ab. So the parser backtracks and
tries the alternative A . This time, it finds a match [figure 5.3(d)].
Thus, the parsing is complete and successful.
S S S S
a b A a b A a b A a b A
c d c
(a) (b) (c) (d)
82 System Programming
Scanning and Parsing Unit 5
Input a + b $
Stack
X Predictive parsing program Output
Y
Z
$ Parsing Table
System Programming 83
Unit 5 Scanning and Parsing
In bottom-up parsing, the parse tree for the input string is constructed
beginning at the bottom nodes (leaves) and working up towards the root,
the reverse process of top-down parsing approach. Thus in bottom-up
parsing instead of expanding the successive non-terminals according to
production rules, a current string or right sentential form is collapsed each
time until the start non-terminals is reached to predict the legal next sym-
bol. This can be regarded as a series of reductions. This approach is also
known as shift-reduce parsing. This is the primary method for many com-
pilers, mainly because of its speed and the tools, which automatically gen-
erate a parser, based on the grammar. For example, consider the follow-
ing set of productions:
84 System Programming
Scanning and Parsing Unit 5
EE+E
EE*E
E (E)
E id
Parse tree representation for the above input is shown in fig. 5.5:
id + id * id E E E
id + id * id id + id * id
E E
E id + id+id
E E
E E E E E E
id + id * id id + id * id
+ - * / Id ( ) $
+ ·> ·> <· <· <· <· ·> ·>
- ·> ·> <· <· <· <· ·> ·>
* ·> ·> ·> ·> <· <· ·> ·>
/ ·> ·> ·> ·> <· <· ·> ·>
Id ·> ·> ·> ·> ·> ·>
( <· <· <· <· <· <·
) ·> ·> ·> ·> ·> ·>
$ <· <· <· <· <· <·
2. Scan towards left over all equal precedence until the first <· prece-
dence is encountered.
5.9 LR PARSING
System Programming 87
Unit 5 Scanning and Parsing
Input
action goto
Fig. 5.7 : Model of an LR parser
Basic LR parsing model consists of two parts – action and goto. The algo-
rithm used is much similar as shift-reduce parsing method. The stack for
an LR parser consists of grammar symbol and states. The states summa-
rize what is bellow that state on the stack and pairing the state on the top
of the stack with the next input symbol indices into the parsing table to
determine the next action. The action portion consist of
1. Shift action – pushes the next input symbol and next state onto the
stack.
LET US KNOW
88 System Programming
Scanning and Parsing Unit 5
90 System Programming