Mod 1 Atc - Merged
Mod 1 Atc - Merged
waste core by leaving the assembler in memory while the user’s program was being executed.
Also the programmer would have to retranslate his program with each execution, thus wasting
translation time. To over come this problem of wasted translation time and memory. System
programmers developed another component called loader
“A loader is a program that places programs into memory and prepares them for
execution.” It would be more efficient if subroutines could be translated into object form the
loader could “relocate” directly behind the user’s program. The task of adjusting programs or
they may be placed in arbitrary core locations is called relocation. Relocation loaders
perform four functions.
TRANSLATOR
A translator is a program that takes as input a program written in one language and
produces as output a program in another language. Beside program translation, the translator
performs another very important role, the error-detection. Any violation of d HLL
specification would be detected and reported to the programmers. Important role of translator
are:
1 Translating the hll program input into an equivalent ml program.
2 Providing diagnostic messages wherever the programmer violates specification of the hll.
TYPE OF TRANSLATORS:-
INTERPRETOR
COMPILER
PREPROSSESSOR
34
AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
35
AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
Lexical Analysis:- Lexical Analyzer or Scanners reads the source program one character at a
time, On reading character stream of sou rce program, it grou ps them into meaningfu l
sequ ence s called “ Lexemes”. For each lexeme analyzer produces an output called tokens.
<token_name, attribute value>
points to entry in
the symbol table for this
Abstract symbol used in parser token
A token describes a pattern of characters having same meaning in the source program. (such as
identifiers, operators, keywords, numbers, delimiters and so on)
Ex newval := oldval + 12 => tokens: newval identifier
:= assignment operator
oldval identifier
+ add operator
12 a number
Syntax Analysis:-The second stage of translation is called Syntax analysis or parsing. In this
phase expressions, statements, declarations etc… are identified by using the results of lexical
analysis. Syntax analysis is aided by using techniques based on formal grammar of the
programming language.
A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the given program.
A syntax analyzer is also called as a parser. A parse tree describes a syntactic structure.
Semantic Analysis: Uses syntax tree and information in symbol table to check source program for
semantic consistency with language definition. It gathers type information and saves it in either
syntax tree or symbol table for use in Intermediate code generation.
Type checking- compiler checks whether each operator has the matching operands.
Coercions-language specification may permit some type of conversion.
36
AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
Code Optimization :-This is optional phase described to improve the intermediate code so that
the output runs faster and takes less space.
Code Generation:-The last phase of translation is code generation. A number of optimizations to
reduce the length of machine language program are carried out during this phase. The
output of the code generator is the machine language program of the specified computer.
37
Automata Theory & Compiler Design 21CS51 Module 2
Let L be a regular language. Then there exists a constant ‘n’ (which depends on L) such that for
every string ‘w’ in L such that |w| ≥ n, we can break w into three strings, w=xyz, such that:
2. |xy| ≤ n
vtucode.in Page 35
Automata Theory & Compiler Design 21CS51 Module 2
each ai is an input symbol. Since we have ‘m’ input symbols, naturally we should have ‘m+1’
states, in sequence q0, q1, q2……………….qm where q0 is → start state and qm is → final state.
Since |w| ≥ n, by the pigeonhole principle it is not possible to have distinct transitions, since there
are only ‘n’ different states. So one of the state can have a loop. Thus we can find two different
integers i and j with 0 ≤ i < j ≤ n, such that qi = qj. Now we can break the string w = xyz as
follows:
x = a1a2a3……………..ai.
y = ai+1, ai+2, ……..aj ( loop string where i =j)
z = aj+1,aj+2,…………..am.
The relationships among the strings and states are given in figure below:
‘x’ may be empty in the case that i= 0. Also ‘z’ may be empty if j = n = m. However, y cannot be
empty, since ‘i’ is strictly less than ‘j’.
Thus for any k ≥ o, xykz is also accepted by DFA ‘A’; that is for a language L to be a regular,
xykz is in L for all k ≥ o.
Applications of Pumping lemma:
1. It is useful to prove certain languages are non-regular.
2. It is possible to check whether a language accepted by FA is finite or infinite.
Show that L= {an bn | n>= 0} is not regular.
Let L is regular language and ‘n’ be the number of states in FA.
since |w| = n +n = 2n ≥ n, we can split ‘w’ into xyz such that |xy| ≤ n and |y |≥ 1 as
Where |x| = n-1 and |y| = 1 so that |xy| = n-1 +1 = n ≤ n, which is true.
vtucode.in Page 36
Automata Theory & Compiler Design 21CS51 Module 2
The forward pointer moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme. In above example as soon as forward pointer encounters
a blank space, the lexeme is identified.
The fp will be moved ahead when it sees white space. That is when fp encounters white space
it ignores and moves ahead. Then both fp and bp is set at next token.
1. One buffer
2. Two buffer
One buffer scheme:
Here only one buffer is used to store the input string. But the problem with this scheme is that, if
a lexeme is very long, then it crosses the buffer boundary. To scan the remaining part of lexeme,
the buffer has to be refilled, that makes overwriting of first part of lexeme. Sometimes it may
result in loss of data due to the user misinterpretation.
Two Buffer scheme:
Why two buffer schemes is used in lexical analysis? Explain.
Because of the amount of time taken to process characters and the large number of characters
that must be processed during the compilation of a large source program, specialized two
buffering techniques have been developed to reduce the amount of overhead required to process
a single input character.
Here a buffer (array) divided into two N-character halves, where N = number of
characters on one disk block Ex: 4096 bytes – If fewer than N characters remain in the
input file , then special character, represented by eof, marks the end of source file and it is
different from input character.
One read command is used to read N characters. Two pointers are maintained: beginning
of the lexeme pointer and forward pointer.
Initially, both pointers point to the first character of the next lexeme.
Using this method we can overcome the problem faced by one buffer scheme, even
though the input is lengthier the user knows from where he has to begin in the next
buffer, as he can see the contents of previous buffer. Thus there is no scope for loss of
any data.
Sentinels:
In two buffering scheme we must check the forward pointer, each time it is incremented. Thus
we make two tests: one for the end of the buffer, and one to determine what character is read.
We can combine these two tests, if we use a sentinel character at the end of buffer.
vtucode.in Page 46
Automata Theory & Compiler Design 21CS51 Module 3
Note: The derivation process may end whenever one of the following things happens.
i. The working string no longer contains any non terminal symbols (including, as a special case when
the working string is ε). Ie: working string is generated.
ii. There are non terminal symbols in the working string but there is no match with the left-hand
side of any rule in the grammar. For example, if the working string were AaBb, this would
happen if the only left-hand side were C
Left Most Derivation (LMD): In derivation process, if a leftmost variable is replaced at every step,
then the derivation is said to be leftmost.
Example: E → E+E | E*E | a | b
Let us derive a string a+b*a by applying LMD.
E => E*E
E+E*E
a +E*E
a+b*E
a+b*a
Right Most Derivation (RMD): In the derivation process, if a rightmost variable is replaced at every
step, then the derivation is said to be rightmost.
Example: E → E+E | E*E | a | b
Let us derive a string a+b*a by applying RMD.
E => E+E
E+E*E
E +E*a
E+b*a
a+b*a
Sentential form: For a context free grammar G, any string ‘w’ in (V U T)* which appears in every
derivation step is called a sentence or sentential form.
Two ways we can generate sentence:
i. Left sentential form
ii. Right sentential form
Example: S => AB
aAbB
abB
abbB
abb
Here {S, AB, aAbB, abB, abbB, abb } can be obtained from start symbol S, Each string in the set is
called sentential form.
Left Sentential form: For a context free grammar G, any string ‘w’ in (V U T)* which appears in
every Left Most Derivation step is called a Left sentential form.
Example: E => E*E
E+E*E
a +E*E
a+b*E
a+b*a
Left sentential form = {E, E*E, E+E*E, a +E*E, a+b*E, a+b*a }
Right Sentential form: For a context free grammar G, any string ‘w’ in (V U T)* which appears in
every Right Most Derivation step is called a Left sentential form.
Example: E => E+E
E+E*E
E +E*a
E + b*a
a + b*a
Right sentential form = {E, E+E, E+E*E, E +E*a, E+ b*a, a + b * a }
PARSE TREE: ( DERIVATION TREE)
What is parse tree?
The derivation process can be shown in the form of a tree. Such trees are called derivation trees or
Parse trees.
Example: E → E+E | E*E | a | b
The Parse tree for the LMD of the string a+b*a is as shown below:
For formalizing computability, Turing assumed that, while computing, a person writes symbols on a
one-dimensional paper (instead of a two dimensional paper as is usually done) which can be viewed as
a tape divided into cells.
One can scan the cells one at a time and usually performs one of the three simple operations
1. Writing a new symbol in the cell being currently scanned
2. Moving to the cell left of the present cell and
3. Moving to the cell right of the present cell.
With these observations in mind, Turing proposed his 'computing machine’, called as Turing Machine.
Define Turing machine
A Turing machine M is a 7-tuple, namely (Q, ∑ ,Г, δ, q0. B, F) where
• Q is a finite nonempty set of states.
• Г is a finite nonempty set of tape symbols,
• B is the blank symbol.
• ∑ is a nonempty set of input symbols and is a subset of Г and B ≠ ∑
• δ is the transition function mapping (q, x) onto (q’, y, D) where D denotes the direction of
movement of R/W head: D = L or R according as the movement is to the left or right.
Q X Г → Q x Г x { L/R}
• q0 € Q is the initial state, and
• F is the subset of Q is the set of final states.
TURING MACHINE MODEL
With neat diagram explain the working principle of a basic Turing machine.
A Turing machine can be defined as M: which is a 7-tuple, namely (Q, ∑ ,Г, δ, q0. B, F) where
• Q is a finite nonempty set of states.
• Г is a finite nonempty set of tape symbols,
• B is the blank symbol.
• ∑ is a nonempty set of input symbols and is a subset of Г and B ≠ ∑
• δ is the transition function mapping (q, x) onto (q’, y, D) where D denotes the direction of
movement of R/W head: D = L or R according as the movement is to the left or right.
Q X Г → Q x Г x { L/R}
• q0 € Q is the initial state, and
• F is the subset of Q is the set of final states.
Automata Theory & Compiler Design 21CS51 Module 5
The Turing machine model uses an infinite tape as its unlimited memory. The input symbols occupy
some of the tape’s cells. Input symbols can be preceded and followed by infinite number of blank
(B) characters. Each cell can store only one symbol. The input to and the output from the finite state
automaton are effected by the R/W head which can examine one cell at a time.
A move of the Turing machine is a function of the state of the finite control and the tape symbol
scanned. In one move, the TM will change state. The next state optionally may be the same as the
current state.
At each step of computation
1. Read/scan the symbol below the R/W head
2. Update/write a symbol the R/W head
3. Move the R/W head one step LEFT
4. Move the R/W head one step RIGHT
Finite Control is with a sort of FSM which has
• Initial state
• Final states or Accepting state.
• Rejecting state
Computation can either: Halt and ACCEPT or Halt and REJECT or LOOP (the machine fails to
HALT).
REPRESENTATION OF TURING MACHINES
We can describe a Turing machine by employing
1. Instantaneous descriptions (ID) using move-relations.
2. Transition table and
Automata Theory & Compiler Design 21CS51 Module 5
CODE GENERATION
Code generator phase generates the target code taking input as intermediate code. The output
of intermediate code generator may be given directly to code generation or may pass through
code optimization before generating code.
*****Issues in Design of Code generation:
Explain the issues in the design of code generator. (10)
The main issues in design of code generation are:
i. Intermediate representation.
ii. Target Code.
iii. Instruction selection.
iv. Register Allocation.
v. Evaluation Order.
Intermediate representation:
The input to the code generator is the intermediate representation of the source program produced
by the front end phase of compiler, along with information in the symbol table. Linear
representation like postfix and three address code or quadruples and graphical representation like
Syntax tree or DAG. We assume that input to code generator whose type checking is done and that
input is in free of errors.
Target code:
The target code may be absolute code, re-locatable machine code or assembly language code.
Absolute code can be executed immediately as the addresses are fixed. But in case of re-locatable it
requires linker and loader to place the code in appropriate location and map (link) the required
library functions. If it generates assembly level code then assemblers are needed to convert it into
machine level code before execution. Re-locatable code provides great deal of flexibilities as the
functions can be compiled separately before generation of object code.
Instruction Selection:
The code generator must map the intermediate-representation program (3-address code) into
sequence of codes that can be executed by the target machine. This mapping can be determined by
considering the factors such as:
Automata Theory & Compiler Design 21CS51 Module 5
Register Allocation:
If the operands are in register, the execution is faster hence the set of variables whose values are
required at a point in the program are to be retained in the registers. Familiarities with the target
machine and its instruction set are a pre-requisite for designing a good code generator.
Consider a hypothetical byte addressable machine as target machine. It has n general purpose
register R1, R2 ------- Rn. The machine instructions are two address instructions of the form
op-code source address destination address
Example:
MOV R0, R1
ADD R1, R2
Target Machine supports for the following addressing modes:
a. Absolute addressing mode
Example: MOV R0, M where M is the address of memory location of one of the operands.
MOV R0, M moves the contents of register R0 to memory location M.
b. Register addressing mode where both the operands are in register.
Example: ADD R0, R1
c. Immediate addressing mode – The operand value appears in the instruction.
Example: ADD # 1, R0
d. Index addressing mode- this is of the form C(R) where the address of operand is at the location
C + Contents(R)
Example: MOV 4(R0), M the operand is located at address = contents (4+contents(R0))
Evaluation Order:
The order in which computations are performed can affect the efficiency of the target code. Some
computation orders require fewer registers to hold intermediate results than others. However,
picking a best order in the general case is a difficult NP-complete problem