Compiler II Lecture Note For GOUNI
Compiler II Lecture Note For GOUNI
Oguike
Course Contents
1. Grammar and language
2. Recognizer
3. Top down and bottom up parsing
4. Production language
5. Run time storage organization
6. The use of display in run time storage allocation
7. L-R grammar and analyzer
8. Construction of L-R table
9. Organisation of symbol tables
10. Allocation of storage to runtime variables
11. Code generation
12. Optimization
13. Translator writing system
Analysis:
This task determines the structure of the source code and its meaning. The analysis task
deals with the properties of the programming language used to write the source code. It
converts the source code into an abstract representation, based on the syntax of the
programming language. This abstract representation is usually implemented as a tree. The
analysis task is further broken into two major compiler tasks, which are:
o Structural Analysis
This task determines the static structure of the source code. It is further divided
into the following tasks:
Lexical Analysis
This task identifies the basic symbols or words or tokens of the source
code. The process of performing this task is called lexical analysis, and the
part of the compiler that does this task is called lexical analyzer or scanner.
Syntactic Analysis
This task ensures that the source code conforms to the syntax of the source
programming language.
o Semantic Analysis
Synthesis
This task creates a target code equivalent of the source code. This task begins from the
developed abstract representation (tree), by providing additional information that relates to
the mapping of the source code to target code. Synthesis consists of these two major tasks:
o Code Generation
This task transforms the abstract representation into an equivalent target code.
o Assembly
This task resolves target addressing and converts the target code into an
appropriate output format accepted by the line editor.
However, at any stage during the compilation process, errors can be detected and reported. This
means that error handling is part of the tasks that compiler performs at any stage of the
compilation process.
Finally, code optimization is another task that a compiler performs, which attempts to improve
the target code, based on specific measure of cost, i.e. code size, execution speed. The diagram
that follows shows the various tasks of a compiler.
Compilation
Synthesis
Analysis
2. Syntactic Analysis
2.0 Introduction
This is one of the tasks in the compilation process, which ensures that source codes conform to
the syntax of the source programming language.
Reserved words; This is sometimes called the keywords of that programming language.
In Java programming language, reserve words or keywords can be any of the following:
java, public, static, void, if, while etc.
Constant or Literals: Depending on the data type of the programming language, it can be
any of the following: numeric constant e.g. 34, 657 etc, string constant e.g. “hello”,
“welcome” etc, Boolean constant e.g. true, false.
Special symbols: These are the special symbols that the programming language permits.
Examples of special symbols in Java programming language are: {, }, <, =. >, +, - etc.
Identifiers: These are names that the programmer chooses for user defined names. The
following can serve as identifier in Java, myjavaprogram, my_program etc. A reserved
word cannot be used as identifier. In some programming languages, identifiers have a
fixed maximum size, while in others, identifiers can have any length.
The principle of longest substring is always used to determine the end of a token or word. This
principle means that tokens are separated by token delimiter or white space.
The first production in the above context free grammar in BNF means that an expression, <exp>
consists of <exp> followed by a + sign, followed by an <exp>, OR an <exp> consists of an <exp>
followed by a . sign, followed by an <exp>, OR an <exp> consists of <exp> enclosed in ()
bracket, OR an <exp> is a <number>.
Furthermore, in the second production, <number> consists of <number>, followed by <digit>,
OR, a <number> consists of <digit>.
In the third production, <digit> is a 0 or 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9.
noun-phrase
noun
the
Consider each of the following production rules in EBNF with the corresponding syntax diagram.
number
number
digit
<digit> := 0|1|2|3|4|5|6|7|8|9
digit
The rail road diagram for the production rule <finalyear> in EBNF has been shown below:
<finalyear>
<fourthyear> <external>
Furthermore, the context free grammar for the <if-statement> is shown below in EBNF and its
syntax diagram is shown below.
<if-statement> := IF <condition> THEN <statement> [ELSE <statement>]
In conclusion, syntax diagram are best written from context free grammar that are in EBNF.
2.2Parse Trees
The context free grammar is used to generate the parse tree. The nodes of the parse tree are the
non-terminals that are specified and defined in the context free grammar, while the leaves of the
parse tree are the terminals. When you read from left to right will form the input strings of token
or words that you are parsing. The following examples will illustrate how a parse tree is generated
from the context free grammar. Consider the context free grammar for the token/word,
<sentence>, which is given as:
Consider this example: Suppose we want to generate the parse tree that will be used to parse the
sentence, “The girl sees a dog.” . The parse tree is given as:
<sentence>
<nounphrase> <verbphrase> .
a dog
Figure 2.2.1
Observe that the leaves of the parse tree are the terminals, while the nodes of the parse tree are the
non-terminals. When the parse tree is recombined from left to write, you get back the word that
you are parsing.
Consider another example: Suppose the context free grammar in BNF for the token/word,
<number> is given below as:
<number> := <number> <digit> | <digit>
<digit> := 0|1|2|3|4|5|6|7|8|9
The parse tree for the number 2345 is given as:
<number>
<number> <digit>
<number> <digit> 5
<number> <digit> 4
<digit> 3
2
Figure 2.2.3
When we remove all the non-terminals in the parse tree, we obtain the syntax tree, which is
shown below as:
5
2
Figure 2.2.4
When the leaves of the tree are recombined from left to right, we obtain the number, 2345..
Consider another example: Suppose the context free grammar for the token/word, <exp> is given
below as:
<exp> := <exp> + <exp> | <exp> . <exp>
| (<exp> ) | <number>
<number> := <number> <digit> | <digit>
<digit> := 0|1|2|3|4|5|6|7|8|9
We want to use the context free grammar to generate the parse tree for this <exp>, 3 + 4.5
<exp>
<exp> + <exp>
3 <digit> <digit>
4 5
Figure 2.2.5
When we remove all the non-terminals, we obtain the syntax tree, as shown below:
+
3 .
4 5
Figure 2.2.6
<ecp>
<ecp> + <ecp>
<digit> <digit> 5
3 4
When we remove all the non-terminals, we obtain the syntax tree, as shown below:
, 5
3 4
Exercise 2.2
Use the above context free grammar in BNF to draw the parse tree and the syntax tree for the
following expressions:
a. 345.65 + 64.3
<exp>
<exp> + <exp>
.
.
3
5 6 5 6 4
3 4
b. 20 + (15.3 + 5)
<exp> <exp>
- -
<exp> <exp> <exp> <exp>
<number> - -
<number> <number> <number> <number> <number>
<digit>
<digit> <digit> <digit> <digit> <digit>
4
3 2 4 3 2
Figure 2..3.1 Figure 2.3.2
This means that using the first parse tree, the <exp>, 4 – 3 – 2 is parsed as 4 – (3 – 2), this is
called right associative, while the second parse tree, the <exp>, 4 – 3 – 2 is parsed as
(4 – 3) – 2, this is called left associative. However, using right or left associative to parse the
<exp>, 4 – 3 – 2 leads to ambiguity because the two lead to different result. In order to remove
the ambiguity, we need to redefine the grammar rule:
<exp> := <exp> - <exp> | <exp> . <exp>
| (<exp> ) | <number>
We redefine it using one recursive definition on the left hand side or on the right hand side of the
symbol, -. This leads to left recursive grammar and right recursive grammar, respective.
The right recursive grammar to the above grammar rule is given below as:
<exp> := <number> - <exp> | <exp> . <exp>
| (<exp> ) | <number>
This production rule will parse the <exp>, 4 – 3 – 2 as right associative.
The left recursive grammar to the original production rule for <exp> is given below as:
<exp> := <exp> - <number> | <exp> . <exp>
| (<exp> ) | <number>
This production rule will parse the <exp>, 4 – 3 – 2 as left associative. Therefore, right recursive
or left recursive grammar has removed the ambiguity in a double recursive grammar.
However, the revised context free grammar for the <exp> can be defined as left recursive
grammar, as shown below:
<exp> := <exp> - <term> | <term>
<term> := <term> . <factor> | <factor>
<factor> := ( <exp> ) | <number>
<number> := <number> <digit> | <digit>
<digit> := 0|1|2|3|4|5|6|7|8|9
It can also be defined as right recursive grammar, as shown below:
<exp> := <term> - <exp> | <term>
<term> := <term> . <factor> | <factor>
<factor> := ( <exp> ) | <number>
<number> := <number> <digit> | <digit>
<digit> := 0|1|2|3|4|5|6|7|8|9
Any of the two has removed the ambiguity that existed in the left-right (double) recursive
grammar. The choice depends on what the language designer wants.
Exercise 2.3
Use the revised left recursive and right recursive grammar in BNF, which is shown above to
generate parse tree and abstract syntax tree for each of the following <exp>:
1. 2–5–8–3
2. 43.7 – (34 – 23.76)
3. Semantic Analysis
3.0 Introduction
This is one of the tasks of the compilation process that determines the meaning of the source code.
One of the fundamental mechanisms in any programming language is the use of names, or
identifiers assigned by the programmer to denote language entities or constructs, like variables,
procedures (methods), constants etc. Therefore, a fundamental step in semantic analysis is to
describe the convention that determines the meaning of such names used in a program. Another
aspect of semantic of a programming language is the concept of location and value. Values are
storable quantities, like integer, real etc, while locations are like memory addresses where values
are stored and retrieved from.
program example
var x : integer;
y : boolean;
procedure p;
var x : boolean;
procedure q;
var y : integer;
begin (* q *)
...
end; (* q *)
begin (* p *)
...
end (* p *)
begin (* main *)
...
end (* main *)
The names in the above program are example, x, y, p and q, but x and y are associated with two
different declarations with different scopes. After the processing of the declaration of p, the
symbol table can be represented as:
Names Attributes Binding
y Boolean, global
p procedure, global
However, after the processing of the declaration q, the stack that is used to maintain the symbol
table will change as follows:
p procedure, global
q procedure, local to p
.
After the processing of the processing of the body of q, during the processing of the body of p, the
symbol table, which is implemented as a stack looks as shown below:
y Boolean, global
p procedure, global
q procedure, local to p
Finally, after the processing/execution of the procedure, p, the symbol table is shown below:
Names Attributes Binding
x integer, global
y boolean, global
p procedure, global
Exercise
Consider the following Pascal program:
program example
var x : integer;
y : boolean;
procedure p;
var x : Boolean;
procedure q;
var y : integer;
procedure r;
var y : boolean;
begin (* r *)
...
end; (* r *)
begin (* q *)
...
end; (* q *)
begin (* p *)
...
end (* p *)
begin (* main *)
...
end (* main *)
begin
free memory
...
end
Figure 3.3a stack data structure maintaining the
environment of the program.
A pointer called an environment pointer is used to maintain the current environment in the stack
data structure.
In a program that contains many procedures that are not nested, as shown below:
program eg1
procedure p;
begin
...
end;
procedure q;
begin
p;
end;
begin
q;
end.
In the above program, each call of a procedure is called activation. Whenever there is an
activation, an activation record is created and pushed into the stack based environment. Each
activation record will contain the storage locations of all the variables of that procedure.
Whenever an activation record is created and pushed into the stack based environment, the
environment pointer will point to the new activation record. When the procedure finishes
execution, the current activation record will be popped out of the stack based environment and the
environment pointer will point to the previous activation record that called the current activation
pointer. Therefore, to make this to happen, the current activation record must maintain a pointer to
the activation record that called the current activation record, this pointer is called control link.
Therefore, each activation record that is not the first record that was pushed into the stack based
environment will contain the storage locations of all the local and global variables to that
procedure, together with the control link.
At the beginning of execution of program eg1, the stack based environment will be the same as
figure 3.3a. However, after the call to procedure q, an activation record will be pushed into the
stack based environment, as shown in figure 3.3b.
Global variables
of eg1
Environment pointer
Global and local
Variables of q. Activation record of q
Control link
Free memory
Figure 3.3b Stack based activation record after the call to procedure q.
Furthermore, when p is called from q, an activation record for p is pushed into the stack based
environment, as shown in figure 3.3c below.
Global variables
of eg1
Activation record of p
Control link
Free memory
program eg2
var x : integer;
procedure p( y : integer)
var i : integer;
b : boolean;
begin
i := x;
end;
procedure q( a : integer)
var x : integer;
begin
p(1);
end;
begin
q(2);
end;
The compiler will maintain the following stack based environment at various stages of execution
of the program
Global environment
a
x Activation record of q
control link
access link
y
i
b Activation record of p
control link
access link
Free memory
References
Torben Ægidius Mogensen, (2000 - 2010), Basics of Compiler Design, Torben Ægidius
Mogensen 2000 – 2010
William M. Waite, Gerhard Goos, (1994), Compiler Construction, Springer-Verlag, Berlin, New
York Inc
Kenneth C. Louden, (1993), Programming Languages: Principles and Practice, PWS publishing
company.