CSC411 Compiler Construction - MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
CSC411 Compiler Construction - MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
CSC411 Compiler Construction - MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
Introduction
Compiler construction is an area of computer science that deals with the theory and practice of
developing programming languages and their associated compilers. The theoretical portion is primarily
concerned with syntax, grammar and semantics of programming languages. Computers are a balanced
blend of software and hardware. Hardware is just a piece of mechanical device and its functions are
controlled by compatible software. Hardware understands instructions in the form of electronic charge,
which is the counterpart of binary language. Binary language has only two alphabets, 0 and 1. To
instruct, the hardware, codes must be written in binary format, which is simply a series of 1s and 0s. It
would be a difficult and cumbersome task for computer programmers to write such codes, which is why
we have compilers to write such codes. Programs are written in high-level language, which is easier for
human to understand and remember. These programs are then fed into a series of tools and operating
system (OS) components to get the desired code that can be used by the machine. This is known as
Language Processing System.
When executing (running), the compiler first parses (or analyzes) all of the language statements
syntactically one after the other and then, in one or more successive stages or passes, builds the output
code, making sure that statements that refer to other statements are referred to correctly in the final
code. The output of the compilation is called object code or sometimes an object module. The object
code is machine code that the processor can execute one instruction at a time.
A compiler works with what are sometimes called 3GL (third generation language) and higher-level
languages. It serves as an interface between human understandable language and machine
understandable language by transforming the former to the later.
Error messages
In a compiler, the source code is translated to object code successfully if it is free of errors. The compiler
specifies the errors at the end of compilation with line numbers when there are any errors in the source
code. The errors must be removed before the compiler can successfully recompile the source code
again.
Program that translates from a low level language to a higher level one is a decompiler. A program that
translates between high-level languages is usually called a language translator, source to source
translator, or language converter. A language rewriter is usually a program that translates the form of
expressions without a change of language
A compiler is likely to perform many or all of the following operations, namely, lexical analysis,
preprocessing, parsing, semantic analysis (Syntax-directed translation), code generation, and code
optimization. Program faults caused by incorrect compiler behavior can be very difficult to track down
and work around; therefore, compiler implementers invest significant effort to ensure the correctness
of their software. The term compiler-compiler is sometimes used to refer to a parser generator, a tool
often used to help create the lexer and parser.
Most of the compilers are not a single tool. Compiler is a combination of five different but necessary
tools, namely :-
(a) Editor (b) Debugger (c) Compiler (d) Linker (e) Loader
Types of Compiler
There are many different types of compilers which produce output in different useful forms. They
include -
a. Native Code Compiler – This compiler is used to compile a source code for same type of platform
only. The output generated by this type of compiler can only be run on the same type of computer
system and operating system that the compiler itself runs on.
b. Cross Compiler – This type of compiler is used to compile a source code for different kinds of
platform. Used in making software for embedded systems that can be used on multiple platforms. A
compiler that runs on platform (A) and is capable of generating executable code for platform (B) is
called a cross-compiler.
c. Source to Source Compiler – This type of compiler takes high-level language code as input and
outputs source code of another high-level language only. Unlike other compilers which convert high
level language into low level machine language, it can take up a code written in Pascal and can
transform it into C-conversion of one high level language into another high level language having
same type of abstraction . Thus, it is also known as transpiler.
d. One Pass Compiler – This is a type of compiler that compiles the whole process in only one-pass.
e. Threaded Code Compiler – This is a type of compiler that simply replaces a string by an appropriate
binary code.
f. Incremental Compiler - This compiler which compiles only the changed lines from the source code
and update the object code.
2
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
g. Source Compiler - This compiler converts the source code (high level language code) into assembly
language only.
h. Just-in-time (JIT) Compiler – This is a type of compiler that defers compilation until runtime. This
compiler is used for languages such as Python and JavaScript. It generally runs inside an interpreter.
Compilers are not the only language processor used to transform source programs. Others are
assemblers and interpreters.
Assembler
An assembler is a program that converts the assembly language program to machine level language
instructions that can be executed by a computer. An assembler enables software and application
developers to access, operate and manage a computer's hardware architecture and components. It is
sometimes referred to as the compiler of assembly language. It also provides the services of an
interpreter.
The output of an assembler is called an object file, which contains a combination of machine
instructions as well as the data required to place these instructions in memory.
An assembler works by assembling and converting the source code of assembly language into object
code or an object file that constitutes a stream of zeros and ones of machine code, which are directly
executable by the processor.
Assemblers are classified based on the number of times it takes them to read the source code before
translating it into single-pass assembler and multi-pass assembler. Single-pass assembler scans a
program source only once and creates the equivalent binary program. It substitutes all of the symbolic
instruction with machine code in one pass. Multi-pass assembler is an assembler which uses more than
one pass in the assembly process. In multi pass, an assembler goes through assembly language several
times and generates the object code. In this, last pass is called a synthesis pass, and this assembler
requires any form of an intermediate code to generate each pass every time. It is comparatively slower
than a single pass assembler, but some actions that can be performed more than once means
duplicated. Some high-end assemblers provide enhanced functionality by enabling the use of control
statements, data abstraction services and providing support for object-oriented programming
structures.
Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language. The
difference lies in the way they read the source code or input. A compiler reads the whole source code at
once, creates tokens, checks semantics, generates intermediate code, executes the whole program and
may involve many passes. In contrast, an interpreter reads a statement from the input, converts it to an
intermediate code, executes it, then takes the next statement in sequence. If an error occurs, an
interpreter stops execution and reports it. Whereas a compiler reads the whole program even if it
encounters several errors.
3
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
An interpreter is a computer program that is used to directly execute program instructions written in
one of the many high-level programming languages. The interpreter transforms the high-level program
into an intermediate language that it then executes, or it could parse the high-level source code and
then performs the commands directly, which is done line by line or statement by statement. An
interpreter directly executes instructions written in a programming or scripting language, without
requiring them previously to have been compiled into a machine language program.
Humans can only understand high-level languages, which are called source code. Computers, on the
other hand, can only understand programs written in binary languages, so either an interpreter or
compiler is required.
Programming languages are implemented in two ways: interpretation or compilation. As the name
suggests, an interpreter transforms or interprets a high-level programming code into code that can be
understood by the machine (machine code) or into an intermediate language that can be easily
executed as well.
The interpreter reads each statement of code and then converts or executes it directly. In contrast, an
assembler or a compiler converts a high-level source code into native (compiled) code that can be
executed directly by the operating system (e.g. by creating a .exe program).
Both compilers and interpreters have their advantages and disadvantages and are not mutually
exclusive; this means that they can be used in conjunction as most integrated development
environments employ both compilation and translation for some high-level languages.
In most cases, a compiler is preferable since its output runs much faster compared to a line-by-line
interpretation. Rather than scanning the whole program and translating it into machine code like a
compiler does, the interpreter translates code one statement at a time.
While the time to analyze source code is reduced, especially a particularly large one, execution time for
an interpreter is comparatively slower than a compiler. On top of that, since interpretation happens per
line or statement, it can be stopped in the middle of execution to allow for either code modification or
debugging.
Compilers must generate intermediate object code that requires more memory to be linked, contrarily
to interpreters which tend to use memory more efficiently.
Since an interpreter reads and then executes code in a single process, it very useful for scripting and
other small programs. As such, it is commonly installed on Web servers, which run a lot of executable
scripts. It is also used during the development stage of a program to test small chunks of code one by
one rather than having to compile the whole program every time.
Every source statement will be executed line by line during execution, which is particularly appreciated
for debugging reasons to immediately recognize errors. Interpreters are also used for educational
purposes since they can be used to show students how to program one script at a time.
Programming languages that use interpreters include Python, Ruby, and JavaScript, while programming
languages that use compilers include Java, C++, and C.
An interpreter generally uses one of the following strategies for program execution:
4
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
COMPILATION
Compilers enabled the development of programs that are machine-independent. Before the
development of FORTRAN (FORmula TRANslator), the first higher-level language, in the 1950s,
machinedependent assembly language was widely used. While assembly language produces more
reusable and relocatable programs than machine code on the same architecture, it has to be modified
or rewritten if the program is to be executed on different computer hardware architecture. With the
advance of highlevel programming languages that followed FORTRAN, such as COBOL, C, and BASIC,
programmers could write machine-independent source programs. A compiler translates the high-level
source programs into target programs in machine languages for the specific hardware. Once the target
program is generated, the user can execute the program
STRUCTURE OF A COMPILER
Compilers bridge source programs in high-level languages with the underlying hardware. A compiler
requires (1) determining the correctness of the syntax of programs, (2) generating correct and efficient
object code, (3) run-time organization, and (4) formatting output according to assembler and/or linker
conventions.
A compiler consists of three main parts: (a) the frontend, (b) the middle-end, and (c) the backend.
The front end checks whether the program is correctly written in terms of the programming language
syntax and semantics. Here legal and illegal programs are recognized. Errors are reported, if any, in a
useful way. Type checking is also performed by collecting type information. The frontend then
generates an intermediate representation or IR of the source code for processing by the middle-end.
The middle end is where optimization takes place. Typical transformations for optimization are removal
of useless or unreachable code, discovery and propagation of constant values, relocation of
computation to a less frequently executed place (e.g., out of a loop), or specialization of computation
based on the context. The middle-end generates another IR for the following backend. Most
5
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
The back end is responsible for translating the IR from the middle-end into assembly code. The target
instruction(s) are chosen for each IR instruction. Register allocation assigns processor registers for the
program variables where possible. The backend utilizes the hardware by figuring out how to keep
parallel execution units busy, filling delay slots, and so on. Although most algorithms for optimization
are in NP, heuristic techniques are well-developed.
Compilers are broadly divided into two phases based on the way they compile into (a) analysis phase
and (b) synthesis phase.
In synthesis phase, the equivalent target program is created from this intermediate representation.
This contains: ➢ Intermediate Code Generator, ➢ Code Optimizer, and ➢ Code Generator
6
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
Front-end Back-end
Analysis Synthesis
Intermediate Machine
Source Code Code
Code Representation
PHASES OF COMPILER
The compilation process is a sequence of various phases. Each phase takes input from its previous stage,
has its own representation of source program, and feeds its output to the next phase of the compiler.
Let us understand the phases of a compiler.
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code as a stream of
characters and converts it into meaningful lexemes. Lexical analyzer represents these lexemes in the
form of tokens as:
<token-name, attribute-value>
Lexical analysis is the first phase of a compiler. It takes the modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these syntaxes into
a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely with
the syntax analyzer. It reads character streams from the source code, checks for legal tokens, and
passes the data to the syntax analyzer when it demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some predefined
rules for every lexeme to be identified as a valid token. These rules are defined by grammar rules, by
7
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
means of a pattern. A pattern explains what can be a token, and these patterns are defined by means of
regular expressions.
Source Code
Lexical Analyzer
Syntax Analyzer
Semantic Analyzer
Intermediate code
Symbol Generator Error
Table Handler
Machine independent
Code Optimiser
Code Generator
Machine Dependent
Code Optimiser
Target Code
Syntax Analysis
Syntax Analysis or Parsing is the second phase after lexical analysis. It takes the token produced by
lexical analysis as input and generates a data structure, called a Parse tree or Syntax tree. The parse tree
is constructed by using the pre-defined Grammar of the language and the input string. If the given input
string can be produced with the help of the syntax tree (in the derivation process), the input string is
found to be in the correct syntax. if not, error is reported by syntax analyzer. In this phase, token
arrangements are checked against the source code grammar, i.e. It checks the syntactical structure of
the given input, i.e. whether the given input is in the correct syntax (of the language in which the input
has been written) or not.
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The
parser analyzes the source code (token stream) against the production rules to detect any errors in the
code. The output of this phase is a parse tree.
8
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and generating a
parse tree as the output of the phase. Parsers are expected to parse the whole code even if some errors
exist in the program. Parsers use error recovering strategies.
Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During parsing,
we take two decisions for some sentential form of input:
• Deciding the non-terminal which is to be replaced.
• Deciding the production rule, by which, the non-terminal will be replaced.
To decide which non-terminal to be replaced with production rule, we can have two options.
Left-Most Derivation - If the sentential form of an input is scanned and replaced from left to right, it is
called left-most derivation. The sentential form derived by the left-most derivation is called the
leftsentential form.
Right-Most Derivation - If we scan and replace the input with production rules, from right to left, it is
known as right-most derivation. The sentential form derived from the right-most derivation is called the
right-sentential form.
Example
Production rules:
E→E+E
E→E*E
E → id
Input string: id + id * id
Parse Tree - A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are
derived from the start symbol. The start symbol of the derivation becomes the root of the parse tree.
Let us see this by an example from the last topic. We take the left-most derivation of a + b * c
E→E*E
Step 1:
E→E+E*E
Step 2:
E → id + E * E
Step 3:
In a parse tree:
All leaf nodes are terminals.
All interior nodes are non -
terminals.
In-order traversal gives
original input string.
E → id +id * E E → id +id * id
Step 4: Step 5:
Steps in leftmost derivation
A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed first,
therefore the operator in that sub-tree gets precedence over the operator which is in the parent nodes.
Ambiguity - A grammar G is said to be ambiguous if it has more than one parse tree (left or right
derivation) for at least one string.
Example
E→E+E
E→E–E
E → id
For the string id + id – id, the above grammar generates two parse trees:
10
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
Semantic Analysis
Semantic analysis is the third phase of compiler. It checks whether the parse tree constructed follows
the rules of language. For example, assignment of values is between compatible data types, and adding
string to an integer. Also, the semantic analyzer keeps track of identifiers, their types and expressions;
whether identifiers are declared before use or not etc. The semantic analyzer produces an annotated
syntax tree as an output.
It makes sure that declarations and statements of program are semantically correct. It is a collection of
procedures which is called by parser as and when required by grammer. Both syntax tree of the
previous phase and symbol table are used to check the consistency of the given code. Type checking is
an important part of semantic analysis where compiler makes sure that each operator has matching
operands.
Semantic Analyzer
It uses syntax tree and symbol table to check whether the given program is semantically consistent with
language definition. It gathers type information and stores it in either syntax tree or symbol table. This
type information is subsequently used by compiler during intermediate-code generation.
Semantic Errors
Errors recognized by semantic analyzer are as follows: a.
Type mismatch
b. Undeclared variables
c. Reserved identifier misuse
11
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
and the machine language. This intermediate code should be generated in such a way that it makes it
easier to be translated into the target machine code.
a. High Level IR - High-level intermediate code representation is very close to the source language
itself. They can be easily generated from the source code and we can easily apply code
modifications to enhance performance. But for target machine optimization, it is less preferred.
b. Low Level IR - This one is close to the target machine, which makes it suitable for register and
memory allocation, instruction set selection, etc. It is good for machine-dependent optimizations.
Intermediate code can be either language specific (e.g., Byte Code for Java) or language independent.
a. Three-Address Code - Intermediate code generator receives input from its predecessor phase,
semantic analyzer, in the form of an annotated syntax tree. That syntax tree then can be converted into
a linear representation, e.g., postfix notation. Intermediate code tends to be machine independent
code. Therefore, code generator assumes to have unlimited number of memory storage (register) to
generate code.
For example: a = b + c * d;
The intermediate code generator will try to divide this expression into sub-expressions and then
generate the corresponding code.
r1 = c * d; r2 = b + r1; r3 = r2 + r1; a = r3
r being used as registers in the target program.
A three-address code has at most three address locations to calculate the expression. A three-address
code can be represented in two forms: quadruples and triples.
(i) Quadruples - Each instruction in quadruples presentation is divided into four fields: operator,
arg1, arg2, and result. The above example is represented in Table 4 in quadruples format:
12
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
Quadruples format
Op arg1 arg2 result
* c d r1
+ b r1 r2
+ r2 r1 r3
= r3 a
(ii) Triples - Each instruction in triples presentation has three fields: op, arg1, and arg2 as shown in
Table 5.The results of respective sub-expressions are denoted by the position of expression.
Triples represent similarity with DAG and syntax tree. They are equivalent to DAG while
representing expressions.
Triples format
Op arg1 arg2
* c d
+ b (0)
+ (1) (0)
= (2)
Triples face the problem of code immovability while optimization, as the results is positional
and changing the order or position of an expression may cause problems.
b. Indirect Triples - This representation is an enhancement over triples representation. It uses pointers
instead of position to store results. This enables the optimizers to freely re-position the sub-expression
to produce an optimized code
CODE OPTIMIZATION
The next phase does code optimization of the intermediate code. Optimization is a program
transformation technique, which tries to improve the code by making it consume less resources (i.e.
CPU, Memory) and deliver high speed.
In optimization, high-level general programming constructs are replaced by very efficient low-level
programming codes. Optimization can be assumed as something that removes unnecessary code lines,
and arranges the sequence of statements in order to speed up the program execution without wasting
resources (CPU, memory).
A code optimizing process must follow the three rules given below:
a. The output code must not, in any way, change the meaning of the program.
b. Optimization should increase the speed of the program and if possible, the program should
demand less number of resources.
c. Optimization should itself be fast and should not delay the overall compiling process.
Efforts for an optimized code can be made at various levels of compiling the process.
a. At the beginning, users can change/rearrange the code or use better algorithms to write the
code.
b. After generating intermediate code, the compiler can modify the intermediate code by address
calculations and improving loops.
c. While producing the target machine code, the compiler can make use of memory hierarchy and
CPU registers.
13
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
CODE GENERATION
Code generation can be considered as the final phase of compilation. In this phase, the code generator
takes the optimized representation of the intermediate code and maps it to the target machine
language. The code generator translates the intermediate code into a sequence of (generally)
relocatable machine code. Sequence of instructions of machine code performs the task as the
intermediate code would do. The code generated by the compiler is an object code of some lower-level
programming language, for example, assembly language. We have seen that the source code written in
a higher-level language is transformed into a lower-level language that results in a lower-level object
code, which should have the following minimum properties:
a. It should carry the exact meaning of the source code.
b. It should be efficient in terms of CPU usage and memory management.
The code generator should take the following things into consideration to generate the code:
• Target language - The code generator has to be aware of the nature of the target language for
which the code is to be transformed. That language may facilitate some machine-specific
instructions to help the compiler generate the code in a more convenient way. The target
machine can have either CISC or RISC processor architecture.
• IR Type - Intermediate representation has various forms. It can be in Abstract Syntax Tree (AST)
structure, Reverse Polish Notation, or 3-address code.
• Selection of instruction - The code generator takes Intermediate Representation as input and
converts (maps) it into target machine’s instruction set. One representation can have many ways
(instructions) to convert it, so it becomes the responsibility of the code generator to choose the
appropriate instructions wisely.
• Register allocation - A program has a number of values to be maintained during the execution.
The target machine’s architecture may not allow all of the values to be kept in the CPU memory
or registers. Code generator decides what values to keep in the registers. Also, it decides the
registers to be used to keep these values.
• Ordering of instructions - At last, the code generator decides the order in which the instruction
will be executed. It creates schedules for instructions to execute them .
Descriptors - The code generator has to track both the registers (for availability) and addresses (location
of values) while generating the code. For both of them, the following two descriptors are used:
• Register descriptor - Register descriptor is used to inform the code generator about the
availability of registers. Register descriptor keeps track of values stored in each register.
Whenever a new register is required during code generation, this descriptor is consulted for
register availability.
• Address descriptor - Values of the names (identifiers) used in the program might be stored at
different locations while in execution. Address descriptors are used to keep track of memory
locations where the values of identifiers are stored. These locations may include CPU registers,
heaps, stacks, memory or a combination of the mentioned locations.
Code generator keeps both the descriptor updated in real-time. For a load statement, LD R1, x, the code
generator:
• updates the Register Descriptor R1 that has value of x and
• updates the Address Descriptor (x) to show that one instance of x is in R1.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifier's names along
with their types are stored here. The symbol table makes it easier for the compiler to quickly search the
identifier record and retrieve it. The symbol table is also used for scope management.
A symbol table may serve the following purposes depending upon the language:
14
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
Error handler: If the source program is not written as per the syntax of the language then syntax errors
are detected by the tool debugger associated with the compiler. Each phase of the compiler can
encounter errors. A compiler that stops when it finds the first error is not as helpful as it could be. The
syntax and semantic analysis phases usually handle a large fraction of the errors detectable by the
compiler. The lexical phase can detect errors where the characters remaining in the input do not form
any token of the language. Errors when the token stream violates the syntax of the language are
determined by the syntax analysis phase. During semantic analysis the compiler tries to detect
constructs that have the right syntactic structure but no meaning to the operation involved.
Regular Expressions
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belong
to the language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for finite strings
of symbols. The grammar defined by regular expressions is known as regular grammar. The language
defined by regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a set of
strings, so regular expressions serve as names for a set of strings. Programming language tokens can be
described by regular languages. The specification of regular expressions is an example of a recursive
definition. Regular languages are easy to understand and have efficient implementation.
The first step of compilation, called lexical analysis, converts the input from a simple sequence of
characters into a list of tokens of different kinds, such as numerical and string constants, variable
identifiers, and programming language keywords. The purpose of lex is to lexical analyzers.
Regular expressions are often used to describe the tokens of a language. they specify exactly what
values are legal for the tokens to assume. Some tokens are simply keywords, like if, else, and for.
Others, like identifiers, can be any sequence of letters and digits provided that they do not match a
keyword and do not start with a digit. Typically, an identifier is a variable name such as current, flag2, or
window Status. In general, an identifier is a letter followed by any combination of digits and letters.
15
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
If an identifier is found in the program, then the action corresponding to the identifier is taken. Perhaps
some information would be added to the symbol table in this case. If a keyword such as ‘if’, is
recognized, a different action would be taken.
In the literary sense of the term, grammars denote syntactical rules for conversation in natural
languages. Linguistics has attempted to define grammars since the inception of natural languages like
English, Sanskrit, Mandarin, etc.
The theory of formal languages finds its applicability extensively in the fields of Computer Science.
Noam Chomsky gave a mathematical model of grammar in 1956 which is effective for writing computer
languages.
Grammar
A grammar G can be formally written as a 4-tuple (N, T, S, P) where −
N or VN is a set of variables or non-terminal symbols.
• T or ∑ is a set of Terminal symbols.
• S is a special variable called the Start symbol, S ∈ N
• P is Production rules for Terminals and Non-terminals. A production rule has the form α
→ β, where α and β are strings on V N ∪ ∑ and least one symbol of α belongs to VN. Example
Grammar G1
({S, A, B}, {a, b}, S, {S → AB, A → a, B → b}) Here,
• S, A, and B are Non-terminal symbols;
• a and b are Terminal symbols
• S is the Start symbol, S ∈ N
• Productions, P : S → AB, A → a, B → b
Example
Grammar G2 −
(({S, A}, {a, b}, S,{S → aAb, aA → aaAb, A → ε } )
Here,
• S and A are Non-terminal symbols.
• a and b are Terminal symbols.
• ε is an empty string.
• S is the Start symbol, S ∈ N
• Production P : S → aAb, aA → aaAb, A → ε
Example
Let us consider the grammar −
16
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
The set of all strings that can be derived from a grammar is said to be the language generated from that
grammar.
Example
If there is a grammar
G: N = {S, A, B} T = {a, b} P = {S → AB, A → a, B → b}
Here S produces AB, and we can replace A by a, and B by b. Here, the only accepted string is ab, i.e.,
L(G) = {ab}
Example
Suppose we have the following grammar − G: N = {S, A, B} T = {a, b} P = {S → AB, A → aA|a, B → bB|b}
The language generated by this grammar –
L(G) = {ab, a2b, ab2, a2b2, ………} = {am bn | m ≥ 1 and n ≥ 1}
Example
Problem − Suppose, L (G) = {a m bn | m ≥ 0 and n > 0}. We have to find out the grammar G which
produces L(G).
Solution
Since L(G) = {am bn | m ≥ 0 and n > 0}, the set of strings accepted can be rewritten as – L(G)
= {b, ab,bb, aab, abb, …….}
Here, the start symbol has to take at least one ‘b’ preceded by any number of ‘a’ including null.
To accept the string set {b, ab, bb, aab, abb, …….}, we have taken the productions − S
→ aS , S → B, B → b and B → bB
S → B → b (Accepted)
S → B → bB → bb (Accepted)
S → aS → aB → ab (Accepted)
S → aS → aaS → aaB → aab(Accepted)
S → aS → aB → abB → abb (Accepted)
Thus, we can prove every single string in L(G) is accepted by the language generated by the production
set.
17
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
Example
Problem − Suppose, L (G) = {a m bn | m > 0 and n ≥ 0}. We have to find out the grammar G which
produces L(G).
Solution −
Since L(G) = {am bn | m > 0 and n ≥ 0}, the set of strings accepted can be rewritten as − L(G)
= {a, aa, ab, aaa, aab ,abb, …….}
Here, the start symbol has to take at least one ‘a’ followed by any number of ‘b’ including null.
To accept the string set {a, aa, ab, aaa, aab, abb, …….}, we have taken the productions − S
→ aA, A → aA , A → B, B → bB ,B → λ
S → aA → aB → aλ → a (Accepted)
S → aA → aaA → aaB → aaλ → aa (Accepted)
S → aA → aB → abB → abλ → ab (Accepted)
S → aA → aaA → aaaA → aaaB → aaaλ → aaa (Accepted)
S → aA → aaA → aaB → aabB → aabλ → aab (Accepted)
S → aA → aB → abB → abbB → abbλ → abb (Accepted)
Thus, we can prove every single string in L(G) is accepted by the language generated by the production
set.
Hence the grammar −
G: ({S, A, B}, {a, b}, S, {S → aA, A → aA | B, B → λ | bB })
TYPES OF GRAMMAR
Noam Chomosky classified grammar into four types. The table summarizes each of Chomsky's four types
of grammars, the class of language it generates, the type of automaton that recognizes it, and the form
its rules must have.
18
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
Type - 3 Grammar
Type-3 grammars: generate regular languages. Type-3 grammars must have a single non-terminal on
the left-hand side and a right-hand side consisting of a single terminal or single terminal followed by a
single non-terminal.
Type-3 grammars generate the regular languages. Such a grammar restricts its rules to a single
nonterminal on the left-hand side and a right-hand side consisting of a single terminal, possibly followed
by a single non-terminal (right regular). Alternatively, the right-hand side of the grammar can consist of
a single terminal, possibly preceded by a single non-terminal (left regular). These generate the same
languages.
The rule S → ε is allowed if S does not appear on the right side of any rule.
Example
X→ε
X → a | aY
Y→b
Type - 2 Grammar
Type-2 grammars generate context-free languages.
19
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
Type - 1 Grammar
Type-1 grammars generate context-sensitive languages. The
productions must be in the form α A β → α γ β where A ∈ N
(Non-terminal) and α, β, γ ∈ (T ∪ N)* (Strings of terminals
and non-terminals)
Example
AB → AbBc
A → bcA
B→b
Type - 0 Grammar
Type-0 grammars generate recursively enumerable languages. The productions have no restrictions.
They are any phase structure grammar including all formal grammars.
They generate the languages that are recognized by a Turing machine.
The productions can be in the form of α → β where α is a string of terminals and nonterminals with at
least one non-terminal and α cannot be null. β is a string of terminals and non-terminals.
Example
S → ACaB
Bc → acB
CB → DB aD
→ Db
PARSING
A parser is a compiler or interpreter component that breaks data into smaller elements for easy
translation into another language. A parser takes input in the form of a sequence of tokens, interactive
commands, or program instructions and breaks them up into parts that can be used by other
components in programming.
A parser usually checks all data provided to ensure it is sufficient to build a data structure in the form of
a parse tree or an abstract syntax tree.
In order for the code written in human-readable form to be understood by a machine, it must be
converted into machine language. This task is usually performed by a translator (interpreter or
compiler). The parser is commonly used as a component of the translator that organizes linear text in a
structure that can be easily manipulated (parse tree). To do so, it follows a set of defined rules called
“grammar”.
3. Semantic Parsing: The final parsing stage in which the meaning and implications of the validated
expression are determined and necessary actions are taken.
Types of Parsing
1. Top-Down Parsing
This involves searching a parse tree to find the left-most derivations of an input stream by using a
topdown expansion. Parsing begins with the start symbol which is transformed into the input symbol
until all symbols are translated and a parse tree for an input string is constructed. Examples include LL
parsers and recursive-descent parsers. Top-down parsing is also called predictive parsing or recursive
parsing.
Top-down parser is classified into 2 types: Recursive descent parser, and Non-recursive descent parser.
(i) Recursive descent parser:
It is also known as Brute force parser or the with backtracking parser. It basically generates the
parse tree by using brute force and backtracking.
2. Bottom-Up Parsing
This involves rewriting the input back to the start symbol. It acts in reverse by tracing out the rightmost
derivation of a string until the parse tree is constructed up to the start symbol This type of parsing is
also known as shift-reduce parsing. Bottom-up parser is classified into 2 types: LR parser, and Operator
precedence parser.
(i) LR parser:
LR parser is the bottom-up parser which generates the parse tree for the given string by using
unambiguous grammar. It follows reverse of right most derivation.
21
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
Disadvantages
1. It is too much work to construct LR parser by hand for a programming language grammar. A
specialized tool called LR parser generator is needed.
Advantages
1. Easy to implement
2. Once an operator precedence relation is made between all the pairs of terminal of grammars,
the grammar can be ignored. The grammar is not referred anymore during implementation.
Disadvantages
1. It is hard to handle tokens like minus sign (-) which has two different precedence.
2. Only a small class of grammar can be parsed using operator precedence parser.
ERROR RECOVERY
A parser should be able to detect and report any error in the program. It is expected that when an error
is encountered, the parser should be able to handle it and carry on parsing the rest of the input. Mostly
it is expected from the parser to check for errors but errors may be encountered at various stages of the
compilation process. A program may have the following kinds of errors at various stages:
• Lexical : name of some identifier typed incorrectly
• Syntactical : missing semicolon or unbalanced parenthesis
• Semantical : incompatible value assignment
• Logical : code not reachable, infinite loop
There are four common error-recovery strategies that can be implemented in a parser to deal with
errors in the code.
a. Panic Mode - When a parser encounters an error anywhere in the statement, it ignores the rest of
the statement by not processing input from erroneous input to delimiter, such as semi-colon. This is
the easiest way of error-recovery and also, it prevents the parser from developing infinite loops.
b. Statement Mode - When a parser encounters an error, it tries to take corrective measures so that
the rest of inputs of statement allow the parser to parse ahead. For example, inserting a missing
semicolon, replacing comma with a semicolon etc. Parser designers have to be careful here because
one wrong correction may lead to an infinite loop.
22
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
c. Error Productions - Some common errors are known to the compiler designers that may occur in
the code. In addition, the designers can create augmented grammar to be used, as productions that
generate erroneous constructs when these errors are encountered.
d. Global Correction - The parser considers the program in hand as a whole and tries to figure out
what the program is intended to do and tries to find out a closest match for it, which is error-free.
When an erroneous input (statement) X is fed, it creates a parse tree for some closest error-free
statement Y. This may allow the parser to make minimal changes in the source code, but due to the
complexity (time and space) of this strategy, it has not been implemented in practice yet.
SCANNING
Scanning is the process of identifying tokens from the raw text source code of a program. At first glance,
scanning might seem trivial, after all, identifying words in a natural language is as simple as looking for
spaces between letters. However, identifying tokens in source code requires the language designer to
clarify many fine details, so that it is clear what is permitted and what is not. Most languages will have
tokens in these categories:
a. Keywords are words in the language structure itself, like while or class or true. Keywords must
be chosen carefully to reflect the natural structure of the language, without interfering with the
likely names of variables and other identifiers.
b. Identifiers are the names of variables, functions, classes, and other code elements chosen by
the programmer. Typically, identifiers are arbitrary sequences of letters and possibly numbers.
Some languages require identifiers to be marked with a sentinel (like the dollar sign in Perl) to
clearly distinguish identifiers from keywords.
c. Numbers could be formatted as integers, or floating point values, or fractions, or in alternate
bases such as binary, octal or hexadecimal. Each format should be clearly distinguished, so that
the programmer does not confuse one with the other.
d. Strings are literal character sequences that must be clearly distinguished from keywords or
identifiers. Strings are typically quoted with single or double quotes, but also must have some
facility for containing quotations, newlines, and unprintable characters.
e. Comments and white space are used to format a program to make it visually clear, and in some
cases (like Python) are significant to the structure of a program. When designing a new
language, or designing a compiler for an existing language, the first job is to state precisely what
characters are permitted in each type of token.
RECOGNIZER
A recognizer is a parser that does not perform syntax-directed translation. It only tells you whether the
input string is from the language described by the grammar. A recognizer for a language is a program
that decides whether an input belongs to the language defined by grammar. Decide here means
answers either yes or no. It takes a string x as an input and answers "yes" if x is a sentence of the
23
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
language and "no" otherwise. It also detects whether an input exhibit an ambiguity in the grammar and
report it to the user. A recognizer for a language is a program that
One can compile any regular expression into a recognizer by constructing a generalized transition
diagram called a finite automation.
Finite Automata is made up of two types - Deterministic Finite Automata (DFA) and Nondeterministic
Finite Automata (NFA)
In a DFA, for a particular input character, the machine goes to one state only. A transition function is
defined on every state for every input symbol. Also in DFA null (or ε) move is not allowed, i.e., DFA
cannot change state without any input character.
24
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
For example, below DFA with Σ = {0, 1} accepts all strings ending with 0.
NFA
One important thing to note is, in NFA, if any path for an input string leads to a final state, then the
input string is accepted. For example, in above NFA, there are multiple paths for input string “00”. Since,
one of the paths leads to a final state, “00” is accepted by above NFA.
When the target program executes, it runs in its own logical address space in which the value of each
program has a location. The logical address is shared among compiler, operating system and target
machine for management ad organization. The operating system is used to map the logical address into
physical address which is usually spread throughout the memory.
Storage Organization
Target program runs in its own logical space. The size of generated code is usually fixed at compile time
unless code is loaded or produced dynamically. Compiler can place the executable at fixed addresses.
25
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
Memory location for code and static data are determined at compile time, while data objects of stack
and heap are dynamically allocated at runtime runtime storage comes in blocks, where a byte is
used to show the smallest unit of addressable memory.
Both Stack storage allocation and Heap storage allocation are dynamic storage allocation
26
CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
27