CSC411 Compiler Construction - MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session

CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session
Com 413: Compiler Construction
Introduction
Compiler construction is an area of computer science that deals with the theory and practice of
developing programming languages and their associated compilers. The theoretical portion is primarily
concerned with syntax, grammar and semantics of programming languages. Computers are a balanced
blend of software and hardware. Hardware is just a piece of mechanical device and its functions are
controlled by compatible software. Hardware understands instructions in the form of electronic charge,
which is the counterpart of binary language. Binary language has only two alphabets, 0 and 1. To
instruct, the hardware, codes must be written in binary format, which is simply a series of 1s and 0s. It
would be a difficult and cumbersome task for computer programmers to write such codes, which is why
we have compilers to write such codes. Programs are written in high-level language, which is easier for
human to understand and remember. These programs are then fed into a series of tools and operating
system (OS) components to get the desired code that can be used by the machine. This is known as
Language Processing System.
In computing, a compiler is a special program that processes statements written in a particular

programming language and turns them into machine language or "code" that a computer's processor
uses. A compiler can also be seen as a computer program that converts (translates) computer code
written in one programming language (the source language) into another language (the target
language). Compiler is a name primarily used for programs that translate source code from a high-level
programming language to a lower level language (e.g., assembly language, object code, or machine
code) to create an executable program. Typically, a programmer writes language statements in a
language such as Pascal or C one line at a time using an editor. The file that is created contains what are
called the source statements. The programmer then runs the appropriate language compiler, specifying
the name of the file that contains the source statements.
When executing (running), the compiler first parses (or analyzes) all of the language statements
syntactically one after the other and then, in one or more successive stages or passes, builds the output
code, making sure that statements that refer to other statements are referred to correctly in the final
code. The output of the compilation is called object code or sometimes an object module. The object
code is machine code that the processor can execute one instruction at a time.
A compiler works with what are sometimes called 3GL (third generation language) and higher-level
languages. It serves as an interface between human understandable language and machine
understandable language by transforming the former to the later.
Source program Compiler Object program
Error messages
Schematic diagram of a compiler
a. Source program – This is normally a program written in a high-level programming language. It

contains a set of rules, symbols, and special words used to construct a computer program.
b. Target program – This is normally the equivalent program in machine code. It contains the binary
representation of the instructions that the hardware of computer can perform.
c. Error Message – This is issued by the compiler due to detection of syntax error(s) in the source
program.
1
In a compiler, the source code is translated to object code successfully if it is free of errors. The compiler
specifies the errors at the end of compilation with line numbers when there are any errors in the source
code. The errors must be removed before the compiler can successfully recompile the source code
again.
Program that translates from a low level language to a higher level one is a decompiler. A program that
translates between high-level languages is usually called a language translator, source to source
translator, or language converter. A language rewriter is usually a program that translates the form of
expressions without a change of language
A compiler is likely to perform many or all of the following operations, namely, lexical analysis,
preprocessing, parsing, semantic analysis (Syntax-directed translation), code generation, and code
optimization. Program faults caused by incorrect compiler behavior can be very difficult to track down
and work around; therefore, compiler implementers invest significant effort to ensure the correctness
of their software. The term compiler-compiler is sometimes used to refer to a parser generator, a tool
often used to help create the lexer and parser.
Most of the compilers are not a single tool. Compiler is a combination of five different but necessary
tools, namely :-
(a) Editor (b) Debugger (c) Compiler (d) Linker (e) Loader
Types of Compiler
There are many different types of compilers which produce output in different useful forms. They
include -
a. Native Code Compiler – This compiler is used to compile a source code for same type of platform
only. The output generated by this type of compiler can only be run on the same type of computer
system and operating system that the compiler itself runs on.
b. Cross Compiler – This type of compiler is used to compile a source code for different kinds of
platform. Used in making software for embedded systems that can be used on multiple platforms. A
compiler that runs on platform (A) and is capable of generating executable code for platform (B) is
called a cross-compiler.
c. Source to Source Compiler – This type of compiler takes high-level language code as input and
outputs source code of another high-level language only. Unlike other compilers which convert high
level language into low level machine language, it can take up a code written in Pascal and can
transform it into C-conversion of one high level language into another high level language having
same type of abstraction . Thus, it is also known as transpiler.
d. One Pass Compiler – This is a type of compiler that compiles the whole process in only one-pass.
e. Threaded Code Compiler – This is a type of compiler that simply replaces a string by an appropriate
binary code.
f. Incremental Compiler - This compiler which compiles only the changed lines from the source code
and update the object code.
2
g. Source Compiler - This compiler converts the source code (high level language code) into assembly
language only.
h. Just-in-time (JIT) Compiler – This is a type of compiler that defers compilation until runtime. This
compiler is used for languages such as Python and JavaScript. It generally runs inside an interpreter.
Compilers are not the only language processor used to transform source programs. Others are
assemblers and interpreters.
Assembler
An assembler is a program that converts the assembly language program to machine level language
instructions that can be executed by a computer. An assembler enables software and application
developers to access, operate and manage a computer's hardware architecture and components. It is
sometimes referred to as the compiler of assembly language. It also provides the services of an
interpreter.
The output of an assembler is called an object file, which contains a combination of machine
instructions as well as the data required to place these instructions in memory.
An assembler works by assembling and converting the source code of assembly language into object
code or an object file that constitutes a stream of zeros and ones of machine code, which are directly
executable by the processor.
Assembly code Assembler Machine code
Schematic diagram of an Assembler
Assemblers are classified based on the number of times it takes them to read the source code before
translating it into single-pass assembler and multi-pass assembler. Single-pass assembler scans a
program source only once and creates the equivalent binary program. It substitutes all of the symbolic
instruction with machine code in one pass. Multi-pass assembler is an assembler which uses more than
one pass in the assembly process. In multi pass, an assembler goes through assembly language several
times and generates the object code. In this, last pass is called a synthesis pass, and this assembler
requires any form of an intermediate code to generate each pass every time. It is comparatively slower
than a single pass assembler, but some actions that can be performed more than once means
duplicated. Some high-end assemblers provide enhanced functionality by enabling the use of control
statements, data abstraction services and providing support for object-oriented programming
structures.
Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language. The
difference lies in the way they read the source code or input. A compiler reads the whole source code at
once, creates tokens, checks semantics, generates intermediate code, executes the whole program and
may involve many passes. In contrast, an interpreter reads a statement from the input, converts it to an
intermediate code, executes it, then takes the next statement in sequence. If an error occurs, an
interpreter stops execution and reports it. Whereas a compiler reads the whole program even if it
encounters several errors.
3
An interpreter is a computer program that is used to directly execute program instructions written in
one of the many high-level programming languages. The interpreter transforms the high-level program
into an intermediate language that it then executes, or it could parse the high-level source code and
then performs the commands directly, which is done line by line or statement by statement. An
interpreter directly executes instructions written in a programming or scripting language, without
requiring them previously to have been compiled into a machine language program.
Humans can only understand high-level languages, which are called source code. Computers, on the
other hand, can only understand programs written in binary languages, so either an interpreter or
compiler is required.
Programming languages are implemented in two ways: interpretation or compilation. As the name
suggests, an interpreter transforms or interprets a high-level programming code into code that can be
understood by the machine (machine code) or into an intermediate language that can be easily
executed as well.
The interpreter reads each statement of code and then converts or executes it directly. In contrast, an
assembler or a compiler converts a high-level source code into native (compiled) code that can be
executed directly by the operating system (e.g. by creating a .exe program).
Both compilers and interpreters have their advantages and disadvantages and are not mutually
exclusive; this means that they can be used in conjunction as most integrated development
environments employ both compilation and translation for some high-level languages.
In most cases, a compiler is preferable since its output runs much faster compared to a line-by-line
interpretation. Rather than scanning the whole program and translating it into machine code like a
compiler does, the interpreter translates code one statement at a time.
While the time to analyze source code is reduced, especially a particularly large one, execution time for
an interpreter is comparatively slower than a compiler. On top of that, since interpretation happens per
line or statement, it can be stopped in the middle of execution to allow for either code modification or
debugging.
Compilers must generate intermediate object code that requires more memory to be linked, contrarily
to interpreters which tend to use memory more efficiently.
Since an interpreter reads and then executes code in a single process, it very useful for scripting and
other small programs. As such, it is commonly installed on Web servers, which run a lot of executable
scripts. It is also used during the development stage of a program to test small chunks of code one by
one rather than having to compile the whole program every time.
Every source statement will be executed line by line during execution, which is particularly appreciated
for debugging reasons to immediately recognize errors. Interpreters are also used for educational
purposes since they can be used to show students how to program one script at a time.
Programming languages that use interpreters include Python, Ruby, and JavaScript, while programming
languages that use compilers include Java, C++, and C.
An interpreter generally uses one of the following strategies for program execution:
4
1. Parse the source code and perform its behavior directly;

2. Translate source code into some efficient intermediate representation and immediately execute
this;
3. Explicitly execute stored precompiled code made by a compiler which is part of the interpreter
system.
Difference between Compiler and Interpreter Compiler Interpreter

A compiler is a program which coverts the entire
Interpreter takes a source program and runs it line
source code of a programming language into
by line, translating each line as it comes to it.
executable machine code for a CPU.
Compiler takes large amount of time to analyze Interpreter takes less amount of time to analyze the
entire source code but the overall execution the source code but the overall execution time of time of
the program is comparatively faster. the program is slower. Compiler generates the error message only
after
scanning the whole program, so debugging is Its Debugging is easier as it continues translating
comparatively hard as the error can be present the program until the error is met anywhere in the
program.
Generates intermediate object code. No intermediate object code is generated.
Examples: C, C++, Java Examples: Python, Perl
COMPILATION
Compilers enabled the development of programs that are machine-independent. Before the
development of FORTRAN (FORmula TRANslator), the first higher-level language, in the 1950s,
machinedependent assembly language was widely used. While assembly language produces more
reusable and relocatable programs than machine code on the same architecture, it has to be modified
or rewritten if the program is to be executed on different computer hardware architecture. With the
advance of highlevel programming languages that followed FORTRAN, such as COBOL, C, and BASIC,
programmers could write machine-independent source programs. A compiler translates the high-level
source programs into target programs in machine languages for the specific hardware. Once the target
program is generated, the user can execute the program
STRUCTURE OF A COMPILER
Compilers bridge source programs in high-level languages with the underlying hardware. A compiler
requires (1) determining the correctness of the syntax of programs, (2) generating correct and efficient
object code, (3) run-time organization, and (4) formatting output according to assembler and/or linker
conventions.
A compiler consists of three main parts: (a) the frontend, (b) the middle-end, and (c) the backend.
The front end checks whether the program is correctly written in terms of the programming language
syntax and semantics. Here legal and illegal programs are recognized. Errors are reported, if any, in a
useful way. Type checking is also performed by collecting type information. The frontend then
generates an intermediate representation or IR of the source code for processing by the middle-end.
The middle end is where optimization takes place. Typical transformations for optimization are removal
of useless or unreachable code, discovery and propagation of constant values, relocation of
computation to a less frequently executed place (e.g., out of a loop), or specialization of computation
based on the context. The middle-end generates another IR for the following backend. Most
5
The back end is responsible for translating the IR from the middle-end into assembly code. The target
instruction(s) are chosen for each IR instruction. Register allocation assigns processor registers for the
program variables where possible. The backend utilizes the hardware by figuring out how to keep
parallel execution units busy, filling delay slots, and so on. Although most algorithms for optimization
are in NP, heuristic techniques are well-developed.
Compilers are broadly divided into two phases based on the way they compile into (a) analysis phase
and (b) synthesis phase.
In synthesis phase, the equivalent target program is created from this intermediate representation.
This contains: ➢ Intermediate Code Generator, ➢ Code Optimizer, and ➢ Code Generator
Analysis Phase (Analysis of the source program being compiled)

In analysis phase, known as the front-end of the compiler, an intermediate representation is created
from the given source program. It contains Lexical Analyzer, Syntax Analyzer and Semantic Analyzer. The
analysis phase of the compiler reads the source program, divides it into core parts and then checks for
lexical, grammar and syntax errors. The analysis phase generates an intermediate representation of the
source program and symbol table, which should be fed to the Synthesis phase as input.
6
Front-end Back-end
Analysis Synthesis
Intermediate Machine
Source Code Code
Code Representation
Synthesis Phase (Synthesis of the target program)

The synthesis phase is also known as the back-end of the compiler. In synthesis phase, the equivalent
target program is created from the intermediate representation. This contains Intermediate Code
Generator, Code Optimizer, and Code Generator. The synthesis phase generates the target program
with the help of intermediate source code representation and symbol table.
A compiler can have many phases and passes.
a. Pass: A pass refers to the traversal of a compiler through the entire program.
b. Phase: A phase of a compiler is a distinguishable stage, which takes input from the previous stage,
processes and yields output that can be used as input for the next stage. A pass can have more than
one phase.
PHASES OF COMPILER
The compilation process is a sequence of various phases. Each phase takes input from its previous stage,
has its own representation of source program, and feeds its output to the next phase of the compiler.
Let us understand the phases of a compiler.
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code as a stream of
characters and converts it into meaningful lexemes. Lexical analyzer represents these lexemes in the
form of tokens as:
<token-name, attribute-value>
Lexical analysis is the first phase of a compiler. It takes the modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these syntaxes into
a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely with
the syntax analyzer. It reads character streams from the source code, checks for legal tokens, and
passes the data to the syntax analyzer when it demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some predefined
rules for every lexeme to be identified as a valid token. These rules are defined by grammar rules, by
7
means of a pattern. A pattern explains what can be a token, and these patterns are defined by means of
regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers, operators and

punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line Int

value=100, contains the following tokens:
Int (keyword), value (identifier), =(operator), 100(constant) and ; (symbol)
Source Code
Lexical Analyzer
Syntax Analyzer
Semantic Analyzer
Intermediate code
Symbol Generator Error
Table Handler
Machine independent
Code Optimiser
Code Generator
Machine Dependent
Code Optimiser
Target Code
Syntax Analysis
Syntax Analysis or Parsing is the second phase after lexical analysis. It takes the token produced by
lexical analysis as input and generates a data structure, called a Parse tree or Syntax tree. The parse tree
is constructed by using the pre-defined Grammar of the language and the input string. If the given input
string can be produced with the help of the syntax tree (in the derivation process), the input string is
found to be in the correct syntax. if not, error is reported by syntax analyzer. In this phase, token
arrangements are checked against the source code grammar, i.e. It checks the syntactical structure of
the given input, i.e. whether the given input is in the correct syntax (of the language in which the input
has been written) or not.
A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The
parser analyzes the source code (token stream) against the production rules to detect any errors in the
code. The output of this phase is a parse tree.
8
This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and generating a
parse tree as the output of the phase. Parsers are expected to parse the whole code even if some errors
exist in the program. Parsers use error recovering strategies.
Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During parsing,
we take two decisions for some sentential form of input:
• Deciding the non-terminal which is to be replaced.
• Deciding the production rule, by which, the non-terminal will be replaced.
To decide which non-terminal to be replaced with production rule, we can have two options.
Left-Most Derivation - If the sentential form of an input is scanned and replaced from left to right, it is
called left-most derivation. The sentential form derived by the left-most derivation is called the
leftsentential form.
Right-Most Derivation - If we scan and replace the input with production rules, from right to left, it is
known as right-most derivation. The sentential form derived from the right-most derivation is called the
right-sentential form.
Example
Production rules:
E→E+E
E→E*E
E → id
Input string: id + id * id
The left-most derivation is:

E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Notice that the left-most side non-terminal is always processed first.
The right-most derivation is:

E→E+E
E→E+E*E
E → E + E * id
E → E + id * id
E → id + id * id
9
Parse Tree - A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are
derived from the start symbol. The start symbol of the derivation becomes the root of the parse tree.
Let us see this by an example from the last topic. We take the left-most derivation of a + b * c
The left-most derivation is:

E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
E→E*E
Step 1:
E→E+E*E
Step 2:
E → id + E * E
Step 3:
In a parse tree:
 All leaf nodes are terminals.
 All interior nodes are non -
terminals.
 In-order traversal gives
original input string.
E → id +id * E E → id +id * id
Step 4: Step 5:
Steps in leftmost derivation
A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed first,
therefore the operator in that sub-tree gets precedence over the operator which is in the parent nodes.
Ambiguity - A grammar G is said to be ambiguous if it has more than one parse tree (left or right
derivation) for at least one string.
Example
E→E+E
E→E–E
E → id
For the string id + id – id, the above grammar generates two parse trees:
10
Figure 8: Ambiguity in parsing
The language generated by an ambiguous grammar is said to be inherently ambiguous. Ambiguity in

grammar is not good for a compiler construction. No method can detect and remove ambiguity
automatically, but it can be removed by either re-writing the whole grammar without ambiguity, or by
setting and following associativity and precedence constraints.
Semantic Analysis
Semantic analysis is the third phase of compiler. It checks whether the parse tree constructed follows
the rules of language. For example, assignment of values is between compatible data types, and adding
string to an integer. Also, the semantic analyzer keeps track of identifiers, their types and expressions;
whether identifiers are declared before use or not etc. The semantic analyzer produces an annotated
syntax tree as an output.
It makes sure that declarations and statements of program are semantically correct. It is a collection of
procedures which is called by parser as and when required by grammer. Both syntax tree of the
previous phase and symbol table are used to check the consistency of the given code. Type checking is
an important part of semantic analysis where compiler makes sure that each operator has matching
operands.
Semantic Analyzer
It uses syntax tree and symbol table to check whether the given program is semantically consistent with
language definition. It gathers type information and stores it in either syntax tree or symbol table. This
type information is subsequently used by compiler during intermediate-code generation.
Semantic Errors
Errors recognized by semantic analyzer are as follows: a.
Type mismatch
b. Undeclared variables
c. Reserved identifier misuse
Functions of Semantic Analysis

1. Type Checking: Ensures that data types are used in a way consistent with their definition.
2. Label Checking: A program should contain labels references.
3. Flow Control Check: Keeps a check that control structures are used in a proper manner.(example:
no break statement outside a loop)
INTERMEDIATE CODE GENERATION

After semantic analysis the compiler generates an intermediate code of the source code for the target
machine. It represents a program for some abstract machine. It is in between the high-level language
11
and the machine language. This intermediate code should be generated in such a way that it makes it
easier to be translated into the target machine code.
Benefits of intermediate code

a. If a compiler translates the source language to its target machine language without having the
option for generating intermediate code, then for each new machine, a full native compiler is
required.
b. Intermediate code eliminates the need of a new full compiler for every unique machine by keeping
the analysis portion same for all the compilers.
c. The second part of compiler, synthesis, is changed according to the target machine.
d. It becomes easier to apply the source code modifications to improve code performance by applying
code optimization techniques on the intermediate code.
INTERMEDIATE CODE REPRESENTATION

Intermediate codes can be represented in a variety of ways and they have their own benefits.
a. High Level IR - High-level intermediate code representation is very close to the source language
itself. They can be easily generated from the source code and we can easily apply code
modifications to enhance performance. But for target machine optimization, it is less preferred.
b. Low Level IR - This one is close to the target machine, which makes it suitable for register and
memory allocation, instruction set selection, etc. It is good for machine-dependent optimizations.
Intermediate code can be either language specific (e.g., Byte Code for Java) or language independent.
a. Three-Address Code - Intermediate code generator receives input from its predecessor phase,
semantic analyzer, in the form of an annotated syntax tree. That syntax tree then can be converted into
a linear representation, e.g., postfix notation. Intermediate code tends to be machine independent
code. Therefore, code generator assumes to have unlimited number of memory storage (register) to
generate code.
For example: a = b + c * d;
The intermediate code generator will try to divide this expression into sub-expressions and then
generate the corresponding code.
r1 = c * d; r2 = b + r1; r3 = r2 + r1; a = r3
r being used as registers in the target program.
A three-address code has at most three address locations to calculate the expression. A three-address
code can be represented in two forms: quadruples and triples.
(i) Quadruples - Each instruction in quadruples presentation is divided into four fields: operator,
arg1, arg2, and result. The above example is represented in Table 4 in quadruples format:
12
Quadruples format
Op arg1 arg2 result
* c d r1
+ b r1 r2
+ r2 r1 r3
= r3 a
(ii) Triples - Each instruction in triples presentation has three fields: op, arg1, and arg2 as shown in
Table 5.The results of respective sub-expressions are denoted by the position of expression.
Triples represent similarity with DAG and syntax tree. They are equivalent to DAG while
representing expressions.
Triples format
Op arg1 arg2
* c d
+ b (0)
+ (1) (0)
= (2)
Triples face the problem of code immovability while optimization, as the results is positional
and changing the order or position of an expression may cause problems.
b. Indirect Triples - This representation is an enhancement over triples representation. It uses pointers
instead of position to store results. This enables the optimizers to freely re-position the sub-expression
to produce an optimized code
CODE OPTIMIZATION
The next phase does code optimization of the intermediate code. Optimization is a program
transformation technique, which tries to improve the code by making it consume less resources (i.e.
CPU, Memory) and deliver high speed.
In optimization, high-level general programming constructs are replaced by very efficient low-level
programming codes. Optimization can be assumed as something that removes unnecessary code lines,
and arranges the sequence of statements in order to speed up the program execution without wasting
resources (CPU, memory).
A code optimizing process must follow the three rules given below:
a. The output code must not, in any way, change the meaning of the program.
b. Optimization should increase the speed of the program and if possible, the program should
demand less number of resources.
c. Optimization should itself be fast and should not delay the overall compiling process.
Efforts for an optimized code can be made at various levels of compiling the process.
a. At the beginning, users can change/rearrange the code or use better algorithms to write the
code.
b. After generating intermediate code, the compiler can modify the intermediate code by address
calculations and improving loops.
c. While producing the target machine code, the compiler can make use of memory hierarchy and
CPU registers.
13
CODE GENERATION
Code generation can be considered as the final phase of compilation. In this phase, the code generator
takes the optimized representation of the intermediate code and maps it to the target machine
language. The code generator translates the intermediate code into a sequence of (generally)
relocatable machine code. Sequence of instructions of machine code performs the task as the
intermediate code would do. The code generated by the compiler is an object code of some lower-level
programming language, for example, assembly language. We have seen that the source code written in
a higher-level language is transformed into a lower-level language that results in a lower-level object
code, which should have the following minimum properties:
a. It should carry the exact meaning of the source code.
b. It should be efficient in terms of CPU usage and memory management.
The code generator should take the following things into consideration to generate the code:
• Target language - The code generator has to be aware of the nature of the target language for
which the code is to be transformed. That language may facilitate some machine-specific
instructions to help the compiler generate the code in a more convenient way. The target
machine can have either CISC or RISC processor architecture.
• IR Type - Intermediate representation has various forms. It can be in Abstract Syntax Tree (AST)
structure, Reverse Polish Notation, or 3-address code.
• Selection of instruction - The code generator takes Intermediate Representation as input and
converts (maps) it into target machine’s instruction set. One representation can have many ways
(instructions) to convert it, so it becomes the responsibility of the code generator to choose the
appropriate instructions wisely.
• Register allocation - A program has a number of values to be maintained during the execution.
The target machine’s architecture may not allow all of the values to be kept in the CPU memory
or registers. Code generator decides what values to keep in the registers. Also, it decides the
registers to be used to keep these values.
• Ordering of instructions - At last, the code generator decides the order in which the instruction
will be executed. It creates schedules for instructions to execute them .
Descriptors - The code generator has to track both the registers (for availability) and addresses (location
of values) while generating the code. For both of them, the following two descriptors are used:
• Register descriptor - Register descriptor is used to inform the code generator about the
availability of registers. Register descriptor keeps track of values stored in each register.
Whenever a new register is required during code generation, this descriptor is consulted for
register availability.
• Address descriptor - Values of the names (identifiers) used in the program might be stored at
different locations while in execution. Address descriptors are used to keep track of memory
locations where the values of identifiers are stored. These locations may include CPU registers,
heaps, stacks, memory or a combination of the mentioned locations.
Code generator keeps both the descriptor updated in real-time. For a load statement, LD R1, x, the code
generator:
• updates the Register Descriptor R1 that has value of x and
• updates the Address Descriptor (x) to show that one instance of x is in R1.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifier's names along
with their types are stored here. The symbol table makes it easier for the compiler to quickly search the
identifier record and retrieve it. The symbol table is also used for scope management.
A symbol table may serve the following purposes depending upon the language:
14
a. Store the names of all entities in a structured form at one place.

b. Verify if a variable has been declared.
c. Implement type checking, by verifying assignments and expressions in the source code are
semantically correct.
d. Determine the scope of a name (scope resolution).
Error handler: If the source program is not written as per the syntax of the language then syntax errors
are detected by the tool debugger associated with the compiler. Each phase of the compiler can
encounter errors. A compiler that stops when it finds the first error is not as helpful as it could be. The
syntax and semantic analysis phases usually handle a large fraction of the errors detectable by the
compiler. The lexical phase can detect errors where the characters remaining in the input do not form
any token of the language. Errors when the token stream violates the syntax of the language are
determined by the syntax analysis phase. During semantic analysis the compiler tries to detect
constructs that have the right syntactic structure but no meaning to the operation involved.
Regular Expressions
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that belong
to the language in hand. It searches for the pattern defined by the language rules.
Regular expressions have the capability to express finite languages by defining a pattern for finite strings
of symbols. The grammar defined by regular expressions is known as regular grammar. The language
defined by regular grammar is known as regular language.
Regular expression is an important notation for specifying patterns. Each pattern matches a set of
strings, so regular expressions serve as names for a set of strings. Programming language tokens can be
described by regular languages. The specification of regular expressions is an example of a recursive
definition. Regular languages are easy to understand and have efficient implementation.
The first step of compilation, called lexical analysis, converts the input from a simple sequence of
characters into a list of tokens of different kinds, such as numerical and string constants, variable
identifiers, and programming language keywords. The purpose of lex is to lexical analyzers.
Regular expressions are often used to describe the tokens of a language. they specify exactly what
values are legal for the tokens to assume. Some tokens are simply keywords, like if, else, and for.
Others, like identifiers, can be any sequence of letters and digits provided that they do not match a
keyword and do not start with a digit. Typically, an identifier is a variable name such as current, flag2, or
window Status. In general, an identifier is a letter followed by any combination of digits and letters.
We can build up the (extended) regular expression for an identifier as follows.

1. Define letter to be an element of the language represented by the regular expression ([a-z] | [A-
Z]).
2. Define digit to be an element of the language represented by the regular expression [0-9].
3. Then an identifier can be represented as letter (digit/letter).
An identifier begins with a letter and that letter is followed by any combination of digits and
letters.
4. Keywords can also be represented by regular expressions For example
begin/else/function/if/procedure/then. is a regular expression representing some common
keywords.
15
If an identifier is found in the program, then the action corresponding to the identifier is taken. Perhaps
some information would be added to the symbol table in this case. If a keyword such as ‘if’, is
recognized, a different action would be taken.
GRAMMAR AND LANGUAGES

A grammar is a set of production rules which are used to generate strings of a language. A grammar lets
us transform a program, which is normally represented as a linear sequence of ASCII characters, into a
syntax tree. Only programs that are syntactically valid can be transformed in this way. This tree will be
the main data-structure that a compiler or interpreter uses to process the program.
In the literary sense of the term, grammars denote syntactical rules for conversation in natural
languages. Linguistics has attempted to define grammars since the inception of natural languages like
English, Sanskrit, Mandarin, etc.
The theory of formal languages finds its applicability extensively in the fields of Computer Science.
Noam Chomsky gave a mathematical model of grammar in 1956 which is effective for writing computer
languages.
Grammar
A grammar G can be formally written as a 4-tuple (N, T, S, P) where − 
N or VN is a set of variables or non-terminal symbols.
• T or ∑ is a set of Terminal symbols.
• S is a special variable called the Start symbol, S ∈ N
• P is Production rules for Terminals and Non-terminals. A production rule has the form α
→ β, where α and β are strings on V N ∪ ∑ and least one symbol of α belongs to VN. Example
Grammar G1
({S, A, B}, {a, b}, S, {S → AB, A → a, B → b}) Here,
• S, A, and B are Non-terminal symbols;
• a and b are Terminal symbols
• S is the Start symbol, S ∈ N
• Productions, P : S → AB, A → a, B → b
Example
Grammar G2 −
(({S, A}, {a, b}, S,{S → aAb, aA → aaAb, A → ε } )
Here,
• S and A are Non-terminal symbols.
• a and b are Terminal symbols.
• ε is an empty string.
• S is the Start symbol, S ∈ N
• Production P : S → aAb, aA → aaAb, A → ε
Derivations from a Grammar

Strings may be derived from other strings using the productions in a grammar. If a grammar G has a
production α → β, we can say that x α y derives x β y in G. This derivation is written as −
x α y ⇒G x β y
Example
Let us consider the grammar −
16
G2 = ({S, A}, {a, b}, S, {S → aAb, aA → aaAb, A → ε } )
Some of the strings that can be derived are − S

⇒ aAb using production S → aAb
⇒ aaAbb using production aA → aAb
⇒ aaaAbbb using production aA → aAb
⇒ aaabbb using production A → ε
The set of all strings that can be derived from a grammar is said to be the language generated from that
grammar.
A language generated by a grammar G is a subset formally defined by L(G)={W|W ∈ ∑*, S ⇒G W}
If L(G1) = L(G2), the Grammar G1 is equivalent to the Grammar G2.
Example
If there is a grammar
G: N = {S, A, B} T = {a, b} P = {S → AB, A → a, B → b}
Here S produces AB, and we can replace A by a, and B by b. Here, the only accepted string is ab, i.e.,
L(G) = {ab}
Example
Suppose we have the following grammar − G: N = {S, A, B} T = {a, b} P = {S → AB, A → aA|a, B → bB|b}
The language generated by this grammar –
L(G) = {ab, a2b, ab2, a2b2, ………} = {am bn | m ≥ 1 and n ≥ 1}
Construction of a Grammar Generating a Language

Here we consider some languages and convert it into a grammar G which produces those languages.
Example
Problem − Suppose, L (G) = {a m bn | m ≥ 0 and n > 0}. We have to find out the grammar G which
produces L(G).
Solution
Since L(G) = {am bn | m ≥ 0 and n > 0}, the set of strings accepted can be rewritten as – L(G)
= {b, ab,bb, aab, abb, …….}
Here, the start symbol has to take at least one ‘b’ preceded by any number of ‘a’ including null.
To accept the string set {b, ab, bb, aab, abb, …….}, we have taken the productions − S
→ aS , S → B, B → b and B → bB
S → B → b (Accepted)
S → B → bB → bb (Accepted)
S → aS → aB → ab (Accepted)
S → aS → aaS → aaB → aab(Accepted)
S → aS → aB → abB → abb (Accepted)
Thus, we can prove every single string in L(G) is accepted by the language generated by the production
set.
17
Hence the grammar − G: ({S, A, B}, {a, b}, S, { S → aS | B , B → b | bB })
Example
Problem − Suppose, L (G) = {a m bn | m > 0 and n ≥ 0}. We have to find out the grammar G which
produces L(G).
Solution −
Since L(G) = {am bn | m > 0 and n ≥ 0}, the set of strings accepted can be rewritten as − L(G)
= {a, aa, ab, aaa, aab ,abb, …….}
Here, the start symbol has to take at least one ‘a’ followed by any number of ‘b’ including null.
To accept the string set {a, aa, ab, aaa, aab, abb, …….}, we have taken the productions − S
→ aA, A → aA , A → B, B → bB ,B → λ
S → aA → aB → aλ → a (Accepted)
S → aA → aaA → aaB → aaλ → aa (Accepted)
S → aA → aB → abB → abλ → ab (Accepted)
S → aA → aaA → aaaA → aaaB → aaaλ → aaa (Accepted)
S → aA → aaA → aaB → aabB → aabλ → aab (Accepted)
S → aA → aB → abB → abbB → abbλ → abb (Accepted)
Thus, we can prove every single string in L(G) is accepted by the language generated by the production
set.
Hence the grammar −
G: ({S, A, B}, {a, b}, S, {S → aA, A → aA | B, B → λ | bB })
TYPES OF GRAMMAR
Noam Chomosky classified grammar into four types. The table summarizes each of Chomsky's four types
of grammars, the class of language it generates, the type of automaton that recognizes it, and the form
its rules must have.
Grammar Grammar Language Accepted Production Rule Automaton

Type Accepted
Unrestricted Recursively
Type 0 α→β Turing Machine
grammar enumerable language
Context-sensitive Context-sensitive αAβ→αγβ Linear-bounded
Type 1
grammar language automaton
Context-free Context-free language
Type 2 A→γ Pushdown automaton
grammar
X→a
Type 3 Regular grammar Regular language or Finite state automaton
X → aY, X → Ya
18
The relationship between Chomsky hierarchies of grammars/languages
Type - 3 Grammar
Type-3 grammars: generate regular languages. Type-3 grammars must have a single non-terminal on
the left-hand side and a right-hand side consisting of a single terminal or single terminal followed by a
single non-terminal.
Type-3 grammars generate the regular languages. Such a grammar restricts its rules to a single
nonterminal on the left-hand side and a right-hand side consisting of a single terminal, possibly followed
by a single non-terminal (right regular). Alternatively, the right-hand side of the grammar can consist of
a single terminal, possibly preceded by a single non-terminal (left regular). These generate the same
languages.
The productions must be in the form X → a or X → aY

where X, Y N (Non terminal) and a T (Terminal)
The rule S → ε is allowed if S does not appear on the right side of any rule.
Example
X→ε
X → a | aY
Y→b
Type - 2 Grammar
Type-2 grammars generate context-free languages.
The productions must be in the form A → γ where A ∈ N

(Non terminal) and γ ∈ (T ∪ N)* (String of terminals and
non-terminals).
These languages generated by these grammars are recognized by a non-deterministic pushdown

automaton. Example
S→Xa
X→a
X → aX
X → abc
X→ε
19
Type - 1 Grammar
Type-1 grammars generate context-sensitive languages. The
productions must be in the form α A β → α γ β where A ∈ N
(Non-terminal) and α, β, γ ∈ (T ∪ N)* (Strings of terminals
and non-terminals)
The strings α and β may be empty, but γ must be non-empty.

The rule S → ε is allowed if S does not appear on the right side of any rule. The languages generated by
these grammars are recognized by a linear bounded automaton.
Example
AB → AbBc
A → bcA
B→b
Type - 0 Grammar
Type-0 grammars generate recursively enumerable languages. The productions have no restrictions.
They are any phase structure grammar including all formal grammars.
They generate the languages that are recognized by a Turing machine.
The productions can be in the form of α → β where α is a string of terminals and nonterminals with at
least one non-terminal and α cannot be null. β is a string of terminals and non-terminals.
Example
S → ACaB
Bc → acB
CB → DB aD
→ Db
PARSING
A parser is a compiler or interpreter component that breaks data into smaller elements for easy
translation into another language. A parser takes input in the form of a sequence of tokens, interactive
commands, or program instructions and breaks them up into parts that can be used by other
components in programming.
A parser usually checks all data provided to ensure it is sufficient to build a data structure in the form of
a parse tree or an abstract syntax tree.
In order for the code written in human-readable form to be understood by a machine, it must be
converted into machine language. This task is usually performed by a translator (interpreter or
compiler). The parser is commonly used as a component of the translator that organizes linear text in a
structure that can be easily manipulated (parse tree). To do so, it follows a set of defined rules called
“grammar”.
The overall process of parsing involves three stages:

1. Lexical Analysis: A lexical analyzer is used to produce tokens from a stream of input string
characters, which are broken into small components to form meaningful expressions. A token is
the smallest unit in a programming language that possesses some meaning (such as +, -, *,
“function”, or “new” in JavaScript).
2. Syntactic Analysis: Checks whether the generated tokens form a meaningful expression. This
makes use of a context-free grammar that defines algorithmic procedures for components. These
work to form an expression and define the particular order in which tokens must be placed.
20
3. Semantic Parsing: The final parsing stage in which the meaning and implications of the validated
expression are determined and necessary actions are taken.
Types of Parsing
1. Top-Down Parsing
This involves searching a parse tree to find the left-most derivations of an input stream by using a
topdown expansion. Parsing begins with the start symbol which is transformed into the input symbol
until all symbols are translated and a parse tree for an input string is constructed. Examples include LL
parsers and recursive-descent parsers. Top-down parsing is also called predictive parsing or recursive
parsing.
Top-down parser is classified into 2 types: Recursive descent parser, and Non-recursive descent parser.
(i) Recursive descent parser:
It is also known as Brute force parser or the with backtracking parser. It basically generates the
parse tree by using brute force and backtracking.
(ii). Non-recursive descent parser:

It is also known as LL(1) parser or predictive parser or without backtracking parser or dynamic
parser. It uses parsing table to generate the parse tree instead of backtracking.
2. Bottom-Up Parsing
This involves rewriting the input back to the start symbol. It acts in reverse by tracing out the rightmost
derivation of a string until the parse tree is constructed up to the start symbol This type of parsing is
also known as shift-reduce parsing. Bottom-up parser is classified into 2 types: LR parser, and Operator
precedence parser.
(i) LR parser:
LR parser is the bottom-up parser which generates the parse tree for the given string by using
unambiguous grammar. It follows reverse of right most derivation.
Types of LR parser include:

They are:
a. SLR(1) – Simple LR Parser:
• Works on smallest class of grammar
• Few number of states, hence very small table
• Simple and fast construction
b. LR(1) – LR Parser:
• Works on complete set of LR(1) Grammar
• Generates large table and large number of states
• Slow construction
c. LALR(1) – Look-Ahead LR Parser:
• Works on intermediate size of grammar
• Number of states are same as in SLR(1)
Advantages of LR parser
1. Recognizes virtually all programming language construct for which all CFG can be written.
2. It is an efficient non-backtracking shift-reducing parsing method
3. A grammar that can be parsed using LR method is a proper superset of grammar that can be
parsed with predictive parser.
4. It detects a syntactic error as soon as possible.
21
Disadvantages
1. It is too much work to construct LR parser by hand for a programming language grammar. A
specialized tool called LR parser generator is needed.
(ii). Operator precedence parser

This parser generates the parse tree form given grammar and string but the only condition is two
consecutive non-terminals and epsilon never appears in the right-hand side of any production. One
example is an LR parser.
Advantages
1. Easy to implement
2. Once an operator precedence relation is made between all the pairs of terminal of grammars,
the grammar can be ignored. The grammar is not referred anymore during implementation.
Disadvantages
1. It is hard to handle tokens like minus sign (-) which has two different precedence.
2. Only a small class of grammar can be parsed using operator precedence parser.
Parsers are widely used in the following technologies:

a. Java and other programming languages.
b. HTML and XML.
c. Interactive data language and object definition language.
d. Database languages, such as SQL.
e. Modeling languages, such as virtual reality modeling language.
f. Scripting languages.
g. Protocols, such as HTTP and Internet remote function calls .
ERROR RECOVERY
A parser should be able to detect and report any error in the program. It is expected that when an error
is encountered, the parser should be able to handle it and carry on parsing the rest of the input. Mostly
it is expected from the parser to check for errors but errors may be encountered at various stages of the
compilation process. A program may have the following kinds of errors at various stages:
• Lexical : name of some identifier typed incorrectly
• Syntactical : missing semicolon or unbalanced parenthesis
• Semantical : incompatible value assignment
• Logical : code not reachable, infinite loop
There are four common error-recovery strategies that can be implemented in a parser to deal with
errors in the code.
a. Panic Mode - When a parser encounters an error anywhere in the statement, it ignores the rest of
the statement by not processing input from erroneous input to delimiter, such as semi-colon. This is
the easiest way of error-recovery and also, it prevents the parser from developing infinite loops.
b. Statement Mode - When a parser encounters an error, it tries to take corrective measures so that
the rest of inputs of statement allow the parser to parse ahead. For example, inserting a missing
semicolon, replacing comma with a semicolon etc. Parser designers have to be careful here because
one wrong correction may lead to an infinite loop.
22
c. Error Productions - Some common errors are known to the compiler designers that may occur in
the code. In addition, the designers can create augmented grammar to be used, as productions that
generate erroneous constructs when these errors are encountered.
d. Global Correction - The parser considers the program in hand as a whole and tries to figure out
what the program is intended to do and tries to find out a closest match for it, which is error-free.
When an erroneous input (statement) X is fed, it creates a parse tree for some closest error-free
statement Y. This may allow the parser to make minimal changes in the source code, but due to the
complexity (time and space) of this strategy, it has not been implemented in practice yet.
SCANNING
Scanning is the process of identifying tokens from the raw text source code of a program. At first glance,
scanning might seem trivial, after all, identifying words in a natural language is as simple as looking for
spaces between letters. However, identifying tokens in source code requires the language designer to
clarify many fine details, so that it is clear what is permitted and what is not. Most languages will have
tokens in these categories:
a. Keywords are words in the language structure itself, like while or class or true. Keywords must
be chosen carefully to reflect the natural structure of the language, without interfering with the
likely names of variables and other identifiers.
b. Identifiers are the names of variables, functions, classes, and other code elements chosen by
the programmer. Typically, identifiers are arbitrary sequences of letters and possibly numbers.
Some languages require identifiers to be marked with a sentinel (like the dollar sign in Perl) to
clearly distinguish identifiers from keywords.
c. Numbers could be formatted as integers, or floating point values, or fractions, or in alternate
bases such as binary, octal or hexadecimal. Each format should be clearly distinguished, so that
the programmer does not confuse one with the other.
d. Strings are literal character sequences that must be clearly distinguished from keywords or
identifiers. Strings are typically quoted with single or double quotes, but also must have some
facility for containing quotations, newlines, and unprintable characters.
e. Comments and white space are used to format a program to make it visually clear, and in some
cases (like Python) are significant to the structure of a program. When designing a new
language, or designing a compiler for an existing language, the first job is to state precisely what
characters are permitted in each type of token.
Functions of a scanner (called also lexical analyzer)

a. Read characters from the source file
b. Group input characters into meaningful units, called tokens
c. Removal of comments and white space
d. Keeping track of current line number required for reporting error messages
e. Case conversions of identifiers and keywords to Simplify searching if the language is not
casesensitive
f. Interpretation of compiler directives. Flags are internally set to direct code generation
g. Communication with the symbol or literal table. Identifiers can be entered in the symbol table
while String literals can be entered in the literal table
RECOGNIZER
A recognizer is a parser that does not perform syntax-directed translation. It only tells you whether the
input string is from the language described by the grammar. A recognizer for a language is a program
that decides whether an input belongs to the language defined by grammar. Decide here means
answers either yes or no. It takes a string x as an input and answers "yes" if x is a sentence of the
23
language and "no" otherwise. It also detects whether an input exhibit an ambiguity in the grammar and
report it to the user. A recognizer for a language is a program that
One can compile any regular expression into a recognizer by constructing a generalized transition
diagram called a finite automation.
Finite Automata (FA)

Finite Automata (FA) is the simplest machine to recognize patterns. The finite automata or finite state
machine is an abstract machine which has five elements or tuple. It has a set of states and rules for
moving from one state to another but it depends upon the applied input symbol. Basically, it is an
abstract model of digital computer
Essential features of a general automation.
The figure shows following features of automata:

1. Input
2. Output
3. States of automata
4. State relation
5. Output relation
A Finite Automata consists of the following:

Q : Finite set of states. Σ
: set of Input Symbols.
q : Initial state. F : set of
Final States. δ :
Transition Function.
Formal specification of a finite machine is FA = {Q, Σ, q, F, δ}.
Finite Automata is made up of two types - Deterministic Finite Automata (DFA) and Nondeterministic
Finite Automata (NFA)
1) Deterministic Finite Automata (DFA) – DFA consists of 5 tuples {Q, Σ, q, F, δ} where

Q : set of all states.
Σ : set of input symbols. (Symbols which machine takes as input)
q : Initial state. (Starting state of a machine) F
: set of final state.
δ : Transition Function, defined as δ : Q X Σ Q.
In a DFA, for a particular input character, the machine goes to one state only. A transition function is
defined on every state for every input symbol. Also in DFA null (or ε) move is not allowed, i.e., DFA
cannot change state without any input character.
24
For example, below DFA with Σ = {0, 1} accepts all strings ending with 0.
DFA with Σ = {0, 1}

One important thing to note is, there can be many possible DFAs for a pattern. A DFA with minimum
number of states is generally preferred.
2) Nondeterministic Finite Automata (NFA)
NFA is similar to DFA except following additional features:
1. Null (or ε) move is allowed i.e., it can move forward without reading symbols.
2. Ability to transmit to any number of states for a particular input.
However, these above features don’t add any power to NFA. If we compare both in terms of power,
both are equivalent.
Due to above additional features, NFA has a different transition function, rest is same as DFA.
δ: Transition Function δ:
Q X (Σ U ε ) --> 2 ^ Q.
Transition function is for any input including null (or ε), NFA can go to any state number of states. For
example, below is a NFA for above problem
NFA
One important thing to note is, in NFA, if any path for an input string leads to a final state, then the
input string is accepted. For example, in above NFA, there are multiple paths for input string “00”. Since,
one of the paths leads to a final state, “00” is accepted by above NFA.
RUNTIME STORAGE ORGANIZATION

A runtime environment is a set of data structures maintained at run time to implement high-level
structures example, stack, heap, static area, and virtual function tables Compilers create and manage a
runtime environment in which the target programs are executed. Runtime deals with the layout,
allocation, and deallocation of storage locations, linkages between procedures, and passing parameters
among other concerns.
When the target program executes, it runs in its own logical address space in which the value of each
program has a location. The logical address is shared among compiler, operating system and target
machine for management ad organization. The operating system is used to map the logical address into
physical address which is usually spread throughout the memory.
Storage Organization
Target program runs in its own logical space. The size of generated code is usually fixed at compile time
unless code is loaded or produced dynamically. Compiler can place the executable at fixed addresses.
25
Runtime storage can be subdivided into

a. Target code
b. Static data objects such as global constants
c. Stack to keep track of procedure activations and local data (automatic data objects)
d. Heap to keep all other information like dynamic data
Code
Static
Stack
Free memory
Heap
Memory location for code and static data are determined at compile time, while data objects of stack
and heap are dynamically allocated at runtime runtime storage comes in blocks, where a byte is
used to show the smallest unit of addressable memory.
Storage Allocation Methods

The different ways to allocate memory are:
1. Static storage allocation
2. Stack storage allocation
3. Heap storage allocation
Static storage allocation

This allocation lays out storage at compile time only by studying the program text. Memory allocated at
compile time will be in the static area. In static allocation, names are bound to storage locations at
compilation time. Bindings do not change, so no run time support is required.
Limitations of Static allocation
• Size of all data objects must be known at compile time
• Data structures cannot be created dynamically
• Recursive procedures are not allowed
Stack Storage Allocation

Stack allocation manage run-time allocation with a stack storage and local data are allocated on the
stack. In stack storage allocation, storage is organized as a stack. An activation record is pushed into the
stack when activation begins and it is popped when the activation end. Activation record contains the
locals so that they are bound to fresh storage in each activation record. The value of locals is deleted
when the activation ends.It works on the basis of last-in-first-out (LIFO) and this allocation supports the
recursion process.
Heap Storage Allocation

Heap allocation is the most flexible allocation scheme. Allocation and deallocation of memory can be
done at any time and at any place depending upon the user's requirement. Heap allocation is used to
allocate memory to the variables dynamically and when the variables are no more used then claim it
back. This storage allocation supports recursion process.
Both Stack storage allocation and Heap storage allocation are dynamic storage allocation
26
Differences between Static and Dynamic Allocation

Static Dynamic
Variable access is fast Variable access is slow
Addresses are known at compile time Accesses need redirection through stack/heap pointer
Cannot support recursion Supports recursion
Difference between Stack and Heap Allocation

Stack Heap
Allocation/deallocation is automatic Allocation/deallocation is explicit
Less expensive More expensive
Space for allocation is limited Challenge is fragmentation
27

CSC411 Compiler Construction - MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session

Uploaded by

Copyright:

Available Formats

CSC411 Compiler Construction - MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSC411 Compiler Construction - MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session

Uploaded by

Copyright:

Available Formats

CSC411 Compiler Construction – MO Onyesolu and OU Ekwealor - First Semester 2020/2021 Session

Com 413: Compiler Construction

In computing, a compiler is a special program that processes statements written in a particular

Source program Compiler Object program

Schematic diagram of a compiler

a. Source program – This is normally a program written in a high-level programming language. It

Assembly code Assembler Machine code

Schematic diagram of an Assembler

1. Parse the source code and perform its behavior directly;

Difference between Compiler and Interpreter Compiler Interpreter

Analysis Phase (Analysis of the source program being compiled)

Synthesis Phase (Synthesis of the target program)

In programming language, keywords, constants, identifiers, strings, numbers, operators and

For example, in C language, the variable declaration line Int

The left-most derivation is:

The right-most derivation is:

The left-most derivation is:

Figure 8: Ambiguity in parsing

The language generated by an ambiguous grammar is said to be inherently ambiguous. Ambiguity in

Functions of Semantic Analysis

INTERMEDIATE CODE GENERATION

Benefits of intermediate code

INTERMEDIATE CODE REPRESENTATION

a. Store the names of all entities in a structured form at one place.

We can build up the (extended) regular expression for an identifier as follows.

GRAMMAR AND LANGUAGES

Derivations from a Grammar

G2 = ({S, A}, {a, b}, S, {S → aAb, aA → aaAb, A → ε } )

Some of the strings that can be derived are − S

A language generated by a grammar G is a subset formally defined by L(G)={W|W ∈ ∑*, S ⇒G W}

If L(G1) = L(G2), the Grammar G1 is equivalent to the Grammar G2.

Construction of a Grammar Generating a Language

Hence the grammar − G: ({S, A, B}, {a, b}, S, { S → aS | B , B → b | bB })

Grammar Grammar Language Accepted Production Rule Automaton

The relationship between Chomsky hierarchies of grammars/languages

The productions must be in the form X → a or X → aY

The productions must be in the form A → γ where A ∈ N

These languages generated by these grammars are recognized by a non-deterministic pushdown

The strings α and β may be empty, but γ must be non-empty.

The overall process of parsing involves three stages:

(ii). Non-recursive descent parser:

Types of LR parser include:

(ii). Operator precedence parser

Parsers are widely used in the following technologies:

Functions of a scanner (called also lexical analyzer)

Finite Automata (FA)

Essential features of a general automation.

The figure shows following features of automata:

A Finite Automata consists of the following:

Formal specification of a finite machine is FA = {Q, Σ, q, F, δ}.

1) Deterministic Finite Automata (DFA) – DFA consists of 5 tuples {Q, Σ, q, F, δ} where

DFA with Σ = {0, 1}

RUNTIME STORAGE ORGANIZATION

Runtime storage can be subdivided into

Storage Allocation Methods

Static storage allocation

Stack Storage Allocation

Heap Storage Allocation

Differences between Static and Dynamic Allocation