Compiler Design
Compiler Design
Compiler design refers to the process of creating In simpler terms, a compiler is like a language
a special software tool called a compiler, which translator that helps turn human-readable code
helps in translating and converting high-level into a language that the computer can understand
programming code written by humans into a and run. It makes sure the code is correct, and
format that can be understood and executed by a then it converts it into a form that the computer
computer. It involves various steps like analyzing can execute.
the code's structure, checking for errors,
PHASES OF COMPILER
optimizing the code for better performance, and
generating the final executable program. In
• Lexical Analysis: This is the first phase of
simpler terms, compiler design is all about
the compiler. It reads the source code
building a translator that turns human-readable
character by character and groups them
code into instructions that a computer can follow.
into meaningful units called lexemes. For
A compiler is a special tool that helps translate example, it recognizes keywords,
code written in a high-level programming identifiers, numbers, and symbols in the
language (like Python or Java) into a language code. These lexemes are then converted
that the computer's processor can understand, into tokens, which are representations of
which is called machine language. these meaningful units.
The high-level language is what developers use to • Syntax Analysis: In this phase, the
write code because it's easier for humans to compiler takes the tokens generated in
understand. The machine language, on the other the lexical analysis phase and checks if
hand, is what the computer actually understands they form valid expressions according to
and can execute. the rules of the programming language's
syntax. It constructs a parse tree, which
When you write and run a program, the compiler
represents the structure and relationships
checks the code for errors and makes sure it
between the components of the code. The
follows the rules of the programming language.
parser ensures that the code is
Its main job is to convert the code written in one
syntactically correct.
language into another language without changing
the program's meaning. • Semantic Analysis: The semantic analysis
phase checks whether the parse tree,
The program execution happens in two parts :-
generated in the syntax analysis phase,
1. First, the compiler takes the code you follows the rules of the programming
wrote (source program) and translates it language in terms of meaning and
into a lower-level language called the context. It verifies the types of identifiers,
object program. This translation makes expressions, and statements. This phase
the code easier for the computer to work also performs tasks like type checking and
with. symbol table management to ensure that
2. Then, the object program is further the code has meaningful semantics.
translated into the target program using a • Intermediate Code Generation: Here, the
tool called an assembler. The target compiler converts the source code into an
program is the final version that the intermediate representation that is closer
computer can directly understand and to the machine language but still
execute. independent of the target machine. This
intermediate code should be easy to
translate into the final machine code. It
helps in further analysis and optimizations In simpler terms, the compiler goes through
before generating the actual target code. several steps: analyzing the code's structure and
meaning, generating an intermediate
• Code Optimization: This phase is optional
representation, optimizing the code if needed,
but aims to improve the intermediate
and finally translating it into the machine
code generated in the previous phase. It
language of the target computer. Each phase
analyzes the code for opportunities to
plays a specific role in converting human-readable
make it more efficient and optimized in
code into executable machine code.
terms of execution speed and memory
usage. This can involve eliminating
redundant code, rearranging the sequence
of statements, or applying mathematical
transformations to simplify expressions.
During lexical analysis, the compiler reads the During the lexical analysis phase, the lexical
source code character by character and groups analyzer will recognize and extract the following
them into lexemes. Lexemes represent the lexemes :-
smallest meaningful units in a programming
• The identifier ‘x’
language, such as keywords, identifiers,
• The assignment operator ‘=’
operators, constants, and punctuation symbols.
• The number ‘10’
The lexical analyzer, also known as the lexer or • The addition operator ‘+’
scanner, is responsible for recognizing these • The identifier ‘y’
lexemes by applying predefined rules and patterns
Once the lexemes are identified, they are
specific to the programming language. It scans
converted into tokens, which are abstract
the source code, identifies lexemes based on these
representations of these lexemes. These tokens
rules, and generates a sequence of tokens.
will be used in later stages of the compiler to
Tokens are abstract representations of lexemes analyze the code's syntax and semantics.
and carry more meaning in the context of the
After the lexical analyzer identifies the lexemes
programming language. They provide a higher-
(meaningful units) from the source code, the next
level view of the code, categorizing the lexemes
step is to convert them into tokens. Tokens are
into different types like keywords, identifiers,
abstract representations of these lexemes and
operators, or literals.
carry more meaning in the context of the
The purpose of lexical analysis is to transform the programming language.
continuous stream of characters in the source
To convert lexemes into tokens, the compiler uses
code into a structured sequence of tokens. This
a set of predefined rules and patterns specific to
structured representation makes it easier for
the programming language. These rules define the
subsequent compiler phases, such as syntax
different categories or types of tokens that can be
analysis and semantic analysis, to analyze and
encountered in the code. Each token has a
understand the code's structure and meaning.
specific meaning and role in the language's
In simpler terms, lexical analysis is like breaking syntax.
down the source code into meaningful chunks
For example, let's consider the following line of
and assigning labels to these chunks. It helps the
code in the Python programming language:
compiler recognize and categorize different parts
of the code, which is important for understanding x = 10 + y
and processing the code correctly.
The lexical analyzer has identified the following
Lexical analysis is the first phase of the lexemes:
compilation process in which the compiler reads
the source code character by character. Its main • The identifier lexeme: "x" and "y"
job is to identify and extract meaningful units • The assignment operator lexeme: "="
called lexemes from the code. • The number lexeme: "10"
• The addition operator lexeme: "+"
Think of the source code as a long string of
characters. The lexical analyzer breaks down this
Now, let's see how these lexemes are converted SYNTAX ANALYSIS
into tokens:
Syntax analysis, also known as parsing, is the
• The identifier lexeme "x" becomes an
second phase of the compilation process after
identifier token.
lexical analysis. It focuses on analyzing the
• The assignment operator lexeme "="
structure of the source code based on the
becomes an assignment token.
grammar rules of the programming language.
• The number lexeme "10" becomes a
numeric constant token. Syntax analysis checks whether the sequence of
• The addition operator lexeme "+" becomes tokens generated from the lexical analysis phase
an addition operator token. conforms to the rules specified by the language's
syntax. It ensures that the code follows the
These tokens provide a higher-level representation
correct arrangement and combination of tokens to
of the code. They carry information about the
form valid expressions, statements, and program
type of the lexeme and its role in the
structures.
programming language. The tokens are then used
in subsequent phases of the compiler, such as Syntax analysis typically constructs a parse tree
syntax analysis and semantic analysis, to or syntax tree, which is a hierarchical
understand the structure and meaning of the representation of the code's structure. The parse
code. tree shows how the different parts of the code
relate to each other based on the language's
syntax rules.
int x = 10;
Syntax analysis is crucial because it ensures that Semantic analysis ensures that the code is
the code follows the correct structure and semantically valid based on the information
grammar of the programming language. By provided by the parse tree generated during the
constructing the parse tree, it provides a syntax analysis phase. It performs various checks
structural representation of the code that can be and validations to ensure that the code makes
used for subsequent analysis, optimization, and sense and follows the rules and constraints of the
code generation phases of the compiler. programming language. Let's explore some of the
tasks performed during semantic analysis in more
SEMANTIC ANALYZER detail :-
Semantic analysis is the phase of the compilation 1. Type Checking: Semantic analysis verifies
process that follows syntax analysis. It focuses on that the operations and expressions in the
understanding the meaning and correctness of the code are applied to compatible types. It
code beyond its structure. Semantic analysis
checks that variables, constants, and
checks for semantic errors and ensures that the
code makes sense according to the rules and expressions are used in a manner
constraints of the programming language. consistent with their declared types.
Here's a detailed explanation of the target code Target Machine Code (MIPS assembly):
generator and examples to illustrate its workings:
lw $t0, y ; Load y into register $t0
1. Instruction Selection: The target code lw $t1, z ; Load z into register $t1
add $s0, $t0, $t1 ; Add y and z,
generator selects appropriate machine
store result in $s0
instructions based on the intermediate sw $s0, x ; Store result in x
code instructions. It maps each operation
and expression in the intermediate code In this example, the target code generator uses
to a corresponding sequence of machine the lw instruction to load values from memory
instructions. For example: into registers and the sw instruction to store the
result back into memory.
Intermediate Code:
4. Optimization and Target-Specific Features:
t1 = a + b
The target code generator may perform
additional optimizations specific to the
Target Machine Code (x86 assembly):
target machine architecture. It can exploit
ADD eax, ebx pipeline features, vectorization, or other
architectural characteristics for improved
In this example, the target code generator selects performance. The generated code may
the ADD instruction in x86 assembly to perform also include target-specific instructions or
the addition operation. extensions. For example:
2. Register Allocation: The target code
generator assigns intermediate variables
and values to specific registers or memory
Intermediate Code:
x = a * b
Lexical analysis, also known as scanning or During lexical analysis, the following tokens
tokenization, is an important phase in the would be generated:
compilation process. Its purpose is to break down
the source code into smaller, meaningful units
• Keyword token: int
2. Identifiers:
• x: 3
• y: 2
• sum: 2
3. Punctuation:
• (: 1
• ): 1
• {: 1
• }: 1
• ;: 3
4. Operators:
• =: 3
• +: 1
5. Numeric Literals:
• 5: 1
• 10: 1
By tallying the counts, we can determine that the
code segment contains :-
CHAPTER THREE messages are generated to inform the
programmer about the specific issues in
SYNTAX ANALYSIS the code. These errors may include
missing semicolons, mismatched
The second phase of compiler design is syntax parentheses, or incorrect usage of
analysis, also known as parsing. The primary goal language constructs.
of syntax analysis is to ensure that the source The output of the syntax analysis phase is either
code follows the rules and structure specified by a parse tree or an abstract syntax tree (AST). An
the language's grammar. It involves analyzing the AST simplifies and abstracts the parse tree by
sequence of tokens generated by the lexical removing unnecessary details and focuses on the
analysis phase and constructing a parse tree or essential elements of the program's syntax. The
syntax tree that represents the syntactic structure resulting parse tree or AST is then used in
of the program. subsequent phases of the compiler, such as
Syntax analysis verifies the correctness of the semantic analysis and code generation.
program's syntax by checking if the sequence of Syntax analysis is a crucial step in the
tokens can be derived from the grammar of the compilation process as it ensures that the source
programming language. This process involves code adheres to the language's grammar and can
applying a set of production rules defined by a be further processed and transformed into
context-free grammar (CFG) or a similar executable code.
formalism. The CFG describes the syntax rules of
the language and determines how different
language constructs can be combined and nested. THE ROLE OF PARSER
During syntax analysis, the following steps are The parser plays a crucial role in the syntax
typically performed: analysis phase of the compiler. Its main task is to
1. Tokenization :- The input source code is take the sequence of tokens generated by the
divided into tokens using the rules lexical analyzer and determine whether this
specified by the lexical analyzer. Each sequence adheres to the grammar rules of the
token represents a meaningful unit of the source language. The grammar used by the parser
programming language, such as is typically a Context-Free Grammar (CFG), which
identifiers, keywords, operators, and provides a set of production rules for generating
literals. valid program structures.
2. Parsing: The tokens are processed and The parser's primary objective is to ensure that
organized hierarchically to create a parse the input program is well-formed and
tree or syntax tree. The parse tree syntactically correct. It accomplishes this by
represents the syntactic structure of the constructing a parse tree, which represents the
program and shows how the various hierarchical structure of the program based on
language constructs are related to each the grammar rules. The parse tree serves as a
other. There are different parsing visual representation of how the different
techniques, such as top-down parsing language constructs are nested within each other.
(e.g., recursive descent parsing) and There are two main types of parsers used in
bottom-up parsing (e.g., LR parsing), compilers: top-down parsers and bottom-up
which systematically apply the grammar parsers.
rules to construct the parse tree.
1. Top-Down Parsing: In top-down parsing,
3. Error Handling: If any syntax errors are the parser starts from the root of the
encountered during parsing, error parse tree and works its way down to the
leaves. It begins with the start symbol of CONTEXT FREE GRAMMERS
the grammar and recursively applies
production rules to generate the parse In the field of compiler design, context-free
tree. The most common top-down parsing grammars play a crucial role in specifying the
method is Recursive Descent Parsing, syntactic structure of programming languages. A
where each non-terminal in the grammar context-free grammar describes a set of strings or
is associated with a separate procedure or language and provides rules for composing these
function. strings from various syntactic elements. These
elements include terminals, non-terminals, start
2. Bottom-Up Parsing: In bottom-up parsing,
symbols, and production rules.
the parser starts from the input tokens
and gradually builds the parse tree by Terminals represent the basic symbols from which
reducing the tokens according to the strings are formed. They correspond to the
grammar rules. Bottom-up parsing fundamental building blocks of a programming
algorithms, such as LR (Left-to-Right, language, such as identifiers, keywords,
Rightmost derivation) or LALR (Look- operators, and literals. Non-terminals, on the
Ahead LR) parsing, are commonly used to other hand, act as placeholders that can be
construct the parse tree. replaced by other terminals or non-terminals to
create larger structures.
Regardless of the parsing method used, the parser
scans the input program from left to right, The start symbol is a designated non-terminal
examining one symbol (token or non-terminal) at that indicates the entry point or top-level
a time. If the parser encounters any syntax construct of the language's syntax. It defines the
errors, such as an invalid sequence of tokens or a language that the grammar represents. By starting
violation of the grammar rules, it reports these from the start symbol and applying production
errors to the programmer in a meaningful and rules, valid programs or expressions in the
understandable way. Error recovery mechanisms language can be derived.
may also be implemented in the parser to handle
Production rules specify how terminals and non-
common syntax errors and allow the compilation
terminals can be combined to form valid strings
process to continue.
in the language. Each production rule consists of
Once the parser constructs the parse tree a head (left side), an arrow symbol (→), and a
successfully, it passes the parse tree or an body (right side). The head represents the
Abstract Syntax Tree (AST) to the subsequent construct being defined, while the body describes
phases of the compiler for further processing. The the components that can be used to construct
parse tree or AST serves as the foundation for valid strings.
subsequent analysis, such as semantic analysis,
With these components, context-free grammars
optimization, and code generation.
provide a precise and formal way to describe the
Overall, the parser is responsible for verifying the syntactic structure of programming languages.
syntactic correctness of the program by applying They help compilers understand and analyze the
the grammar rules, constructing the parse tree, structure of programs during the syntax analysis
reporting syntax errors, and passing the resulting phase, enabling the detection of syntax errors and
tree structure to the next phases of the compiler. the construction of parse trees for further
processing.
• Terminals: Terminals are the basic c) Body or Right Side: The body of a
symbols from which strings are formed. production rule consists of zero or more
They represent the fundamental building terminals and non-terminals. It describes
blocks of a language. In the context of a one possible way in which strings can be
compiler, terminals are often referred to constructed from the non-terminal at the
as token names. Tokens are generated by head. The components of the body can be
the lexical analyzer and serve as input to terminals (tokens) or non-terminals, and
the syntax analyzer. Each token they specify the structure of the language.
corresponds to a specific lexeme, such as
Production rules provide the building blocks for
identifiers, keywords, operators, or
constructing valid strings in the language. By
literals.
applying the production rules recursively, starting
1. Non-Terminals: Non-terminals, also known from the start symbol, a parser can generate the
as syntactic variables, represent sets of valid syntax tree or parse tree for a given
strings of terminals. They act as program.
placeholders that can be replaced by other
Overall, context-free grammars are used to
terminals or non-terminals according to
specify the syntactic structure of a programming
the grammar rules. Non-terminals are
language. They define the valid combinations and
used to define the structure and syntax of
arrangements of terminals and non-terminals,
a programming language. They are
which are necessary for parsing and
typically represented by uppercase letters
understanding the structure of a program during
or symbols.
the syntax analysis phase of the compiler.
2. Start Symbol: In a context-free grammar,
one non-terminal is designated as the For example, take the following grammar and
start symbol. The start symbol specifies input string. E → E – E | E * E | a | b | c ;
the language that the grammar defines. It and input string given is “ a – b * c ”.
represents the top-level construct or entry
point of the language's syntax. All valid In the given example, we have a context-
programs or expressions in the language free grammar with a start symbol E,
can be derived by starting from the start
terminals -, *, a, b, c, and a single non-
symbol and applying the production rules.
terminal E. The grammar consists of
3. Production Rules: Production rules specify several production rules, each with the
the different ways in which terminals and
head A (in this case, E) and alternative
non-terminals can be combined to form
strings in the language. Each production bodies (α1, α2, ..., αk).
rule consists of three parts :-
The production rules for this grammar
a) Head or Left Side: The head of a are as follows :-
production rule is a non-terminal that
defines the strings generated by that rule. 1. E → E - E
It represents the current construct being 2. E → E * E
defined or expanded. 3. E → a
b) Arrow Symbol (→): The arrow symbol 4. E → b
separates the head from the body of the 5. E → c
production rule. It indicates that the head
can be replaced by the elements in the The first two production rules have the
body. same head E, and their bodies represent
alternative ways to construct expressions The production rules in a context-free
involving subtraction and multiplication. grammar define the structure of valid
The third, fourth, and fifth production strings in the language it represents. By
rules define the possible terminals that applying these rules, parsers can analyze
can be directly derived from E, which the syntax of programs and construct
are the single characters a, b, and c. parse trees that capture the hierarchical
relationships between the language
To generate a parse tree for a given
constructs.
input string "a - b * c" using this
grammar, the parser applies these DERIVATIONS
production rules in a way that matches
the input string. The parser starts with In the process of parsing, a derivation refers to a
sequence of production rules that are applied to
the start symbol E and applies the
transform a given input string according to the
production rules to derive the input grammar of the language. During parsing, two
string. Here's a step-by-step breakdown :- decisions are made: selecting the non-terminal to
be replaced and determining the production rule
1. E → E - E (using the first
to be used for the replacement.
production rule) This rule allows E
To illustrate the concept of derivation, let's
to be expanded into E - E.
consider a non-terminal symbol A surrounded by
2. E - E → a - E (using the third grammar symbols α and β, such that the current
sentential form is αAβ. Suppose we have a
production rule) The left E is
production rule A → γ. In this case, we can write
expanded into a using the third
the derivation as αAβ ⟹ αγβ, indicating that A
production rule. has been replaced by γ.
E → id + id * id
It's worth noting that regular expressions can Top-down parsing has advantages such as
describe only regular languages, while context- simplicity and ease of understanding. It allows for
free grammars can describe both regular and non- straightforward error reporting and recovery since
regular languages. Regular expressions are it detects errors as soon as they occur during the
typically used in lexical analysis for tokenizing parsing process. However, it can be inefficient in
input strings, while grammars are used in syntax cases where backtracking is required due to
analysis to analyze the hierarchical structure of a ambiguous or non-deterministic grammars. To
language. address this, optimization techniques like
memoization or lookahead may be employed to
improve the efficiency of top-down parsers.
3. Matching: A → ab | a
• Compare the symbol pointed by To construct a parse tree for the input string w =
the tree pointer with the symbol cad, begin with a tree consisting of a single node
pointed by the input pointer. labeled S, and the input pointer pointing to c,
• If they match, advance both the first symbol of w. S has only one production,
pointers to the right. so we use it to expand S and obtain the tree of
Fig. 3.2 (a). The leftmost leaf, labeled c, matches
• If they don't match, backtrack to
the first symbol of input w, so we advance the
the previous step before the non- input pointer to a, the second symbol of w, and
terminal expansion and try consider the next leaf, labeled A.
another production.
4. Recursive Expansion:
LR PARSING
CHAPTER FIVE
LR parsing is a bottom-up parsing technique that
constructs a parse tree for the input string by
SYMBOL TABLE , SYNTAX RELATED
performing a series of reductions (reversing the
TRANSLATION AND TYPE CHECKING
right-hand side of a production) in a rightmost
derivation. It stands for "left-to-right, rightmost
A symbol table is a data structure used by
derivation" and is known for its power and
compilers to store and manage information about
efficiency in parsing a wide range of context-free
identifiers (such as variables, functions, classes)
languages.
encountered in the source code of a program. It
In LR parsing, the input string is initially placed acts as a central repository for collecting and
on the bottom of a stack. The parser reads the organizing information about these program
input symbols from left to right and tries to find constructs.
occurrences of production rules in the reverse
The symbol table is built incrementally during the
order of their right-hand sides. This process
analysis phases of a compiler and utilized during
involves shifting input symbols onto the stack and
the synthesis phases for generating the target
reducing them when a production rule is
code. Each entry in the symbol table contains
recognized.
relevant information about an identifier, such as Depending on the size and complexity of the
its name (lexeme), data type, memory location, language being compiled, symbol tables may be
scope, and other properties specific to the implemented using different data structures such
programming language. as hash tables, binary search trees, or other
suitable data structures to ensure efficient symbol
The symbol table plays a crucial role in
management during the compilation process.
maintaining scope and binding information. It
helps in resolving references to identifiers and
ensures that they are used correctly and INTRODUCTION TO TYPE CHECKING
consistently within the program. When a name is
Type checking is a crucial process performed by a
encountered in the source code, the symbol table
compiler to ensure that the source program
is searched to retrieve information about that
adheres to the syntactic and semantic rules of the
name. If a new identifier is discovered or
programming language. It is a static checking
additional information about an existing identifier
process that occurs during compilation, prior to
is found, the symbol table is updated accordingly.
the actual execution of the program.
The primary purposes of a symbol table are :-
The main objective of type checking is to detect
1. Storing entity names: It provides a and prevent type errors, which occur when an
structured form to store the names of all operation or expression is applied to incompatible
entities (variables, functions, classes, etc.) data types. Type errors can lead to unexpected
encountered in the program. This allows behavior or runtime errors in the compiled
for efficient access and retrieval of program. By performing type checks, the compiler
information. ensures that the program operates with consistent
2. Declaration verification: The symbol table and compatible data types, promoting safer and
is used to check if a variable or any other more reliable code execution.
entity has been properly declared before Static type checking involves examining the
its usage. It helps in ensuring that the program's structure, expressions, and statements
program follows the language's scoping to verify their compatibility and correctness
and declaration rules. according to the language's type system. Here are
3. Type checking: The symbol table is some key aspects of type checking :-
involved in the process of type checking, 1. Type Compatibility: The type checker
which ensures that assignments and verifies that the operands and operators
expressions in the source code are used in expressions or statements are
semantically correct. It verifies that the compatible. For example, adding an
operations performed on variables or integer and a string would be flagged as
expressions are compatible with their a type error.
declared types.
2. Type Inference: In languages with type
4. Scope resolution: The symbol table helps inference, the type checker deduces the
determine the scope of a name, i.e., types of variables or expressions based on
where the name is valid and accessible their usage and context. This allows for
within the program. It assists in resolving implicit type declarations without explicit
conflicts or ambiguities that may arise type annotations.
due to the presence of the same name in
3. Type Consistency: The type checker
different scopes.
ensures that the assigned value or result
Efficient symbol table mechanisms are essential to of an expression is consistent with the
support fast insertion and retrieval of entries. declared or expected type. For instance,
assigning a floating-point value to an In the context of type checking, the design of a
integer variable may require a type type checker relies on information about the
conversion or trigger a type error. syntactic constructs in the language, the concept
of types, and the rules for assigning types to
4. Function and Procedure Calls: The type
different language constructs. The type system
checker validates that function and
defines the properties and behavior of these types
procedure calls match the expected
and how they interact with each other.
number, order, and types of arguments. It
checks that the arguments provided In many programming languages, including Pascal
correspond to the defined parameter and C, types are classified as either basic or
types. constructed.
5. Data Structures and Operations :- Type 1. Basic Types: Basic types are atomic types
checking ensures that operations such as that do not have any internal structure as
indexing, dereferencing, or member access far as the programmer is concerned. They
are applied to compatible data structures. represent fundamental data types
For example, accessing an element in an supported by the language. Examples of
array requires an integer index, and basic types in Pascal include boolean,
accessing a field in a struct requires the character, integer, and real. Additionally,
correct field name. Pascal allows the construction of other
basic types, such as enumerated types
6. Type Safety: Type checking promotes type
(e.g., (violet, indigo, blue, green, yellow,
safety by preventing implicit type
orange, red)) and subrange types (e.g.,
conversions that may lead to data loss or
1...10). Basic types provide the building
unpredictable behavior. It enforces explicit
blocks for constructing more complex
type conversions or casting when
types.
necessary.
2. Constructed Types: Constructed types are
The type checker analyzes the program's
formed by combining basic types and
structure, symbol table entries, and type
other constructed types. They have a
declarations to perform these checks. If any type
composite structure and are used to
errors are detected, the compiler reports them as
represent more complex data structures.
compilation errors or warnings, allowing the
Common examples of constructed types
programmer to address and fix them before
are arrays, records (structs in C), and
executing the program.
sets. Arrays are collections of elements of
Overall, type checking plays a vital role in the same type, while records consist of
ensuring the integrity and correctness of the multiple fields or members, each with its
program's type system, promoting robustness and own type. Sets represent a collection of
reliability in the compiled code. distinct values from a predefined range.