Compiler Construction Notes - Hamza
Compiler Construction Notes - Hamza
NOTES
SUBJECT :
Compiler Construction
CLASS :
BSCS 6th Semester
WRITTEN BY :
(CR) KASHIF MALIK
INTRODUCTION TO COMPILERS
• Multipass Compilers
Single Pass Compiler
When all the phases of the compiler are present inside a single
module, it is simply called a single-pass compiler. It performs
the work of converting source code to machine code.
Two Pass Compiler
Two-pass compiler is a compiler in which the program is
translated twice, once from the front end and the back from the
back end known as Two Pass Compiler.
Multipass Compiler
When several intermediate codes are created in a program and a
syntax tree is processed many times, it is called Multi pass
Compiler. It breaks codes into smaller programs.
X identifier
= Assignment operator
Y identifier
+ Addition operator
10 Number
x = y + 10
Tokens
Hamza zahoor whatsapp 0341-8377-917
Example
For example, total =
count + rate * 5
Intermediate code with the help of address code method is:
t1 := int_to_float(5)
t2 := rate * t1 t3 :=
count + t2
total := t3
Can become
b =c * 10.0 f
= e+b
Phase 6: Code Generation
Code generation is the last and final phase of a compiler. It gets
inputs from code optimization phases and produces the page
code or object code as a result. The objective of this phase is to
allocate storage and generate relocatable machine code. It also
allocates memory locations for the variable. The instructions in
the intermediate code are converted into machine instructions.
This phase coverts the optimize or intermediate code into the
target language.
The target language is the machine code. Therefore, all the
memory locations and registers are also selected and allotted
during this phase. The code generated by this phase is executed
to take inputs and generate expected outputs. Example
a = b + 60.0
Would be possibly translated to registers.
MOVF a, R1
MULF #60.0, R2
ADDF R1, R2
Hamza zahoor whatsapp 0341-8377-917
• EXAMPLE
Hamza zahoor whatsapp 0341-8377-917
Hamza zahoor whatsapp 0341-8377-917
GROUPING OF PHASES
1. Front End phases: The front end consists of those
phases or parts of phases that are source
languagedependent and target machine,
independents. These generally consist of lexical
analysis, semantic analysis, syntactic analysis,
symbol table creation, and intermediate code
generation. A little part of code optimization can
also be included in the frontend part. The front-end
part also includes the error handling that goes along
with each of the phases.
Grouping
Several phases are grouped together to a pass so that it can read
the input file and write an output file.
1. One-Pass – In One-pass all the phases are grouped into one
phase. The six phases are included here in one pass.
2. Two-Pass – In Two-pass the phases are divided into two
parts i.e. Analysis or Front End part of the compiler and the
synthesis part or back end part of the compiler.
Hamza zahoor whatsapp 0341-8377-917
Example of Translation:
Consider a simple assignment statement in a programming
language:
Hamza zahoor whatsapp 0341-8377-917
int result = a + b * c;
3. Semantic Analysis:
- Ensure that the statement adheres to the language's semantic
rules. For example, check if the variables are declared before
use.
t1 = b * c
t2 = a + t1
result = t2
Hamza zahoor whatsapp 0341-8377-917
6. Code Generation:
- Translate the intermediate code into the target machine code or
another intermediate representation.
• PARSING
The process of transforming the data from one format to another
is called Parsing. This process can be accomplished by the
parser. The parser is a component of the translator that helps to
organise linear text structure following the set of defined rules
which is known as grammar.
Types of Parsing:
Hamza zahoor whatsapp 0341-8377-917
What is a token?
Hamza zahoor whatsapp 0341-8377-917
Example of Non-Tokens:
• Comments, preprocessor directive, macros, blanks, tabs,
newline, etc.
Lexeme: The sequence of characters matched by a pattern to
form the corresponding token or a sequence of input characters
that comprises a single token is called a lexeme. eg- “float”,
“abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
• INPUT BUFFERING
Input buffering is a technique that allows the compiler to read
input in larger chunks, which can improve performance and
reduce overhead.
1. The basic idea behind input buffering is to read a block of
input from the source code into a buffer, and then process
that buffer before reading the next block.
The lexical analyzer scans the input from left to right one
character at a time. It uses two pointers begin ptr(bp) and
forward ptr(fp) to keep track of the pointer of the input scanned.
Initially both the pointers point to the first character of the input string as shown below
Hamza zahoor whatsapp 0341-8377-917
What is Token ?
In programming language, keywords, constants, identifiers,
strings, numbers, operators and punctuations symbols can be
considered as tokens.For example, in C language, the variable
declaration lineint value = 100;contains the tokens:int
(keyword), value (identifier), = (operator), 100 (const ant) and ;
(symbol).
Lexeme Token
= EQUAL_OP
* MULT_OP
, COMMA
Hamza zahoor whatsapp 0341-8377-917
( LEFT_PAREN
Specifications of Tokens:
Let us understand how the language theory undertakes the
following terms:
1. Alphabets
2. Strings
3. Special symbols
4. Language
5. Longest match rule
6. Operations
7. Notations
8. Representing valid tokens of a language in regular
expression
9. Finite automata
1. Alphabets: Any finite set of symbols
• {0,1} is a set of binary alphabets,
• {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal
alphabets,
• {a-z, A-Z} is a set of English language alphabets.
2. Strings: Any finite sequence of alphabets is called a string.
3. Special symbols: A typical high-level language contains the
following symbols:
Hamza zahoor whatsapp 0341-8377-917
Assignment =
Preprocessor #
1. Union : L(r)UL(s)
2. Concatenation : L(r)L(s)
3. Kleene closure : (L(r))*
8. Representing valid tokens of a language in regular
expression:If x is a regular expression, then:
• x* means zero or more occurrence of x.
• x+ means one or more occurrence of x.
9. Finite automata: Finite automata is a state machine that takes
a string of symbols as input and changes its state
accordingly.Ifthe input string is successfully processed and the
automata reaches its final state, it is accepted.The
mathematical model of finite automata consists of:
• Finite set of states (Q)
• Finite set of input symbols (Σ)
• One Start state (q0)
• Set of final states (qf)
• Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a
finite set of input symbols (Σ), Q × Σ ➔ Q
• LEX
o Lex is a program that generates lexical analyzer. It is used
with YACC parser generator. o The lexical analyzer is a
program that transforms an input stream into a sequence of
tokens.
o It reads the input stream and produces the source code as
output through implementing the lexical analyzer in the C
program.
The function of Lex is as follows:
o Firstly lexical analyzer creates a program lex.1 in the Lex
language. Then Lex compiler runs the lex.1 program and
produces a C program lex.yy.c. o Finally C compiler runs
the lex.yy.c program and produces an object program a.out.
o a.out is lexical analyzer that transforms an input stream
• FINITE AUTOMATA
Finite automata is a state machine that takes a string of symbols
as input and changes its state accordingly. Finite automata is a
recognizer for regular expressions. When a regular expression
string is fed into finite automata, it changes its state for each
literal. If the input string is successfully processed and the
automata reaches its final state, it is accepted, i.e., the string just
fed was said to be a valid token of the language in hand.
The mathematical model of finite automata consists of:
• Finite set of states (Q)
• Finite set of input symbols (Σ)
• One Start state (q0)
• Set of final states (qf)
• Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a
finite set of input symbols (Σ), Q × Σ ➔ Q
Finite Automata Construction
Hamza zahoor whatsapp 0341-8377-917
L = {a, aa, aaa, aaaa, aaaaa, ba, bba, bbbaa, aba, abba, aaba,
abaa}
Above is simple subset of the possible acceptable strings
there can many other strings which ends with ‘a’ and
contains symbols {a,b}.
q0 q1 q0
q1 q1 q0
Symbol⇢
⇣State\ a b
q0 {q0 ,q1 } q0
q1 ∅ ∅
One important thing to note is, in NFA, if any path for an input
string leads to a final state, then the input string is accepted.
For example, in the above NFA, there are multiple paths for the
input string “00”. Since one of the paths leads to a final state,
“00” is accepted by the above NFA.
1. Input Specification:
- Define a formal language to specify lexical rules, often using
regular expressions or similar formalisms.
- Allow users to provide a description of tokens, patterns, and
associated actions.
3. Token Definition:
- Define structures to represent tokens and their attributes.
Hamza zahoor whatsapp 0341-8377-917
5. Code Generation:
- Generate code for the lexical analyzer based on the processed
lexical rules.
- Output code should typically be in a language like C, Java, or
another programming language.
6. Error Handling:
- Implement mechanisms to handle errors gracefully, providing
meaningful error messages when lexical errors are
encountered.
8. Performance Considerations:
- Optimize generated code for speed and efficiency.
Hamza zahoor whatsapp 0341-8377-917
9. Customization Options:
- Allow users to customize the generated lexer, for example, by
specifying additional code snippets to be included.
10. Documentation:
- Provide comprehensive documentation for users, explaining
the input format, customization options, and integration steps.
• PATTERN MATCHING
Pattern matching in compiler construction involves identifying
and recognizing specific patterns within the source code. This is
crucial for tasks like lexical analysis, where tokens need to be
matched against predefined patterns. Here's a simple example
using regular expressions for pattern matching in a lexical
analyzer:
1. Token Definitions:
- Identifiers: Any sequence of letters and digits, starting with a
letter.
- Numbers: Integer or floating-point numbers.
- Arithmetic Operators: '+', '-', '*', '/'
2. Regular Expressions:
- Identifier Pattern: [a-zA-Z][a-zA-Z0-9]* - Number Pattern: \
d+(\.\d+)?
- Arithmetic Operator Patterns: +, -, *, /
Hamza zahoor whatsapp 0341-8377-917
4. Lexical Analysis:
- The lexical analyzer scans the source code character by
character.
- It uses the defined regular expressions to match and identify
tokens.
5. Identified Tokens:
- For the given source code, the lexical analyzer might produce
the following sequence of identified tokens:
- Identifier: sum
- Arithmetic Operator: =
- Number: 10
- Arithmetic Operator: +
- Number: 20
- Arithmetic Operator: *
- Number: 3
- Arithmetic Operator: ;
Hamza zahoor whatsapp 0341-8377-917
- Identifier: average
- Arithmetic Operator: =
- Identifier: sum
- Arithmetic Operator: /
- Number: 2.0
- Arithmetic Operator: ;
The constructed NFA has only one accepting state, but this state,
having no out-transitions, is not an important state. By
concatenating a unique right end marker # to a regular
expression r, we give the accepting state for r a transition on #,
making it an important state of the NFA for ( r ) # . In other
words, by using the augmented regular expression ( r ) # , we
can forget about accepting states as the subset construction
proceeds; when the construction is complete, any state with a
transition on # must be an accepting state.
Two parse trees that describe CFGs that generate the string "x
+ y * z"
a≐ b means that the terminal "a" and "b" both have same
precedence.
Precedence table:
Hamza zahoor whatsapp 0341-8377-917
Now let us process the string with the help of the above
precedence table:
Hamza zahoor whatsapp 0341-8377-917
• LR PARSERS
LR parser is a bottom-up parser for context-free grammar that is
very generally used by computer programming language
compiler and other associated tools. LR parser reads their input
from left to right and produces a right-most derivation. It is
called a Bottom-up parser because it attempts to reduce the
toplevel grammar productions by building up from the leaves.
LR parsers are the most powerful parser of all deterministic
parsers in practice.
Hamza zahoor whatsapp 0341-8377-917
Description of LR parser :
The term parser LR(k) parser, here the L refers to the left-toright
scanning, R refers to the rightmost derivation in reverse and k
refers to the number of unconsumed “look ahead” input symbols
that are used in making parser decisions. Typically, k is 1 and is
often omitted. A context-free grammar is called LR (k) if the LR
(k) parser exists for it. This first reduces the sequence of tokens
to the left. But when we read from above, the derivation order
first extends to non-terminal.
2. Stack –
The combination of state symbol and current input symbol
is used to refer to the parsing table in order to take the
parsing decisions.
Parsing Table :
Parsing table is divided into two parts- Action table and Go-To
table. The action table gives a grammar rule to implement the
given current state and current terminal in the input stream.
There are four cases used in action table as follows.
Hamza zahoor whatsapp 0341-8377-917
LR parser diagram :
Hamza zahoor whatsapp 0341-8377-917
Example-
1. Consider the production shown below –
S->aSbS | bSaS | ∈
Say, we want to generate the string “abab” from the above
grammar. We can observe that the given string can be derived
using two parse trees. So, the above grammar is ambiguous.
The grammars which have only one derivation tree or parse tree
are called unambiguous grammars.
2. Consider the productions shown below –
Hamza zahoor whatsapp 0341-8377-917
S -> AB
A ->Aa | a
B -> b
For the string “aab” we have only one Parse Tree for the above
grammar as shown below.
• PARSER GENERATOR
A parser generator is a tool used in compiler construction to
automatically generate parsers for a given formal grammar.
Here's a brief overview of how a parser generator works:
1. *Grammar Specification:*
- Define the grammar of the programming language using a
formalism such as BNF (Backus-Naur Form) or EBNF
(Extended Backus-Naur Form). This grammar describes the
syntactic structure of the language.
4. *Types of Parsers:*
- Parser generators can produce different types of parsers, such
as LL parsers, LR parsers, or LALR parsers, depending on the
algorithm used.
5. *Table Construction:*
Hamza zahoor whatsapp 0341-8377-917
6. *Code Output:*
- The parser generator outputs source code for the generated
parser. This code typically includes functions or methods to
parse the input source code and build a parse tree or abstract
syntax tree.
8. *Error Handling:*
- The generated parser includes mechanisms for error detection
and reporting, helping to provide meaningful feedback to
developers when syntax errors are encountered.
10. *Optimizations:*
Hamza zahoor whatsapp 0341-8377-917
yacc translate.y
cc y.tab.c -ly
• PROBLEMS SOLVING
In compiler construction, various challenges and problems arise
that developers need to address to create effective and efficient
compilers. Here are some common problems encountered and
solved in the process:
1. *Ambiguity in Grammar:*
-
*Problem:* Ambiguous grammars can lead to multiple
interpretations of the same input, causing challenges for parser
generators.
- *Solution:* Refine the grammar to remove ambiguity or
use techniques like associativity and precedence to clarify
parsing rules.
2. *Error Handling:*
- *Problem:* Detecting and recovering from errors in the
source code is crucial for providing meaningful feedback to
developers.
- *Solution:* Implement robust error-handling mechanisms
within the lexer and parser to identify and report errors
gracefully.
3. *Efficient Parsing:*
- *Problem:* Some parsing algorithms can be
computationally expensive, affecting compiler performance.
- *Solution:* Choose parsing algorithms carefully (e.g., LL,
LR, LALR) based on the characteristics of the language
grammar. Implement optimizations to enhance parsing speed.
5. *Optimizations:*
- *Problem:* Generating efficient target code while
maintaining correctness can be challenging.
- *Solution:* Implement various optimization techniques,
including constant folding, loop optimization, and register
allocation, to enhance the performance of the generated code.
8. *Debugging Information:*
- *Problem:* Generating debug information that aids
developers in identifying issues in the source code can be
complex.
- *Solution:* Implement mechanisms to include debugging
information in the generated code, such as line numbers and
variable names, to facilitate debugging.
9. *Security Concerns:*
- *Problem:* Compilers need to be designed with security in
mind to prevent vulnerabilities like buffer overflows or injection
attacks.
- *Solution:* Implement secure coding practices, conduct
rigorous testing, and incorporate security checks in the compiler
construction process.
Example:
Features –
Example:
E --> E1 + T { E.val = E1.val + T.val}
In this, E.val derive its values from E1.val and T.val
S --> E
E --> E1 + T
E --> T
T --> T1 * F
T --> F
F --> digit
S --> T L
T --> int
T --> float
T --> double
L --> L1, id
L --> id
The SDD for the above grammar can be written as
follow
2. *Attribute Computation:*
- At each node of the syntax tree, compute the synthesized and
inherited attributes according to the S-attributed definitions
associated with the grammar rules.
3. *Bottom-Up Evaluation:*
- Start the evaluation process from the leaves of the syntax tree
and proceed towards the root.
- For each node, compute its synthesized attributes based on the
values inherited from its children. If the node has inherited
attributes, they are computed first.
- Continue this process until the attributes for the root node are
computed.
4. *Inherited Attributes:*
- If a node has inherited attributes, compute them based on the
values inherited from its parent or other ancestors. These
attributes may depend on values computed during the bottom-
up evaluation of its siblings.
5. *Semantic Actions:*
- Incorporate semantic actions into the evaluation process. These
actions are associated with the grammar rules and are executed
during attribute computation. They are responsible for assigning
values to attributes.
6. *Example:*
- Consider a simple S-attributed definition for a programming
language construct like arithmetic expressions. The synthesized
attribute might be the value of the expression, and the inherited
attribute might represent additional information.
- An S-attributed rule could be:
• BOTTOM UP EVALUATION
Bottom-up Parsing: The bottom-up parsing works just the
reverse of the top-down parsing. It first traces the rightmost
derivation of the input until it reaches the start symbol.
Shift-Reduce Parsing: Shift-reduce parsing works on two steps:
Shift step and Reduce step.
Shift step: The shift step indicates the increment of the input
pointer to the next input symbol that is shifted.
Reduce Step: When the parser has a complete grammar rule on
the right-hand side and replaces it with RHS.
LR Parsing: LR parser is one of the most efficient syntax
analysis techniques as it works with context-free grammar. In LR
parsing L stands for the left to right tracing, and R stands for the
right to left tracing.
• RECURSION
Recursion is defined as a process which calls itself directly
or indirectly and the corresponding function is called a
recursive function.
Recursion plays a crucial role in various aspects of compiler
construction, particularly in parsing and semantic analysis. Here
are some key areas where recursion is commonly used:
1. *Parsing:*
- *Recursive Descent Parsing:* In top-down parsing,
recursive descent parsing involves implementing parsing
functions that correspond to grammar rules. Each parsing
function may call other parsing functions recursively to handle
subexpressions.
- *Bottom-Up Parsing:* In bottom-up parsing, techniques like
LR parsing involve recognizing and reducing parts of the input
by applying production rules. LR parsers often use recursive
techniques to handle the reduction steps.
4. *Code Generation:*
- *Expression Evaluation:* Recursive algorithms are utilized
for evaluating expressions during code generation. This involves
traversing the AST and generating machine code or intermediate
code for arithmetic and logical operations.
- *Function Calls:* Code generation for function calls often
involves recursion, as the compiler needs to generate code for the
called function and handle the return values.
5. *Optimization:*
- *Recursive Algorithms for Analysis:* Some optimization
techniques, such as data flow analysis or constant folding, use
recursive algorithms to analyze and transform code for improved
performance.
- *Loop Optimization:* Recursive algorithms may be
employed to analyze and optimize loops, identifying
opportunities for unrolling or other loop transformations.
6. *Error Handling:*
- *Error Recovery:* Recursive descent parsers often
incorporate recursive error recovery mechanisms. These
mechanisms involve backtracking or other recursive strategies to
handle syntax errors and resume parsing.
• TYPE CHECKING
Type checking is the process of verifying and enforcing
constraints of types in values. A compiler must check that the
source program should follow the syntactic and semantic
conventions of the source language and it should also check the
type rules of the language. It allows the programmer to limit
what types may be used in certain circumstances and assigns
types to values. The type-checker determines whether these
values are used appropriately or not.
• It checks the type of objects and reports a type error in the
case of a violation, and incorrect types are corrected.
Whatever the compiler we use, while it is compiling the
program, it has to follow the type rules of the language.
Every language has its own set of type rules for the
language. We know that the information about data types is
maintained and computed by the compiler.
• The information about data types like INTEGER, FLOAT,
CHARACTER, and all the other data types is maintained
and computed by the compiler. The compiler contains
modules, where the type checker is a module of a compiler
and its task is type checking.
• TYPE SYSTEM
A type system in compiler construction is a set of rules that
govern the usage of types in a programming language. It
enforces constraints on how different types of data can be
manipulated, combined, and assigned within a program. The
primary goals of a type system include catching errors at
compile-time, enhancing program reliability, and facilitating
optimizations. Here are key components and aspects of a type
system:
1. *Type Checking:*
- *Static Type Checking:* Performed at compile-time, ensuring
type correctness before the program runs. Common in
languages like Java or C++.
-
*Dynamic Type Checking:* Performed at runtime, allowing
more flexibility but may lead to runtime errors. Common in
languages like Python or JavaScript.
2. *Data Types:*
- Define basic data types (integers, floating-point numbers,
characters, etc.) and user-defined types (structures, classes,
enums).
- Specify the size, representation, and behavior of each type.
3. *Type Inference:*
- Automatically deducing types without explicit declarations.
Helps reduce redundancy and enhance code readability.
- Common in languages like ML, Haskell, or Rust.
4. *Type Compatibility:*
- Define rules for how different types can be used together in
expressions or assignments.
- Implicit conversions, coercion, or casting may be allowed
based on the language.
5. *Polymorphism:*
-
*Parametric Polymorphism:* Enables writing generic code
that works with different types.
- *Ad-hoc Polymorphism (Overloading):* Allows using the
same function or operator with different types.
6. *Type Hierarchies:*
- Define relationships between types, such as inheritance or
interfaces in object-oriented languages.
- Hierarchies may include base types and derived types.
7. *Type Safety:*
- Prevents operations that could result in runtime errors or
unexpected behaviors.
- Enforces rules to catch type-related errors during compilation.
8. *Type Annotations:*
- Allow programmers to provide explicit type information for
variables, parameters, and functions.
- Facilitates understanding and improves tool support.
1. *Structural Equivalence:*
- *Definition:* Two types are structurally equivalent if their
internal structures match.
- *Example:* Consider two record types: text
type Person1 = { name: String, age: Int }
type Person2 = { name: String, age: Int }
• TYPE CONVERSION
Type conversion : In type conversion, a data type is
automatically converted into another data type by a compiler at
the compiler time. In type conversion, the destination data type
cannot be smaller than the source data type, that’s why it is also
called widening conversion. One more important thing is that it
can only be applied to compatible data types.
Example: x = (a + b * c) / (a – b * c)
OPTIMIZATION
The code optimization in the synthesis phase is a program
transformation technique, which tries to improve the
intermediate code by making it consume fewer resources (i.e.
CPU, Memory) so that faster-running machine code will result.
Compiler optimizing process should meet the following
objectives :
END