Compiler Easy Notes - Hamza Zahoor
Compiler Easy Notes - Hamza Zahoor
NOTES
SUBJECT :
Compiler
CLASS :
BSCS 6TH Semester
WRITTEN BY :
Hamza Zahoor
Lexical analysis is the first phase of compiler construction. Its primary goal is to convert
the source code into a stream of tokens, which are the smallest units of meaning in
programming. The lexical analyzer, also called a lexer or scanner, reads the source code
character by character and groups them into meaningful sequences known as lexemes,
which are then classified into tokens.
3. Error Detection: The lexer checks for illegal characters and informs the
compiler of any lexical errors.
Example:
int a = 5;
During lexical analysis, the code is broken down into the following tokens:
Lexeme Token
int KEYWORD
a IDENTIFIER
= OPERATOR
5 CONSTANT
; SEMICOLON
Token Types:
• Keywords: Reserved words in the programming language (e.g., int, if, for).
Working:
1. The lexer reads the first three characters: i, n, t, and recognizes them as the
keyword int.
Lexical analyzers often use regular expressions to define patterns for recognizing tokens.
For example:
int a = 5$;
The lexer will flag $ as an illegal character since it does not belong to the valid token set of
the language.
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
Lexical analysis plays a critical role by cleaning the input and preparing it for the next phase
of the compiler, which is syntax analysis.
Reading the source program character by character from a file can be slow because each
character read from the disk involves an I/O operation. Frequent disk accesses can
significantly impact performance. To mitigate this, compilers use input buffering to
optimize reading operations.
1. Buffer Setup: Instead of reading one character at a time, the compiler reads
the input file in chunks or blocks of data into a buffer in memory. This reduces the number
of disk I/O operations because multiple characters are loaded into memory in a single
read.
2. Dual Buffering: A common strategy is the use of two buffers (called double
buffering). The source code is divided into two parts, each part being loaded into one of the
buffers:
• When the first buffer is exhausted (all characters have been processed), the
second buffer is loaded with the next chunk of the input file.
• While one buffer is being processed, the other is being loaded from disk in
parallel, improving efficiency.
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
This setup avoids the need to wait for the disk to load the next portion of the input, making
the processing of the source code continuous.
• The source file is loaded in two buffers. Suppose each buffer holds 1KB of
data.
• The lexical analyzer reads characters sequentially from the first buffer. When
it reaches the end, it switches to the second buffer, which has already been filled.
• When both buffers are exhausted, the first buffer is reloaded with the next
part of the input file while the second buffer is being processed.
2. Parallelism: With double buffering, one buffer is loaded while the other is
being processed, reducing idle time for the compiler.
Challenges
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
• Buffer Overruns: Care must be taken to ensure that the lexical analyzer
doesn’t exceed the buffer’s boundaries.
• End of File Handling: Proper management is required when the file’s end is
reached, ensuring no data is missed or redundant characters are read.
Conclusion
Input buffering is crucial in compiler design for optimizing the reading and processing of
source code. It reduces overhead, ensures efficient memory management, and speeds up
the overall compilation process by reducing the number of direct I/O operations.
In compiler construction, token specification and recognition are key steps in the lexical
analysis phase, where the source code is converted into tokens.
1. Token Specification
A token is a string of characters that are grouped together as a meaningful element. For
example, keywords (if, while), operators (+, -), identifiers (variable names), and punctuation
(semicolon ;) are tokens.
Tokens are specified by regular expressions (regex). A regular expression defines a pattern
that matches specific sequences of characters in the input. Each token type, like an
identifier or an operator, has its own regular expression. For example:
3. Operators: +, -, *, ==
4. Punctuation: ;, ,, (, )
2. Token Recognition
Once tokens are specified by regular expressions, the lexical analyzer (also known as the
scanner) reads the source code character by character to match these patterns and
produce a sequence of tokens. This process is called token recognition.
1. Input Scanning: The lexical analyzer scans the source code from left to right.
int x = 10;
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
• x is recognized as an identifier.
• = is an operator.
• 10 is a literal.
• ; is a punctuation token.
Summary:
These tokens are then used by the parser for syntax analysis.
1. Terminals: These are the basic symbols from which strings are formed. In
programming languages, terminals are typically tokens like keywords, operators, and
symbols.
3. Production Rules: These are rules that describe how terminals and non-
terminals can be combined to form valid strings. Each rule specifies how a non-terminal
can be replaced by a combination of terminals and non-terminals.
4. Start Symbol: This is the initial non-terminal from which production begins. It
serves as the root of the derivation tree that represents the structure of the code.
Structure of a CFG
• : Set of non-terminals.
• : Set of terminals.
Example
• Terminals: 
• Non-terminals: 
• Start Symbol: 
• Production Rules:
• 
• 
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
• 
•  represents expressions,
•  represents terms,
In compilers, CFGs are used to define the syntax of the source language. The syntax
analyzer, or parser, uses the CFG to verify that the source code follows the syntactic
structure. This helps the compiler to detect errors early in the process and build a parse
tree or syntax tree, which serves as the foundation for semantic analysis and further
compilation stages.
Understanding CFGs is fundamental for compiler construction because they allow the
compiler to understand the rules governing the structure of valid statements, expressions,
and program blocks in the language.
• Each construct will have specific syntax rules that need to be defined in the
grammar.
• Terminals: These are the basic symbols (tokens) of the language, like
keywords (if, while, etc.), operators (+, -, *, etc.), and punctuation (commas, semicolons,
etc.).
• Define production rules that describe how terminals and non-terminals can
combine to form valid code structures.
• Expression: 
• Term: 
• Factor: 
• Choose a start symbol, the top-level non-terminal, which serves as the root
of the parse tree. This represents the complete program or main structure.
• The start symbol enables the parser to start building the syntax tree from the
top and expand through other production rules.
• Ambiguity occurs when a grammar allows multiple parse trees for the same
string, making it unclear how to interpret certain statements. This can lead to issues in
parsing.
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
For a language with basic arithmetic and assignment, a possible grammar might look like
this:
• Terminals: 
• Non-terminals: 
• Start Symbol: 
• Production Rules:
• 
• 
• 
• 
Defining grammar in compiler construction helps create a parser that can systematically
break down source code into its syntactic components, check its correctness, and
construct a syntax tree or parse tree. This process enables the compiler to:
[6/11/2025 2:50 PM] 卄卂爪乙卂 : In *compiler construction*, various tools are used
to automate different parts of the compilation process, making it easier to develop,
maintain, and extend compilers. These tools help in tasks like lexical analysis, parsing,
code generation, optimization, and error handling.
Here’s a detailed look at the common *compiler construction tools*, along with examples
of each.
- *How It Works: The developer specifies patterns for tokens using **regular expressions*,
and the tool generates a lexical analyzer based on these patterns.
- *Example*:
- *Lex* and *Flex* (Fast Lexical Analyzer) are popular lexer tools.
- *Process*: You define a set of token patterns like keywords, operators, identifiers, etc.
The tool generates C code to recognize and return these tokens.
- *Usage*: For instance, a C-like token definition in Flex might look like:
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
digit [0-9]
letter [a-zA-Z]
%%
%%
This would recognize keywords like int, identifiers, and numbers in a source program.
- *Purpose: Automates the generation of the **parser, which checks the syntactic
structure of the code and builds the **parse tree* or *syntax tree*.
- *How It Works: The developer defines the **grammar* of the language using *context-
free grammar (CFG)* rules. The tool generates a parser that recognizes sentences
(programs) that match the grammar.
- *Example*:
- *Yacc* (Yet Another Compiler Compiler) and *Bison* (GNU version of Yacc) are widely
used parser generators.
| term;
| factor;
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
| NUMBER;
This would generate a parser that understands arithmetic expressions involving addition
and multiplication.
- *How It Works: These tools extend the parser to not only check for syntactic correctness
but also perform **actions* (like generating intermediate code) while parsing.
- *Example*: Syntax-directed translation engines like Yacc and Bison allow actions to be
associated with grammar rules. For instance:
| term { $$ = $1; };
Here, an action ($$ = $1 + $3;) is triggered when the parser matches the expr + term rule,
meaning intermediate code or values can be generated as parsing proceeds.
- *Purpose: Tools that help generate an **intermediate representation (IR)* of the source
code.
- *How It Works*: These tools transform the syntax tree or parse tree generated by the
parser into an intermediate form that is easier to optimize and target across different
machine architectures.
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
- *Purpose: Automates the generation of the final **machine code* or *assembly code*
from the intermediate representation.
- *How It Works*: These tools take the intermediate code and translate it into the specific
machine instructions for a target architecture.
- *Example*: LLVM's backend generates machine-specific code (e.g., x86, ARM) from the
intermediate representation. Other code generators include GCC (GNU Compiler
Collection), which produces optimized machine code for different platforms.
- *Purpose: Tools that help improve the intermediate or final machine code by applying
**optimizations* to enhance performance or reduce resource usage.
- *How It Works: These tools take the intermediate or machine code and apply various
transformations such as **loop unrolling, **dead code elimination, and **constant
propagation*.
- *Example: The **GCC optimizer* (-O1, -O2, -O3 flags) optimizes the code for different
levels of performance. LLVM also includes powerful optimization passes for intermediate
code.
int x = 2 * 3;
Into:
int x = 6;
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
- *How It Works*: These tools generate error messages and diagnostics that help
developers understand the nature of the errors in their code.
| term { $$ = $1; }
This would trigger an error message if the expression does not follow the correct
grammar.
- *Purpose: Tools that help in converting the **assembly code* into *machine code* and
linking various code modules together to form an executable.
- *How It Works*: After code generation, the assembly code needs to be converted into an
object file and linked with libraries and other code modules.
- *Example*:
- *Assembler*: Converts assembly language into machine code. Examples include GNU
as (assembler).
- *Linker*: Combines object files and libraries into an executable. Examples include GNU
ld (linker).
- *Purpose: Tools used to **profile* the performance of the code and *debug* it by tracing
errors, runtime exceptions, and crashes.
- *Example*:
- *GDB* (GNU Debugger) allows you to debug programs by setting breakpoints, stepping
through the code, and examining variables.
- *Valgrind* is a profiling tool that helps detect memory leaks and analyze performance.
---
Here’s how these tools might work together to build a simple compiler:
- Source code int x = 10; is broken into tokens like int, x, =, 10, and ;.
- The tokens are parsed according to a grammar that recognizes the structure of variable
declarations and statements.
%1 = alloca i32
4. *Optimization* (LLVM):
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
mov eax, 10
---
### Conclusion
• a (identifier)
• + (plus operator)
• b (identifier)
• * (multiplication operator)
• c (identifier)
t1 = b * c
t2 = a + t1
MOV R1, b
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
MUL R1, c
MOV R2, a
ADD R2, R1
class Translator:
self.expression = expression
self.temp_counter = 0
self.intermediate_code = []
def generate_temp(self):
self.temp_counter += 1
return f"t{self.temp_counter}"
if '+' in expr:
left_temp = self.translate(left.strip())
right_temp = self.translate(right.strip())
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
temp = self.generate_temp()
return temp
left_temp = self.translate(left.strip())
right_temp = self.translate(right.strip())
temp = self.generate_temp()
return temp
else:
return expr.strip()
def print_code(self):
print(code)
# Example usage:
translator = Translator(expr)
translator.translate(expr)
translator.print_code()
t1 = b * c
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
t2 = a + t1
This basic structure can be extended to handle more complex expressions, optimizations,
and different target languages.
In programming, compilers need to keep track of identifiers, their attributes, and their
scopes. The symbol table facilitates:
Symbol tables are typically implemented as hash tables, binary search trees, or other data
structures that allow efficient insertions and lookups.
Example
int a = 10;
int b = 20;
void foo() {
int c = 30;
a = a + b;
How it Works:
1. Global Scope: When the compiler encounters the global variables a and b, it
adds them to the symbol table with their data types (int), scope (Global), memory
addresses (assigned during code generation), and initial values (10 and 20).
2. Function Declaration: The function foo is added to the table. It has its own
scope (global), and further analysis of its contents takes place when the function is
compiled.
3. Local Variables: Inside the function foo, the local variable c is added to the
table with its scope marked as Local (in foo).
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
4. Usage: When the compiler processes the statement a = a + b;, it checks the
symbol table for a and b, retrieves their types, and generates code based on their memory
addresses.
Conclusion:
The symbol table plays a crucial role in resolving identifiers during the different phases of
compilation, ensuring correct code generation and type checking. It helps the compiler
know which identifiers refer to which variables or functions, even when they have local or
global scope distinctions.
[6/11/2025 2:50 PM] 卄卂爪乙卂 : The *phases of a compiler* represent the distinct
stages a compiler goes through in order to translate high-level programming code (source
code) into machine code (target code). Each phase processes the input from the previous
phase, and together they ensure correct transformation from human-readable code to a
format that a machine can execute. The phases are often categorized into *two main
groups: **analysis* and *synthesis*.
- *Objective:* To break the source code into *tokens* (basic syntactical units such as
keywords, operators, identifiers, etc.).
- *Example:* For the input int x = 5;, the output tokens might be:
- int (keyword)
- x (identifier)
- = (assignment operator)
- 5 (literal)
- ; (delimiter)
- *Process:* It checks if the source program follows the correct syntactic structure (based
on rules like BNF or CFG).
- *Output:* A parse tree or abstract syntax tree (AST), which represents the hierarchical
structure of the code.
- *Example:* For int x = 5;, the syntax analysis will check if this is a valid variable
declaration according to the language’s grammar.
- *Objective:* To check the *semantic consistency* of the code, ensuring that operations
and data types make sense together.
- *Process:* This phase involves type checking, verifying variable declarations, and
function calls for correctness.
- *Output:* The AST is annotated with type information, ensuring that semantic rules (e.g.,
no addition of integers with strings) are followed.
- *Example:* The semantic analyzer checks if int x = "hello"; is valid, and would flag it as
an error because of type mismatch.
- *Process:* This IR is closer to machine code but is still abstract and can be optimized
more easily.
t1 = b * c
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
t2 = a + t1
x = t2
- *Types of Optimization:*
- *Objective:* To convert the optimized intermediate code into *target code* (assembly or
machine code) for the specific machine architecture.
MOV R1, a
ADD R1, b
MOV x, R1
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
- *Objective:* To handle external function calls, link libraries, and assign memory
addresses to program variables and functions.
- *Process:* The linker resolves external references and combines object code into a
single executable.
- *Output:* The final *executable machine code*, ready for execution by the operating
system.
6. *Code Generation*: Translates the optimized intermediate code into machine code.
7. *Code Linking and Loading*: Links external references and produces the final
executable.
These phases ensure that a compiler can correctly and efficiently translate high-level code
into machine code.
: The *grouping of phases* in compiler design refers to how different phases of the
compilation process can be logically grouped based on the nature of the tasks they
perform. Generally, the phases of a compiler are grouped into two major categories:
The goal of the analysis phase is to *understand* and *validate* the source code. This
phase takes the high-level source code and breaks it down into a form that the compiler
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
can more easily process, while also ensuring that it is syntactically and semantically
correct.
- *Lexical Analysis*
- *Syntax Analysis*
- *Semantic Analysis*
- *Lexical Analysis*: Converts source code into tokens, which are the basic building
blocks of the code (keywords, operators, identifiers, etc.).
- *Semantic Analysis*: Ensures that the syntax follows the rules of meaning (semantics)
for the programming language, such as type checking and variable declarations.
After the analysis phase, the compiler has a clear understanding of the structure and
meaning of the source code. The output of this phase is typically a well-formed *abstract
syntax tree (AST)*, possibly annotated with type and scope information.
The goal of the synthesis phase is to *translate* the intermediate representation into
target code, often with optimizations to improve performance or resource efficiency.
- *Code Optimization*
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
- *Code Generation*
- *Intermediate Code Generation*: Converts the high-level abstract syntax tree into an
intermediate form, which is easier to manipulate and optimize.
- *Code Generation*: Produces the final machine code or assembly code from the
optimized intermediate representation.
- *Code Linking and Loading*: Combines external libraries and object files, resolving
references and creating the final executable.
- *Clarity and Modularity: By grouping the phases into analysis and synthesis, it becomes
easier to conceptualize how the compiler works. The analysis phase is all about
**understanding* the source code, while the synthesis phase focuses on *translating* and
*optimizing* it for execution.
- *Separation of Concerns: This grouping allows for better **separation of concerns*. The
analysis phase is concerned with ensuring that the source code is correct and meaningful,
whereas the synthesis phase focuses on translating and improving performance.
- *Modular Compiler Design*: In practice, many modern compilers are built in a modular
fashion where the analysis and synthesis phases are separated. This allows for easier
debugging, code reuse, and even support for multiple target architectures (through
separate synthesis phases).
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
- *Analysis Phase*:
3. *Semantic Analysis*: Ensures the parse tree adheres to the language’s semantics.
- *Synthesis Phase*:
3. *Code Generation*: Converts the optimized code into target machine code.
- *Front-End (Analysis)*:
- This includes the *lexical analysis, **syntax analysis, and **semantic analysis*.
- The front-end checks for correctness and builds an intermediate representation of the
code.
- *Back-End (Synthesis)*:
### Conclusion
In summary, the *grouping of phases* into *analysis* and *synthesis* helps streamline the
compilation process. The *analysis phase* handles understanding and validating the
source code, while the *synthesis phase* deals with translating it into efficient machine
code. This separation simplifies the design and implementation of compilers, making them
more modular and easier to manage.
• Recursive descent parsers are easy to implement but only work on grammars
that are free of left recursion. If the grammar has left recursion, it can lead to infinite
recursion.
2. Predictive Parsing:
• Predictive parsers use lookahead tokens to decide which rule to apply. They
typically need the grammar to be LL(1), meaning they only require a single lookahead token
to make parsing decisions.
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
• Predictive parsers can handle a restricted set of grammars but are efficient
for many practical languages.
3. Translation Process:
1. The parser starts with Expr and tries to match the input tokens.
3. For each recognized rule, the parser can translate the expression directly into
intermediate code.
Pros:
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
Cons:
• Generally less efficient than bottom-up parsers for more complex languages,
especially those requiring significant lookahead.
Practical Use
Top-down translation is often used in compilers for languages with a simpler syntax or in
specific stages of larger compilers (e.g., expression parsing). It’s widely seen in interpreters
or lightweight language processors where fast implementation and readability are more
important than parsing complex syntax.
In compiler design, Syntax-Directed Definitions (SDDs) are used to specify the semantic
rules associated with the grammar of a language. These rules define how the syntax of a
language is translated into intermediate representations or how computations are carried
out during parsing.
Components of SDD
1. Attributes:
2. Semantic Rules: Rules that define how the attributes are evaluated.
E → E1 + T
E→T
T → T1 * F
T→F
F → (E)
F → num
Now, we add semantic rules to compute the value of an expression. Let’s use a
Synthesized Attribute val to hold the value of a symbol.
Syntax-Directed Definition:
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
2. E → T { E.val = T.val }
4. T → F { T.val = F.val }
Explanation:
Input: 2 + 3 * 4
1. Parse tree:
/\
E T
| |
NOTES BY HAMZA ZAHOOR WHATS APP (0341-8377-917)
T *
| /\
F F num
| |
num num
2. Attribute evaluation:
• num = 3 → F.val = 3
• T.val = 12
• E.val = 2 + 12 = 14
Output: 14
Applications: