CSC 307 Compiling Techniques
CSC 307 Compiling Techniques
Course Outline
Review of Compilers, Assemblers and Interpreters, Structure and functional aspects of a typical
compiler, Syntax, Semantics, and pragmatics, functional relationship between lexical analysis,
expression analysis and code generation, internal form of course programme. Use of a standard
compiler, as a working example. Error detection and recovery, grammars and languages, the
parsing problem and the scanner. Pre-requite: CSC 206.
Introduction
A compiler is a software tool that translates high-level programming language code (e.g., C, C++)
into machine code or binary executable that the computer’s hardware can execute. Role:
Compilers read the entire source code at once, analyze it, and generate a standalone executable
file, which can then be run independently without needing the compiler after the code is compiled.
Example: GCC (GNU Compiler Collection): GCC is commonly used to compile C/C++ code into
an executable. If we write a simple C program, `hello.c`, with the code:
c
#include <stdio.h>
int main () {
return 0;
Using the GCC compiler (`gcc hello.c -o hello`), this code is translated into an executable file
`hello`, which can be run on the command line to print "Hello, World!".
Key Advantage: Once compiled, the code runs faster since it is in machine-readable form.
An assembler is a utility that converts assembly language, which is a low-level language with a
close correspondence to machine code, into actual machine code. Role: Assemblers translate
mnemonics (like `MOV`, `ADD`, `SUB`) into binary opcodes understandable by the CPU.
Example: Microsoft Macro Assembler(MASM) for x86 assembly: MASM takes assembly code
like:
asm
section .data
section .text
global _start
start:
1|Page
mov ebx, 1 ; File descriptor (stdout)
Using an assembler, this code is converted to machine code and executed by the CPU. Its outputs
"Hello, World!" to the terminal.
An interpreter: is a program that executes code line-by-line or statement-by-statement without
converting it into an independent executable. Role: Interpreters read the source code, parse it, and
execute each line on the fly, typically providing immediate feedback, making it ideal for scripting
languages.
Example:
Python Interpreter: In Python, the code:
python
is executed directly by the Python interpreter, without needing a compilation step. The interpreter
reads and executes the command to output "Hello, World!" immediately.
Key Advantage: Easier debugging and interactive execution, ideal for development, scripting, and
rapid prototyping.
Examples and Comparison
Compiled Language (C):
For example, the code snippet below in C:
c
#include <stdio.h>
int main() {
printf("Hello, World!\n");
return 0;
When compiled using GCC (`gcc hello.c -o hello`), this creates an executable file `hello`. Running
this file directly from the command line (`./hello`) outputs "Hello, World!" without further
interaction from the compiler.
2|Page
Interpreted Language (Python): in contrast, writing and running a similar "Hello, World!" program
in Python:
python
The Python interpreter reads and executes the line, producing the output immediately without
creating an intermediate executable file.
In assembly, the "Hello, World!" code is closer to hardware and needs an assembler to translate
the human-readable assembly language instructions into machine language that can be executed
by the CPU. This requires knowledge of specific CPU instruction sets.
High-Level Architecture of Compiler vs. Interpreter
A diagram here can clarify the differences in workflows:
1. Compiler Architecture:
- Source Code → [Compiler] → Machine Code/Executable → [CPU Execution]
2. Interpreter Architecture:
- Source Code → [Interpreter] → [CPU Execution of Each Statement]
The interpreter translates and executes each line directly, while the compiler generates an
executable for the entire code. This distinction impacts performance, with compiled programs
often being faster due to pre-compiled binaries, while interpreted code benefits from flexibility
and ease of debugging.
2. Structure and Functional Aspects of a Typical Compiler
The process of compilation in a typical compiler involves several distinct phases, each responsible
for a specific part of transforming human-readable code into machine-executable code. Let’s go
through each phase in detail with examples, using a simple C statement as a running example.
Phases of Compilation
1. Lexical Analysis– Tokenizing the Source Code
Role: Lexical analysis (or scanning) is the first phase of compilation. It reads the source code
character by character and groups them into tokens — the smallest units of meaning, such as
keywords, identifiers, operators, and symbols.
3|Page
Example:
For the statement `int a = b + c;`, the lexical analyzer generates tokens like:
Output: A sequence of tokens that are passed to the next phase for syntactic analysis.
2. Syntax Analysis– Building a Syntax Tree
Role: Syntax analysis (or parsing) takes the tokens produced by the lexical analyzer and organizes
them into a syntax tree based on the grammar rules of the programming language. This tree (often
called a parse tree) represents the structure of the code.
Example:
For `int a = b + c;`, a parse tree is created where `int` is a type declaration, `a` is an identifier,
`=` is an assignment, and `b + c` is an expression.
Output: A syntax tree that validates whether the code follows the correct syntactical rules.
3. Semantic Analysis– Ensuring Logical Correctness
Role: In semantic analysis, the compiler verifies that the syntax tree adheres to the language’s
semantic rules, such as type compatibility, variable declarations, and scope resolution.
Example:
For `int a = b + c;`, semantic analysis checks that:
`a`, `b`, and `c` are declared and in scope.
`b` and `c` are compatible for addition, and their result can be assigned to `a` (both should
ideally be of type `int` or compatible types).
- Output: An annotated syntax tree with type information and other semantic details, or errors if
there are semantic violations.
4. Intermediate Code Generation – Converting to Intermediate Representation (IR)
- Role: This phase translates the syntax tree into an intermediate representation (IR), which is a
lower-level, machine-independent code. The IR is closer to machine code but remains generic
enough to be optimized before the final machine code is generated.
4|Page
Example
For `int a = b + c;`, an IR might look like:
T1 = b + c ; temporary variable T1 stores the result of b + c
a = T1 ; assigns the result to a
Output: Intermediate code (IR), often in a form like three-address code, that makes subsequent
optimization easier.
5. Code Optimization – Refining IR for Performance
Role: The compiler optimizes the IR to improve performance without changing the functionality.
Optimizations include removing redundant instructions, minimizing memory usage, and
improving execution speed.
Example:
For `int a = b + c;`, if `b` and `c` are constants (e.g., `b = 2`, `c = 3`), the code can be optimized
by pre-calculating `b + c = 5`, thus simplifying the IR:
a=5
Output: Optimized IR ready for code generation, which should be faster and more efficient.
6. Code Generation– Producing Machine Code
Role: In this phase, the optimized IR is translated into machine code specific to the target CPU
architecture. The machine code consists of low-level instructions that the hardware can directly
execute.
Example:
For `int a = b + c;`, code generation converts the IR into assembly or machine code instructions.
For instance, on an x86 architecture:
MOV EAX, [b] ; Load the value of b into register EAX
ADD EAX, [c] ; Add the value of c to EAX
MOV [a], EAX ; Store the result in a
5|Page
Example
Suppose the program includes external libraries or functions (like `printf` in C), the linker
resolves these references, creating a complete executable. The assembly language instructions are
converted to machine instructions ready for execution.
Output: An executable file, like `a.out` or `hello.exe`, that can be run on the target system.
Diagram: Layered Diagram of Each Compilation Phase
The following diagram provides a high-level overview of the compilation process:
Source Code
6|Page
To summarize how `int a = b + c;` is transformed:
1. Lexical Analysis: Generates tokens like `<keyword, int>`, `<identifier, a>`, `<operator, =>`, `<identifier, b>`, `<operator, +>`, `<identifier, c>`,
`<delimiter, ;>`.
2. Syntax Analysis: Builds a syntax tree where `=` is the root, with `a` on the left and the expression `b + c` on the right.
3. Semantic Analysis: Checks type compatibility, ensuring `b` and `c` are integers and compatible for addition.
5. Code Optimization: Simplifies if possible, e.g., pre-computing values if `b` and `c` are constants.
6. Code Generation: Produces machine code such as `MOV EAX, [b]; ADD EAX, [c]; MOV [a], EAX;`.
7. Linking and Assembly: Combines with other modules to create a runnable executable.
This step-by-step breakdown and real example demonstrate how a high-level statement is
systematically transformed through each compiler phase into executable code. This structured
process helps optimize, validate, and ensure the program functions as intended.
Syntax, Semantics, and Pragmatics in Programming Languages
Understanding syntax, semantics, and pragmatics is essential in compiler design because they help
the compiler enforce the rules of a language, ensure logical correctness, and optimize code for
practical use. Here’s a detailed look at each concept with examples.
1. Syntax is the set of grammatical rules that define the structure and form of statements in a
programming language. These rules dictate how symbols, keywords, and identifiers should be
arranged.
-Role in Compilation: During the syntax analysis (parsing) phase, the compiler checks whether
code adheres to the correct structural rules of the language.
Examples
Valid Syntax:
c
int x = 5;
Here, the statement follows C syntax rules for variable declaration and assignment. `int` is a type
specifier, `x` is an identifier, `=` is an assignment operator, and `5` is an integer literal. The
semicolon `;` indicates the end of the statement.
Invalid Syntax:
c
int = 5 x;
This statement has incorrect syntax. The identifier `x` must come immediately after `int`, and
the assignment operator `=` should follow the identifier. The compiler would raise a syntax error
here.
Purpose: Syntax is essential for ensuring that the program’s structure conforms to the rules and
can be parsed into a syntax tree.
2. Semantics
7|Page
Definition: Semantics refers to the meaning behind syntactically correct statements. Semantic rules
ensure that statements make sense logically and that operations are valid for the data types
involved.
Role in Compilation: In the semantic analysis phase, the compiler verifies that the program’s
meaning aligns with the language’s rules. It checks things like type compatibility, correct function
calls, and the logical validity of operations.
Examples
Valid Semantics
c
int x = 5;
int y = x + 3;
Here, the statement `int y = x + 3;` is semantically correct because `x` and `3` are compatible
integer types, and `+` is a valid operation for integers.
Invalid Semantics
c
int x = 5;
int y = x + "text";
This statement would produce a semantic error because the operation `x + "text"` tries to add
an integer (`x`) to a string (`"text"`), which is incompatible in languages like C. Even though the
syntax is correct, the compiler flags this as a type error during semantic analysis.
Purpose: Semantic checks are crucial to ensure logical correctness, preventing operations that
would otherwise cause runtime errors or undefined behavior.
Pragmatics
Definition: Pragmatics deals with the practical aspects of executing code, such as efficiency and
optimization, to ensure that the program performs well in real-world scenarios.
Role in Compilation: Pragmatics comes into play during the code optimization phase, where the
compiler applies transformations to improve performance without altering the program’s intended
outcome.
Examples
Pragmatic Optimization
c
int x = 5;
int y = x + x + x;
8|Page
The expression `x + x + x` could be optimized to `3 x` by the compiler, as multiplication is
faster than repeated addition.
Variable Frequency:
If the compiler detects that certain variables are frequently accessed, it may keep these variables
in CPU registers (register allocation) instead of repeatedly storing and retrieving them from
memory, reducing access time and improving performance.
- **Purpose**: Pragmatic optimization makes code more efficient and resource-friendly without
changing its meaning or output.
Example Breakdown
Let’s examine an example with each concept applied:
c
int a = 10;
int b = 20;
int c = a + b;
1. Syntax:
Syntax rules confirm that:
`int` is a recognized data type.
9|Page
Diagram: Syntax Tree for a Simple Arithmetic Expression
Here’s a syntax tree for the expression `a + b * c`, showing the syntactic structure and hierarchy.
Assignment (=)
/ \
/ \
/ \
ID(c)
Tree Structure:
The root node is the assignment operator `=`, which has `a` on the left and an expression on the
right.
The expression `a + (b * c)` shows the precedence of operators: multiplication (`*`) is evaluated
before addition (`+`).
This overview illustrates how syntax, semantics, and pragmatics contribute to the compilation
process. Syntax provides the structural foundation, semantics ensures logical correctness, and
pragmatics optimizes performance. Together, these elements are fundamental in translating
human-readable code into efficient, executable programs.
4. Functional Relationship Between Lexical Analysis, Expression Analysis, and Code
Generation
In a compiler, Lexical Analysis, Expression Analysis, and Code Generation work sequentially and
are interdependent. Each phase builds upon the work of the previous one, transforming high-level
code into machine-executable instructions. Here’s a detailed explanation of each phase and how
they work together, along with an example.
1. Lexical Analysis
Lexical analysis, often called “scanning,” is the first phase of compilation. It reads the source
code and converts it into tokens — the smallest meaningful units, such as keywords, identifiers,
operators, and symbols.
Role in Compilation: The lexical analyzer tokenizes the source code and removes whitespace
and comments. The tokens generated are essential for the next phase, as they represent the
foundational elements of the language.
10 | P a g e
Example:
Consider the statement `int x = a + b * c;`.
Output: A sequence of tokens is produced and passed to the expression analysis phase.
2. Expression Analysis
Expression analysis, often part of syntax and semantic analysis, is responsible for validating and
structuring tokens into meaningful expressions according to language rules. It builds syntax trees
and checks semantic correctness, like type compatibility.
Role in Compilation: Expression analysis interprets the structure of tokens by organizing them
into syntax trees and validating expressions. It ensures that expressions make sense logically and
follow the rules for operator precedence and associativity.
Example:
From the tokens generated by lexical analysis, expression analysis builds a syntax tree for the
expression `a + b * c`, based on operator precedence.
Syntax Tree:
Expression (=)
/ \
/ \
/ \
ID(b) ID(c)
The tree shows that `b * c` is evaluated first, followed by `a + (result of b * c)`, and then assigned
to `x`.
Output: A syntax tree or an intermediate representation (IR) for the expressions in the code.
11 | P a g e
3. Code Generation
Code generation translates the syntax tree or IR produced by expression analysis into low-level
code, often in assembly or machine language, that the computer’s processor can execute.
Role in Compilation: Code generation transforms the abstract structures created by expression
analysis into specific, executable instructions for the target machine.
Example:
For `int x = a + b * c;`, the code generator translates the syntax tree into assembly language:
12 | P a g e
Converts the syntax tree into assembly instructions to perform the operations defined by the
expression, ready to run on the target hardware.
Flowchart of Data from Lexical Analysis to Code Generation
Source Code (int x = a + b * c;)
Tokens ([int], [x], [=], [a], [+], [b], [*], [c], [;])
Executable Code
This breakdown of Lexical Analysis, Expression Analysis, and Code Generation illustrates the
functional relationship and dependency between each phase in a compiler. Each phase
incrementally translates and structures the code, ensuring it becomes efficient, executable machine
code.
Internal Form of Course Program
The internal form of a program refers to the structured representation that a compiler creates and
processes after analyzing source code. This internal form includes elements like Abstract Syntax
Trees (AST), Symbol Tables, and Intermediate Representation (IR), each of which plays a crucial
role in optimizing and generating efficient machine code.
1. Abstract Syntax Trees (AST)
An Abstract Syntax Tree (AST) is a hierarchical, tree-like structure that represents the logical
structure of the source code, abstracting away syntactic details while focusing on the essential
elements and their relationships.
13 | P a g e
Role in Compilation: The AST serves as an intermediary representation of the program, used for
further analysis and transformations. It helps the compiler understand the structure and flow of the
code without extra syntax like parentheses or delimiters.
Example
Consider the following simple `if-else` statement:
c
if (x > 0) {
y = 1;
} else {
y = -1;
/ \
Condition Statements
/ \
(x > 0) ELSE
/ \
y=1 y = -1
14 | P a g e
Example:
For the above code, the symbol table might contain entries like:
+----------+---------+----------+
| Variable | Type | Scope |
+----------+---------+----------+
|x | int | global |
|y | int | global |
+----------+---------+----------+
the table records `x` and `y` as integer variables in the global scope.
- This information helps the compiler enforce type rules and manage variable lifetimes.
3. Intermediate Representation (IR)
The Intermediate Representation (IR) is a simplified, lower-level form of code between the high-
level source code and the final machine code. IR is easier for a compiler to analyze and optimize.
Role in Compilation: The compiler transforms the AST into IR, which is more uniform and thus
easier to optimize. It enables platform-independent optimizations before final code generation.
Example
For the above `if-else` code, the IR might look something like this in three-address code:
t1 = x > 0 // Evaluate condition
y = -1 // Else branch: y = -1
L1: y = 1 // If branch: y = 1
`t1 = x > 0`: The condition `x > 0` is evaluated and stored in a temporary variable `t1`.
- `if t1 goto L1`: If `t1` is true, the control jumps to `L1`, where `y = 1`.
Otherwise, the code executes `y = -1` (the `else` branch), followed by a jump to `L2`, marking
the end. Combined Example: AST, Symbol Table, and IR for a Simple `if-else` Statement
Let’s put it all together using the `if-else` statement example.
15 | P a g e
1. Source Code:
c
if (x > 0) {
y = 1;
} else {
y = -1;
/ \
Condition Statements
/ \
(x > 0) ELSE
/ \
y=1 y = -1
3. Symbol Table
+----------+---------+----------+
| Variable | Type | Scope |
+----------+---------+----------+
|x | int | global |
|y | int | global |
+----------+---------+----------+
4. Intermediate Representation (IR):
t1 = x > 0 // Evaluate condition
y = -1 // Else branch: y = -1
L1: y = 1 // If branch: y = 1
16 | P a g e
The IR is a lower-level form that simplifies the program into a sequence of steps for optimization
and eventual translation to machine code.
These internal forms are essential in translating source code to efficient executable code, forming
the backbone of compiler functionality.
Use of a Standard Compiler as a Working Example
Using a standard compiler like GCC (GNU Compiler Collection) offers a practical way to
understand the steps involved in translating high-level code into machine-executable code. GCC
is widely used to compile C, C++, and other languages and provides various options to observe
different compilation stages, making it ideal for a detailed walkthrough.
1. Example Compiler: GCC (GNU Compiler Collection)
What is GCC?
GCC is a free, open-source compiler that supports multiple programming languages, including
C, C++, Objective-C, Fortran, and more.
It is known for its flexibility, reliability, and optimization capabilities and is commonly used on
Unix-like systems, including Linux.
Installation
GCC can be installed on most systems using package managers. For instance:
On Debian/Ubuntu:
bash
bash
#include <stdio.h>
int main() {
printf("Hello, World!\n");
return 0;
17 | P a g e
Compilation Steps: To observe each phase of compilation, we can use the `-v` (verbose) option
to show detailed information during compilation. Additional flags can break down the phases.
Step-by-Step Breakdown of Compilation Stages.
1. Lexical Analysis and Syntax Analysis:
Command:
bash
The `-E` option tells GCC to stop after the preprocessing stage, producing a file with all
macros expanded (`hello.i`).
- During preprocessing, GCC handles all macros, includes directives like `#include <stdio.h>`,
and removes comments.
Output:
The output file (`hello.i`) contains the source code after preprocessing, including expanded
library content and macros. This is also where lexical analysis and initial syntax checks occur.
2. Syntax Analysis and Parsing:
Command:
bash
The `-S` option instructs GCC to compile the source code into assembly code, producing a
`.s` file. At this stage, GCC parses the code, builds an Abstract Syntax Tree (AST), and performs
syntax and semantic checks.
Output:
The output assembly file (`hello.s`) represents the code in assembly language, showing the
program logic in terms of low-level instructions.
3. Intermediate Code Generation and Optimization:
Command:
bash
The `-c` flag compiles the code into an object file (`.o`), which is in machine-readable
format. During this phase, GCC generates intermediate code, optimizes it, and then translates it
into binary code.
Output:
18 | P a g e
The output object file (`hello.o`) contains binary code, which is platform-specific and ready for
linking.
4. Linking and Code Generation:
Command:
bash
Without any specific flags, GCC performs linking. It combines the object file (`hello.o`) with
standard libraries and other dependencies to produce the final executable (`hello`).
Output:
The final output is the executable file (`hello`), which can be run on the system:
bash
./hello
Output on execution:
Hello, World!
Detailed Walkthrough of Compilation Phases with GCC
Lexical and Syntax Analysis (Preprocessing and Tokenization): In this phase, GCC preprocesses
the code, removing comments, handling macros, and expanding includes. Example: If we compile
with `gcc -E hello.c -o hello.i`, we can see the preprocessed code where `#include <stdio.h>` has
been expanded with the full content of the standard I/O library.
AST Generation and Semantic Analysis (Parsing):
GCC checks the syntax and semantics of the code. Errors like undeclared variables, type
mismatches, and other logical inconsistencies are caught here. Example: Adding an undeclared
variable, like `x = 10;`, will trigger a semantic error (`error: 'x' undeclared`) when we try to compile
it.
Intermediate Representation (IR) and Code Optimization: GCC generates an IR, which is
optimized before being converted to assembly. For example, in loops or conditions, GCC may
optimize repeated calculations.
Example: The command `gcc -S hello.c -o hello.s` generates an assembly representation in
`hello.s`.
Code Generation and Linking:
In this final phase, the optimized IR is converted into machine code, and linking occurs to
resolve function calls (e.g., `printf` from the C standard library).
Example: The executable is produced by combining the object files with any necessary library
functions.
19 | P a g e
GCC Output Stages (Illustrated)
Here’s an outline of each output generated by GCC during each stage:
1. Source Code (`hello.c`):
#include <stdio.h>
int main() {
printf("Hello, World!\n");
return 0;
2. Preprocessed Code (`hello.i`): Expanded to include all macros and standard I/O library content.
3. Assembly Code (`hello.s`):
.section __TEXT,__text,regular,pure_instructions
main:
pushq %rbp
movq %rsp, %rbp
leaq L_.str(%rip), %rdi
callq _printf
xorl %eax, %eax
popq %rbp
retq
4. Object Code (`hello.o`): Binary format, not human-readable.
5. Executable (`hello`):
When run, it outputs:
Hello, World!
By using a standard compiler like GCC and observing each phase with specific commands, we
gain a clear, practical understanding of how source code transforms from human-readable text into
machine-executable instructions.
Error Detection and Recovery in Compilation
Error detection and recovery are essential aspects of compilation, enabling a compiler to handle
code errors effectively. GCC and other compilers detect various types of errors at different stages
of the compilation process, from lexical analysis to semantic analysis. Once an error is detected,
20 | P a g e
recovery techniques allow the compiler to continue parsing the code, providing feedback for
multiple errors in a single pass.
1. Error Types in Compilation
1. Lexical Errors
Lexical errors occur when invalid tokens (such as misspelled keywords or incorrect symbols) are
present in the code.
Example:
Faulty Code:
c
inta x = 5; // Mistyped "int" as "inta"
Error Message: `error: 'inta' undeclared (first use in this function)`
During lexical analysis, `inta` is not recognized as a valid keyword, causing an undeclared
identifier error.
2. Syntax Errors
Syntax errors are violations of the grammatical rules of the language, such as missing
semicolons, unmatched parentheses, or incorrect structuring of statements.
Example:
Faulty Code:
c
int x = 5
printf("Hello, World!\n");
21 | P a g e
Error Message: `error: incompatible types when initializing type ‘int’ with an expression of type
‘char ’`
The semantic analyzer detects that the string `"text"` cannot be assigned to an integer
variable `x` due to a type mismatch.
2. Error Recovery Techniques
Once an error is detected, the compiler can attempt to recover and continue analyzing the rest of
the code. Common recovery techniques include panic mode recovery and phrase level recovery.
1. Panic Mode Recovery
In panic mode, the compiler discards tokens until it reaches a known synchronization point,
such as a semicolon or a brace (`{` or `}`), allowing it to continue parsing from that point.
Example
Faulty Code:
int x = 5
printf("Hello, World!\n");
int y = 10;
After detecting the missing semicolon in `int x = 5`, the compiler skips tokens until it
reaches the `printf` statement, synchronizing at the semicolon at the end of the line. This enables
the compiler to report further errors if any are found in `int y = 10`.
2. Phrase Level Recovery
Phrase level recovery attempts to fix errors locally by making small adjustments,
such as inserting, replacing, or deleting tokens, so that the compiler can proceed without
discarding larger portions of code.
Example:
Faulty Code:
int x = 5;
The compiler detects that `prinf` is unrecognized. In some cases, it may suggest
corrections, such as replacing `prinf` with `printf`. Phrase level recovery helps the compiler avoid
missing additional errors that follow.
3. Practical Example: Compilation of Faulty Code and Error Messages
Suppose we compile the following C code with GCC:
c
22 | P a g e
#include <stdio.h>
int main() {
int y = 10;
return 0;
23 | P a g e
Flowchart of Error Detection and Recovery
Here’s a simplified flowchart of error detection and recovery within a compiler:
+--------------------------+
| Start Compilation |
+--------------------------+
+--------------------------+
| Lexical Analysis |
+--------------------------+
+--------------------------+
| Syntax Analysis |
+--------------------------+
+--------------------------+
| Semantic Analysis |
+--------------------------+
+--------------------------+
| Error Recovery |
+--------------------------+
+--------------------------+
+--------------------------+
24 | P a g e
Error detection and recovery are essential for a compiler to provide meaningful feedback on code
errors. Compilers like GCC detect different types of errors—lexical, syntax, and semantic—each
with its unique characteristics. Through error recovery techniques like panic mode and phrase level
recovery, the compiler can continue processing code and provide feedback on multiple issues,
aiding developers in resolving errors efficiently.
Grammars and Languages in Compiler Design
Grammars and languages are foundational in defining the syntax of programming languages,
allowing compilers to understand and process code correctly. This section explores Formal
Language Theory, Context-Free Grammars (CFG), and how they are used to validate and parse
code structures.
1. Formal Language Theory
Formal language theory provides the mathematical framework for defining the syntax of
programming languages. Key concepts include:
Alphabet: A finite set of symbols. For programming languages, the alphabet may include
characters, digits, and symbols (e.g., `+`, `-`, `*`, `/`).
String: A sequence of symbols from the alphabet (e.g., `x + y`).
Language: A set of strings formed from the alphabet that are valid according to specified rules
(syntax).
Grammar: A set of production rules that define the syntax of a language, determining which
strings are valid in that language.
Using formal language theory, we define a programming language's grammar and rules, allowing
a compiler to distinguish valid code from invalid code.
2. Context-Free Grammars (CFG)
Context-Free Grammars (CFG) are a common type of grammar used to define the syntax of
programming languages. CFGs consist of:
Non-terminal Symbols: Abstract symbols representing patterns or structures in the language
(e.g., `<expression>`, `<term>`, `<factor>`).
Terminal Symbols: Symbols from the language's alphabet that represent actual values or
operations (e.g., numbers, operators).
Production Rules: Rules that define how non-terminal symbols can be replaced by terminal or
other non-terminal symbols to form valid expressions.
Start Symbol: The initial non-terminal symbol from which parsing begins, often representing a
complete statement or expression.
25 | P a g e
Example CFG for Arithmetic Expressions: Let’s define a CFG that describes simple arithmetic
expressions with addition, subtraction, multiplication, and parentheses.
1. Grammar Rules:
<expression> ::= <expression> + <term> | <expression> - <term> | <term>
<term> ::= <term> * <factor> | <term> / <factor> | <factor>
<factor> ::= ( <expression> ) | <number>
<number> ::= [0-9]+
Explanation:
- `<expression>`: Represents an entire arithmetic expression. It can be an addition or subtraction
of terms or a single term.
- `<term>`: Represents parts of the expression separated by `*` or `/` operators, forming
individual factors.
- `<factor>`: Represents either a number or a grouped expression in parentheses.
- `<number>`: Represents a single or sequence of digits (0-9).
2. Sample Expression and Derivation:
Let’s take the expression `3 + 5 * (2 - 1)` and see how it can be derived using the grammar.
Derivation Steps:
Start with the `<expression>` symbol:
<expression>
Apply the rule `<expression> ::= <expression> + <term>`:
<expression> -> <term> + <term>
Substitute `<term>` with `<factor>`:
<term> + <term> -> <factor> + <factor> * <factor>
Substitute `<factor>` with `number` or `(expression)`:
<factor> + <factor> * <factor> -> 3 + 5 * (2 - 1)
3. Example Parse Tree for Arithmetic Expression
A parse tree is a visual representation of how an expression is derived from the CFG, showing the
relationships between symbols and production rules.
26 | P a g e
Parse Tree for `3 + 5 * (2 - 1)`:
<expression>
|
+----------------+----------------+
| |
<term> <term>
| |
<factor> <factor>
| |
'3' *
|
<term>
|
<factor>
|
()
|
<expression>
/ \
<term> '-'
/ \
<factor> <factor>
'2' '1'
In this parse tree:
The root node is `<expression>`, representing the full arithmetic expression.
Each branch represents the application of a production rule, breaking down the expression into
individual terms and factors.
Leaf nodes represent terminal symbols (numbers or operators).
27 | P a g e
Practical Example Using CFG
Consider the code for a simple expression evaluator. By implementing a parser based on the CFG
for arithmetic expressions, the evaluator can determine if an input string is a valid expression and
then compute its value.
Example Code:
python
return eval(expression)
result = evaluate_expression(expression)
Here, we use Python’s `eval` to simplify evaluation, but a real compiler uses custom parsing and
evaluation methods according to CFG rules.
Grammars and languages form the foundation of programming language syntax, enabling
compilers to parse and validate code. Context-Free Grammars, in particular, are crucial for
defining rules that establish valid syntax for expressions and statements. By using parse trees and
CFGs, we can systematically represent, validate, and understand code structures, making
grammars and languages an essential part of compiler design.
The Parsing Problem and the Scanner in Compiler Design
Parsing is a critical step in compiling that involves analyzing the structure of code to ensure it
conforms to the grammar rules of the programming language. Parsing converts a sequence of
tokens (provided by the scanner) into a syntax tree, representing the hierarchical structure of the
code. This process involves two main types of parsing algorithms, each with unique approaches:
Top-Down Parsing and Bottom-Up Parsing. The scanner plays a crucial role in identifying tokens
for the parser, feeding it a clean, structured sequence of elements to process.
1. Parsing Algorithms
Parsing algorithms determine how to traverse and apply grammar rules to a given sequence of
tokens. Here, we’ll focus on two primary types: Top-Down Parsing and Bottom-Up Parsing.
A. Top-Down Parsing
Top-down parsing begins at the start symbol (the highest level of the grammar) and attempts to
expand it down to the individual tokens by applying production rules.
Approach:
28 | P a g e
It proceeds from the root of the parse tree and expands each non-terminal based on the grammar
rules.
Common types of top-down parsers include LL parsers (where "L" stands for "left-to-right" and
the second "L" stands for "leftmost derivation").
Example of LL Parsing:
Grammar for a simple arithmetic expression:
<expression> ::= <term> | <expression> + <term>
<term> ::= <factor> | <term> * <factor>
<factor> ::= ( <expression> ) | number
Expression: `3 + 5 * 2`
Top-Down Parsing Steps:
1. Start with `<expression>`.
2. Expand `<expression>` to `<expression> + <term>`.
3. Expand `<expression>` to `<term>`, then `<term>` to `<factor>`.
4. Resolve `<factor>` to `3`, then expand the remaining parts similarly.
Result: A parse tree rooted at `<expression>`.
B. Bottom-Up Parsing
Bottom-up parsing begins from the leaves (individual tokens) and attempts to construct the parse
tree by working up to the root symbol, reversing the production rules.
Approach:
Commonly used by LR parsers (where "L" stands for "left-to-right" and "R" stands for "rightmost
derivation in reverse"). The parser reduces groups of tokens and symbols to non-terminals until it
reaches the start symbol.
Example of LR Parsing:
Grammar: Use the same grammar for arithmetic expressions as above.
Expression: `3 + 5 * 2`
Bottom-Up Parsing Steps:
1. Start with tokens `3`, `+`, `5`, `*`, and `2`.
2. Reduce `5 * 2` to a `<term>`.
29 | P a g e
3. Then reduce `3 + <term>` to `<expression>`.
Result: A parse tree rooted at `<expression>`, similar to the top-down result but constructed from
leaves to the root.
2. Scanner Role
The scanner, also known as the lexical analyzer, is responsible for breaking the source code into a
stream of tokens, such as keywords, operators, and literals, that the parser can process. Each token
represents a meaningful unit of the code, helping the parser by eliminating irrelevant characters
like whitespace and comments.
Scanner Process: Reads the source code character-by-character. Groups characters into tokens
based on predefined patterns, such as identifiers, numbers, operators, etc.
Emits a sequence of tokens, including token type and value, to the parser.
Example:
Given a line of code: `int x = 5 + y;`
The scanner would produce a token stream:
[int, IDENTIFIER(x), =, NUMBER(5), +, IDENTIFIER(y), ;]
Each token is passed sequentially to the parser, enabling it to apply grammar rules to generate a
parse tree.
3. Example: Parsing a Simple Expression
Consider the expression `3 + 5 * 2` and parse it using both top-down and bottom-up methods.
Grammar:
<expression> ::= <term> | <expression> + <term>
<term> ::= <factor> | <term> * <factor>
<factor> ::= number
Top-Down Parsing (LL Parser)
1. Start with `<expression>`.
2. Expand `<expression>` to `<expression> + <term>`.
3. Expand `<expression>` to `<term>`, then `<term>` to `<factor>`, resulting in `3`.
4. Expand the remaining part of `<term> + <factor>` with `5 * 2`.
5. Complete the parse by building a tree from the start symbol `<expression>`.
30 | P a g e
Bottom-Up Parsing (LR Parser)
1. Start with individual tokens `3`, `+`, `5`, `*`, and `2`.
2. Reduce `5 * 2` to `<term>`.
3. Then reduce `3 + <term>` to `<expression>` by applying rules in reverse.
4. The final parse tree is the same as with top-down parsing but constructed from the leaves up.
Comparison of Top-Down and Bottom-Up Parsing
Top-Down Parsing (LL) Bottom-Up Parsing (LR)
<expression> <expression>
| |
+-------------+--------------+ +--------+--------+
| | | |
<term> + <term> <term>
| <term> | |
<factor> <factor> + <factor> *
'3' '5' '3' <factor>
'*' |
| '2'
<factor>
'2'
Parsing is essential for syntax analysis in compilers, transforming token streams into parse trees
or syntax trees. Top-Down Parsing works from the root down to the leaves, while Bottom-Up
Parsing builds from the leaves up to the root. The scanner supports parsing by transforming source
code into tokens, simplifying the parser's task. Both parsing methods are foundational, and each is
suitable for different types of grammars, making them core techniques in compiler design.
31 | P a g e