0% found this document useful (0 votes)
3 views

Compiler Design

Uploaded by

patixiw394
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Compiler Design

Uploaded by

patixiw394
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

UNIT 1: Unit I Introduction to

Compilers:

Structure of a compiler :
Lexical Analysis – Role of Lexical Analyzer

The lexical analyzer, also known as a


scanner, plays a crucial role as the first
stage in compiler design. It acts as an
interface between the programmer's
source code and the rest of the compiler.
Here are some key aspects of its function:

Breaking Down the Code:


● The lexical analyzer reads the source
code character by character. It groups
these characters into meaningful units
called tokens. These tokens are the
smallest building blocks that have a
specific meaning in the programming
language.
● Examples of tokens include keywords
(like if, for), identifiers (variable names
like x, total), operators (arithmetic
operators like +, -), and literals (numeric
or string values like 10, "hello").

● Input Pre-processing: It cleans up the input by


removing comments, whitespace, and other
non-essential characters from the source
text, preparing it for lexical analysis1.
● Tokenization: This is the process of breaking
the input text into a sequence of tokens. A
token is a string of characters that is treated
as a single logical entity. For example,
keywords, identifiers, constants, operators,
and special symbols are all considered
tokens1.
● Token Classification: After tokenization, each
token is classified into categories such as
keywords, identifiers, numbers, operators,
and separators. This classification is based
on the rules of the programming language1.
● Token Validation: The Lexical Analyzer checks
each token to ensure it is valid according to the
programming language’s rules. This step is
essential to prevent syntax errors at later stages1.
● Output Generation: Finally, the Lexical Analyzer
generates the output of the lexical analysis
process, which is typically a list of tokens. These
tokens are then passed to the Syntax Analyzer
(Parser) for the next phase of compilation1.

input buffering
In compiler design, input buffering is an
optimization technique used to improve the
efficiency of the lexical analysis stage. The
lexical analyzer, or scanner, is responsible for
reading the source code and breaking it down
into tokens (keywords, identifiers, operators,
etc.).

Here's how input buffering works:


● Reading in Chunks: Instead of reading the
source code one character at a time, the
compiler reads a larger block of characters
(the buffer) into memory. This buffer size
can be fixed or variable depending on the
compiler design.
● Processing the Buffer: The lexical analyzer
then processes the characters in the buffer,
identifying tokens. It searches for patterns
within the buffer that match token
definitions.

Benefits of Input Buffering:


Reduced I/O Overhead: One of the main advantages of
input buffering is that it reduces the number of
input/output (I/O) operations required. Reading one
character at a time involves frequent calls to the operating
system, which can be slow. By reading in larger chunks,
the compiler minimizes these calls, improving
performance.

Lookahead Capability: Having a buffer allows the lexical


analyzer to peek ahead at upcoming characters. This
lookahead capability can be helpful for identifying certain
tokens or resolving ambiguities.
In compiler design, there are three specifications
for tokens: strings, language, and regular
expressions:

● Strings: A finite set of characters or symbols


● Language: A set of strings
● Regular expressions: Used to specify the different
types of patterns thatcan form tokens

.Tokens are defined based on patterns of characters in


the programming language. These patterns are
typically expressed using:

○ Keywords: Reserved words with specific


meanings in the language (e.g., if, for,
while).
○ Identifiers: User-defined names for variables,
functions, or other entities (e.g., x, total,
calculateArea). They follow specific naming
rules defined by the language.
○ Operators: Symbols used for performing
operations (e.g., +, -, *, s.
● Regular Expressions: These are patterns that describe
the structure of tokens. For example, a regular
expression might specify that identifiers must start
with a letter followed by any combination of letters
and digits1.
● String Language: This refers to the set of all strings
(sequences of characters) that a token can represent.
For instance, the string language for a number might
include any sequence of digits1.

*Finite Automata (FA):These are simple state machines


that can recognize patterns of characters. An FA has a
set of states, transitions between states triggered by
specific characters, and designated start and accepting
states. The scanner reads the source code character by
character and transitions between states based on the
character. If it reaches an accepting state and the
consumed characters match the defined pattern, a
token is identified.

Lexical Analyzer Generators (LAGs):** These are tools


that automate the process of building a scanner based
on a set of token specifications. The programmer
provides the token definitions (often using regular
expressions), and the LAG generates the code for the
scanner that can recognize these tokens efficiently.
Lex in compiler design

In compiler design, Lex is a computer program that


generates lexical analyzers, also known as scanners
or tokenizers. Lexical analyzers convert a stream of
characters into tokens, which identify lexical patterns
in input programs and convert input text into a
sequence of tokens. Lex is often used with the YACC
(Yet Another Compiler Compiler) parser generator.

Lex
Lex is a tool or a computer program that generates
Lexical Analyzers (converts the stream of characters
into tokens). The Lex tool itself is a compiler. The
Lex compiler takes the input and transforms that
input into input patterns. It is commonly used with
YACC(Yet Another Compiler Compiler)
A Lex program has three parts:

● Declarations: Includes declarations of variables


● Translation rules: Consists of pattern and action
● Auxiliary procedures: Holds auxiliary functions used in
the actions
Finite Automata

Finite automata (FA), also known as finite-state machines


(FSM), are a fundamental concept in computer science,
particularly automata theory and compiler design. They are
abstract models of computation that can be used to
recognize patterns in strings.

Here's a breakdown of the key aspects of finite automata:

Components:

● States: A finite set of states that the automaton can be


in at any given time. One state is designated as the
start state, where the automaton begins processing
the input. Additionally, there can be one or more
accepting states that signify successful recognition of
a pattern.
● Input Alphabet: A set of symbols that the automaton
can read as input. This alphabet can include
characters, digits, or any other relevant symbols
depending on the application.
● Transition Function: A function that defines how the
automaton transitions between states based on the
current state and the input symbol it reads. This
function essentially determines the rules that govern
the automaton's behavior.

How it Works:

1. The automaton starts in the designated start state.


2. It reads an input symbol from the input string.
3. Based on the current state and the input symbol, the
transition function determines the next state for the
automaton.
4. This process of reading, transitioning, and reading
again continues until the entire input string is
consumed.

Types of Finite Automata:

● Deterministic Finite Automata (DFA): In a DFA, for any


given state and input symbol, there is always a single
unique next state. This makes DFAs simpler to
analyze and implement.
● Nondeterministic Finite Automata (NFA): In an NFA, a
state might have multiple possible transitions for a
single input symbol. This allows for more flexibility in
pattern recognition but can be more complex to
analyze.

Minimizing a DFA (Deterministic Finite


Automaton) is the process of transforming a
given DFA into an equivalent DFA with the
minimum number of states. Here's why it's
important:
● Reduced Complexity: A minimized DFA
is simpler and easier to understand. This
is beneficial for debugging, analyzing,
and potentially modifying the automaton.
● Improved Efficiency: Minimization can
lead to a smaller memory footprint for
storing the automaton and potentially
faster execution for pattern recognition
tasks.

Flexibility: Regular expressions offer a


convenient way to define complex patterns
that can be systematically converted into
efficient FAs.

The Minimization Process:


There are several algorithms for minimizing
DFAs. Here's a common approach that
involves two main steps:
1.State Equivalence Partitioning:
2. DFA Reconstruction:
Unut 5 Code Optimization:

C ode Optimization in compiler design is a crucial phase


where the intermediate code generated by the compiler is
transformed to improve efficiency and speed of the
executed program. The goal is to enhance the performance
of the code without altering its functionality. Here are
some key points about code optimization:
● Objectives: The primary objectives of code
optimization are to reduce the runtime, memory
usage, and power consumption of the program. It
should also ensure that the optimization does not
change the meaning of the program and keeps the
compilation time reasonable1.
● When to Optimize: Typically, code optimization is
performed at the end of the development stage

● Types of Code Optimization:


○ Machine Independent Optimization: This type
focuses on improving the intermediate code
without considering the specifics of the target
machine’s architecture. It does not involve CPU
registers or absolute memory locations1.
○ Machine Dependent Optimization: This type is
performed after the target code has been
generated and is tailored according to the target
machine’s architecture. It often involves CPU
registers and absolute memory references1.
● Techniques:
○ Compile Time Evaluation: Calculating
expressions at compile time rather than at
runtime to save execution time2.

○ Variable Propagation: Replacing variables with
their known values to simplify expressions and
reduce the number of instructions2.

○ Constant Propagation: Substituting variables with
constant values whenever possible to eliminate
unnecessary calculations2.



○ Code Movement: Moving code outside of loops
when it does not depend on the loop iteration,
thus reducing the number of executions2.

○ Dead Code Elimination: Removing code that does
not affect the program output, such as
instructions after a return statement2.

○ Strength Reduction: Replacing expensive
operations with cheaper ones, like using bit shifts
instead of multiplication2.

– Peephole optimization:

Peephole optimization is a type of code Optimization


performed on a small part of the code. It is performed on a
very small set of instructions in a segment of code.
It basically works on the theory of replacement in which a
part of code is replaced by shorter and faster code without
a change in output. The peephole is machine-dependent
optimization

Objectives of Peephole Optimization:

The objective of peephole optimization is as follows:

1. To improve performance
2. To reduce memory footprint

3. To reduce code size

Limitations of Peephole Optimization:


● Limited Scope: Peephole optimization
only looks at small instruction
sequences and might miss broader
optimization opportunities.
● Machine Dependence: The
optimization rules are specific to the
target processor architecture and
instruction set.

Peephole Optimization Techniques

A. Redundant load and store elimination: In this technique,


redundancy is eliminated.

Initial code:
y = x + 5;
i = y;
z = i;
w = z * 3;

Optimized code:
y = x + 5;
w = y * 3; //* there is no i now

//* We've removed two redundant variables i & z


whose value were just being copied from one
another.

B. Constant folding: The code that can be simplified by the


user itself, is simplified. Here simplification to be done at
runtime are replaced with simplified code to avoid
additional computation.

Initial code:
x = 2 * 3;

Optimized code:
x = 6;
Strength Reduction: The operators that consume higher
execution time are replaced by the operators consuming
less execution time.

Initial code:
y = x * 2;

Optimized code:
y = x + x; or y = x << 1;

Null sequences/ Simplify Algebraic Expressions : Useless


operations are deleted.
a := a + 0;
a := a * 1;
a := a/1;
a := a - 0;

Directed Acyclic Graph :


The Directed Acyclic Graph (DAG) is used to represent the
structure of basic blocks, to visualize the flow of values
between basic blocks, and to provide optimization
techniques in the basic block. To apply an optimization
technique to a basic block, a DAG is a three-address code
that is generated as the result of an intermediate code
generation.

● Directed acyclic graphs are a type of data structure

and they are used to apply transformations to basic

blocks.

● The Directed Acyclic Graph (DAG) facilitates the

transformation of basic blocks.

● DAG is an efficient method for identifying common

sub-expressions.

Structure:

● A DAG is a directed graph, meaning edges (also

called arcs) have a direction, indicating a flow or

relationship between elements.

● Unlike a general graph, a DAG has no cycles. This

means you cannot follow directed edges and end up

back at the starting vertex (node) you began from.

Components:
● Vertices (or nodes): Represent entities or data points

in the graph.

● Edges (or arcs): Represent directed connections

between vertices, showing the flow of information or

relationship between them.


- Optimization of Basic Blocks-Global Data
Flow Analysis\

Optimization of Basic Blocks using Global


Data Flow Analysis
Optimizing code within basic blocks (sequences of
instructions without jumps) is a common task in compiler
design. While peephole optimization focuses on small
instruction sequences, a more comprehensive approach
called global data flow analysis can be used to identify
optimization opportunities across basic blocks.

lobal Data Flow Analysis (GDFA):

● GDFA is a technique that analyzes how data flows


throughout a program. It tracks how the values of
variables are defined and used at different points in
the code. This information is essential for various
compiler optimizations.
● There are different types of data flow analysis, but
some common ones include:
○ Reaching Definitions Analysis: Identifies which
definitions (assignments) of a variable can
potentially reach a specific point in the program.
○ Live Variable Analysis: Determines which
variables hold a value that might still be needed
for future computations at a specific point in the
program.
○ Available Expressions Analysis: Identifies
expressions that have already been computed
and their results are still available for reuse at a
specific point in the program.

Optimizations using GDFA:

By leveraging the information obtained from GDFA,


compilers can perform various optimizations within basic
blocks:

● Dead Code Elimination: If GDFA reveals that a variable


assignment is never used later in the program (not
live), the compiler can safely remove the assignment
and the associated code, reducing unnecessary
computations.
● Common Subexpression Elimination: If GDFA
identifies the same expression being calculated
multiple times within a basic block, the compiler can
calculate it once and store the result, eliminating
redundant computations.
● Strength Reduction: If a complex operation (e.g.,
division) is used on a variable that always has a
constant value, GDFA can help identify this and
replace the complex operation with a simpler one
(e.g., multiplication) based on the constant value.

Optimizations using GDFA:

By leveraging the information obtained from GDFA,


compilers can perform various optimizations within basic
blocks:

● Dead Code Elimination: If GDFA reveals that a variable


assignment is never used later in the program (not
live), the compiler can safely remove the assignment
and the associated code, reducing unnecessary
computations.
● Common Subexpression Elimination: If GDFA
identifies the same expression being calculated
multiple times within a basic block, the compiler can
calculate it once and store the result, eliminating
redundant computations.
● Strength Reduction: If a complex operation (e.g.,
division) is used on a variable that always has a
constant value, GDFA can help identify this and
replace the complex operation with a simpler one
(e.g., multiplication) based on the constant value.

Benefits of using GDFA:

● Improved Optimization: GDFA provides a broader view


of data flow compared to peephole optimization,
enabling more comprehensive optimizations across
basic blocks.
● Reduced Code Size: Eliminating dead code and
redundant calculations can lead to a smaller and more
efficient machine code.
● Faster Execution: By reducing unnecessary
computations, optimizations based on GDFA can
improve the program's execution speed.

Challenges of GDFA:

● Complexity: Implementing and analyzing GDFA


algorithms can be more complex compared to simpler
techniques like peephole optimization.
● Potential for Errors: Incorrect data flow information
can lead to unintended consequences during
optimization.
Unit IV: Run-Time Environment and Code
Generation:

Storage Organization
In the context of Compiler Design (CD), storage
organization refers to how data is allocated
and managed within the program's memory
space during its execution. This primarily
focuses on the memory layout of the compiled
code and its associated data structures. Here's
a breakdown of the key aspects:
Memory Regions:

A compiled program typically uses distinct memory


regions:

● Code: This region stores the machine code


instructions generated by the compiler. These
instructions are read by the CPU for execution.
● Static Data: This region holds global and static
variables that are declared outside functions and
have a fixed lifetime throughout the program's
execution.
● Heap: This is a dynamically allocated memory
region where memory can be requested and
released during program execution using
functions like malloc and free. It's commonly
used for data structures that grow or shrink at
runtime.
● Stack: This region is used for function calls, local
variables, and parameter passing. The stack
follows a Last-In-First-Out (LIFO) principle, where
data is pushed onto the stack when a function is
called and popped off when the function returns.

, Stack Allocation Space

1. Static Area (Like a Fixed Shelf):

○ This is where the compiler places global


variables and static variables (special
variables that stay around the entire program).
○ Advantage: Fastest access since the location
is fixed and known beforehand.
○ Disadvantage: Not suitable for data that grows
or shrinks as the program runs.
2. Stack Area (Like a Pile of Papers):

○ This area is used for local variables (created


inside functions) and function parameters.
○ Advantage: Easy to manage as space is
automatically freed when the function is done.
○ Disadvantage: Limited size and not suitable
for data that needs to persist after a function
call.
3. Heap Area (Like a Big Box of Stuff):

○ This is a flexible space where you can request


memory during runtime using functions like
malloc (similar to renting storage space).
○ Advantage: Good for data structures that grow
or shrink as needed.
○ Disadvantage: Slower access and requires
careful management by the programmer to
avoid running out of memory or memory leaks
(forgetting to return unused space).

Choosing the Right Storage:

● Use static allocation for global variables that are


always needed throughout the program.
● Use stack allocation for local variables that only
need to exist within a function.
● Use heap allocation for data structures that change
size or need to be shared between functions (use
with caution!).

Access to Non-local Data on the Stack


Access to non-local data on the stack is a concept in compiler design
that deals with the scenario where a function (or a nested scope) needs
to access data that is not within its local scope. This situation
commonly arises in programming languages that support nested
functions. Here’s how it’s typically handled:

1. Activation Record: Each function call creates an activation record


(also known as a stack frame) on the stack, which contains its
local variables, parameters, return address, and other necessary
information1.
2. Non-local Variables: These are variables that are not defined
within the current function but are accessible due to the lexical
scope, usually because they are defined in an enclosing
function1.
3. Access Links: To access non-local variables, compilers often use
a chain of access links. An access link is a pointer in an activation
record that points to the activation record of the enclosing scope.
This allows a function to follow the chain of access links to find
the non-local variables it needs1.
4. Displays: Another method for accessing non-local data is the use
of displays, which are arrays of pointers to activation records.
Each entry in the display corresponds to a nesting level of scopes.
This allows direct access to the activation records of enclosing
scopes without following a chain1.
5. Static Chain: The chain of access links forms a static chain that
reflects the static (lexical) nesting structure of the program. When
a function needs to access a non-local variable, it can traverse up
the static chain to the activation record where the variable is
located1.
6. Dynamic Chain: The dynamic chain, on the other hand, reflects
the actual call history at runtime. It is used to return control to the
caller when a function exits1.

How does a compiler handle access to non-local data on


the stack?
The compiler generates code to set up an access link in the current
stack frame, pointing to the calling function's stack frame or subroutine.
This allows nested functions or subroutines to access non-local
variables by following the chain of stack frames using access links.

What is the role of access links in accessing non-local data


on the stack?
Access links are pointers that follow the chain of stack frames to locate
non-local variables. They point to the stack frame of the calling function
or subroutine, allowing nested functions or subroutines to access
variables in the calling function's stack frame and variables in the stack
frames of any other calling functions up the chain.

How does the depth of nesting affect access to non-local


data on the stack?
The depth of nesting refers to the number of levels of nested functions
or subroutines. Each level of nesting requires an additional link in the
chain of stack frames, which must be followed to access non-local data
on the stack. Therefore, deeper levels of nesting can lead to longer
chains of stack frames and potentially slower access times.
Heap Management:
Heap management refers to the techniques used to
allocate and deallocate memory on the heap during
program execution. The heap is a dynamic memory pool
that allows programs to request memory at runtime, unlike
the stack which has a fixed size and automatic
allocation/deallocation. Effective heap management is
crucial for avoiding memory-related errors and ensuring
efficient program execution.

Memory Allocation:
● Programs use functions like malloc (allocate) and
calloc (allocate and initialize to zero) to request
memory from the heap.
● This pointer is used by the program to access and
manipulate the allocated memory.

challenges of Heap Management:

● Memory Leaks: As mentioned earlier, forgetting to


deallocate memory with free can lead to leaks. This
can significantly impact program performance over
time as available memory dwindles.
● Fragmentation: Over time, frequent allocation and
deallocation of memory can lead to
fragmentation,This can make it difficult to allocate
larger blocks of memory later, even if there's enough
free space in total.

● Performance: Heap allocation and deallocation are


generally slower than stack allocation due to the
complexity of managing the free memory blocks3.

The heap memory manager performs the following


fundamental memory operations:

● Allocation: Performed by malloc and calloc


● Deallocation: Performed by free
● Reallocation: Performed by realloc
Issues in Code Generation

Issues in code generation within compiler design are


multifaceted and can arise at various stages of the process.
Here are some of the key issues:
1. Input to the Code Generator: The code generator must
handle intermediate representations of the source
program, such as syntax trees or three-address code, and
assume that they are free from syntactic and semantic
errors1.
2. Target Program: The output of the code generator can be
in different forms like assembly language, relocatable
machine language, or absolute machine language, each
with its own challenges1.
3. Memory Management: Properly mapping symbol table
entries to memory addresses is crucial. This includes
managing local variables on the stack and global
variables in static areas1.
4. Instruction Selection: Choosing the most efficient
machine instructions based on the target machine’s
instruction set is vital for the performance of the
generated code1.
5. Register Allocation: Efficiently allocating registers for
variables is important as registers are faster than
memory. This involves selecting which variables should
reside in registers and assigning them appropriately1.
● Debugging Difficulties:
● 2. Target Program Issues:
● Data Representation:
Syntax Directed Definitions (SDDs) are a powerful tool in
compiler design for associating semantic information with the
syntactic structure of a program. They extend Context-Free
Grammars (CFGs) by attaching semantic rules to productions.
These rules define how attributes associated with grammar
symbols are computed during parsing.

Key Components of SDDs:

● Attributes: These are additional symbols attached to


grammar symbols (terminals and non-terminals) that
hold semantic information. Attributes can represent
various aspects like type information, values, or code
generation details.
● Semantic Rules: These are code snippets associated with
grammar productions. They define how the attributes of a
non-terminal on the left-hand side (LHS) of the
production are computed based on the attributes of the
symbols on the right-hand side (RHS).

Evaluation Orders for Syntax Directed Definitions (SDDs)

As we discussed earlier, evaluation order in SDDs is crucial for


ensuring the correctness of semantic analysis during parsing.
Here's a deeper dive into the two main types of SDDs based on
their evaluation orders:

1. S-attributed SDDs (Synthesized Attributes):

● Focus: S-attributed SDDs deal only with synthesized


attributes. These attributes are computed based on the
information flowing upwards from the children nodes in
the parse tree towards the root.

Evaluation Approach:

● Semantic rules are placed at the end of grammar


productions.
● Evaluation follows a bottom-up approach.

As the parser encounters non-terminal nodes

Example:

Consider the grammar for expressions with addition and


subtraction:

E -> E + T | E - T | T
T -> int

L-attributed SDDs (Inherited Attributes):

● Focus: L-attributed SDDs can have both synthesized


and inherited attributes. Inherited attributes flow
downwards from the parent node to its children in the
parse tree.
● Evaluation Approach:
○ Semantic rules can be placed at the beginning,
end, or both at the beginning and end of
productions depending on the dependencies
between attributes.
○ Evaluation can involve a combination of
top-down and bottom-up approaches:
ntermediate Languages: Diving Deeper
Intermediate Languages (IL) come in various forms, each
with its strengths and weaknesses. Here's a breakdown of
some common representations used within the realm of
ILs,

Example:

Consider a grammar for assignment statements:

S -> var = E
E -> T
T -> int

Intermediate Languages (IL) come in various forms, each with


its strengths and weaknesses. Here's a breakdown of some
common representations used within the realm of ILs
\

Syntax Tree:

● Structure: A syntax tree is a hierarchical


representation of the program's grammatical
structure. It resembles a tree, with the root node
representing the program itself and internal nodes
representing operators, function calls, and control
flow statements. Leaves of the tree hold terminal
symbols like variables and constants.

Advantages:

● Clearly reflects the program's syntactic structure.


● Useful for code analysis and manipulation tasks.
● Can serve as an intermediate representation before
further optimization or code generation.

Disadvantages:

● Can be cumbersome for representing complex control


flow or data structures.
● Not directly translatable to machine code.

Three-Address Code (TAC):

● Structure: TAC is a linear representation of the


program. It consists of a sequence of instructions,
each typically containing three operands (source 1,
source 2, and destination) and an operator. These
instructions represent basic operations like
assignments, arithmetic calculations, and control flow
transfers.

Advantages:

● Easier to translate to machine code compared to


syntax trees.
● Enables efficient code optimization like register
allocation and instruction scheduling.

Types and Declarations:

● Importance: Type information and variable


declarations are crucial aspects of any programming
language.
○ Types define the data types of variables (e.g.,
integer, float, string), which helps ensure type
safety and program correctness.
○ Declarations specify the names, types, and
storage locations of variables.
● Representation in IL: Intermediate Languages
○ Type information can be associated with symbols
in the syntax tree
○ Declarations can be translated into instructions
for allocating memory and initializing variables
during code generation.

Translation of Expressions and Type Checking in


Compilers
During compilation, translating expressions and
performing type checking are crucial steps that ensure the
program's correctness and efficiency. Here's a breakdown
of these processes:

1. Translation of Expressions:

● Goal: Transform expressions written in the source


language into a format suitable for code generation in
the target machine code.
● Process:
○ The parser identifies expressions within the
source code.
○ The semantic analyzer performs type checking
on the expression's operands (variables, literals,
etc.) to ensure compatibility.
○ Based on the expression type and operator, the
expression is translated into an equivalent
representation in the chosen intermediate
language (IL). This could be:
■ Three-Address Code (TAC): The expression
is broken down into a sequence of TAC
instructions representing the operations
involved (e.g., a = b + c translated to t1 =
b + c; a = t1).
■ Bytecode: The expression is converted into
a sequence of bytecode instructions specific
to the virtual machine for which the IL is
targeted.

type Checking:

● Goal: Verify that the operands in an expression are of


compatible types and that the overall expression
results in a valid type according to the programming
language's rules.
● Benefits:
○ Prevents type errors that could lead to program
crashes or unexpected behavior.
○ Improves code clarity and maintainability by
enforcing type consistency.
● Techniques:
○ Static Type Checking: This is typically done
during compilation. The compiler analyzes the
source code to infer the types of variables and
expressions based on their declarations and
usage.
○ Dynamic Type Checking: This happens at
runtime. The program checks the types of
operands at runtime before performing the
operation. (Less common in statically typed
languages).
Unit II :Syntax Analysis:

parser

● Function: parser is a program that is part of a compiler


and is responsible for parsing, which is the process of
transforming data from one format to another. In
compiler design, a parser is a phase that converts a
token string into an Intermediate Representation (IR)
using a set of rules called grammar.
● Process
○ It compares the token stream against the
grammar rules of the programming language,
essentially trying to fit the tokens into a valid
structure defined by the grammar.
○ If the token sequence matches a production rule
(a rule defining how symbols can be arranged),
the parser proceeds, building a parse tree (a
hierarchical representation of the program's
structure).
○ If a mismatch occurs (unexpected token or
missing element), the parser encounters a syntax
error.

There are two main types of parsers:

● Top-down parser: Starts with the start symbol and ends on the
terminals
● Bottom-up parser: Also known as Shift Reduce Parsers
2. Grammars:

● Definition: Grammars are the formal rules that define


the valid syntax (structure) of a programming
language. They act as the blueprint for the parser.
● Types: Context-Free Grammars (CFGs) are commonly
used in compiler design. These grammars consist of:
○ Terminal symbols: Represent basic elements like
keywords, identifiers, operators, and punctuation
(e.g., int, x, +, ;).
○ Non-terminal symbols: Represent higher-level
program constructs like statements, expressions,
and blocks (e.g., statement, expression,
block).
○ Production rules: Define how non-terminal
symbols can be rewritten using terminal and
non-terminal symbols (e.g., statement -> if (
expression ) statement else statement).
● Role in Parsing: The parser uses the grammar rules to
determine if the sequence of tokens it encounters is a
valid program structure according to the language's
definition.

3. Error Handling:

● Importance: Syntax errors can render a program


non-functional. Effective error handling is crucial for
identifying and reporting these errors to the
programmer.
● Strategies:
○ Error Detection: The parser should be able to
detect syntax errors as soon as they occur
during the parsing process. This might involve
techniques like lookahead (checking upcoming
tokens) to anticipate potential issues.
○ Error Reporting: Upon encountering an error, the
parser should provide informative error
messages that pinpoint the location of the error
(line number, token) and offer a clear explanation
of the issue. This helps the programmer locate
and rectify the problem.
○ Error Recovery: In some cases, the parser might
attempt to recover from an error and continue
parsing by skipping certain parts of the code or
assuming reasonable defaults. However, this is a
complex area and should be implemented with
caution to avoid introducing new errors.

You might also like