0% found this document useful (0 votes)
15 views18 pages

Compiler Designnotes

This document provides an overview of compilers and parsing, explaining the differences between compilers and interpreters, the phases of a compiler, and the role of lexical analyzers and parsers. It details parsing techniques, including top-down and bottom-up parsing, as well as error recovery methods and the use of tools like LEX and YACC for generating lexical analyzers and parsers. Additionally, it discusses concepts such as regular expressions, finite automata, context-free grammar, and ambiguity in grammar.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views18 pages

Compiler Designnotes

This document provides an overview of compilers and parsing, explaining the differences between compilers and interpreters, the phases of a compiler, and the role of lexical analyzers and parsers. It details parsing techniques, including top-down and bottom-up parsing, as well as error recovery methods and the use of tools like LEX and YACC for generating lexical analyzers and parsers. Additionally, it discusses concepts such as regular expressions, finite automata, context-free grammar, and ambiguity in grammar.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT - I: Introduction to Compilers and Parsing

Introduction to Compilers

What is a Compiler?

A compiler is a program that converts code written in a high-level programming language


(like C, Java) into a low-level language (like machine code) that a computer can understand
and execute. It translates the entire program at once before running it.

What is an Interpreter?

An interpreter is a program that executes code line by line without translating the entire
program into machine code first. It reads, translates, and runs each line one by one.

Differences Between Compiler and Interpreter

Feature Compiler Interpreter

Execution Translates entire code at once Executes code line by line

Slower execution (translates on-


Speed Faster execution (after compilation)
the-fly)

Generates executable file (machine


Output No separate executable file
code)

Error
Shows all errors after compilation Shows errors line by line
Detection

Example C, C++ compilers Python, JavaScript interpreters

Phases of a Compiler

A compiler works in several steps, called phases, to convert source code into executable
code:

1. Lexical Analysis: Breaks the code into small pieces called tokens (like keywords,
symbols).

2. Syntax Analysis: Checks if tokens follow the grammar rules of the language.

3. Semantic Analysis: Ensures the code makes sense (e.g., variables are declared
before use).
4. Intermediate Code Generation: Creates a middle-level code that is easier to
optimize.

5. Code Optimization: Improves the intermediate code to make it faster or smaller.

6. Code Generation: Converts optimized code into machine code.

7. Symbol Table Management: Keeps track of variables, functions, and their details.

8. Error Handling: Detects and reports errors in each phase.

Role of Lexical Analyzer

The lexical analyzer (or scanner) is the first phase of a compiler. It:

• Reads the source code character by character.

• Groups characters into meaningful units called tokens (e.g., int, +, variable_name).

• Removes unnecessary things like comments and extra spaces.

• Passes tokens to the next phase (syntax analyzer).

Regular Expressions

A regular expression (regex) is a way to describe patterns in text. For example:

• a* means "zero or more 'a' characters" (like "", "a", "aa").

• [0-9]+ means "one or more digits" (like "123").


Regular expressions are used to define the rules for tokens in a lexical analyzer.

Finite Automata

A finite automaton (FA) is a simple machine that recognizes patterns in text. It has:

• States: Different stages of processing.

• Transitions: Rules to move between states based on input characters.

• Accepting State: If the machine reaches this state, the input matches the pattern.
There are two types:

1. Deterministic Finite Automata (DFA): Only one possible transition for each input.

2. Non-deterministic Finite Automata (NFA): Multiple possible transitions for an


input.

From Regular Expressions to Finite Automata


• Regular expressions can be converted into an NFA or DFA.

• An NFA is built first because it’s easier to construct from a regex.

• The NFA can then be converted into a DFA for faster processing.

• This process helps the lexical analyzer recognize tokens based on regex patterns.

Pass and Phases of Translation

• A pass is one complete run through the source code by the compiler.

• A compiler may need multiple passes to complete all phases (e.g., one pass for
lexical analysis, another for code generation).

• Some compilers combine phases into a single pass to save time.

Bootstrapping

Bootstrapping is the process of writing a compiler for a language using the same language.
For example:

• Writing a C compiler in C.

• The initial compiler is written in a different language or a subset of the target


language, then it’s used to compile itself.

LEX - Lexical Analyzer Generator

LEX is a tool that automatically creates a lexical analyzer. You give it:

• A set of regular expressions for tokens.

• Actions to perform when a token is found.


LEX generates a C program that scans the input and produces tokens for the
compiler.

Parsing

What is Parsing?

Parsing is the process of analyzing the structure of the source code to check if it follows
the grammar rules of the programming language. It’s done by the parser in the syntax
analysis phase.

Role of Parser

The parser:
• Takes tokens from the lexical analyzer.

• Checks if the tokens form valid sentences according to the language’s grammar.

• Builds a parse tree (a diagram showing the structure of the code).

• Reports syntax errors (e.g., missing semicolon).

Context-Free Grammar (CFG)

A context-free grammar is a set of rules that defines the syntax of a programming


language. It has:

• Terminals: Basic symbols (like tokens: int, +).

• Non-terminals: Symbols that represent groups of terminals (like expression).

• Productions: Rules that describe how non-terminals are formed (e.g., expr → expr +
term).

• Start Symbol: The main non-terminal that represents the entire program.

Derivations

A derivation is the process of applying grammar rules to create a valid sentence. For
example:

• Grammar: S → aS | b

• Derivation: Start with S, apply rules to get aab (S → aS → aaS → aab).

Parse Trees

A parse tree is a tree-like diagram that shows how a sentence is derived from the grammar:

• Root: Start symbol.

• Nodes: Non-terminals and terminals.

• Leaves: Tokens from the input.


It helps visualize the structure of the code.

Ambiguity

A grammar is ambiguous if a single sentence can have multiple parse trees. For example:

• Grammar: expr → expr + expr | num

• Sentence: 2 + 3 + 4
• Possible parse trees: (2 + 3) + 4 or 2 + (3 + 4).
Ambiguity causes confusion, so it must be eliminated.

Elimination of Left Recursion

Left recursion happens when a grammar rule starts with the same non-terminal (e.g., A →
Aα | β). This causes problems for some parsers. To eliminate it:

• Rewrite the rule: A → βA', A' → αA' | ε.

• Example:

o Original: E → E + T | T

o After: E → TE', E' → +TE' | ε

Left Factoring

Left factoring removes common prefixes from grammar rules to make parsing easier. For
example:

• Original: A → aB | aC

• After: A → aD, D → B | C

Eliminating Ambiguity from Dangling-Else Grammar

The dangling-else problem occurs in grammars for if-else statements, where it’s unclear
which if an else belongs to. For example:

• Code: if (cond1) if (cond2) stmt1 else stmt2

• Ambiguity: Does else stmt2 belong to the first or second if?


To fix:

• Rewrite the grammar to enforce that else binds to the nearest if.

• Example grammar:

o stmt → matched_stmt | open_stmt

o matched_stmt → if (expr) matched_stmt else matched_stmt | other

o open_stmt → if (expr) stmt | if (expr) matched_stmt else open_stmt

Classes of Parsing

There are two main types of parsing:


1. Top-Down Parsing: Starts from the start symbol and builds the parse tree
downward.

2. Bottom-Up Parsing: Starts from the tokens and builds the parse tree upward.

Top-Down Parsing

In top-down parsing, the parser starts with the start symbol and tries to derive the input
sentence.

Backtracking

• The parser tries different grammar rules and backtracks if a choice leads to a dead
end.

• Example: If a rule doesn’t work, it goes back and tries another.

• Problem: Slow and inefficient for large programs.

Recursive Descent Parsing

• A type of top-down parsing where each non-terminal has a function.

• The function recursively calls other functions to match the input.

• Example: For grammar E → T + E | T, there’s a function for E that calls functions for T
and E.

Predictive Parsers

• A predictive parser guesses the next rule to apply based on the current token.

• It uses a parsing table to decide which rule to use.

• Fast and efficient but requires the grammar to be suitable (e.g., LL(1)).

LL(1) Grammars

• LL(1) stands for "Left-to-right, Leftmost derivation, 1-token lookahead."

• The parser reads the input from left to right, builds a leftmost derivation, and looks
at one token ahead to make decisions.

• Requirements:

o No ambiguity.

o No left recursion.
o Rules must be left-factored.

• A grammar is LL(1) if the parser can always choose the correct rule by looking at the
next token.

Unit:-2

UNIT - II: Bottom-Up Parsing

Introduction to Bottom-Up Parsing

What is Bottom-Up Parsing?

Bottom-up parsing is a method of parsing where the parser starts with the input tokens
(the "bottom") and builds the parse tree upward until it reaches the start symbol of the
grammar (the "top"). It tries to construct the parse tree by combining tokens into larger
structures based on the grammar rules.

• Key Idea: It works by reducing tokens into non-terminals using grammar rules,
moving from the input to the start symbol.

• Example: For a grammar S → aB and input aB, the parser starts with aB and reduces
it to S.

Handles

A handle is a part of the input string that matches the right-hand side of a grammar rule
and can be reduced to a non-terminal.

• Example: For grammar S → aB, if the input is aB, the handle is aB because it can be
reduced to S.

• The parser identifies handles and replaces them with the corresponding non-
terminal.

Handle Pruning

Handle pruning is the process of repeatedly finding and reducing handles in the input
string until the start symbol is reached.

• Steps:

1. Find a handle in the input.

2. Reduce the handle using a grammar rule.


3. Repeat until the entire input is reduced to the start symbol.

• This process "prunes" the parse tree from the bottom up.

Shift-Reduce Parsing

Stack Implementation of Shift-Reduce Parsing

Shift-reduce parsing is a common technique for bottom-up parsing. It uses a stack to


keep track of tokens and non-terminals. The parser performs two main actions:

1. Shift: Push the next input token onto the stack.

2. Reduce: If the top of the stack matches the right-hand side of a grammar rule (a
handle), pop those symbols and push the corresponding non-terminal.

How It Works:

• The parser maintains a stack and an input buffer.

• It shifts tokens from the input to the stack until it finds a handle.

• When a handle is found, it reduces the handle to a non-terminal.

• This continues until the input is empty and the stack contains only the start symbol.

Example:

• Grammar: S → aB, B → b

• Input: ab

• Steps:

1. Stack: [], Input: ab → Shift a → Stack: [a]

2. Stack: [a], Input: b → Shift b → Stack: [a, b]

3. Stack: [a, b] → Reduce b to B (using B → b) → Stack: [a, B]

4. Stack: [a, B] → Reduce aB to S (using S → aB) → Stack: [S]

5. Accept: The input is parsed successfully.

Conflicts in Shift-Reduce Parsing

Sometimes, the parser cannot decide whether to shift or reduce. These situations are
called conflicts:

1. Shift-Reduce Conflict:
o The parser can either shift the next token or reduce the stack.

o Example: For grammar S → aS | a, if the stack has a and the next token is a,
the parser doesn’t know whether to shift a or reduce a to S.

2. Reduce-Reduce Conflict:

o The parser can reduce the stack using two or more different grammar rules.

o Example: For grammar S → aB, B → c, C → c, if the stack has ac, the parser
doesn’t know whether to reduce c to B or C.

Conflicts occur in ambiguous grammars or poorly designed grammars. They can be


resolved by modifying the grammar or using parser tools like YACC.

LR Grammars and Parsers

LR Grammars

LR grammars are a class of grammars that can be parsed using a bottom regognizable
bottom-up parser. The name LR stands for Left-to-right, Rightmost derivation.

• LR parsers are powerful and can handle a wide range of grammars, including some
ambiguous ones.

• They use a parsing table to decide whether to shift or reduce based on the current
state and the next token.

Types of LR Parsers

There are three main types of LR parsers:

1. Simple LR (SLR):

o The simplest type of LR parser.

o Uses a basic parsing table with limited lookahead information.

o Can handle fewer grammars because it doesn’t consider enough context.

2. Canonical LR (CLR):

o The most powerful LR parser.

o Uses a larger parsing table with full lookahead information.

o Can handle more complex grammars but requires more memory and time.

3. Look-Ahead LR (LALR):
o A middle ground between SLR and CLR.

o Uses a smaller parsing table than CLR but can handle more grammars than
SLR.

o Most commonly used (e.g., in YACC).

Comparison:

Parser Power Table Size Speed

SLR Low Small Fast

LALR Medium Medium Medium

CLR High Large Slow

Error Recovery in Parsing

When a parser finds a syntax error (e.g., a missing semicolon), it needs to recover so it can
continue parsing. Common error recovery techniques:

1. Panic Mode:

o Skip tokens until a recognizable token (e.g., a semicolon) is found.

o Example: If the parser expects a ) but sees a +, it skips tokens until it finds a
valid one.

2. Phrase-Level Recovery:

o Insert or delete a token to fix the error (e.g., add a missing semicolon).

o The parser makes a guess to continue parsing.

3. Error Productions:

o Add special grammar rules to handle common errors.

o Example: Add a rule like stmt → error ; to handle invalid statements.

4. Global Recovery:

o Rearrange the parse tree to fix large-scale errors (less common).

Good error recovery helps the compiler report multiple errors instead of stopping at the
first one.

Parsing Ambiguous Grammars


An ambiguous grammar allows multiple parse trees for the same input, which can
confuse the parser. For example:

• Grammar: E → E + E | num

• Input: 2 + 3 + 4

• Possible parse trees: (2 + 3) + 4 or 2 + (3 + 4).

LR parsers can handle ambiguous grammars by:

• Using precedence and associativity rules to resolve conflicts (e.g., + is left-


associative, so (2 + 3) + 4 is chosen).

• Modifying the grammar to remove ambiguity (e.g., rewriting the grammar to enforce
precedence).

• Adding special directives in tools like YACC to guide the parser.

YACC - Automatic Parser Generator

YACC (Yet Another Compiler-Compiler) is a tool that automatically generates a bottom-up


(LALR) parser from a grammar specification. How it works:

• You provide:

o A grammar with rules (e.g., S: a B { action }).

o Actions to perform when a rule is reduced (e.g., create a parse tree node).

• YACC generates:

o A C program that implements an LALR parser.

o A parsing table to guide the shift and reduce operations.

• YACC works with LEX (lexical analyzer generator) to get tokens.

• Features:

o Handles ambiguous grammars using precedence rules.

o Supports error recovery.

o Widely used for building compilers (e.g., for C, Java).

Example YACC Input:

%{
#include <stdio.h>

%}

%token NUM

%%

E : E '+' E { printf("Found addition\n"); }

| NUM { printf("Found number\n"); }

%%

• This defines a simple grammar for expressions (E → E + E | NUM).

• YACC generates a parser that recognizes inputs like 2 + 3.

Unit:-3

UNIT - III: Syntax Directed Translation and Intermediate Code Generation

Syntax Directed Translation

What is Syntax Directed Translation (SDT)?

Syntax Directed Translation (SDT) is a method used by compilers to translate source code
into another form (like intermediate code or machine code) while parsing the code. It
attaches rules or actions to the grammar rules to perform translations during parsing.

• Key Idea: Each grammar rule has associated actions (called semantic actions) that
describe what to do when the rule is applied.

• Example: For a grammar rule E → E1 + E2, an SDT might generate code to add E1 and
E2.

Syntax Directed Definition (SDD)

A Syntax Directed Definition (SDD) is a formal way to define the translation process. It
consists of:

• A context-free grammar with production rules.

• Attributes for each grammar symbol:


o Synthesized Attributes: Computed from child nodes (bottom-up).

o Inherited Attributes: Computed from parent or sibling nodes (top-down).

• Semantic Rules: Instructions that calculate attribute values for each production
rule.

Example:

Grammar: E → E1 + E2

• Attributes: E.val (synthesized attribute for the value of E).

• Semantic Rule: E.val = E1.val + E2.val

• Meaning: The value of E is the sum of the values of E1 and E2.

Construction of Syntax Trees

A syntax tree (or parse tree) is a tree that represents the structure of the source code
according to the grammar. SDT can be used to build syntax trees by:

• Associating semantic actions with grammar rules to create tree nodes.

• Example:

o Grammar: E → E1 + E2

o Semantic Action: E.node = new Node('+', E1.node, E2.node)

o This creates a tree node for + with E1 and E2 as children.

S-Attributed Definitions

An S-attributed definition uses only synthesized attributes. These attributes are


computed bottom-up, from child nodes to parent nodes.

• Characteristics:

o Easy to implement in bottom-up parsers (like LR parsers).

o All attributes are calculated after the children are processed.

• Example:

o Grammar: E → E1 + E2

o S-attributed Rule: E.val = E1.val + E2.val

o The value of E depends only on the values of its children.


L-Attributed Definitions

An L-attributed definition uses both synthesized and inherited attributes, but inherited
attributes are computed in a left-to-right order.

• Characteristics:

o Suitable for top-down parsers (like LL parsers).

o Attributes can depend on the parent or left siblings but not on right siblings.

• Example:

o Grammar: D → T id

o Inherited Rule: id.type = T.type (type of id is inherited from T).

o Synthesized Rule: D.code = T.code || "declare " || id.name

Translation Schemes

A translation scheme is an SDT where semantic actions are embedded within the
grammar rules to specify when the actions should be executed during parsing.

• Key Idea: Actions (enclosed in {}) are placed in the production to indicate the order
of execution.

• Example:

o Grammar: E → E1 + E2

o Translation Scheme:

o E → E1 + { E.temp = new_temp(); emit(E.temp = E1.val + E2.val); } E2

o This generates intermediate code for addition when the + is parsed.

Emitting a Translation

Emitting a translation means producing the output of the translation process (e.g.,
intermediate code, machine code). The translation scheme controls when and how the
output is generated.

• Example: For a + b, the translation scheme might emit:

• t1 = a + b

where t1 is a temporary variable holding the result.


Intermediate Code Generation

What is Intermediate Code?

Intermediate code is a simplified

Unit:- 4

UNIT - V: Code Optimization and Code Generation

Code Optimization

What is Code Optimization?

Code optimization is the process of improving the intermediate code (or machine code) to
make it run faster, use less memory, or consume less power, while producing the same
output. It’s an optional phase in a compiler but very important for performance.

• Goal: Make the program more efficient without changing its behavior.

• Example: Replace x = a + a with x = 2 * a to reduce computation.

Organization of Code Optimizer

A code optimizer is a part of the compiler that applies optimization techniques. It typically
works on intermediate code and is organized as:

• Front End: Analyzes the intermediate code and builds data structures (like flow
graphs).

• Optimization Passes: Applies specific optimization techniques (e.g., loop


optimization, constant folding).

• Back End: Outputs the optimized intermediate code for code generation.

• Multiple passes may be made to apply different optimizations.

Basic Blocks and Flow Graphs

• Basic Block:

o A sequence of instructions with only one entry point (the first instruction) and
one exit point (the last instruction).

o No jumps or branches inside except at the start or end.


o Example:

o t1 = a + b

o t2 = t1 * c

This is a basic block because it has no jumps.

• Flow Graph:

o A graph where each node is a basic block, and edges represent control flow
(jumps or branches) between blocks.

o Helps analyze the program’s structure for optimization.

o Example:

o B1: t1 = a + b

o goto B2

o B2: t2 = t1 * c

Here, B1 and B2 are basic blocks, and there’s an edge from B1 to B2.

Optimization of Basic Blocks

Optimizations within a basic block are called local optimizations. Common techniques
include:

1. Constant Folding:

o Evaluate constant expressions at compile time.

o Example: Replace x = 2 + 3 with x = 5.

2. Constant Propagation:

o Replace variables with their constant values.

o Example: If x = 5, replace y = x + 1 with y = 6.

3. Common Subexpression Elimination:

o Reuse previously computed values.

o Example: If t1 = a * b and t2 = a * b, reuse t1 instead of recomputing.

4. Dead Code Elimination:

o Remove code that doesn’t affect the program’s output.


o Example: If x = 5 is never used, remove it.

5. Strength Reduction:

o Replace expensive operations with cheaper ones.

o Example: Replace x = y * 2 with x = y + y.

Principal Sources of Optimization

The main opportunities for optimization come from:

1. Redundant Computations:

o Eliminate repeated calculations (e.g., common subexpression elimination).

2. Unreachable Code:

o Remove code that can never be executed.

3. Loop Optimizations:

o Improve loops by moving invariant code outside or reducing loop iterations.

o Example: Move x = 5 outside a loop if it doesn’t change.

4. Algebraic Simplifications:

o Simplify expressions (e.g., x + 0 becomes x).

5. Function Inlining:

o Replace function calls with the function’s body to avoid call overhead.

Directed Acyclic Graph (DAG) Representation of Basic Block

A Directed Acyclic Graph (DAG) is a graphical representation of a basic block used for
optimization.

• Nodes: Represent variables, constants, or operations.

• Edges: Show dependencies (e.g., an edge from a to + means a is an operand).

• Uses:

o Identifies common subexpressions (nodes shared by multiple operations).

o Detects redundant computations.

o Simplifies code generation.


• Example:

o Code: `t1 = a

You might also like