Short Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

1.

INTRODUCTION TO COMPILERS AND ITS PHASES

A compiler is a program takes a program written in a source language and translates it into an
equivalent program in a target language.

Source program COMPILER Target program

This subject discusses the various techniques used to achieve this objective. In addition to the
development of a compiler, the techniques used in compiler design can be applicable to many
problems in computer science.
o Techniques used in a lexical analyzer can be used in text editors, information
retrieval system, and pattern recognition programs.
o Techniques used in a parser can be used in a query processing system such as
SQL.
o Many software having a complex front-end may need techniques used in
compiler design.
A symbolic equation solver which takes an equation as input. That
program should parse the given input equation.
o Most of the techniques used in compiler design can be used in Natural Language
Processing (NLP) systems.

1.1 Major Parts of a Compiler

There are two major parts of a compiler: Analysis and Synthesis


• In analysis phase, an intermediate representation is created from the given source program.
– Lexical Analyzer, Syntax Analyzer and Semantic Analyzer are the phases in this part.
• In synthesis phase, the equivalent target program is created from this intermediate
representation.
– Intermediate Code Generator, Code Generator, and Code Optimizer are the phases in this
part.

1.2 Phases of a Compiler

Source Lexical Syntax Semantic Intermediate Code Code Target


Program Analyzer Analyzer Analyzer Code Generator Optimizer Generator Program

Each phase transforms the source program from one representation into another representation.
They communicate with error handlers and the symbol table.

1.2.1 Lexical Analyzer

• Lexical Analyzer reads the source program character by character and returns the tokens
of the source program.
• A token describes a pattern of characters having same meaning in the source program.
(such as identifiers, operators, keywords, numbers, delimiters and so on)

Example:
In the line of code newval := oldval + 12, tokens are:
newval (identifier)
:= (assignment operator)
oldval (identifier)
+ (add operator)
12 (a number)
• Puts information about identifiers into the symbol table.
• Regular expressions are used to describe tokens (lexical constructs).
• A (Deterministic) Finite State Automaton can be used in the implementation of a lexical
analyzer.

1.2.2 Syntax Analyzer

• A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the given
program.
• A syntax analyzer is also called a parser.
• A parse tree describes a syntactic structure.
Example:
For the line of code newval := oldval + 12, parse tree will be:

assignment

identifier := expression

newval expression + expression

identifier number

oldval 12

• The syntax of a language is specified by a context free grammar (CFG).


• The rules in a CFG are mostly recursive.
• A syntax analyzer checks whether a given program satisfies the rules implied by a CFG or
not.
– If it satisfies, the syntax analyzer creates a parse tree for the given program.

Example:
CFG used for the above parse tree is:
assignment identifier := expression
expression identifier
expression number
expression expression + expression

• Depending on how the parse tree is created, there are different parsing techniques.
• These parsing techniques are categorized into two groups:
– Top-Down Parsing,
– Bottom-Up Parsing
• Top-Down Parsing:
– Construction of the parse tree starts at the root, and proceeds towards the leaves.
– Efficient top-down parsers can be easily constructed by hand.
– Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing).
• Bottom-Up Parsing:
– Construction of the parse tree starts at the leaves, and proceeds towards the root.

– Normally efficient bottom-up parsers are created with the help of some software tools.
– Bottom-up parsing is also known as shift-reduce parsing.
– Operator-Precedence Parsing – simple, restrictive, easy to implement
– LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR

1.2.3 Semantic Analyzer

• A semantic analyzer checks the source program for semantic errors and collects the type
information for the code generation.
• Type-checking is an important part of semantic analyzer.
• Normally semantic information cannot be represented by a context-free language used in
syntax analyzers.
• Context-free grammars used in the syntax analysis are integrated with attributes (semantic
rules) . The result is a syntax-directed translation and Attribute grammars

Example:
In the line of code newval := oldval + 12, the type of the identifier newval must match
with type of the expression (oldval+12).

1.2.4 Intermediate Code Generation

• A compiler may produce an explicit intermediate codes representing the source program.
• These intermediate codes are generally machine architecture independent. But the level of
intermediate codes is close to the level of machine codes.

Example:

newval := oldval * fact + 1

id1 := id2 * id3 + 1

MULT id2, id3, temp1


ADD temp1, #1, temp2
MOV temp2, id1

The last form is the Intermediates Code (Quadruples)

1.2.5 Code Optimizer

• The code optimizer optimizes the code produced by the intermediate code generator in the
terms of time and space.

Example:
The above piece of intermediate code can be reduced as follows:

MULT id2, id3, temp1


ADD temp1, #1, id1

1.2.6 Code Generator

• Produces the target language in a specific architecture.

• The target program is normally is a relocatable object file containing the machine codes.
Example:
Assuming that we have architecture with instructions that have at least one operand as a
machine register, the Final Code our line of code will be:

MOVE id2, R1
MULT id3, R1
ADD #1, R1
MOVE R1, id1

1.3 Phases v/s Passes

Phases of a compiler are the sub-tasks that must be performed to complete the compilation
process. Passes refer to the number of times the compiler has to traverse through the entire
program.

1.4 Bootstrapping and Cross-Compiler

There are three languages involved in a single compiler- the source language (S), the target
language (A) and the language in which the compiler is written (L).

CLSA
The language of the compiler and the target language are usually the language of the computer
on which it is working.

CASA
If a compiler is written in its own language then the problem would be to how to compile the first
compiler i.e. L=S. For this we take a language, R which is a small part of language S. We write
a compiler of R in language of the computer A. The complier of S is written in R and complied
on the complier of R make a full fledged compiler of S. This is known as Bootstrapping.

CRSA CARA CASA

A Cross Compiler is compiler that runs on one machine (A) and produces a code for another
machine (B).

CBSA

2. LEXICAL ANALYSIS

Lexical Analyzer reads the source program character by character to produce tokens.
Normally a lexical analyzer does not return a list of tokens at one shot; it returns a token
when the parser asks a token from it.

2.1 Token

• Token represents a set of strings described by a pattern. For example, an identifier


represents a set of strings which start with a letter continues with letters and digits. The actual
string is called as lexeme.
• Since a token can represent more than one lexeme, additional information should be held for
that specific lexeme. This additional information is called as the attribute of the token.
• For simplicity, a token may have a single attribute which holds the required information for that
token. For identifiers, this attribute is a pointer to the symbol table, and the symbol table holds
the actual attributes for that token.
• Examples:
– <identifier, attribute> where attribute is pointer to the symbol table
– <assignment operator> no attribute is needed
– <number, value> where value is the actual value of the number
• Token type and its attribute uniquely identify a lexeme.
• Regular expressions are widely used to specify patterns.

2.2 Languages

2.2.1 Terminology

• Alphabet : a finite set of symbols (ASCII characters)


• String : finite sequence of symbols on an alphabet
– Sentence and word are also used in terms of string
– ε is the empty string
– |s| is the length of string s.
• Language: sets of strings over some fixed alphabet
– ∅ the empty set is a language.
– {ε} the set containing empty string is a language
– The set of all possible identifiers is a language.
• Operators on Strings:
– Concatenation: xy represents the concatenation of strings x and y. s ε = s εs=s
n 0
– s = s s s .. s ( n times) s =ε

2.2.2. Operations on Languages

• Concatenation: L1L2 = { s1s2 | s1 ∈ L1 and s2 ∈ L2 }


• Union: L1 ∪ L2 = { s | s ∈ L1 or s ∈ L2 }
• Exponentiation: L0 = {ε} L1 = L L2 = LL
• Kleene Closure: L* =
• Positive Closure: L+ =

Examples:
• L1 = {a,b,c,d} L2 = {1,2}
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
• L1 ∪ L2 = {a,b,c,d,1,2}
• L13 = all strings with length three (using a,b,c,d}
• L1* = all strings using letters a,b,c,d and empty string
• L1+ = doesn’t include the empty string

2.3 Regular Expressions and Finite Automata

2.3.1 Regular Expressions

• We use regular expressions to describe tokens of a programming language.


• A regular expression is built up of simpler regular expressions (using defining rules)
• Each regular expression denotes a language.
• A language denoted by a regular expression is called as a regular set.

For Regular Expressions over alphabet Σ

Regular Expression Language it denotes


ε {ε}
a∈ Σ {a}
(r1) | (r2) L(r1) ∪ L(r2)
(r1) (r2) L(r1) L(r2)
(r)* (L(r))*
(r) L(r)

• (r)+ = (r)(r)*
• (r)? = (r) | ε
• We may remove parentheses by using precedence rules.
– * highest
– concatenation next
– | lowest
• ab*|c means (a(b)*)|(c)
Examples:
– Σ = {0,1}
– 0|1 = {0,1}
– (0|1)(0|1) = {00,01,10,11}
*
– 0 = {ε ,0,00,000,0000,....}
*
– (0|1) = All strings with 0 and 1, including the empty string

2.3.2 Finite Automata

• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
• We call the recognizer of the tokens as a finite automaton.
• A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)
• This means that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
• Both deterministic and non-deterministic finite automaton recognize regular sets.
• Which one?
– deterministic – faster recognizer, but it may take more space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.

• First, we define regular expressions for tokens; Then we convert them into a DFA to get a
lexical analyzer for our tokens.

2.3.3 Non-Deterministic Finite Automaton (NFA)


• A non-deterministic finite automaton (NFA) is a mathematical model that consists of:
– S - a set of states
– Σ - a set of input symbols (alphabet)
– move - a transition function move to map state-symbol pairs to sets of states.
– s0 - a start (initial) state
– F- a set of accepting states (final states)
• ε- transitions are allowed in NFAs. In other words, we can move from one state to another one
without consuming any symbol.
• A NFA accepts a string x, if and only if there is a path from the starting state to one of
accepting states such that edge labels along this path spell out x.

Example:

a
a b
0 1 2
start
b
Transition Graph

0 is the start state s0


{2} is the set of final states F
Σ = {a,b}
S = {0,1,2}

Transition Function:
a b
0 {0,1} {0}
1 {} {2}
2 {} {}

The language recognized by this NFA is (a|b)*ab

2.3.4 Deterministic Finite Automaton (DFA)

• A Deterministic Finite Automaton (DFA) is a special form of a NFA.


• No state has ε- transition
• For each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition
function is from pair of state-symbol to state (not set of states)

Example:

The DFA to recognize the language (a|b)* ab is as follows.

a
b a
a b
0 1 2

b
Transition Graph

0 is the start state s0


{2} is the set of final states F
Σ = {a,b}
S = {0,1,2}

Transition Function:

a b
0 1 0
1 1 2
2 1 0

Note that the entries in this function are single value and not set of values (unlike NFA).

2.3.5 Converting RE to NFA (Thomson Construction)

• This is one way to convert a regular expression into a NFA.


• There can be other ways (much efficient) for the conversion.
• Thomson’s Construction is simple and systematic method.
• It guarantees that the resulting NFA will have exactly one final state, and one start state.
• Construction starts from simplest parts (alphabet symbols).
• To create a NFA for a complex regular expression, NFAs of its sub-expressions are
combined to create its NFA.
• To recognize an empty string ε:

ε
i f

• To recognize a symbol a in the alphabet Σ:

a
i f
• For regular expression r1 | r2:

ε N(r1) ε
i f
ε ε
N(r2)

N(r1) and N(r2) are NFAs for regular expressions r1 and r2.

• For regular expression r1 r2

i N(r1) N(r2) f

Here, final state of N(r1) becomes the final state of N(r1r2).

• For regular expression r*

ε ε
i N(r) f

ε
Example:
For a RE (a|b) * a, the NFA construction is shown below.

ε a ε

a
a
(a | b) ε ε
b
b b

a
ε ε
* ε ε
(a|b) ε ε
b

a
ε ε

(a|b) * a ε ε
ε a
ε
b

ε
2.3.6 Converting NFA to DFA (Subset Construction)

We merge together NFA states by looking at them from the point of view of the input characters:

• From the point of view of the input, any two states that are connected by an -transition
may as well be the same, since we can move from one to the other without consuming
any character. Thus states which are connected by an -transition will be represented
by the same states in the DFA.
• If it is possible to have multiple transitions based on the same symbol, then we can
regard a transition on a symbol as moving from a state to a set of states (ie. the union of
all those states reachable by a transition on the current symbol). Thus these states will be
combined into a single DFA state.

To perform this operation, let us define two functions:

• The -closure function takes a state and returns the set of states reachable from it
based on (one or more) -transitions. Note that this will always include the state tself.
We should be able to get from a state to any state in its -closure without consuming
any input.
• The function move takes a state and a character, and returns the set of states reachable
by one transition on this character.

We can generalise both these functions to apply to sets of states by taking the union of
the application to individual states.

For Example, if A, B and C are states, move({A,B,C},`a') = move(A,`a') move(B,`a')


move(C,`a').

The Subset Construction Algorithm is a follows:

put ε-closure({s0}) as an unmarked state into the set of DFA (DS)

while (there is one unmarked S1 in DS) do

begin
mark S1
for each input symbol a
do begin
S2 ε-closure(move(S1,a)) if
(S2 is not in DS) then
add S2 into DS as an unmarked
state transfunc[S1,a] S2
end
end

• a state S in DS is an accepting state of DFA if a state in S is an accepting state of NFA


• the start state of DFA is ε-closure({s0})
Example:

2 a 3 ε
ε
ε ε
0 1 6 a
ε 7 8
ε
4 b 5
ε
S0 = ε-closure({0}) = {0,1,2,4,7} S0 into DS as an unmarked state
⇓ mark S0
ε-closure(move(S0,a)) = ε-closure({3,8}) = {1,2,3,4,6,7,8} = S1 into DS
S1 ε-closure(move(S0,b)) = ε-closure({5}) = {1,2,4,5,6,7} = S2 S2 into DS
transfunc[S0,a] S1 transfunc[S0,b] S2 ⇓ mark
S1
ε-closure(move(S1,a)) = ε-closure({3,8}) = {1,2,3,4,6,7,8} =
S1 ε-closure(move(S1,b)) = ε-closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S1,a] S1 transfunc[S1,b] S2 ⇓ mark
S2
ε-closure(move(S2,a)) = ε-closure({3,8}) = {1,2,3,4,6,7,8} =
S1 ε-closure(move(S2,b)) = ε-closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S2,a] S1 transfunc[S2,b] S2
S0 is the start state of DFA since 0 is a member of S0={0,1,2,4,7}
S1 is an accepting state of DFA since 8 is a member of S1 = {1,2,3,4,6,7,8}

S1

S0 b a
b

S2

2.4 Lexical Analyzer Generator

Regular Expressions Lexical Analyzer Generator Lexical Analyzer

Source Program Lexical Analyzer Tokens

LEX is an example of Lexical Analyzer Generator.

2.4.1 Input to LEX

• The input to LEX consists primarily of Auxiliary Definitions and Translation Rules.
• To write regular expression for some languages can be difficult, because their regular
expressions can be quite complex. In those cases, we may use Auxiliary Definitions.
• We can give names to regular expressions, and we can use these names as symbols
to define other regular expressions.
• An Auxiliary Definition is a sequence of the definitions of the form:
d1 →
r1 d2
→ r2
.
.
dn → rn

where di is a distinct name and ri is a regular expression over symbols


in Σ ∪ {d1,d 2,...,di-1}

basic symbols previously defined names


Example:
For Identifiers in Pascal
letter → A | B | ... | Z | a | b | ... |
z digit → 0 | 1 | ... | 9
id → letter (letter | digit ) *

If we try to write the regular expression representing identifiers without using regular
definitions, that regular expression will be complex.
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *

Example:
For Unsigned numbers in
Pascal digit → 0 |
1 | ... | 9 digits →
digit +
opt-fraction → ( . digits ) ? opt-
exponent → ( E (+|-)? digits ) ?
unsigned-num → digits opt-fraction opt-exponent

• Translation Rules comprise of a ordered list Regular Expressions and the Program Code to
be executed in case of that Regular Expression encountered.

R1 P1
R2 P2
.
.
Rn Pn

• The list is ordered i.e. the RE’s should be checked in order. If a string matches more than
one RE, the RE occurring higher in the list should be given preference and its Program
Code is executed.

2.4.2 Implementation of LEX

•The Regular Expressions are converted into NFA’s. The final states of each NFA correspond
to some RE and its Program Code.
•Different NFA’s are then converted to a single NFA with epsilon moves. Each final state of
the NFA corresponds one-to-one to some final state of individual NFA’s i.e. some RE and its
Program Code. The final states have an order according to the corresponding RE’s. If more
than one final state is entered for some string, then the one that is higher in order is selected.
•This NFA is then converted to DFA. Each final state of DFA corresponds to a set of states
(having at least one final state) of the NFA. The Program Code of each final state (of the DFA)
is the program code corresponding to the final state that is highest in order out of all the final
states in the set of states (of NFA) that make up this final state (of DFA).

Example:

AUXILIARY DEFINITIONS
(none)

TRANSLATION RULES

a {Action1}
abb {Action2}
a*b+ {Action2}
First we construct an NFA for each RE and then convert this into a single NFA:
abb{ action2 }
a *b + { action3}
a { action1 }
start 3
1 2 a 4 b 5 6
start
start 7 8

1 2
start
ε 0 ε

3 a 4 b 5 6

ε
7 8

This NFA is now converted into a DFA. The transition table for the above DFA is as follows:

State a b Token found


0137 247 8 None
247 7 58 a
8 - 8 a*b+
7 7 8 None
58 - 68 a*b+
68 - 8 abb
3. BASICS OF SYNTAX ANALYSIS

• Syntax Analyzer creates the syntactic structure of the given source program.
• This syntactic structure is mostly a parse tree.
• Syntax Analyzer is also known as parser.
• The syntax of a programming is described by a context-free grammar (CFG). We will use BNF
(Backus-Naur Form) notation in the description of CFGs.
• The syntax analyzer (parser) checks whether a given source program satisfies the rules
implied by a context-free grammar or not.
– If it satisfies, the parser creates the parse tree of that program.
– Otherwise the parser gives the error messages.
• A context-free grammar
– gives a precise syntactic specification of a programming language.
– the design of the grammar is an initial phase of the design of a compiler.
– a grammar can be directly converted into a parser by some tools.

3.1 Parser

• Parser works on a stream of tokens.


• The smallest item is a token.

source Lexical token


Parser parse
program Analyzer
tree
get next token

• We categorize the parsers into two groups:


• Top-Down Parser
– the parse tree is created top to bottom, starting from the root.
• Bottom-Up Parser
– the parse is created bottom to top; starting from the leaves
• Both top-down and bottom-up parsers scan the input from left to right (one symbol at a time).
• Efficient top-down and bottom-up parsers can be implemented only for sub-classes of
context-free grammars.
– LL for top-down parsing
– LR for bottom-up parsing

3.2 Context Free Grammars

• Inherently recursive structures of a programming language are defined by a context-free


grammar.
• In a context-free grammar, we have:
– A finite set of terminals (in our case, this will be the set of tokens)
– A finite set of non-terminals (syntactic-variables)
– A finite set of productions rules in the following form
A→α where A is a non-terminal and

α is a string of terminals and non-terminals (including the empty string)


– A start symbol (one of the non-terminal symbol)
• L(G) is the language of G (the language generated by G) which is a set of sentences.
• A sentence of L(G) is a string of terminal symbols of G.
• If S is the start symbol of G then
(a) ω is a sentence of L(G) iff S ⇒ ω where ω is a string of terminals of G.
• If G is a context-free grammar, L(G) is a context-free language.
• Two grammars are equivalent if they produce the same language.
• S⇒α
- If α contains non-terminals, it is called as a sentential form of G.
- If α does not contain non-terminals, it is called as a sentence of G.

3.2.1 Derivations

Example:
(b) E → E + E | E – E | E * E | E / E | - E
(c) E → ( E )
(d) E → id

• E ⇒ E+E means that E+E derives from E


– we can replace E by E+E
– to able to do this, we have to have a production rule E→E+E in our grammar.
• E ⇒ E+E ⇒ id+E ⇒ id+id means that a sequence of replacements of non-terminal symbols is
called a derivation of id+id from E.
• In general a derivation step is
αAβ ⇒ αγβ if there is a production rule A→γ in our grammar
where α and β are arbitrary strings of terminal and non-terminal
symbols

α1 ⇒ α2 ⇒ ... ⇒ αn (αn derives from α1 or α1 derives αn )

• At each derivation step, we can choose any of the non-terminal in the sentential form of G for
the replacement.
• If we always choose the left-most non-terminal in each derivation step, this derivation is called
as left-most derivation.

Example:

E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(id+E) ⇒ -(id+id)

• If we always choose the right-most non-terminal in each derivation step, this derivation is
called as right-most derivation.

Example:

E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(E+id) ⇒ -(id+id)

• We will see that the top-down parsers try to find the left-most derivation of the given source
program.
• We will see that the bottom-up parsers try to find the right-most derivation of the given source
program in the reverse order.

3.2.2 Parse Tree

• Inner nodes of a parse tree are non-terminal symbols.


• The leaves of a parse tree are terminal symbols.
• A parse tree can be seen as a graphical representation of a derivation.

:
E E E
E ⇒ -E ⇒ -(E) ⇒ -(E+E)
- E - E - E

( E) ( E )

E E E + E
- E - E
⇒ -(id+E) ⇒ -(id+id)
( E ) ( E )

E + E E + E

id id id

3.2.3 Ambiguity

• A grammar produces more than one parse tree for a sentence is called as an ambiguous
grammar.
• For the most parsers, the grammar must be unambiguous.
• Unambiguous grammar
Unique selection of the parse tree for a sentence
• We should eliminate the ambiguity in the grammar during the design phase of the compiler.
• An unambiguous grammar should be written to eliminate the ambiguity.
• We have to prefer one of the parse trees of a sentence (generated by an ambiguous
grammar) to disambiguate that grammar to restrict to this choice.
• Ambiguous grammars (because of ambiguous operators) can be disambiguated according to
the precedence and associativity rules.

Example:

To disambiguate the grammar E → E+E | E*E | E^E | id | (E), we can use precedence of
operators as follows:

^ (right to left) *
(left to right) +
(left to right)

We get the following unambiguous grammar:


E → E+T | T
T → T*F | F
F → G^F | G
G → id | (E)

3.3 Left Recursion


• A grammar is left recursive if it has a non-terminal A such that there is a derivation:

A ⇒ Aα for some string α


• Top-down parsing techniques cannot handle left-recursive grammars.
• So, we have to convert our left-recursive grammar into an equivalent grammar which is not
left-recursive.
• The left-recursion may appear in a single step of the derivation (immediate left-recursion), or
may appear in more than one step of the derivation.

3.3.1 Immediate Left-Recursion


A→Aα| β where β does not start with A
⇓ Eliminate immediate left recursion
A → β A’
A’ → α A’ | ε an equivalent grammar

In general,
A → A α1 | ... | A αm | β1 | ... | βn where β1 ... βn do not start with A
⇓ Eliminate immediate left recursion
A → β1 A’ | ... | βn A’
A’ → α1 A’ | ... | αm A’ | ε an equivalent grammar

Example:

E → E+T | T
T → T*F | F
F → id | (E)

⇓ Eliminate immediate left recursion

E → T E’
E’ → +T E’ | ε
T → F T’
T’ → *F T’ | ε
F → id | (E)

• A grammar cannot be immediately left-recursive, but it still can be left-recursive.


• By just eliminating the immediate left-recursion, we may not get a grammar which is not
left-recursive.

Example:

S → Aa | b
A → Sc | d

This grammar is not immediately left-recursive, but it is still left-recursive.

S ⇒ Aa ⇒ Sca

Or

A ⇒ Sc ⇒ Aac

causes to a left-recursion

• So, we have to eliminate all left-recursions from our grammar.

3.3.2 Elimination

Arrange non-terminals in some order: A1 ... An

for i from 1 to n do {
for j from 1 to i-1 do { replace
each production
Ai → Aj
γ by
Ai → α1 γ | ... | αk γ
where Aj → α1 | ... | αk
}
eliminate immediate left-recursions among Ai productions
}

Example:

S → Aa | b
A → Ac | Sd | f

Case 1: Order of non-terminals: S, A

for S:
- we do not enter the inner loop.
- there is no immediate left recursion in S.

for A:
- Replace A → Sd with A → Aad | bd
So, we will have A → Ac | Aad | bd | f
- Eliminate the immediate left-recursion in
A A → bdA’ | fA’
A’ → cA’ | adA’ | ε

So, the resulting equivalent grammar which is not left-recursive is:


S → Aa | b
A → bdA’ | fA’
A’ → cA’ | adA’ | ε

Case 2: Order of non-terminals: A, S

for A:
- we do not enter the inner loop.
- Eliminate the immediate left-recursion in A
A → SdA’ | fA’
A’ → cA’ | ε

for S:
- Replace S → Aa with S → SdA’a | fA’a So,
we will have S → SdA’a | fA’a | b
- Eliminate the immediate left-recursion in
S S → fA’aS’ | bS’
S’ → dA’aS’ | ε

So, the resulting equivalent grammar which is not left-recursive is:


S → fA’aS’ | bS’

S’ → dA’aS’ | ε
A → SdA’ | fA’
A’ → cA’ | ε

3.4 Left Factoring


• A predictive parser (a top-down parser without backtracking) insists that the grammar
must be left-factored.

grammar a new equivalent grammar suitable for predictive parsing

stmt → if expr then stmt else stmt | if expr then stmt


• when we see if, we cannot now which production rule to choose to re-write stmt in the
derivation
• In general,

A → βα1 | βα2 where α is non-empty and the first symbols


of β1 and β2 (if they have one)are different.
• when processing α we cannot know whether expand
A to βα1 or
A to βα2
• But, if we re-write the grammar as follows
A → αA’
A’ → β1 | β2 so, we can immediately expand A to αA’

3.4.1 Algorithm

• For each non-terminal A with two or more alternatives (production rules) with a common
non-empty prefix, let say

A → βα1 | ... | βαn | γ1 | ... | γm

convert it into

A → αA’ | γ1 | ... | γm
A’ → β1 | ... | βn

Example:

A → abB | aB | cdg | cdeB | cdfB



A → aA’ | cdg | cdeB |
cdfB A’ → bB | B

A → aA’ | cdA’’
A’ → bB | B
A’’ → g | eB | fB

Example:

A → ad | a | ab | abc | b

A → aA’ | b
A’ → d | ε | b | bc

A → aA’ | b
A’ → d | ε | bA’’ A’’
→ε|c

3.5 YACC

YACC generates C code for a syntax analyzer, or parser. YACC uses grammar rules that allow it to
analyze tokens from LEX and create a syntax tree. A syntax tree imposes a hierarchical structure on
tokens. For example, operator precedence and associativity are apparent in the syntax tree. The next
step, code generation, does a depth-first walk of the syntax tree to generate code. Some compilers
produce machine code, while others output assembly.

YACC takes a default action when there is a conflict. For shift-reduce conflicts, YACC will shift. For
reduce-reduce conflicts, it will use the first rule in the listing. It also issues a warning message whenever
a conflict exists. The warnings may be suppressed by making the grammar unambiguous.

... definitions ...


%%
... rules ...
%%
... subroutines ...

Input to YACC is divided into three sections. The definitions section consists of token declarations,
and C code bracketed by “%{“ and “%}”. The BNF grammar is placed in the rules section, and user
subroutines are added in the subroutines section.

You might also like