Short Notes
Short Notes
Short Notes
A compiler is a program takes a program written in a source language and translates it into an
equivalent program in a target language.
This subject discusses the various techniques used to achieve this objective. In addition to the
development of a compiler, the techniques used in compiler design can be applicable to many
problems in computer science.
o Techniques used in a lexical analyzer can be used in text editors, information
retrieval system, and pattern recognition programs.
o Techniques used in a parser can be used in a query processing system such as
SQL.
o Many software having a complex front-end may need techniques used in
compiler design.
A symbolic equation solver which takes an equation as input. That
program should parse the given input equation.
o Most of the techniques used in compiler design can be used in Natural Language
Processing (NLP) systems.
Each phase transforms the source program from one representation into another representation.
They communicate with error handlers and the symbol table.
• Lexical Analyzer reads the source program character by character and returns the tokens
of the source program.
• A token describes a pattern of characters having same meaning in the source program.
(such as identifiers, operators, keywords, numbers, delimiters and so on)
Example:
In the line of code newval := oldval + 12, tokens are:
newval (identifier)
:= (assignment operator)
oldval (identifier)
+ (add operator)
12 (a number)
• Puts information about identifiers into the symbol table.
• Regular expressions are used to describe tokens (lexical constructs).
• A (Deterministic) Finite State Automaton can be used in the implementation of a lexical
analyzer.
• A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the given
program.
• A syntax analyzer is also called a parser.
• A parse tree describes a syntactic structure.
Example:
For the line of code newval := oldval + 12, parse tree will be:
assignment
identifier := expression
identifier number
oldval 12
Example:
CFG used for the above parse tree is:
assignment identifier := expression
expression identifier
expression number
expression expression + expression
• Depending on how the parse tree is created, there are different parsing techniques.
• These parsing techniques are categorized into two groups:
– Top-Down Parsing,
– Bottom-Up Parsing
• Top-Down Parsing:
– Construction of the parse tree starts at the root, and proceeds towards the leaves.
– Efficient top-down parsers can be easily constructed by hand.
– Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing).
• Bottom-Up Parsing:
– Construction of the parse tree starts at the leaves, and proceeds towards the root.
– Normally efficient bottom-up parsers are created with the help of some software tools.
– Bottom-up parsing is also known as shift-reduce parsing.
– Operator-Precedence Parsing – simple, restrictive, easy to implement
– LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR
• A semantic analyzer checks the source program for semantic errors and collects the type
information for the code generation.
• Type-checking is an important part of semantic analyzer.
• Normally semantic information cannot be represented by a context-free language used in
syntax analyzers.
• Context-free grammars used in the syntax analysis are integrated with attributes (semantic
rules) . The result is a syntax-directed translation and Attribute grammars
Example:
In the line of code newval := oldval + 12, the type of the identifier newval must match
with type of the expression (oldval+12).
• A compiler may produce an explicit intermediate codes representing the source program.
• These intermediate codes are generally machine architecture independent. But the level of
intermediate codes is close to the level of machine codes.
Example:
• The code optimizer optimizes the code produced by the intermediate code generator in the
terms of time and space.
Example:
The above piece of intermediate code can be reduced as follows:
• The target program is normally is a relocatable object file containing the machine codes.
Example:
Assuming that we have architecture with instructions that have at least one operand as a
machine register, the Final Code our line of code will be:
MOVE id2, R1
MULT id3, R1
ADD #1, R1
MOVE R1, id1
Phases of a compiler are the sub-tasks that must be performed to complete the compilation
process. Passes refer to the number of times the compiler has to traverse through the entire
program.
There are three languages involved in a single compiler- the source language (S), the target
language (A) and the language in which the compiler is written (L).
CLSA
The language of the compiler and the target language are usually the language of the computer
on which it is working.
CASA
If a compiler is written in its own language then the problem would be to how to compile the first
compiler i.e. L=S. For this we take a language, R which is a small part of language S. We write
a compiler of R in language of the computer A. The complier of S is written in R and complied
on the complier of R make a full fledged compiler of S. This is known as Bootstrapping.
A Cross Compiler is compiler that runs on one machine (A) and produces a code for another
machine (B).
CBSA
2. LEXICAL ANALYSIS
Lexical Analyzer reads the source program character by character to produce tokens.
Normally a lexical analyzer does not return a list of tokens at one shot; it returns a token
when the parser asks a token from it.
2.1 Token
2.2 Languages
2.2.1 Terminology
Examples:
• L1 = {a,b,c,d} L2 = {1,2}
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
• L1 ∪ L2 = {a,b,c,d,1,2}
• L13 = all strings with length three (using a,b,c,d}
• L1* = all strings using letters a,b,c,d and empty string
• L1+ = doesn’t include the empty string
• (r)+ = (r)(r)*
• (r)? = (r) | ε
• We may remove parentheses by using precedence rules.
– * highest
– concatenation next
– | lowest
• ab*|c means (a(b)*)|(c)
Examples:
– Σ = {0,1}
– 0|1 = {0,1}
– (0|1)(0|1) = {00,01,10,11}
*
– 0 = {ε ,0,00,000,0000,....}
*
– (0|1) = All strings with 0 and 1, including the empty string
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a
sentence of that language, and “no” otherwise.
• We call the recognizer of the tokens as a finite automaton.
• A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)
• This means that we may use a deterministic or non-deterministic automaton as a lexical
analyzer.
• Both deterministic and non-deterministic finite automaton recognize regular sets.
• Which one?
– deterministic – faster recognizer, but it may take more space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.
• First, we define regular expressions for tokens; Then we convert them into a DFA to get a
lexical analyzer for our tokens.
Example:
a
a b
0 1 2
start
b
Transition Graph
Transition Function:
a b
0 {0,1} {0}
1 {} {2}
2 {} {}
Example:
a
b a
a b
0 1 2
b
Transition Graph
Transition Function:
a b
0 1 0
1 1 2
2 1 0
Note that the entries in this function are single value and not set of values (unlike NFA).
ε
i f
a
i f
• For regular expression r1 | r2:
ε N(r1) ε
i f
ε ε
N(r2)
N(r1) and N(r2) are NFAs for regular expressions r1 and r2.
i N(r1) N(r2) f
ε ε
i N(r) f
ε
Example:
For a RE (a|b) * a, the NFA construction is shown below.
ε a ε
a
a
(a | b) ε ε
b
b b
a
ε ε
* ε ε
(a|b) ε ε
b
a
ε ε
(a|b) * a ε ε
ε a
ε
b
ε
2.3.6 Converting NFA to DFA (Subset Construction)
We merge together NFA states by looking at them from the point of view of the input characters:
• From the point of view of the input, any two states that are connected by an -transition
may as well be the same, since we can move from one to the other without consuming
any character. Thus states which are connected by an -transition will be represented
by the same states in the DFA.
• If it is possible to have multiple transitions based on the same symbol, then we can
regard a transition on a symbol as moving from a state to a set of states (ie. the union of
all those states reachable by a transition on the current symbol). Thus these states will be
combined into a single DFA state.
• The -closure function takes a state and returns the set of states reachable from it
based on (one or more) -transitions. Note that this will always include the state tself.
We should be able to get from a state to any state in its -closure without consuming
any input.
• The function move takes a state and a character, and returns the set of states reachable
by one transition on this character.
We can generalise both these functions to apply to sets of states by taking the union of
the application to individual states.
begin
mark S1
for each input symbol a
do begin
S2 ε-closure(move(S1,a)) if
(S2 is not in DS) then
add S2 into DS as an unmarked
state transfunc[S1,a] S2
end
end
2 a 3 ε
ε
ε ε
0 1 6 a
ε 7 8
ε
4 b 5
ε
S0 = ε-closure({0}) = {0,1,2,4,7} S0 into DS as an unmarked state
⇓ mark S0
ε-closure(move(S0,a)) = ε-closure({3,8}) = {1,2,3,4,6,7,8} = S1 into DS
S1 ε-closure(move(S0,b)) = ε-closure({5}) = {1,2,4,5,6,7} = S2 S2 into DS
transfunc[S0,a] S1 transfunc[S0,b] S2 ⇓ mark
S1
ε-closure(move(S1,a)) = ε-closure({3,8}) = {1,2,3,4,6,7,8} =
S1 ε-closure(move(S1,b)) = ε-closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S1,a] S1 transfunc[S1,b] S2 ⇓ mark
S2
ε-closure(move(S2,a)) = ε-closure({3,8}) = {1,2,3,4,6,7,8} =
S1 ε-closure(move(S2,b)) = ε-closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S2,a] S1 transfunc[S2,b] S2
S0 is the start state of DFA since 0 is a member of S0={0,1,2,4,7}
S1 is an accepting state of DFA since 8 is a member of S1 = {1,2,3,4,6,7,8}
S1
S0 b a
b
S2
• The input to LEX consists primarily of Auxiliary Definitions and Translation Rules.
• To write regular expression for some languages can be difficult, because their regular
expressions can be quite complex. In those cases, we may use Auxiliary Definitions.
• We can give names to regular expressions, and we can use these names as symbols
to define other regular expressions.
• An Auxiliary Definition is a sequence of the definitions of the form:
d1 →
r1 d2
→ r2
.
.
dn → rn
If we try to write the regular expression representing identifiers without using regular
definitions, that regular expression will be complex.
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *
Example:
For Unsigned numbers in
Pascal digit → 0 |
1 | ... | 9 digits →
digit +
opt-fraction → ( . digits ) ? opt-
exponent → ( E (+|-)? digits ) ?
unsigned-num → digits opt-fraction opt-exponent
• Translation Rules comprise of a ordered list Regular Expressions and the Program Code to
be executed in case of that Regular Expression encountered.
R1 P1
R2 P2
.
.
Rn Pn
• The list is ordered i.e. the RE’s should be checked in order. If a string matches more than
one RE, the RE occurring higher in the list should be given preference and its Program
Code is executed.
•The Regular Expressions are converted into NFA’s. The final states of each NFA correspond
to some RE and its Program Code.
•Different NFA’s are then converted to a single NFA with epsilon moves. Each final state of
the NFA corresponds one-to-one to some final state of individual NFA’s i.e. some RE and its
Program Code. The final states have an order according to the corresponding RE’s. If more
than one final state is entered for some string, then the one that is higher in order is selected.
•This NFA is then converted to DFA. Each final state of DFA corresponds to a set of states
(having at least one final state) of the NFA. The Program Code of each final state (of the DFA)
is the program code corresponding to the final state that is highest in order out of all the final
states in the set of states (of NFA) that make up this final state (of DFA).
Example:
AUXILIARY DEFINITIONS
(none)
TRANSLATION RULES
a {Action1}
abb {Action2}
a*b+ {Action2}
First we construct an NFA for each RE and then convert this into a single NFA:
abb{ action2 }
a *b + { action3}
a { action1 }
start 3
1 2 a 4 b 5 6
start
start 7 8
1 2
start
ε 0 ε
3 a 4 b 5 6
ε
7 8
This NFA is now converted into a DFA. The transition table for the above DFA is as follows:
• Syntax Analyzer creates the syntactic structure of the given source program.
• This syntactic structure is mostly a parse tree.
• Syntax Analyzer is also known as parser.
• The syntax of a programming is described by a context-free grammar (CFG). We will use BNF
(Backus-Naur Form) notation in the description of CFGs.
• The syntax analyzer (parser) checks whether a given source program satisfies the rules
implied by a context-free grammar or not.
– If it satisfies, the parser creates the parse tree of that program.
– Otherwise the parser gives the error messages.
• A context-free grammar
– gives a precise syntactic specification of a programming language.
– the design of the grammar is an initial phase of the design of a compiler.
– a grammar can be directly converted into a parser by some tools.
3.1 Parser
3.2.1 Derivations
Example:
(b) E → E + E | E – E | E * E | E / E | - E
(c) E → ( E )
(d) E → id
• At each derivation step, we can choose any of the non-terminal in the sentential form of G for
the replacement.
• If we always choose the left-most non-terminal in each derivation step, this derivation is called
as left-most derivation.
Example:
• If we always choose the right-most non-terminal in each derivation step, this derivation is
called as right-most derivation.
Example:
• We will see that the top-down parsers try to find the left-most derivation of the given source
program.
• We will see that the bottom-up parsers try to find the right-most derivation of the given source
program in the reverse order.
:
E E E
E ⇒ -E ⇒ -(E) ⇒ -(E+E)
- E - E - E
( E) ( E )
E E E + E
- E - E
⇒ -(id+E) ⇒ -(id+id)
( E ) ( E )
E + E E + E
id id id
3.2.3 Ambiguity
• A grammar produces more than one parse tree for a sentence is called as an ambiguous
grammar.
• For the most parsers, the grammar must be unambiguous.
• Unambiguous grammar
Unique selection of the parse tree for a sentence
• We should eliminate the ambiguity in the grammar during the design phase of the compiler.
• An unambiguous grammar should be written to eliminate the ambiguity.
• We have to prefer one of the parse trees of a sentence (generated by an ambiguous
grammar) to disambiguate that grammar to restrict to this choice.
• Ambiguous grammars (because of ambiguous operators) can be disambiguated according to
the precedence and associativity rules.
Example:
To disambiguate the grammar E → E+E | E*E | E^E | id | (E), we can use precedence of
operators as follows:
^ (right to left) *
(left to right) +
(left to right)
In general,
A → A α1 | ... | A αm | β1 | ... | βn where β1 ... βn do not start with A
⇓ Eliminate immediate left recursion
A → β1 A’ | ... | βn A’
A’ → α1 A’ | ... | αm A’ | ε an equivalent grammar
Example:
E → E+T | T
T → T*F | F
F → id | (E)
E → T E’
E’ → +T E’ | ε
T → F T’
T’ → *F T’ | ε
F → id | (E)
Example:
S → Aa | b
A → Sc | d
S ⇒ Aa ⇒ Sca
Or
A ⇒ Sc ⇒ Aac
causes to a left-recursion
3.3.2 Elimination
for i from 1 to n do {
for j from 1 to i-1 do { replace
each production
Ai → Aj
γ by
Ai → α1 γ | ... | αk γ
where Aj → α1 | ... | αk
}
eliminate immediate left-recursions among Ai productions
}
Example:
S → Aa | b
A → Ac | Sd | f
for S:
- we do not enter the inner loop.
- there is no immediate left recursion in S.
for A:
- Replace A → Sd with A → Aad | bd
So, we will have A → Ac | Aad | bd | f
- Eliminate the immediate left-recursion in
A A → bdA’ | fA’
A’ → cA’ | adA’ | ε
for A:
- we do not enter the inner loop.
- Eliminate the immediate left-recursion in A
A → SdA’ | fA’
A’ → cA’ | ε
for S:
- Replace S → Aa with S → SdA’a | fA’a So,
we will have S → SdA’a | fA’a | b
- Eliminate the immediate left-recursion in
S S → fA’aS’ | bS’
S’ → dA’aS’ | ε
S’ → dA’aS’ | ε
A → SdA’ | fA’
A’ → cA’ | ε
3.4.1 Algorithm
• For each non-terminal A with two or more alternatives (production rules) with a common
non-empty prefix, let say
convert it into
A → αA’ | γ1 | ... | γm
A’ → β1 | ... | βn
Example:
Example:
A → ad | a | ab | abc | b
⇓
A → aA’ | b
A’ → d | ε | b | bc
⇓
A → aA’ | b
A’ → d | ε | bA’’ A’’
→ε|c
3.5 YACC
YACC generates C code for a syntax analyzer, or parser. YACC uses grammar rules that allow it to
analyze tokens from LEX and create a syntax tree. A syntax tree imposes a hierarchical structure on
tokens. For example, operator precedence and associativity are apparent in the syntax tree. The next
step, code generation, does a depth-first walk of the syntax tree to generate code. Some compilers
produce machine code, while others output assembly.
YACC takes a default action when there is a conflict. For shift-reduce conflicts, YACC will shift. For
reduce-reduce conflicts, it will use the first rule in the listing. It also issues a warning message whenever
a conflict exists. The warnings may be suppressed by making the grammar unambiguous.
Input to YACC is divided into three sections. The definitions section consists of token declarations,
and C code bracketed by “%{“ and “%}”. The BNF grammar is placed in the rules section, and user
subroutines are added in the subroutines section.