CD Unit 1
CD Unit 1
Compiler Design
Jeya R 2
Preliminaries Required
• Basic knowledge of programming languages.
programming assignments.
Textbook:
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman,
“Compilers: Principles, Techniques, and Tools”
Addison-Wesley, 1986.
Jeya R 3
Course Outline
• Introduction to Compiling
• Lexical Analysis
• Syntax Analysis
• Syntax-Directed Translation
• Attribute Definitions
• Run-Time Organization
• Code Optimization
• Code Generation
Jeya R 4
Compiler - Introduction
• A compiler is a program that can read a program in one language - the
COMPILERS
• A compiler is a program takes a program written in a
error messages
Compiler vs Interpreter
Compiler Applications
• Machine Code Generation
– Convert source language program to machine understandable one
– Takes care of semantics of varied constructs of source language
– Considers limitations and specific features of target machine
– Automata theory helps in syntactic checks
– valid and invalid programs
– Compilation also generate code for syntactically correct programs
Jeya R 8
Other Applications
• In addition to the development of a compiler, the techniques used in compiler
SQL.
• Many software having a complex front-end may need techniques used in
compiler design.
• A symbolic equation solver which takes an equation as input. That
Synthesis
• In analysis phase, an intermediate representation is created
phase.
1
Jeya R 0
Jeya R 11
Jeya R 12
Phases of A Compiler
Lexical Analyzer
• Lexical Analyzer reads the source program character by character and returns
the tokens of the source program.
• A token describes a pattern of characters having same meaning in the source
program. (such as identifiers, operators, keywords, numbers, delimeters and so
on)
Ex: newval := oldval + 12 => tokens: newval identifier
:= assignment operator
oldval identifier
+ add operator
12 a number
• For each lexeme, the lexical analyzer produces as output a token of the
form
Jeya R 14
Jeya R 15
Lexical Analysis
Jeya R 16
Lexical Analysis
• Lexical analysis breaks up a program into tokens
• Grouping characters into non-separatable units (tokens)
• Changing a stream to characters to a stream of tokens
Jeya R 17
program
• Not part of the final code, however used as reference by all phases of a compiler
• Typical information stored there include name, type, size, relative offset of variables
Analysis
A parse tree describes a syntactic structure
syntax tree).
• In this phase, token arrangements are checked against the source code grammar, i.e. the
• A syntax analyzer checks whether a given program satisfies the rules implied by
a CFG or not.
• If it satisfies, the syntax analyzer creates a parse tree for the given program.
Parsing Techniques
• Depending on how the parse tree is created, there are different parsing techniques.
• Top-Down Parsing,
• Bottom-Up Parsing
• Top-Down Parsing:
• Construction of the parse tree starts at the root, and proceeds towards the leaves.
• Bottom-Up Parsing:
• Construction of the parse tree starts at the leaves, and proceeds towards the root.
• Normally efficient bottom-up parsers are created with the help of some software tools.
source program.
• The syntax analyzer works on the smallest meaningful units (tokens) in a
• The semantic analyzer uses the syntax tree and the information in the symbol
table to check the source program for semantic consistency with the language
definition.
• It also gathers type information and saves it in either the syntax tree or the
symbol table, for subsequent use during intermediate-code generation.
• An important part of semantic analysis is type checking
Phases of Compiler-Semantic
Analysis
• Suppose that position, initial, and rate have been declared to be floating-point
numbers and that the lexeme 60 by itself forms an integer.
• The type checker in the semantic analyzer discovers that the operator
3
Jeya R
7
Assembler
• Assembly code is a mnemonic version of machine code, in which names are used instead of
3
Jeya R
8
Two-Pass
Assembler
• This is the simplest form of assembler
• In First pass, all the identifiers that denote storage
location are found and stored in a symbol table.
Let consider b=a+2
Identifier Address
a 0
b 4
3
Jeya R
9
Loader/Link
editor
• Loading – It Loads the relocatable machine code to the
proper location
• Link editor allows us to make a single program from
several files of relocatable machine code
4
Jeya R
0
Jeya R 41
• Specification of tokens
• Recognition of tokens
• Finite automata
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability
By Nagadevi
token
Source To semantic
Lexical Analyzer Parser
program analysis
getNextToken
Symbol
table
CS416 Compiler Design 45
Lexical Analyzer
• Lexical Analyzer reads the source program character by character to
produce tokens.
• Normally a lexical analyzer doesn’t return a list of tokens at one shot,
Lexical errors
• Some errors are out of power of lexical analyzer to
recognize:
• fi (a == f(x)) …
• d = 2r
Error recovery
• Panic mode: successive characters are ignored until we
Token
• Token represents a set of strings described by a pattern.
• Identifier represents a set of strings which start with a letter continues with letters and
digits
• The actual string (newval) is called as lexeme.
• Since a token can represent more than one lexeme, additional information should be held
for that specific lexeme. This additional information is called as the attribute of the token.
• For simplicity, a token may have a single attribute which holds the required information
Token
• Some attributes:
Example
Terminology of Languages
• Alphabet : a finite set of symbols (ASCII characters)
• String :
Terminology of Languages
• Operators on Strings:
x and y. s ε = s εs=s
• sn =
s s s .. s ( n times) s0 = ε
54
Input buffering
• Sometimes lexical analyzer needs to look ahead some symbols to decide
return
• In Fortran: DO 5 I = 1.25
safely
E = M * C * * 2 eof
55
Cont..,
56
Cont..,
57
Cont..,
58
Sentinels
Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
59
Specification of tokens
• In theory of compilation regular expressions are used to
languages
• Example:
• Letter_(letter_ | digit)*
• Each regular expression is a pattern specifying the form of
strings
60
Regular expressions
•Ɛ is a regular expression, L(Ɛ) = {Ɛ}
• If a is a symbol in ∑then a is a regular expression, L(a) = {a}
L(s)
• (r)(s) is a regular expression denoting the language L(r)L(s)
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
62
Extensions
• One or more instances: (r)+
• Example:
Recognition of tokens
• Starting point is the language grammar to understand the
tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
64
Operations on Languages
• Concatenation:
• L1L2 = { s1s2 | s1 ∈ L1 and s2 ∈ L2 }
• Union
• L1 ∪ L2 = { s | s ∈ L1 or s ∈ L2 }
• Exponentiation:
• L0 = {ε} L1 = L L2 = LL
• Kleene Closure
• L* =
• Positive Closure
• L+ =
CS416 Compiler Design 66
Example
• L1 = {a,b,c,d} L2 = {1,2}
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
• L1 ∪ L2 = {a,b,c,d,1,2}
Regular Expressions
• We use regular expressions to describe tokens of a
programming language.
• A regular expression is built up of simpler regular
regular set.
CS416 Compiler Design 68
• (r)+ = (r)(r)*
• (r)? = (r) | ε
CS416 Compiler Design 69
• Ex:
• Σ = {0,1}
• 0|1 => {0,1}
• (0|1)(0|1) => {00,01,10,11}
• 0* => {ε ,0,00,000,0000,....}
• (0|1)* => all strings with 0 and 1, including the empty string
CS416 Compiler Design 70
Regular Definitions
• To write regular expression for some languages can be difficult, because their regular expressions can
regular expressions.
digit → 0 | 1 | ... | 9
digits → digit +
opt-fraction → ( . digits ) ?
opt-exponent → ( E (+|-)? digits ) ?
unsigned-num → digits opt-fraction opt-exponent
By Nagadevi
Regular expressions
•Ɛ is a regular expression, L(Ɛ) = {Ɛ}
• If a is a symbol in ∑then a is a regular expression, L(a) = {a}
L(s)
• (r)(s) is a regular expression denoting the language L(r)L(s)
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
By Nagadevi
Extensions
• One or more instances: (r)+
• Example:
Recognition of tokens
• Starting point is the language grammar to understand the
tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
By Nagadevi
Transition diagrams
• Transition diagram for relop
By Nagadevi
8
1
Design of a Lexical
Analyzer
• LEX is a software tool that automatically construct a lexical
analyzer from a program
• The Lexical analyzer will be of the form
P1 {action 1}
P2 {action 2}
--
--
8
3
Example
Consider Lexeme
a {action A1 for pattern p1}
abb{action A2 for pattern p2}
a*b* {action A3 for pattern p3}
8
4
LEX in use
• An input file, which we call lex.1, is
written in the Lex language and
describes the lexical analyzer to be
generated.
• The Lex compiler transforms lex. 1
to a C program, in a file that is
always named lex. yy . c.
• The latter file is compiled by the C
compiler into a file called a. out.
• The C-compiler output is a working
lexical analyzer that can take a
stream of input characters and
produce a stream of tokens.
8
5
General
format
1. The declarations section includes
declarations of variables, manifest
constants (identifiers declared to stand for
a constant, e.g., the name of a token)
2. The translation rules each have the form
Pattern { Action )
• Each pattern is a regular expression, which
may use the regular definitions of the
declaration section.
• The actions are fragments of code, typically
written in C, although many variants of Lex
using other languages have been created.
3. The third section holds whatever additional
functions are used in the actions.
8
6
Consider the following
statement
8
7
8
8
Lexical Analyzer Generator - Lex
Lex Source
Lexical Compiler lex.yy.c
program
lex.l
lex.yy.c C a.out
compiler
89
Finite Automata
• Regular expressions = specification
• Finite automata = implementation
• Recognizer ---A recognizer for a language is a program that takes as input
a string x answers ‘yes’ if x is a sentence of the language and ‘no’ otherwise.
90
Finite Automata
• A finite automaton consists of
• An input alphabet Σ
• A set of states S
• A start state n
1
Finite Automata
• Transition
s1 → a s2
• Is read
• If end of input
92
Finite Automata State Graphs
• A state
• An accepting state
a
• A transition
93
CS416 Compiler Design 94
Finite Automata
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of
• This means that we may use a deterministic or non-deterministic automaton as a lexical analyzer.
• Which one?
• First, we define regular expressions for tokens; Then we convert them into a DFA to get a lexical
• Algorithm2: Regular Expression DFA (directly convert a regular expression into a DFA )
Non-Deterministic Finite Automaton
(NFA)
• A non-deterministic finite automaton (NFA) is a mathematical model that consists of:
• S - a set of states
• Σ - a set of input symbols (alphabet)
• move – a transition function move to map state-symbol pairs to sets of states.
• s0 - a start (initial) state
• F – a set of accepting states (final states)
• ε- transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
• A NFA accepts a string x, if and only if there is a path from the starting state to one of
accepting states such that edge labels along this path spell out x.
95
96
• No ε-moves
97
Another Simple Example
• A finite automaton accepting any number of 1’s followed by a single 0
• Alphabet: {0,1}
98
NFA
99
NFA
100
Transition Table
101
CS416 Compiler Design 102
ε N(r1) ε
NFA for r1 | r2
i ε f
ε
N(r2)
CS416 Compiler Design 104
NFA for r1 r2
ε ε
i N(r) f
ε
NFA for
r*
CS416 Compiler Design 105
a ε
ε
ε
(a|b) * ε
ε
ε
b
ε
ε
a ε
ε
(a|b) * a ε ε a
ε ε
b
ε
106
CS416 Compiler Design 107
S1
S0 b a
S2
b
110
111
112