Compier Design - Unit I
Compier Design - Unit I
Textbook:
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman,
“Compilers: Principles, Techniques, and Tools”
Addison-Wesley, 1986.
2
Unit – I Syllabus
Compiler - Introduction
• A compiler is a program that can read a program in one language - the
source language - and translate it into an equivalent program in
another language - the target language.
• A compiler acts as a translator, transforming human-oriented
programming languages into computer-oriented machine languages.
• Ignore machine-dependent details for programmer
Jeya R 4
COMPILERS
• A compiler is a program takes a program written in a
source language and translates it into an equivalent
program in a target language.
error messages
Compiler vs Interpreter
Compiler Applications
• Machine Code Generation
– Convert source language program to machine understandable one
– Takes care of semantics of varied constructs of source language
– Considers limitations and specific features of target machine
– Automata theory helps in syntactic checks
– valid and invalid programs
– Compilation also generate code for syntactically correct programs
Structure of a Compiler
Jeya R 7
Jeya R 8
Jeya R 9
Phases of A Compiler
Lexical Analyzer
• Lexical Analyzer reads the source program character by character and returns
the tokens of the source program.
• A token describes a pattern of characters having same meaning in the source
program. (such as identifiers, operators, keywords, numbers, delimeters and so
on)
Ex: newval := oldval + 12 => tokens: newval identifier
:= assignment
operator
oldval identifier
+ add operator
12 a number
• This phase scans the source code as a stream of characters and converts it
into meaningful lexemes.
• Not part of the final code, however used as reference by all phases of a
compiler
• Typical information stored there include name, type, size, relative offset
of variables
• Generally created by lexical analyzer and syntax analyzer
• Good data structures needed to minimize searching time
• The data structure may be flat or hierarchical
A Syntax Analyzer creates the syntactic
Syntax
structure (generally a parse tree) of the
given program.
A syntax analyzer is also called as a parser.
A parse tree describes a syntactic structure
• It takes the token produced by lexical analysis as input and generates a parse
code grammar, i.e. the parser checks if the expression made by the tokens is
syntactically correct.
Jeya R 16
rules of language.
• The semantic analyzer uses the syntax tree and the information in the
symbol table to check the source program for semantic consistency with
the language definition.
• It also gathers type information and saves it in either the syntax
tree or the symbol table, for subsequent use during intermediate-code
generation.
• An important part of semantic analysis is type checking
Phases of Compiler-Semantic
Analysis
• Suppose that position, initial, and rate have been declared to be
floating-point numbers and that the lexeme 60 by itself forms an integer.
• The type checker in the semantic analyzer discovers that the operator
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability
By Nagadevi
token
Source To semantic
Lexical Analyzer Parser
program analysis
getNextToken
Symbol
table
CS416 Compiler Design 34
Lexical Analyzer
• Lexical Analyzer reads the source program character by character to
produce tokens.
• Normally a lexical analyzer doesn’t return a list of tokens at one shot,
Lexical errors
• Some errors are out of power of lexical analyzer to
recognize:
• fi (a == f(x)) …
• However it may be able to recognize errors like:
• d = 2r
• Such errors are recognized when no pattern for tokens
matches a character sequence
By Nagadevi
Error recovery
• Panic mode: successive characters are ignored until we
reach to a well formed token
• Delete one character from the remaining input
• Insert a missing character into the remaining input
• Replace a character by another character
• Transpose two adjacent characters
CS416 Compiler Design 37
Token
• Token represents a set of strings described by a pattern.
• Identifier represents a set of strings which start with a letter continues with letters and
digits
• The actual string (newval) is called as lexeme.
• Tokens: identifier, number, addop, delimeter, …
• Since a token can represent more than one lexeme, additional information should be held
for that specific lexeme. This additional information is called as the attribute of the token.
• For simplicity, a token may have a single attribute which holds the required information for
that token.
• For identifiers, this attribute a pointer to the symbol table, and the symbol table holds
the actual attributes for that token.
Jeya R 38
Token
• Some attributes:
• <id,attr> where attr is pointer to the symbol table
• <assgop,_> no attribute is needed (if there is only one assignment operator)
• <num,val> where val is the actual value of the number.
• Token type and its attribute uniquely identifies a lexeme.
• Regular expressions are widely used to specify patterns.
By Nagadevi
Example
Terminology of Languages
• Alphabet : a finite set of symbols (ASCII characters)
• String :
• Finite sequence of symbols on an alphabet
• Sentence and word are also used in terms of string
• ε is the empty string
• |s| is the length of string s.
• Language: sets of strings over some fixed alphabet
• ∅ the empty set is a language.
• {ε} the set containing empty string is a language
• The set of well-formed C programs is a language
• The set of all possible identifiers is a language.
Jeya R 42
Terminology of Languages
• Operators on Strings:
• Concatenation: xy represents the concatenation of strings
x and y. s ε = s εs=s
• sn = s s s .. s ( n times) s0 = ε
43
Input buffering
• Sometimes lexical analyzer needs to look ahead some symbols to decide
about the token to return
• In C language: we need to look after -, = or < to decide what token to
return
• In Fortran: DO 5 I = 1.25
• We need to introduce a two buffer scheme to handle large look-aheads
safely
E = M * C * * 2 eof
44
Cont..,
45
Cont..,
46
Cont..,
47
Sentinels
Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
48
Specification of tokens
• In theory of compilation regular expressions are used to
formalize the specification of tokens
• Regular expressions are means for specifying regular
languages
• Example:
• Letter_(letter_ | digit)*
• Each regular expression is a pattern specifying the form of
strings
49
Regular expressions
• Ɛ is a regular expression, L(Ɛ) = {Ɛ}
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
51
Extensions
• One or more instances: (r)+
• Zero or one instances: r?
• Character classes: [abc]
• Example:
• letter_ -> [A-Za-z_]
• digit -> [0-9]
• id -> letter_(letter|digit)*
52
Recognition of tokens
• Starting point is the language grammar to understand the
tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
53
Operations on Languages
• Concatenation:
• L1L2 = { s1s2 | s1 ∈ L1 and s2 ∈ L2 }
• Union
• L1 ∪ L2 = { s | s ∈ L1 or s ∈ L2 }
• Exponentiation:
• L0 = {ε} L1 = L L2 = LL
• Kleene Closure
• L* =
• Positive Closure
• L+ =
CS416 Compiler Design 55
Example
• L1 = {a,b,c,d} L2 = {1,2}
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
• L1 ∪ L2 = {a,b,c,d,1,2}
Regular Definitions
• To write regular expression for some languages can be difficult, because their regular expressions can
be quite complex. In those cases, we may use regular definitions.
• We can give names to regular expressions, and we can use these names as symbols to define other
regular expressions.
. Σ∪{d1,d2,...,di-1}
d n → rn
basic symbols previously defined names
CS416 Compiler Design 57
Regular expressions
• Ɛ is a regular expression, L(Ɛ) = {Ɛ}
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
By Nagadevi
Extensions
• One or more instances: (r)+
• Zero or one instances: r?
• Character classes: [abc]
• Example:
• letter_ -> [A-Za-z_]
• digit -> [0-9]
• id -> letter_(letter|digit)*
By Nagadevi
Recognition of tokens
• Starting point is the language grammar to understand the
tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
By Nagadevi
6
3
Design of a Lexical
Analyzer
• LEX is a software tool that automatically construct a lexical
analyzer from a program
• The Lexical analyzer will be of the form
P1 {action 1}
P2 {action 2}
--
--
6
5
General
format
• The declarations section includes declarations
of variables, manifest constants (identifiers
declared to stand for a constant, e.g., the
name of a token)
• The translation rules each have the form
Pattern { Action )
• Each pattern is a regular expression, which
may use the regular definitions of the
declaration section.
• The actions are fragments of code, typically
written in C, although many variants of Lex
using other languages have been created.
• The third section holds whatever additional
functions are used in the actions.
6
6
Lexical Analyzer Generator - Lex
Lex Source
Lexical Compiler lex.yy.c
program
lex.l
C
lex.yy.c a.out
compiler
Input a.out
Sequenc
stream e of
tokens
67
Finite Automata
• Regular expressions = specification
• Finite automata = implementation
• Recognizer ---A recognizer for a language is a program that takes as input
a string x answers ‘yes’ if x is a sentence of the language and ‘no’ otherwise.
68
Finite Automata
9
Finite Automata
• Transition
s 1 → a s2
• Is read
In state s1 on input “a” go to state s2
• If end of input
• If in accepting state => accept, otherwise => reject
• If no transition possible => reject
70
Finite Automata State Graphs
• A state
• An accepting state
a
• A transition
71
CS416 Compiler Design 72
Finite Automata
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that
language, and “no” otherwise.
• We call the recognizer of the tokens as a finite automaton.
• A finite automaton can be: deterministic(DFA) or non-deterministic (NFA)
• This means that we may use a deterministic or non-deterministic automaton as a lexical analyzer.
• Both deterministic and non-deterministic finite automaton recognize regular sets.
• Which one?
• deterministic – faster recognizer, but it may take more space
• non-deterministic – slower, but it may take less space
• Deterministic automatons are widely used lexical analyzers.
• First, we define regular expressions for tokens; Then we convert them into a DFA to get a lexical
analyzer for our tokens.
• Algorithm1: Regular Expression 🡺 NFA 🡺 DFA (two steps: first to NFA, then to DFA)
• Algorithm2: Regular Expression 🡺 DFA (directly convert a regular expression into a DFA )
Non-Deterministic Finite Automaton (NFA)
• ε- transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
• A NFA accepts a string x, if and only if there is a path from the starting state to one of
accepting states such that edge labels along this path spell out x.
73
74
75
Another Simple Example
• A finite automaton accepting any number of 1’s followed by a single 0
• Alphabet: {0,1}
76
NFA
77
NFA
78
Transition Table
79
CS416 Compiler Design 80
ε N(r1) ε
NFA for r1 | r2
i ε f
ε
N(r2)
CS416 Compiler Design 82
ε ε
i N(r) f
ε
NFA for r*
CS416 Compiler Design 83
a ε
ε
ε ε
(a|b) * ε ε
b
ε
ε
a ε
ε
(a|b) * a ε ε a
ε ε
b
ε
84
CS416 Compiler Design 85
S1
S0 b a
S2
b
88
89
90
Jeya R 91
Minimization of DFA
Jeya R 92
Minimization of DFA
Jeya R 93
Minimization of DFA
Jeya R 94
Minimization of DFA
Jeya R 95
Minimization of DFA
Jeya R 96
Example-Minimization of DFA
Jeya R 97
Example-Minimization of DFA