Module 1
Module 1
Jeya R 1
Preliminaries Required
• Basic knowledge of programming languages.
• Basic knowledge of FSA and CFG.
• Knowledge of a high programming language for the programming
assignments.
Textbook:
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman,
“Compilers: Principles, Techniques, and Tools”
Addison-Wesley, 1986.
Jeya R 2
Course Outline
• Introduction to Compiling
• Lexical Analysis
• Syntax Analysis
• Context Free Grammars
• Top-Down Parsing, LL Parsing
• Bottom-Up Parsing, LR Parsing
• Syntax-Directed Translation
• Attribute Definitions
• Evaluation of Attribute Definitions
• Semantic Analysis, Type Checking
• Run-Time Organization
• Intermediate Code Generation
• Code Optimization
• Code Generation
Jeya R 3
Compiler - Introduction
• A compiler is a program that can read a program in one language - the source language - and
translate it into an equivalent program in another language - the target language.
• A compiler acts as a translator, transforming human-oriented programming languages into
computer-oriented machine languages.
• Ignore machine-dependent details for programmer
Jeya R 4
COMPILERS
• A compiler is a program takes a program written in a source language
and translates it into an equivalent program in a target language.
error messages
( Normally a program written in ( Normally the equivalent program in
a high-level programming language) machine code – relocatable object file)
Jeya R 5
Compiler vs Interpreter
• An interpreter is another common kind of language
processor. Instead of producing a target program as a
translation, an interpreter appears to directly execute
the operations specified in the source program on
inputs supplied by the user
Jeya R 7
Other Applications
• In addition to the development of a compiler, the techniques used in compiler design can be applicable to
many problems in computer science.
• Techniques used in a lexical analyzer can be used in text editors, information retrieval system, and
pattern recognition programs.
• Techniques used in a parser can be used in a query processing system such as SQL.
• Many software having a complex front-end may need techniques used in compiler design.
• A symbolic equation solver which takes an equation as input. That program should parse the given
input equation.
• Most of the techniques used in compiler design can be used in Natural Language Processing (NLP)
systems.
Jeya R 8
Major Parts of Compilers
• There are two major parts of a compiler: Analysis and Synthesis
• In analysis phase, an intermediate representation is created from the
given source program.
• Lexical Analyzer, Syntax Analyzer and Semantic Analyzer are the parts of this phase.
Jeya R 9
Structure of a Compiler
• Breaks the source program into pieces and
Jeya R 1
0
Jeya R 11
Phases of A Compiler
Source Lexical Syntax Semantic Intermediate Code Code Target
Program Analyzer Analyzer Analyzer Code Generator Optimizer Generator Program
Jeya R 12
Lexical Analyzer
• Lexical Analyzer reads the source program character by character and returns the tokens of the source program.
• A token describes a pattern of characters having same meaning in the source program. (such as identifiers,
operators, keywords, numbers, delimeters and so on)
Ex: newval := oldval + 12 => tokens: newval identifier
:= assignment operator
oldval identifier
+ add operator
12 a number
Jeya R 13
Phases of Compiler-Lexical
Analysis
• It is also called as scanning
• This phase scans the source code as a stream of characters and converts it
into meaningful lexemes.
• For each lexeme, the lexical analyzer produces as output a token of the
form
• It passes on to the subsequent phase, syntax analysis.
Jeya R 14
Lexical Analysis
Jeya R 15
Lexical Analysis
• Lexical analysis breaks up a program into tokens
• Grouping characters into non-separatable units (tokens)
• Changing a stream to characters to a stream of tokens
Jeya R 16
Token , Pattern and Lexeme
• Token: Token is a sequence of characters that can
be treated as a single logical entity. Typical tokens are, 1)
Identifiers 2) keywords 3) operators 4) special symbols
5)constants
• Pattern: A set of strings in the input for which the same
token is produced as output. This set of strings is
described by a rule called a pattern associated with the
token.
• Lexeme: A lexeme is a sequence of characters in the
source program that is matched by the pattern for a
token.
Jeya R 17
Token , Pattern and Lexeme
Example 1
int main() {
// printf() sends the string inside quotation to // the standard output (the
display)
printf("Welcome to Compiler Design!");
return 0;
}
Tokens 'int', 'main', '(', ')', '{', 'printf', '(', ' "Welcome to Compiler
Design!" ', ')', ';', 'return', '0', ';', '}'
• Not part of the final code, however used as reference by all phases of a compiler
• Typical information stored there include name, type, size, relative offset of variables
• Generally created by lexical analyzer and syntax analyzer
• Good data structures needed to minimize searching time
• The data structure may be flat or hierarchical
A Syntax Analyzer creates the syntactic
Syntax
structure (generally a parse tree) of the
given program.
A syntax analyzer is also called as a parser.
Analysis
A parse tree describes a syntactic structure
• It takes the token produced by lexical analysis as input and generates a parse tree (or
syntax tree).
• In this phase, token arrangements are checked against the source code grammar, i.e.
the parser checks if the expression made by the tokens is syntactically correct.
Syntax Analyzer (CFG)
• The syntax of a language is specified by a context free grammar (CFG).
• The rules in a CFG are mostly recursive.
• A syntax analyzer checks whether a given program satisfies the rules implied by a CFG or not.
• If it satisfies, the syntax analyzer creates a parse tree for the given program.
• Ex: We use BNF (Backus Naur Form) to specify a CFG
assgstmt -> identifier := expression
expression -> identifier
expression -> number
expression -> expression + expression
Jeya R 23
Parsing Techniques
• Depending on how the parse tree is created, there are different parsing techniques.
• These parsing techniques are categorized into two groups:
• Top-Down Parsing,
• Bottom-Up Parsing
• Top-Down Parsing:
• Construction of the parse tree starts at the root, and proceeds towards the leaves.
• Efficient top-down parsers can be easily constructed by hand.
• Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing).
• Bottom-Up Parsing:
• Construction of the parse tree starts at the leaves, and proceeds towards the root.
• Normally efficient bottom-up parsers are created with the help of some software tools.
• Bottom-up parsing is also known as shift-reduce parsing.
• Operator-Precedence Parsing – simple, restrictive, easy to implement
• LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR
Jeya R 24
Syntax Analyzer versus Lexical Analyzer
• Which constructs of a program should be recognized by the lexical
analyzer, and which ones by the syntax analyzer?
• Both of them do similar things; But the lexical analyzer deals with simple non-recursive constructs of
the language.
• The syntax analyzer deals with recursive constructs of the language.
• The lexical analyzer simplifies the job of the syntax analyzer.
• The lexical analyzer recognizes the smallest meaningful units (tokens) in a source program.
• The syntax analyzer works on the smallest meaningful units (tokens) in a source program to recognize
meaningful structures in our programming language.
Jeya R 25
Semantic
Analysis
Phases of Compiler-Semantic
Analysis
• Semantic analysis checks whether the parse tree constructed follows the rules
of language.
• The semantic analyzer uses the syntax tree and the information in the symbol
table to check the source program for semantic consistency with the language
definition.
• It also gathers type information and saves it in either the syntax tree or the
symbol table, for subsequent use during intermediate-code generation.
• An important part of semantic analysis is type checking
Phases of Compiler-Semantic
Analysis
• Suppose that position, initial, and rate have been declared to be floating-
point numbers and that the lexeme 60 by itself forms an integer.
• The type checker in the semantic analyzer discovers that the operator
* is applied to a floating-point number rate and an integer 60.
Jeya R 37
Cousins of Compiler- Language
Processing System
Jeya R 38
Preprocess
or
• Pre-processors produce input to compilers
• The functions performed are:
• Macro processing - allows user to define macros
• File inclusion - include header files into the program
• Rational pre-processors - It augment older languages with
more modern flow-of-control and data structuring facilities
• Language extension - It attempt to add capabilities to the
language by what
amounts to built-in macros. (embed query in C)
Jeya R 3
9
Assembler
• Assembly code is a mnemonic version of machine code, in which names are used instead of
binary codes for operation
MOV a,R1
ADD #2,R1
MOV R1,b
• Some compiler produce assembly code , which will be passed to an assembler for further
processing
• Some other compiler perform the job of assembler, producing relocatable machine code which will
be passed directly to the loader/link editor
Jeya R 4
0
Two-Pass
Assembler
• This is the simplest form of assembler
• In First pass, all the identifiers that denote storage
location are found and stored in a symbol table.
Let consider b=a+2
Identifier Address
a 0
b 4
Jeya R 4
1
Loader/Link
editor
• Loading – It Loads the relocatable machine code to the
proper location
• Link editor allows us to make a single program from
several files of relocatable machine code
Jeya R 4
2
Compiler Construction Tool
Jeya R 43
Role of a Lexical Analyzer
Jeya R 44
Why to separate Lexical analysis and parsing
1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability
By Nagadevi
The role of lexical analyzer
token
Source To semantic
Lexical Analyzer Parser
program analysis
getNextToken
Symbol
table
By Nagadevi
Lexical Analyzer
• Lexical Analyzer reads the source program character by character to
produce tokens.
• Normally a lexical analyzer doesn’t return a list of tokens at one shot, it
returns a token when the parser asks a token from it.
By Nagadevi
Error recovery
• Panic mode: successive characters are ignored until we reach to a well
formed token
• Delete one character from the remaining input
• Insert a missing character into the remaining input
• Replace a character by another character
• Transpose two adjacent characters
By Nagadevi
Token
• Token represents a set of strings described by a pattern.
• Identifier represents a set of strings which start with a letter continues with letters and digits
• The actual string (newval) is called as lexeme.
• Tokens: identifier, number, addop, delimeter, …
• Since a token can represent more than one lexeme, additional information should be held for that specific lexeme. This
additional information is called as the attribute of the token.
• For simplicity, a token may have a single attribute which holds the required information for that token.
• For identifiers, this attribute a pointer to the symbol table, and the symbol table holds the actual attributes for that
token.
Jeya R 51
Tokens, Patterns and Lexemes
• A token is a pair a token name and an optional token value
• A pattern is a description of the form that the lexemes of a token may
take
• A lexeme is a sequence of characters in the source program that
matches the pattern for a token
By Nagadevi
Example
Jeya R 55
Input buffering
• Sometimes lexical analyzer needs to look ahead some symbols to decide
about the token to return
• In C language: we need to look after -, = or < to decide what token to
return
• In Fortran: DO 5 I = 1.25
• We need to introduce a two buffer scheme to handle large look-aheads
safely
E = M * C * * 2 eof
56
Cont..,
57
Cont..,
58
Cont..,
59
Sentinels
E = M eof * C * * 2 eof eof
Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
60
}
Specification of tokens
• In theory of compilation regular expressions are used to formalize the
specification of tokens
• Regular expressions are means for specifying regular languages
• Example:
• Letter_(letter_ | digit)*
• Each regular expression is a pattern specifying the form of strings
61
Regular expressions
• Ɛ is a regular expression, L(Ɛ) = {Ɛ}
• If a is a symbol in ∑then a is a regular expression, L(a) = {a}
• (r) | (s) is a regular expression denoting the language L(r) ∪ L(s)
• (r)(s) is a regular expression denoting the language L(r)L(s)
• (r)* is a regular expression denoting (L(r))*
• (r) is a regular expression denting L(r)
62
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
63
Extensions
• One or more instances: (r)+
• Zero or one instances: r?
• Character classes: [abc]
• Example:
• letter_ -> [A-Za-z_]
• digit -> [0-9]
• id -> letter_(letter|digit)*
64
Recognition of tokens
• Starting point is the language grammar to understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
65
Recognition of tokens (cont.)
• The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
66
Operations on Languages
• Concatenation:
• L1L2 = { s1s2 | s1 L1 and s2 L2 }
• Union
• L1 L2 = { s | s L1 or s L2 }
• Exponentiation:
• L0 = {} L1 = L L2 = LL
• Kleene Closure
• L* =
L i
i 0
• Positive Closure
• L+ =
L i
i 1
• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
• L1 L2 = {a,b,c,d,1,2}
• (r)+ = (r)(r)*
• (r)? = (r) |
• Ex:
• = {0,1}
• 0|1 => {0,1}
• (0|1)(0|1) => {00,01,10,11}
• 0* => { ,0,00,000,0000,....}
• (0|1)* => all strings with 0 and 1, including the empty string
By Nagadevi
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn
• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
By Nagadevi
Extensions
• One or more instances: (r)+
• Zero or one instances: r?
• Character classes: [abc]
• Example:
• letter_ -> [A-Za-z_]
• digit -> [0-9]
• id -> letter_(letter|digit)*
By Nagadevi
Recognition of tokens
• Starting point is the language grammar to understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
By Nagadevi
Recognition of tokens (cont.)
• The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
• We also need to handle whitespaces:
ws -> (blank | tab | newline)+
By Nagadevi
Transition diagrams
• Transition diagram for relop
By Nagadevi
Transition diagrams (cont.)
• Transition diagram for reserved words and identifiers
By Nagadevi
Transition diagrams (cont.)
• Transition diagram for unsigned numbers
By Nagadevi
Transition diagrams (cont.)
• Transition diagram for whitespace
By Nagadevi
Design of a Lexical
Analyzer (LEX)
8
3
Design of a Lexical
Analyzer
• LEX is a software tool that automatically construct a lexical
analyzer from a program
• The Lexical analyzer will be of the form
P1 {action 1}
P2 {action 2}
--
--
8
5
Example
Consider Lexeme
a {action A1 for pattern p1}
abb{action A2 for pattern p2}
a*b* {action A3 for pattern p3}
8
6
LEX in use
• An input file, which we call lex.1, is
written in the Lex language and
describes the lexical analyzer to be
generated.
• The Lex compiler transforms lex. 1
to a C program, in a file that is
always named lex. yy . c.
• The latter file is compiled by the C
compiler into a file called a. out.
• The C-compiler output is a working
lexical analyzer that can take a
stream of input characters and
produce a stream of tokens.
8
7
Structure of LEX Program
%{
Definition section
%}
%%
Rules section
%%
8
9
Consider the following
statement
9
0
9
1
%%
[ \t\n]+ ;
[aeiouAEIOU]+ {vow++;}
[^aeiouAEIOU] {con++;}
%%
int main( )
{
printf("Enter some input string:\n");
yylex();
printf("Number of vowels=%d\n",vow);
printf("Number of consonants=%d\n",con);
}
Lexical Analyzer Generator -
Lex
Lex Source
Lexical Compiler lex.yy.c
program
lex.l
C
lex.yy.c a.out
compiler
Input a.out
Sequenc
stream e of
tokens
93
Finite Automata
• Regular expressions = specification
• Finite automata = implementation
• Recognizer ---A recognizer for a language is a program that takes as
input a string x answers ‘yes’ if x is a sentence of the language and
‘no’ otherwise.
5
Finite Automata
• Transition
s1 a s2
• Is read
In state s1 on input “a” go to state s2
• If end of input
• If in accepting state => accept, otherwise => reject
• If no transition possible => reject
96
Finite Automata State Graphs
• A state
• An accepting state
a
• A transition
97
Finite Automata
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that
language, and “no” otherwise.
• We call the recognizer of the tokens as a finite automaton.
• A finite automaton can be: deterministic(DFA) or non-deterministic (NFA)
• This means that we may use a deterministic or non-deterministic automaton as a lexical analyzer.
• Both deterministic and non-deterministic finite automaton recognize regular sets.
• Which one?
• deterministic – faster recognizer, but it may take more space
• non-deterministic – slower, but it may take less space
• Deterministic automatons are widely used lexical analyzers.
• First, we define regular expressions for tokens; Then we convert them into a DFA to get a lexical analyzer for
our tokens.
• Algorithm1: Regular Expression NFA DFA (two steps: first to NFA, then to DFA)
• Algorithm2: Regular Expression DFA (directly convert a regular expression into a DFA)
• - transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
• A NFA accepts a string x, if and only if there is a path from the starting state to
one of accepting states such that edge labels along this path spell out x .
99
Deterministic and Nondeterministic
Automata
• Deterministic Finite Automata (DFA)
• One transition per input per state
• No -moves
• Nondeterministic Finite Automata (NFA)
• Can have multiple transitions for one input in a given state
• Can have -moves
• Finite automata have finite memory
• Need only to encode the current state
100
A Simple Example
• A finite automaton that accepts only “1”
101
Another Simple Example
• A finite automaton accepting any number of 1’s followed by a single 0
• Alphabet: {0,1}
102
NFA
103
NFA
104
Transition Table
105
Converting A Regular Expression into A NFA
(Thomson’s Construction)
• This is one way to convert a regular expression into a NFA.
• There can be other ways (much efficient) for the conversion.
• Thomson’s Construction is simple and systematic method.
It guarantees that the resulting NFA will have exactly one final state,
and one start state.
• Construction starts from simplest parts (alphabet symbols).
• To create a NFA for a complex regular expression, NFAs of its sub-
expressions are combined to create its NFA,
N(r1)
NFA for r1 | r2
i f
N(r2)
i N(r) f
NFA for r*
a
(a|b) *
b
a
(a|b) * a a
b
S1
S0 b a
S2
Jeya R 117
Minimization of DFA
Jeya R 118
Minimization of DFA
Jeya R 119
Minimization of DFA
Jeya R 120
Minimization of DFA
Jeya R 121
Example-Minimization of DFA
Jeya R 122
Example-Minimization of DFA
Jeya R 123
Example-Minimization of DFA
Jeya R 124
Example-Minimization of DFA
Jeya R 125
Regular Expression to DFA
(Direct Method)
Jeya R 126
Regular Expression to DFA
(Direct Method)- Example
• Regular Expression: (a/b)*abb
• Augmented Grammar : (a/b)*abb# = (a/b)*.a.b.b.#
Jeya R 127
Regular Expression to DFA
(Direct Method)- Example
Jeya R 128
Computation of Nullable,
Firstpos, LastPos:
Jeya R 129
Example:
Jeya R 130
Direct Method
Jeya R 131
Direct Method
Jeya R 132
Direct Method
Jeya R 133
Direct Method
Jeya R 134
Direct Method
Jeya R 135
Direct Method
Jeya R 136