0% found this document useful (0 votes)

65 views136 pages

Module 1

The document provides information about a compiler design course, including: - The course outline covers topics like lexical analysis, syntax analysis using techniques like LL and LR parsing, syntax-directed translation, semantic analysis, code generation, and optimization. - A compiler translates programs written in a source language into an equivalent program in a target language. It performs tasks like lexical analysis, syntax analysis, code generation and optimization. - The major parts of a compiler are the analysis phase which creates an intermediate representation from the source program, and the synthesis phase which generates the target program from the intermediate representation.

Uploaded by

Neha Venugopal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views136 pages

Module 1

Uploaded by

Neha Venugopal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 136

Compiler Design

Jeya R 1
Preliminaries Required
• Basic knowledge of programming languages.
• Basic knowledge of FSA and CFG.
• Knowledge of a high programming language for the programming
assignments.

Textbook:
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman,
“Compilers: Principles, Techniques, and Tools”
Addison-Wesley, 1986.
Jeya R 2
Course Outline
• Introduction to Compiling
• Lexical Analysis
• Syntax Analysis
• Context Free Grammars
• Top-Down Parsing, LL Parsing
• Bottom-Up Parsing, LR Parsing
• Syntax-Directed Translation
• Attribute Definitions
• Evaluation of Attribute Definitions
• Semantic Analysis, Type Checking
• Run-Time Organization
• Intermediate Code Generation
• Code Optimization
• Code Generation

Jeya R 3
Compiler - Introduction
• A compiler is a program that can read a program in one language - the source language - and
translate it into an equivalent program in another language - the target language.
• A compiler acts as a translator, transforming human-oriented programming languages into
computer-oriented machine languages.
• Ignore machine-dependent details for programmer

Jeya R 4
COMPILERS
• A compiler is a program takes a program written in a source language
and translates it into an equivalent program in a target language.

source program COMPILER target program

error messages
( Normally a program written in ( Normally the equivalent program in
a high-level programming language) machine code – relocatable object file)

Jeya R 5
Compiler vs Interpreter
• An interpreter is another common kind of language
processor. Instead of producing a target program as a
translation, an interpreter appears to directly execute
the operations specified in the source program on
inputs supplied by the user

• The machine-language target program produced by a

compiler is usually much faster than an interpreter at
mapping inputs to outputs .
• An interpreter, however, can usually give better error
diagnostics than a compiler, because it executes the source
program statement by statement
Jeya R 6
Compiler Applications
• Machine Code Generation
– Convert source language program to machine understandable one
– Takes care of semantics of varied constructs of source language
– Considers limitations and specific features of target machine
– Automata theory helps in syntactic checks
– valid and invalid programs
– Compilation also generate code for syntactically correct programs

Jeya R 7
Other Applications
• In addition to the development of a compiler, the techniques used in compiler design can be applicable to
many problems in computer science.
• Techniques used in a lexical analyzer can be used in text editors, information retrieval system, and
pattern recognition programs.
• Techniques used in a parser can be used in a query processing system such as SQL.
• Many software having a complex front-end may need techniques used in compiler design.
• A symbolic equation solver which takes an equation as input. That program should parse the given
input equation.
• Most of the techniques used in compiler design can be used in Natural Language Processing (NLP)
systems.

Jeya R 8
Major Parts of Compilers
• There are two major parts of a compiler: Analysis and Synthesis
• In analysis phase, an intermediate representation is created from the
given source program.
• Lexical Analyzer, Syntax Analyzer and Semantic Analyzer are the parts of this phase.

• In synthesis phase, the equivalent target program is created from this

intermediate representation.
• Intermediate Code Generator, Code Generator, and Code Optimizer are the parts of this phase.

Jeya R 9
Structure of a Compiler
• Breaks the source program into pieces and

Analysis fit into a

grammatical structure
• If this part detect any syntactically ill formed
or semantically unsound error it is report to
the user
• It collect the information about the source
program and stored in a data structure –
Symbol Table

Synthesis • Construct the target program from the

available symbol table and intermediate
representation

Jeya R 1
0
Jeya R 11
Phases of A Compiler
Source Lexical Syntax Semantic Intermediate Code Code Target
Program Analyzer Analyzer Analyzer Code Generator Optimizer Generator Program

• Each phase transforms the source program from one representation

into another representation.

• They communicate with error handlers.

• They communicate with the symbol table.

Jeya R 12
Lexical Analyzer
• Lexical Analyzer reads the source program character by character and returns the tokens of the source program.
• A token describes a pattern of characters having same meaning in the source program. (such as identifiers,
operators, keywords, numbers, delimeters and so on)
Ex: newval := oldval + 12 => tokens: newval identifier
:= assignment operator
oldval identifier
+ add operator
12 a number

• Puts information about identifiers into the symbol table.

• Regular expressions are used to describe tokens (lexical constructs).
• A (Deterministic) Finite State Automaton can be used in the implementation of a lexical analyzer.

Jeya R 13
Phases of Compiler-Lexical
Analysis
• It is also called as scanning
• This phase scans the source code as a stream of characters and converts it
into meaningful lexemes.
• For each lexeme, the lexical analyzer produces as output a token of the
form
• It passes on to the subsequent phase, syntax analysis.

This points to an entry in the

symbol table for this token.
It is an abstract Information from the symbol-
symbol that is <token-name, table
used during entry 'is needed for semantic
attribute-value> analysis and code generation
syntax analysis

Jeya R 14
Lexical Analysis

Jeya R 15
Lexical Analysis
• Lexical analysis breaks up a program into tokens
• Grouping characters into non-separatable units (tokens)
• Changing a stream to characters to a stream of tokens

Jeya R 16
Token , Pattern and Lexeme
• Token: Token is a sequence of characters that can
be treated as a single logical entity. Typical tokens are, 1)
Identifiers 2) keywords 3) operators 4) special symbols
5)constants
• Pattern: A set of strings in the input for which the same
token is produced as output. This set of strings is
described by a rule called a pattern associated with the
token.
• Lexeme: A lexeme is a sequence of characters in the
source program that is matched by the pattern for a
token.

Jeya R 17
Token , Pattern and Lexeme
Example 1

int a = 10; //Input Source code

Tokens int (keyword), a(identifier), =(operator), 10(constant) and ;
(punctuation-semicolon)

Total number of tokens: ??

Token , Pattern and Lexeme
Example 2

int main() {
// printf() sends the string inside quotation to // the standard output (the
display)
printf("Welcome to Compiler Design!");
return 0;
}

Tokens 'int', 'main', '(', ')', '{', 'printf', '(', ' "Welcome to Compiler
Design!" ', ')', ';', 'return', '0', ';', '}'

Total number of tokens: ??

Phases of Compiler-Symbol Table
Management
• Symbol table is a data structure holding information about all symbols defined in the
source program

• Not part of the final code, however used as reference by all phases of a compiler

• Typical information stored there include name, type, size, relative offset of variables
• Generally created by lexical analyzer and syntax analyzer
• Good data structures needed to minimize searching time
• The data structure may be flat or hierarchical
A Syntax Analyzer creates the syntactic

Syntax
structure (generally a parse tree) of the
given program.
A syntax analyzer is also called as a parser.

Analysis
A parse tree describes a syntactic structure

•In a parse tree, all terminals are at leaves.

• All inner nodes are non-terminals in

a context free grammar
Phases of Compiler-Syntax
Analysis
• This is the second phase, it is also called as parsing

• It takes the token produced by lexical analysis as input and generates a parse tree (or
syntax tree).

• In this phase, token arrangements are checked against the source code grammar, i.e.
the parser checks if the expression made by the tokens is syntactically correct.
Syntax Analyzer (CFG)
• The syntax of a language is specified by a context free grammar (CFG).
• The rules in a CFG are mostly recursive.
• A syntax analyzer checks whether a given program satisfies the rules implied by a CFG or not.
• If it satisfies, the syntax analyzer creates a parse tree for the given program.
• Ex: We use BNF (Backus Naur Form) to specify a CFG
assgstmt -> identifier := expression
expression -> identifier
expression -> number
expression -> expression + expression

Jeya R 23
Parsing Techniques
• Depending on how the parse tree is created, there are different parsing techniques.
• These parsing techniques are categorized into two groups:
• Top-Down Parsing,
• Bottom-Up Parsing
• Top-Down Parsing:
• Construction of the parse tree starts at the root, and proceeds towards the leaves.
• Efficient top-down parsers can be easily constructed by hand.
• Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing).
• Bottom-Up Parsing:
• Construction of the parse tree starts at the leaves, and proceeds towards the root.
• Normally efficient bottom-up parsers are created with the help of some software tools.
• Bottom-up parsing is also known as shift-reduce parsing.
• Operator-Precedence Parsing – simple, restrictive, easy to implement
• LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR

Jeya R 24
Syntax Analyzer versus Lexical Analyzer
• Which constructs of a program should be recognized by the lexical
analyzer, and which ones by the syntax analyzer?
• Both of them do similar things; But the lexical analyzer deals with simple non-recursive constructs of
the language.
• The syntax analyzer deals with recursive constructs of the language.
• The lexical analyzer simplifies the job of the syntax analyzer.
• The lexical analyzer recognizes the smallest meaningful units (tokens) in a source program.
• The syntax analyzer works on the smallest meaningful units (tokens) in a source program to recognize
meaningful structures in our programming language.

Jeya R 25
Semantic
Analysis
Phases of Compiler-Semantic
Analysis
• Semantic analysis checks whether the parse tree constructed follows the rules
of language.

• The semantic analyzer uses the syntax tree and the information in the symbol
table to check the source program for semantic consistency with the language
definition.
• It also gathers type information and saves it in either the syntax tree or the
symbol table, for subsequent use during intermediate-code generation.
• An important part of semantic analysis is type checking
Phases of Compiler-Semantic
Analysis
• Suppose that position, initial, and rate have been declared to be floating-
point numbers and that the lexeme 60 by itself forms an integer.

• The type checker in the semantic analyzer discovers that the operator
* is applied to a floating-point number rate and an integer 60.

• In this case, the integer may be converted into a floating-point number.

Intermediate Code
Generation
Phases of Compiler-Intermediate Code
Generation
• After semantic analysis the compiler generates an intermediate code of the
source code for the target machine.
• It represents a program for some abstract machine.
• It is in between the high-level language and the machine language.

• This intermediate code should be generated in such a way that it makes it

easier to be translated into the target machine code.

• A compiler may produce an explicit intermediate codes representing the

source program.
• These intermediate codes are generally machine (architecture
independent). But the level of intermediate codes is close to the level of
machine codes
Phases of Compiler-Intermediate Code
Generation
• An intermediate form called three-address code were used

• It consists of a sequence of assembly-like instructions with three

operands per instruction. Each operand can act like a register.
Code
Optimization
Phases of Compiler-Code
Optimization
• The next phase does code optimization of the intermediate code.
• Optimization can be assumed as something that removes unnecessary
code lines, and arranges the sequence of statements in order to speed up
the program execution without wasting resources (CPU, memory).
Code
Generation
Phases of Compiler-Code
Generation
• In this phase, the code generator takes the optimized representation of the
intermediate code and maps it to the target machine language.
• If the target language is machine code, registers or memory locations are
selected for each of the variables used by the program.
• Then, the intermediate instructions are translated into sequences of
machine instructions that perform the same task.
• Produces the target language in a specific architecture.
• The target program is normally is a relocatable object file containing the
machine codes
Phases of Compiler-Code
Generation
• For example, using registers R1 and R2, the intermediate code
might get translated into the machine code
• The first operand of each instruction specifies a destination. The F
in each instruction tells us that it deals with floating-point
numbers.
Phases of Compiler-Translation of assignment
statement

Jeya R 37
Cousins of Compiler- Language
Processing System

Jeya R 38
Preprocess
or
• Pre-processors produce input to compilers
• The functions performed are:
• Macro processing - allows user to define macros
• File inclusion - include header files into the program
• Rational pre-processors - It augment older languages with
more modern flow-of-control and data structuring facilities
• Language extension - It attempt to add capabilities to the
language by what
amounts to built-in macros. (embed query in C)

Jeya R 3
9
Assembler
• Assembly code is a mnemonic version of machine code, in which names are used instead of
binary codes for operation
MOV a,R1
ADD #2,R1
MOV R1,b

• Some compiler produce assembly code , which will be passed to an assembler for further
processing
• Some other compiler perform the job of assembler, producing relocatable machine code which will
be passed directly to the loader/link editor

Jeya R 4
0
Two-Pass
Assembler
• This is the simplest form of assembler
• In First pass, all the identifiers that denote storage
location are found and stored in a symbol table.
Let consider b=a+2

Identifier Address

a 0
b 4

Jeya R 4
1
Loader/Link
editor
• Loading – It Loads the relocatable machine code to the
proper location
• Link editor allows us to make a single program from
several files of relocatable machine code

Jeya R 4
2
Compiler Construction Tool

Jeya R 43
Role of a Lexical Analyzer

• Role of lexical analyzer

• Specification of tokens
• Recognition of tokens
• Lexical analyzer generator
• Finite automata
• Design of lexical analyzer generator

Jeya R 44
Why to separate Lexical analysis and parsing

1. Simplicity of design
2. Improving compiler efficiency
3. Enhancing compiler portability

By Nagadevi
The role of lexical analyzer

token
Source To semantic
Lexical Analyzer Parser
program analysis
getNextToken

Symbol
table

By Nagadevi
Lexical Analyzer
• Lexical Analyzer reads the source program character by character to
produce tokens.
• Normally a lexical analyzer doesn’t return a list of tokens at one shot, it
returns a token when the parser asks a token from it.

source Lexical token

Parser
program Analyze get next token
r

CS416 Compiler Design 47

Lexical errors
• Some errors are out of power of lexical analyzer to recognize:
• fi (a == f(x)) …
• However it may be able to recognize errors like:
• d = 2r
• Such errors are recognized when no pattern for tokens matches a
character sequence

By Nagadevi
Error recovery
• Panic mode: successive characters are ignored until we reach to a well
formed token
• Delete one character from the remaining input
• Insert a missing character into the remaining input
• Replace a character by another character
• Transpose two adjacent characters

By Nagadevi
Token
• Token represents a set of strings described by a pattern.
• Identifier represents a set of strings which start with a letter continues with letters and digits
• The actual string (newval) is called as lexeme.
• Tokens: identifier, number, addop, delimeter, …
• Since a token can represent more than one lexeme, additional information should be held for that specific lexeme. This
additional information is called as the attribute of the token.
• For simplicity, a token may have a single attribute which holds the required information for that token.
• For identifiers, this attribute a pointer to the symbol table, and the symbol table holds the actual attributes for that
token.

CS416 Compiler Design 50

Token
• Some attributes:
• <id,attr> where attr is pointer to the symbol table
• <assgop,_> no attribute is needed (if there is only one assignment operator)
• <num,val> where val is the actual value of the number.
• Token type and its attribute uniquely identifies a lexeme.
• Regular expressions are widely used to specify patterns.

Jeya R 51
Tokens, Patterns and Lexemes
• A token is a pair a token name and an optional token value
• A pattern is a description of the form that the lexemes of a token may
take
• A lexeme is a sequence of characters in the source program that
matches the pattern for a token

By Nagadevi
Example

Token Informal description Sample lexemes

if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=

id Letter followed by letter and digits pi, score, D2

number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ sorrounded by “ “core dumped”

printf(“total = %d\n”, score);

By Nagadevi
Terminology of Languages
• Alphabet : a finite set of symbols (ASCII characters)
• String :
• Finite sequence of symbols on an alphabet
• Sentence and word are also used in terms of string
•  is the empty string
• |s| is the length of string s.
• Language: sets of strings over some fixed alphabet
•  the empty set is a language.
• {} the set containing empty string is a language
• The set of well-formed C programs is a language
• The set of all possible identifiers is a language.

CS416 Compiler Design 54

Terminology of Languages
• Operators on Strings:
• Concatenation: xy represents the concatenation of strings x and y. s  = s
s=s
• sn = s s s .. s ( n times) s0 = 

Jeya R 55
Input buffering
• Sometimes lexical analyzer needs to look ahead some symbols to decide
about the token to return
• In C language: we need to look after -, = or < to decide what token to
return
• In Fortran: DO 5 I = 1.25
• We need to introduce a two buffer scheme to handle large look-aheads
safely

E = M * C * * 2 eof

56
Cont..,

57
Cont..,

58
Cont..,

59
Sentinels
E = M eof * C * * 2 eof eof
Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
60
}
Specification of tokens
• In theory of compilation regular expressions are used to formalize the
specification of tokens
• Regular expressions are means for specifying regular languages
• Example:
• Letter_(letter_ | digit)*
• Each regular expression is a pattern specifying the form of strings

61
Regular expressions
• Ɛ is a regular expression, L(Ɛ) = {Ɛ}
• If a is a symbol in ∑then a is a regular expression, L(a) = {a}
• (r) | (s) is a regular expression denoting the language L(r) ∪ L(s)
• (r)(s) is a regular expression denoting the language L(r)L(s)
• (r)* is a regular expression denoting (L(r))*
• (r) is a regular expression denting L(r)

62
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn

• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*

63
Extensions
• One or more instances: (r)+
• Zero or one instances: r?
• Character classes: [abc]

• Example:
• letter_ -> [A-Za-z_]
• digit -> [0-9]
• id -> letter_(letter|digit)*

64
Recognition of tokens
• Starting point is the language grammar to understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number

65
Recognition of tokens (cont.)
• The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>

• We also need to handle whitespaces:

ws -> (blank | tab | newline)+

66
Operations on Languages
• Concatenation:
• L1L2 = { s1s2 | s1  L1 and s2  L2 }

• Union
• L1 L2 = { s | s  L1 or s  L2 }

• Exponentiation:
• L0 = {} L1 = L L2 = LL

• Kleene Closure


• L* =
L i

i 0

• Positive Closure


• L+ =
L i

i 1

CS416 Compiler Design 67

Example
• L1 = {a,b,c,d} L2 = {1,2}

• L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}

• L1  L2 = {a,b,c,d,1,2}

• L13 = all strings with length three (using a,b,c,d}

• L1* = all strings using letters a,b,c,d and empty string

• L1+ = doesn’t include the empty string

CS416 Compiler Design 68

Regular Expressions
• We use regular expressions to describe tokens of a programming
language.
• A regular expression is built up of simpler regular expressions (using
defining rules)
• Each regular expression denotes a language.
• A language denoted by a regular expression is called as a regular set.

CS416 Compiler Design 69

Regular Expressions (Rules)
Regular expressions over alphabet 

Reg. Expr Language it denotes

 {}
a  {a}
(r1) | (r2) L(r1)  L(r2)
(r1) (r2) L(r1) L(r2)
(r)* (L(r))*
(r) L(r)

• (r)+ = (r)(r)*
• (r)? = (r) | 

CS416 Compiler Design 70

Regular Expressions (cont.)
• We may remove parentheses by using precedence rules.
• * highest
• concatenation next
• | lowest
• ab*|c means (a(b)*)|(c)

• Ex:
•  = {0,1}
• 0|1 => {0,1}
• (0|1)(0|1) => {00,01,10,11}
• 0* => { ,0,00,000,0000,....}
• (0|1)* => all strings with 0 and 1, including the empty string

CS416 Compiler Design 71

Regular Definitions
• To write regular expression for some languages can be difficult, because their regular expressions can be quite complex. In those cases, we
may use regular definitions.
• We can give names to regular expressions, and we can use these names as symbols to define other regular expressions.

• A regular definition is a sequence of the definitions of the form:

d1  r1 where di is a distinct name and
d2  r2 ri is a regular expression over symbols in
. {d1,d2,...,di-1}
dn  rn
basic symbols previously defined names

CS416 Compiler Design 72

Regular Definitions (cont.)
• Ex: Identifiers in Pascal
letter  A | B | ... | Z | a | b | ... | z
digit  0 | 1 | ... | 9
id  letter (letter | digit ) *
• If we try to write the regular expression representing identifiers without using regular
definitions, that regular expression will be complex.
(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) *

• Ex: Unsigned numbers in Pascal

digit  0 | 1 | ... | 9
digits  digit +
opt-fraction  ( . digits ) ?
opt-exponent  ( E (+|-)? digits ) ?
unsigned-num  digits opt-fraction opt-exponent

CS416 Compiler Design 73

Regular expressions
• Ɛ is a regular expression, L(Ɛ) = {Ɛ}
• If a is a symbol in ∑then a is a regular expression, L(a) = {a}
• (r) | (s) is a regular expression denoting the language L(r) ∪ L(s)
• (r)(s) is a regular expression denoting the language L(r)L(s)
• (r)* is a regular expression denoting (L(r))*
• (r) is a regular expression denting L(r)

By Nagadevi
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn

• Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*

By Nagadevi
Extensions
• One or more instances: (r)+
• Zero or one instances: r?
• Character classes: [abc]

• Example:
• letter_ -> [A-Za-z_]
• digit -> [0-9]
• id -> letter_(letter|digit)*

By Nagadevi
Recognition of tokens
• Starting point is the language grammar to understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number

By Nagadevi
Recognition of tokens (cont.)
• The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
• We also need to handle whitespaces:
ws -> (blank | tab | newline)+

By Nagadevi
Transition diagrams
• Transition diagram for relop

By Nagadevi
Transition diagrams (cont.)
• Transition diagram for reserved words and identifiers

By Nagadevi
Transition diagrams (cont.)
• Transition diagram for unsigned numbers

By Nagadevi
Transition diagrams (cont.)
• Transition diagram for whitespace

By Nagadevi
Design of a Lexical
Analyzer (LEX)

8
3
Design of a Lexical
Analyzer
• LEX is a software tool that automatically construct a lexical
analyzer from a program
• The Lexical analyzer will be of the form
P1 {action 1}
P2 {action 2}
--
--

• Each pattern pi is a regular expression and action i is a program

fragment that is to be executed whenever a lexeme matched
by pi is found in the input
• If two or more patterns that match the longest lexeme, the first
listed matching pattern is chosen
8
4
Design of a Lexical Analyzer
• Here the Lex compiler
constructs a transition table
for a finite automaton from
the regular expression pattern
in the Lex specification
• The lexical analyzer itself
consists of a finite automaton
simulator that uses this
transition table to look for the
regular expression patterns in
the input buffer

8
5
Example
Consider Lexeme
a {action A1 for pattern p1}
abb{action A2 for pattern p2}
a*b* {action A3 for pattern p3}

8
6
LEX in use
• An input file, which we call lex.1, is
written in the Lex language and
describes the lexical analyzer to be
generated.
• The Lex compiler transforms lex. 1
to a C program, in a file that is
always named lex. yy . c.
• The latter file is compiled by the C
compiler into a file called a. out.
• The C-compiler output is a working
lexical analyzer that can take a
stream of input characters and
produce a stream of tokens.

8
7
Structure of LEX Program

%{
Definition section
%}

%%
Rules section
%%

User Subroutine section

General
format
• The declarations section includes declarations
of variables, manifest constants (identifiers
declared to stand for a constant, e.g., the
name of a token)
• The translation rules each have the form
Pattern { Action )
• Each pattern is a regular expression, which
may use the regular definitions of the
declaration section.
• The actions are fragments of code, typically
written in C, although many variants of Lex
using other languages have been created.
• The third section holds whatever additional
functions are used in the actions.

8
9
Consider the following
statement

9
0
9
1
%%

[ \t\n]+ ;
[aeiouAEIOU]+ {vow++;}
[^aeiouAEIOU] {con++;}

int main( )
{
printf("Enter some input string:\n");
yylex();
printf("Number of vowels=%d\n",vow);
printf("Number of consonants=%d\n",con);
}
Lexical Analyzer Generator -
Lex
Lex Source
Lexical Compiler lex.yy.c
program
lex.l

C
lex.yy.c a.out
compiler

Input a.out
Sequenc
stream e of
tokens

93
Finite Automata
• Regular expressions = specification
• Finite automata = implementation
• Recognizer ---A recognizer for a language is a program that takes as
input a string x answers ‘yes’ if x is a sentence of the language and
‘no’ otherwise.

• A better way to convert a regular expression to a recognizer is to

construct a generalized transition diagram from the expression. This
diagram is called a finite automaton.

• Finite Automaton can be

• Deterministic
• Non-deterministic
94
Finite Automata
• A finite automaton consists of
• An input alphabet 
• A set of states S
• A start state n
• A set of accepting states F  S
• A set of transitions state input state

5
Finite Automata
• Transition
s1 a s2
• Is read
In state s1 on input “a” go to state s2

• If end of input
• If in accepting state => accept, otherwise => reject
• If no transition possible => reject

96
Finite Automata State Graphs
• A state

• The start state

• An accepting state

a
• A transition

97
Finite Automata
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that
language, and “no” otherwise.
• We call the recognizer of the tokens as a finite automaton.
• A finite automaton can be: deterministic(DFA) or non-deterministic (NFA)
• This means that we may use a deterministic or non-deterministic automaton as a lexical analyzer.
• Both deterministic and non-deterministic finite automaton recognize regular sets.
• Which one?
• deterministic – faster recognizer, but it may take more space
• non-deterministic – slower, but it may take less space
• Deterministic automatons are widely used lexical analyzers.
• First, we define regular expressions for tokens; Then we convert them into a DFA to get a lexical analyzer for
our tokens.
• Algorithm1: Regular Expression  NFA  DFA (two steps: first to NFA, then to DFA)
• Algorithm2: Regular Expression  DFA (directly convert a regular expression into a DFA)

CS416 Compiler Design 98

Non-Deterministic Finite Automaton (NFA)

• A non-deterministic finite automaton (NFA) is a mathematical model that consists

of:
• S - a set of states
•  - a set of input symbols (alphabet)
• move – a transition function move to map state-symbol pairs to sets of
states.
• s0 - a start (initial) state
• F – a set of accepting states (final states)

• - transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
• A NFA accepts a string x, if and only if there is a path from the starting state to
one of accepting states such that edge labels along this path spell out x .

99
Deterministic and Nondeterministic
Automata
• Deterministic Finite Automata (DFA)
• One transition per input per state
• No -moves
• Nondeterministic Finite Automata (NFA)
• Can have multiple transitions for one input in a given state
• Can have -moves
• Finite automata have finite memory
• Need only to encode the current state

100
A Simple Example
• A finite automaton that accepts only “1”

• A finite automaton accepts a string if we can follow transitions

labeled with the characters in the string from the start to some
accepting state

101
Another Simple Example
• A finite automaton accepting any number of 1’s followed by a single 0
• Alphabet: {0,1}

• Check that “1110” is accepted.

102
NFA

103
NFA

104
Transition Table

105
Converting A Regular Expression into A NFA
(Thomson’s Construction)
• This is one way to convert a regular expression into a NFA.
• There can be other ways (much efficient) for the conversion.
• Thomson’s Construction is simple and systematic method.
It guarantees that the resulting NFA will have exactly one final state,
and one start state.
• Construction starts from simplest parts (alphabet symbols).
• To create a NFA for a complex regular expression, NFAs of its sub-
expressions are combined to create its NFA,

CS416 Compiler Design 106

Thomson’s Construction (cont.) 
i f
• To recognize an empty string 

• To recognize a symbol a in the alphabet  a

i f

• If N(r1) and N(r2) are NFAs for regular expressions r1 and r2

• For regular expression r1 | r2

 N(r1) 
NFA for r1 | r2
i  f

N(r2)

CS416 Compiler Design 107

Thomson’s Construction (cont.)
• For regular expression r1 r2

i N(r1) N(r2) f Final state of N(r2) become

final state of N(r1r2)
NFA for r1 r2

• For regular expression r*

 
i N(r) f

NFA for r*

CS416 Compiler Design 108

Thomson’s Construction (Example - (a|b) *
a ) a: a

a 
(a | b) 
b 
b: b

a 

 
(a|b) *  
b



a 

(a|b) * a   a
 
b

CS416 Compiler Design 109

110
Converting a NFA into a DFA (subset
construction)
put -closure({s0}) as an unmarked state into the -closure({s0}) is the set of all states can be
set of DFA (DS) accessible
while (there is one unmarked S 1 in DS) do from s0 by -transition.
begin
mark S1 set of states to which there is a transition on
a from a state s in S1
for each input symbol a do
begin
S2  -closure(move(S1,a))
if (S2 is not in DS) then
add S2 into DS as an unmarked state
transfunc[S1,a]  S2
end
end

• a state S in DS is an accepting state of DFA if a state in S is an

accepting state of NFA
• the start state of DFA is -closure({s })
0

CS416 Compiler Design 111

Converting a NFA into a DFA (Example)
 2 a 3 
0  1  a
 6 7 8

4 b 5


S0 = -closure({0}) = {0,1,2,4,7} S0 into DS as an unmarked state

 mark S0
-closure(move(S0,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1 S1 into DS
-closure(move(S0,b)) = -closure({5}) = {1,2,4,5,6,7} = S2 S2 into DS
transfunc[S0,a]  S1 transfunc[S0,b]  S2
 mark S1
-closure(move(S1,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S1,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S1,a]  S1 transfunc[S1,b]  S2
 mark S2
-closure(move(S2,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = S1
-closure(move(S2,b)) = -closure({5}) = {1,2,4,5,6,7} = S2
transfunc[S2,a]  S1 transfunc[S
CS416 Compiler2,b]  S2
Design 112
Converting a NFA into a DFA (Example –
cont.)
S0 is the start state of DFA since 0 is a member of S0={0,1,2,4,7}
S1 is an accepting state of DFA since 8 is a member of S1 = {1,2,3,4,6,7,8}

S0 b a

CS416 Compiler Design 113

114
115
116
Minimization of DFA

Jeya R 117
Minimization of DFA

Jeya R 118
Minimization of DFA

Jeya R 119
Minimization of DFA

Jeya R 120
Minimization of DFA

Jeya R 121
Example-Minimization of DFA

Jeya R 122
Example-Minimization of DFA

Jeya R 123
Example-Minimization of DFA

Jeya R 124
Example-Minimization of DFA

Jeya R 125
Regular Expression to DFA
(Direct Method)

Jeya R 126
Regular Expression to DFA
(Direct Method)- Example
• Regular Expression: (a/b)*abb
• Augmented Grammar : (a/b)*abb# = (a/b)*.a.b.b.#

Jeya R 127
Regular Expression to DFA
(Direct Method)- Example

Jeya R 128
Computation of Nullable,
Firstpos, LastPos:

Jeya R 129
Example:

Jeya R 130
Direct Method

Jeya R 131
Direct Method

Jeya R 132
Direct Method

Jeya R 133
Direct Method

Jeya R 134
Direct Method

Jeya R 135
Direct Method

Jeya R 136

Compiler Design Note1
No ratings yet
Compiler Design Note1
111 pages
CD Unit 1
No ratings yet
CD Unit 1
112 pages
Compiler Design Module
No ratings yet
Compiler Design Module
120 pages
Automata and Compiler Design: D.Rahul
No ratings yet
Automata and Compiler Design: D.Rahul
638 pages
Compiler Design: Instructor: Mohammed O. Samara University
100% (1)
Compiler Design: Instructor: Mohammed O. Samara University
28 pages
compiler design unit 1 srm 21 regulation
No ratings yet
compiler design unit 1 srm 21 regulation
193 pages
Compier Design - Unit I
No ratings yet
Compier Design - Unit I
97 pages
Compiler Design Slide Chapter 1-6
No ratings yet
Compiler Design Slide Chapter 1-6
250 pages
Chapter 1
No ratings yet
Chapter 1
35 pages
Introduction
No ratings yet
Introduction
46 pages
Chapter 1
No ratings yet
Chapter 1
44 pages
Unit 1 Compiler Design
No ratings yet
Unit 1 Compiler Design
124 pages
CDUnit 1
No ratings yet
CDUnit 1
39 pages
Chapter-1 Compiler Design
100% (1)
Chapter-1 Compiler Design
13 pages
Compiler Design: Instructor: Mohammed O. Samara University
No ratings yet
Compiler Design: Instructor: Mohammed O. Samara University
28 pages
Compiler Design
No ratings yet
Compiler Design
19 pages
Handout-114 - Original
No ratings yet
Handout-114 - Original
134 pages
Slides 01 - Compiler Construction - UET CS - Introduction
No ratings yet
Slides 01 - Compiler Construction - UET CS - Introduction
37 pages
U1 CD
No ratings yet
U1 CD
27 pages
Compiler Design
No ratings yet
Compiler Design
118 pages
Notes Compile Complete
No ratings yet
Notes Compile Complete
117 pages
Basics of compilation process COM 413
No ratings yet
Basics of compilation process COM 413
35 pages
Chapter 1 (Introduction)
No ratings yet
Chapter 1 (Introduction)
47 pages
_CD -unit 1
No ratings yet
_CD -unit 1
46 pages
Introduction Compiler
No ratings yet
Introduction Compiler
47 pages
L2 - Structure of a Compiler
No ratings yet
L2 - Structure of a Compiler
43 pages
Compiler RNP SP Unit 4
No ratings yet
Compiler RNP SP Unit 4
69 pages
compiler lec-one (1)
No ratings yet
compiler lec-one (1)
46 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
cd unit 1
No ratings yet
cd unit 1
63 pages
Module-1 1
No ratings yet
Module-1 1
53 pages
INTRO TO COMPILERS
No ratings yet
INTRO TO COMPILERS
77 pages
Chapter 1 - Introduction
No ratings yet
Chapter 1 - Introduction
27 pages
Lec 1 2
No ratings yet
Lec 1 2
24 pages
CH1 3
No ratings yet
CH1 3
32 pages
Compiler Design - Webview
No ratings yet
Compiler Design - Webview
10 pages
Language Translator
No ratings yet
Language Translator
71 pages
Compiler Design: Computer Science
No ratings yet
Compiler Design: Computer Science
117 pages
Introduction
No ratings yet
Introduction
23 pages
Compiler CH1
No ratings yet
Compiler CH1
24 pages
Lexical Analyzer (Compiler Contruction)
100% (1)
Lexical Analyzer (Compiler Contruction)
6 pages
Basic of Compiler
100% (1)
Basic of Compiler
17 pages
CH 1
No ratings yet
CH 1
23 pages
CS416 Compiler Design
No ratings yet
CS416 Compiler Design
20 pages
CC Viva Questions
0% (1)
CC Viva Questions
5 pages
Chapter 1 - Introduction To Comp
No ratings yet
Chapter 1 - Introduction To Comp
27 pages
Compiler Design
No ratings yet
Compiler Design
19 pages
AWS DA With Answers
No ratings yet
AWS DA With Answers
161 pages
CSC 318 Class Notes
No ratings yet
CSC 318 Class Notes
21 pages
Unit 1
No ratings yet
Unit 1
29 pages
Document From Aditya Tripathi
No ratings yet
Document From Aditya Tripathi
5 pages
Web Technology Record
No ratings yet
Web Technology Record
63 pages
Compiler Design Chapter-1
No ratings yet
Compiler Design Chapter-1
41 pages
compiler_design_and_implementation
No ratings yet
compiler_design_and_implementation
5 pages
Unit 1
No ratings yet
Unit 1
29 pages
com.magma.cheat_logcat
No ratings yet
com.magma.cheat_logcat
27 pages
Cs133 Group A: Compiler Construction
No ratings yet
Cs133 Group A: Compiler Construction
24 pages
Introduction To Compiler Design-Unit I
No ratings yet
Introduction To Compiler Design-Unit I
30 pages
Lec00 Outline
No ratings yet
Lec00 Outline
27 pages
CS 321 - Compilers: Outline
No ratings yet
CS 321 - Compilers: Outline
8 pages
Compiler Design
No ratings yet
Compiler Design
7 pages
Compiler Assignment
No ratings yet
Compiler Assignment
6 pages
L1 Slides - Intro To Python Programming - Y8
No ratings yet
L1 Slides - Intro To Python Programming - Y8
26 pages
Final Report
No ratings yet
Final Report
72 pages
Unit III Levels of Testing
No ratings yet
Unit III Levels of Testing
42 pages
Recruiting Tech Talent 101 Problems and Solutions 2
100% (1)
Recruiting Tech Talent 101 Problems and Solutions 2
16 pages
Cucumber MCQs
No ratings yet
Cucumber MCQs
30 pages
Linkedin RSC JobAdder Setup
No ratings yet
Linkedin RSC JobAdder Setup
9 pages
GSoC Proposal
No ratings yet
GSoC Proposal
5 pages
Introduction To Golang
No ratings yet
Introduction To Golang
7 pages
Advanced Oracle PLSQL Programming: Advantech 1 Krishnamurthy
No ratings yet
Advanced Oracle PLSQL Programming: Advantech 1 Krishnamurthy
0 pages
Agile Models: Scrum, XP, DSDM, ASD, FDD, Crystal Family
No ratings yet
Agile Models: Scrum, XP, DSDM, ASD, FDD, Crystal Family
33 pages
MPLAB Harmony Creating An Application - v110
No ratings yet
MPLAB Harmony Creating An Application - v110
14 pages
Instructions: Language of The Machine
No ratings yet
Instructions: Language of The Machine
47 pages
C++ Programing
No ratings yet
C++ Programing
19 pages
10 CH 4
No ratings yet
10 CH 4
11 pages
Tetris
No ratings yet
Tetris
23 pages
Java Full Stack
No ratings yet
Java Full Stack
20 pages
Worksheet 3
No ratings yet
Worksheet 3
8 pages
Intro To FIN6 Emulation Plans
No ratings yet
Intro To FIN6 Emulation Plans
2 pages
HTML Cheatsheet Skillcrush
100% (2)
HTML Cheatsheet Skillcrush
1 page
OWASP Testing Checklist
No ratings yet
OWASP Testing Checklist
1 page
React Syllabus PDF
No ratings yet
React Syllabus PDF
6 pages
PF Sessional2 Fall 22
No ratings yet
PF Sessional2 Fall 22
10 pages
How To Root An Android Device
No ratings yet
How To Root An Android Device
12 pages
159.101 Computer Science Fundamentals - Massey - Exam - S1 2013
No ratings yet
159.101 Computer Science Fundamentals - Massey - Exam - S1 2013
5 pages
Function Point SPM
No ratings yet
Function Point SPM
5 pages
Software Maintenance: Corrective Maintenance - Corrective Software Maintenance Is What One Would
No ratings yet
Software Maintenance: Corrective Maintenance - Corrective Software Maintenance Is What One Would
9 pages
Serial Windows XP
No ratings yet
Serial Windows XP
2 pages
Assembly Language:Simple, Short, And Straightforward Way Of Learning Assembly Programming
From Everand
Assembly Language:Simple, Short, And Straightforward Way Of Learning Assembly Programming
Sherwyn Allibang
2/5 (1)

Module 1

Uploaded by

Module 1

Uploaded by

Compiler Design

source program COMPILER target program

• The machine-language target program produced by a

• In synthesis phase, the equivalent target program is created from this

Analysis fit into a

Synthesis • Construct the target program from the

• Each phase transforms the source program from one representation

• They communicate with error handlers.

• They communicate with the symbol table.

• Puts information about identifiers into the symbol table.

This points to an entry in the

int a = 10; //Input Source code

Total number of tokens: ??

Total number of tokens: ??

•In a parse tree, all terminals are at leaves.

• All inner nodes are non-terminals in

• In this case, the integer may be converted into a floating-point number.

• This intermediate code should be generated in such a way that it makes it

• A compiler may produce an explicit intermediate codes representing the

• It consists of a sequence of assembly-like instructions with three

• Role of lexical analyzer

source Lexical token

CS416 Compiler Design 47

CS416 Compiler Design 50

Token Informal description Sample lexemes

id Letter followed by letter and digits pi, score, D2

printf(“total = %d\n”, score);

CS416 Compiler Design 54

• We also need to handle whitespaces:

CS416 Compiler Design 67

• L13 = all strings with length three (using a,b,c,d}

• L1* = all strings using letters a,b,c,d and empty string

• L1+ = doesn’t include the empty string

CS416 Compiler Design 68

CS416 Compiler Design 69

Reg. Expr Language it denotes

CS416 Compiler Design 70

CS416 Compiler Design 71

• A regular definition is a sequence of the definitions of the form:

CS416 Compiler Design 72

• Ex: Unsigned numbers in Pascal

CS416 Compiler Design 73

• Each pattern pi is a regular expression and action i is a program

User Subroutine section

• A better way to convert a regular expression to a recognizer is to

• Finite Automaton can be

• The start state

CS416 Compiler Design 98

• A non-deterministic finite automaton (NFA) is a mathematical model that consists

• A finite automaton accepts a string if we can follow transitions

• Check that “1110” is accepted.

CS416 Compiler Design 106

• To recognize a symbol a in the alphabet  a

• If N(r1) and N(r2) are NFAs for regular expressions r1 and r2

CS416 Compiler Design 107

i N(r1) N(r2) f Final state of N(r2) become

• For regular expression r*

CS416 Compiler Design 108

CS416 Compiler Design 109

• a state S in DS is an accepting state of DFA if a state in S is an

CS416 Compiler Design 111

S0 = -closure({0}) = {0,1,2,4,7} S0 into DS as an unmarked state

CS416 Compiler Design 113

You might also like