0% found this document useful (0 votes)
83 views103 pages

Unit 1 Lexical Analyzer

The document outlines a course on compiler design. It discusses topics that will be covered like lexical analysis, syntax analysis using context free grammars, syntax directed translation, semantic analysis, code generation and optimization. It also defines compilers and interpreters, and discusses the major parts and phases of a compiler like the lexical analyzer, syntax analyzer, intermediate code generation, code optimization and code generation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views103 pages

Unit 1 Lexical Analyzer

The document outlines a course on compiler design. It discusses topics that will be covered like lexical analysis, syntax analysis using context free grammars, syntax directed translation, semantic analysis, code generation and optimization. It also defines compilers and interpreters, and discusses the major parts and phases of a compiler like the lexical analyzer, syntax analyzer, intermediate code generation, code optimization and code generation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Compiler Design : TY Comp.

Sem VI

Prof. Reshma Pise


Comp Engg. Dept
Vishwakarma University
Course Outline
• Introduction to Compiling
• Lexical Analysis
• Syntax Analysis
– Context Free Grammars
– Top-Down Parsing, LL Parsing
– Bottom-Up Parsing, LR Parsing
• Syntax-Directed Translation
– Attribute Definitions
– Evaluation of Attribute Definitions
• Semantic Analysis, Type Checking
• Intermediate Code Generation
• Code Optimization
• Code Generation Recent topics in compiler

Prof. Reshma Pise


Translators

• Programs written in high-level languages need to be translated into low-


level (machine code) for processing and execution by the CPU. This is
done by a translator program.

• There are two types of translator program:


interpreters
compilers
Assembler

Prof. Reshma Pise


COMPILERS
• A compiler is a program takes a program written in a source language
and translates it into an equivalent program in a target language.

source program COMPILER target / object program


( Normally a program written in ( Normally the equivalent program in
a high-level programming language) machine code – relocatable object file)

error messages

Prof. Reshma Pise


Compilers
• Compilers translate from a source language (typically a high level
language) to a functionally equivalent target language (typically the
machine code of a particular machine or a machine-independent
virtual machine).
• Compilers for high level programming languages are among the
larger and more complex pieces of software
– Original languages included Fortran and Cobol
• Often multi-pass compilers (to facilitate memory reuse)

– Compiler development helped in better programming


language design
• Early development focused on syntactic analysis and optimization

– Commercially, compilers are developed by very large


software groups
• Current focus is on optimization and smart use of resources for modern RISC (reduced
instruction set computer) architectures. 5
Why Study Compilers?
• General background information for good software engineer
– Increases understanding of language semantics
– Seeing the machine code generated for language constructs
helps understand performance issues for languages
– Teaches good language design
– New devices may need device-specific languages
– New business fields may need domain-specific languages

6
Applications of Compiler Technology & Tools

• Processing XML/other to generate documents, code, etc.


• Processing domain-specific and device-specific languages.
• Implementing a server that uses a protocol such as http or imap
• Natural language processing, for example, spam filter, search,
document comprehension, summary generation
• Translating from a hardware description language to the
schematic of a circuit
• Automatic graph layout (graphviz, for example)
• Extending an existing programming language
• Program analysis and improvement tools

7
Other Applications
• In addition to the development of a compiler, the techniques used in
compiler design can be applicable to many problems in computer
science.
– Techniques used in a lexical analyzer can be used in text editors, information
retrieval system, and pattern recognition programs.

– Techniques used in a parser can be used in a query processing system such as


SQL.

– A symbolic equation solver which takes an equation as input. That program should
parse the given input equation.

– Most of the techniques used in compiler design can be used in Natural Language
Processing (NLP) systems.

Prof. Reshma Pise


Compilers
A Compiler program translates the whole program into a machine code
version that can be run without the compiler being present.

Advantage: program runs fast as


already in machine code, translator
program only needed at the time of
compiling

Disadvantage: slow to compile as


whole program translated

Prof. Reshma Pise


Interpreters
Interpreter program translates HLL
code into machine code one line at a
time.

Advantage: easy to find errors, better


for learners

Disadvantage: program runs slow as


have to be continually interpreted,
interpreter program always in memory
to interpret program.

Prof. Reshma Pise


Major Parts of Compilers
• There are two major parts of a compiler: Analysis and Synthesis

• In analysis phase, an intermediate representation is created from the


given source program.

• In synthesis phase, the equivalent target program is created from this


intermediate representation.

Prof. Reshma Pise


Phases of A Compiler

Source Lexical Syntax Semantic Intermediate Code Code Target


Program Analyzer Analyzer Analyzer Code Generator Optimizer Generator Program

• Each phase transforms the source program from one representation


into another representation.

• They communicate with error handlers.

• They communicate with the symbol table.


•Analysis : Front end

Prof. Reshma Pise


1. Lexical Analyzer (Scanner)
• Lexical Analyzer reads the source program character by character and
returns the tokens of the source program.
• A token describes a pattern of characters having same meaning in the
source program. (such as identifiers, operators, keywords, numbers,
delimeters and so on)
• int
Ex: newval := oldval + 12 => tokens: newval identifier
:= assignment operator
oldval identifier
+ add operator
12 a number

• Puts information about identifiers into the symbol table.


• Regular expressions are used to describe tokens (lexical constructs).
• A (Deterministic) Finite State Automaton can be used in the
implementation of a lexical analyzer.
Prof. Reshma Pise
2. Syntax Analyzer (Parser)
• A Syntax Analyzer creates the syntactic structure (generally a parse
tree) of the given program.
• A syntax analyzer is also called as a parser.
• A parse tree describes a syntactic structure.
assgstmt

identifier := expression • In a parse tree, all terminals are at leaves.

newval expression + expression • All inner nodes are non-terminals in


a context free grammar.
identifier number

oldval 12

Prof. Reshma Pise


Syntax Analyzer
• The syntax of a language is specified by a context free grammar
(CFG).
• The rules in a CFG are mostly recursive.
• A syntax analyzer checks whether a given program satisfies the rules
implied by a CFG or not.
– If it satisfies, the syntax analyzer creates a parse tree for the given program.

• Ex: We use BNF (Backus Naur Form) to specify a CFG


assgstmt -> identifier := expression ( Production rules)
expression -> identifier
expression -> number
expression -> expression + expression

Prof. Reshma Pise


Syntax Analyzer versus Lexical Analyzer
• Which constructs of a program should be recognized by the lexical
analyzer, and which ones by the syntax analyzer?

– Both of them do similar things; But the lexical analyzer deals with the simple non-
recursive constructs of the language.

– The syntax analyzer deals with the recursive constructs of the language.

– The lexical analyzer simplifies the job of the syntax analyzer.

– The lexical analyzer recognizes the smallest meaningful units (tokens) in a source
program.

– The syntax analyzer works on the smallest meaningful units (tokens) in a source
program to recognize meaningful structures in our programming language.
Prof. Reshma Pise
Parsing Techniques
• Depending on how the parse tree is created, there are different parsing
techniques.
• These parsing techniques are categorized into two groups:
– Top-Down Parsing,
– Bottom-Up Parsing
• Top-Down Parsing:
– Construction of the parse tree starts at the root, and proceeds towards the leaves.
– Efficient top-down parsers can be easily constructed by hand.
– Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing).
• Bottom-Up Parsing:
– Construction of the parse tree starts at the leaves, and proceeds towards the root.
– Normally efficient bottom-up parsers are created with the help of some software
tools.
– Bottom-up parsing is also known as shift-reduce parsing.
– Operator-Precedence Parsing – simple, restrictive, easy to implement
– LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR
Prof. Reshma Pise
3. Semantic Analyzer
• A semantic analyzer checks the source program for semantic errors and
collects the type information for the code generation.
• Type-checking is an important part of semantic analyzer.
• Scope checking
• Context-free grammars used in the syntax analysis are integrated with
attributes (semantic rules)
– the result is a syntax-directed translation,
– Attribute grammars
• Ex: a = b + c;
newval := oldval + 12

• The type of the identifier newval must match with type of the expression
(oldval+12)
Prof. Reshma Pise
4. Intermediate Code Generation
• A compiler may produce an explicit intermediate codes representing
the source program.
• These intermediate codes are generally machine (architecture)
independent. But the level of intermediate codes is closer to the level of
machine codes.
• Ex:
newval := oldval * fact + 1

id1 := id2 * id3 + 1

temp1 := id2 * id3 Intermediates Codes (three address code)


temp2 := temp1 + 1
id1 := temp2

Prof. Reshma Pise


Intermediate Code Generation
• Properties of IR-
Easy to produce
Easy to translate into the target program
• IR can be in the following forms-
Syntax trees
Postfix notation
Three address statements
• Properties of three address statements-
At most one operator in addition to an assignment operator
Must generate temporary names to hold the value computed at each
instruction
Some instructions have fewer than three operands
Prof. Reshma Pise
5. Code Optimizer (for Intermediate Code
Generator)
• The code optimizer optimizes the code produced by the intermediate
code generator in the terms of time and space.

• Ex:
temp1 := id2 * id3
id1 := temp1 * 1 => id1 = temp1 ( transformations)
a = a*2 => a= a+a
a = b + c;
for(i=1; i<100; i++)
{ a = b + c; Loop Invariant
z++; =>
x = y + z;
}

Prof. Reshma Pise


6. Code Generator
• Produces the target language in a specific architecture.
• The target program is normally a relocatable object file containing the
machine code or assembly code.
• Intermediate instructions are each translated into a sequence of machine
instructions that perform the same task.
• Ex:
( assume that we have an architecture with instructions whose at least one of its operands is
a machine register)

MOVE id2,R1
MULT id3,R1 MOV R1, id3
MOV R2, 60
ADD #1,R1
MUL R1, R2
MOVE R1,id1 ADD R1, id2
Prof. Reshma Pise
Prof. Reshma Pise
The Structure of a Compiler

Intermediate Code Generator


[Intermediate Code Generator]

Non-optimized Intermediate Code


Scanner
[Lexical Analyzer]

Tokens

Code Optimizer
Parser
[Syntax Analyzer]
Optimized Intermediate Code
Parse tree

Code Generator
Semantic Process
[Semantic analyzer] Target machine code

Abstract Syntax Tree w/ Attributes MOV R1, id3


MOV R2, 60.0
MUL R1, R2
ADD R1, id2
Prof. Reshma Pise
MOVEM R1, id1
The input program as you see it.

main ()
{
int i,sum;
float f;
sum = 0;
for (i=1; i<=10; i++);
sum = sum + i;
printf("%d\n",sum);
}
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
• S -> D SL
• D -> type idlist ;
• SL -> A | IF | While
• Type -> int | float ….
• …..
• OR
• S -> D SL D -> Type idlist; D
• D -> type idlist ;
• SL -> S | A | IF | While
• Type -> int | float

Prof. Reshma Pise


Prof. Reshma Pise
Animal : token
Dog, cat, rat : lexem

Prof. Reshma Pise


Prof. Reshma Pise
Role of the Lexical Analyzer

whilea=10 LA: lexeme : whilea


Lookahead
whileabc=100 LA : whileabc
LA : =

Prof. Reshma Pise


Tasks of Lexical Analyzer
• Reads source text and detects the token

• Stripe out comments, white spaces, tab, newline characters.

• Correlates error messages from compilers to source program

Approaches to implementation
. Use assembly language- Most efficient but most difficult to implement

. Use high level languages like C- Efficient but difficult to implement

. Use tools like lex, flex- Easy to implement but not as efficient as the first
two cases
Prof. Reshma Pise
Lexical Analyzer in Perspective
• LEXICAL ANALYZER • PARSER
– Scan Input – Perform Syntax Analysis
– Remove WS, NL, … – Actions Dictated by Token Order
– Identify Tokens – Update Symbol Table Entries
– Create Symbol Table – Create Abstract Rep. of Source
– Insert Tokens into ST – Generate Errors
– Generate Errors – And More…. (We’ll see later)
– Send Tokens to Parser

Prof. Reshma Pise


What Factors Have Influenced the Functional
Division of Labor ?
• Separation of Lexical Analysis From Parsing Presents a Simpler
Conceptual Model
– A parser embodying the conventions for comments and white space is significantly
more complex that one that can assume comments and white space have already
been removed by lexical analyzer.

• Separation Increases Compiler Efficiency


– Specialized buffering techniques for reading input characters and processing
tokens…

• Separation Promotes Portability.


– Input alphabet peculiarities and other device-specific anomalies can be restricted to
the lexical analyzer.
Prof. Reshma Pise
Prof. Reshma Pise
Handling Lexical Errors
In what Situations do Errors Occur?
– Lexical analyzer is unable to proceed because none of the patterns for
tokens matches a prefix of remaining input.
For example:
fi ( a == f(x) )….

1) Panic mode recovery - deleting successive characters until a well


formed token is formed.
2) Inserting a missing character
3) Replacing a missing character by a correct character
4) Transposing two adjacent characters.
5) Deleting an extraneous character

Prof. Reshma Pise


Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Prof. Reshma Pise
Direct Simulation of an NFA
s  s0
c  nextchar;
while c  eof do
s  move(s,c); DFA
c  nextchar;
end;
simulation
if s is in F then return “yes”
else return “no”

S  -closure({s0})
c  nextchar;
while c  eof do
NFA
S  -closure(move(S,c));
c  nextchar; simulation
end;
if SF then return “yes”
else return “no”
Prof. Reshma Pise
Pattern Matching Based on NFA (1)
P1 : a {action}
P2 : abb {action} 3 patterns

P3 : a*b+ {action}

NFA’s :
P1
start a
1 2
P2
start a b b
3 4 5 6

a b
start P3
7 8
b

Prof. Reshma Pise


Pattern Matching Based on NFA continued (2)
Combined NFA :
a P1
1 2

start  a b b
0 3 4 5 6 P2

 a b
P3
7 8
b
Examples a a b a
{0,1,3,7} {2,4,7} {7} {8} death
pattern matched: - P1 - P3 -
a b b
{0,1,3,7} {2,4,7} {5,8} {6,8}
break tie in
pattern matched: - P1 P3 P2,P3  favor of P2
Prof. Reshma Pise
DFA for Lexical Analyzers
Alternatively Construct DFA:
keep track of correspondence between patterns and new accepting states

Input Symbol
STATE a b Pattern
{0,1,3,7} {2,4,7} {8} none
{2,4,7} {7} {5,8} P1
{8} - {8} P3
{7} {7} {8} none
{5,8} - {6,8} P3
break tie in
{6,8} - {8} P2 favor of P2
Prof. Reshma Pise
Example
Input: aaba
{0,1,3,7} {2,4,7} {7} {8}
Input: aba
{0,1,3,7} {2,4,7} {5,8} P3
Input Symbol
STATE a b Pattern
{0,1,3,7} {2,4,7} {8} none
{2,4,7} {7} {5,8} P1
{8} - {8} P3
{7} {7} {8} none
{5,8} - {6,8} P3
{6,8} - {8} P2
Prof. Reshma Pise
Example- TD Based Lexical Analyzers

<= : start < = RTN(LE)


0 6 7

other
8 * RTN(LT)

We’ve accepted “<” and have read other char that must be
unread.
DRAW a TD for {<=,<>,=,>=,>}.
Prof. Reshma Pise
Example : All RELOPs

start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other

= 4
*
return(relop, LT)

5 return(relop, EQ)
>

=
6 7 return(relop, GE)
other
8
*
return(relop, GT)

Prof. Reshma Pise


IDENTIFIER

letter or digit

letter other
9 10 11

Prof. Reshma Pise


INTEGER

digit
*
digit other
25 26 27

Prof. Reshma Pise


Implementing Transition Diagrams
int state = 0, start = 0
start < =
lexeme_beginning = forward; 0 1 2
token nexttoken() >
{ while(1) { 3
switch (state) { other
*
case 0: c = nextchar(); = 4
if (c == ‘<‘) state = 1;
repeat else if (c == ‘=‘) state = 5;
5
>
until else if (c == ‘>’) state = 6;
a “return” else state = fail(); 6
=
7
or failure case 1: ..... other
*
occurs ... 8

case 8: .......
}
}
}

Prof. Reshma Pise


Implementing Transition Diagrams
int state = 0, start = 0
start < =
lexeme_beginning = forward; 0 1 2
token nexttoken() >
{ while(1) { 3
switch (state) { other
*
case 0: c = nextchar(); = 4
if (c == ‘<‘) state = 1;
repeat else if (c == ‘=‘) state = 5;
5
>
until else if (c == ‘>’) state = 6;
a “return” else state = fail(); 6
=
7
or failure case 1: ..... other
*
occurs ... 8

case 8: retract();
retToken.attribute = GT;
return(RELOP);
}
}
}
Prof. Reshma Pise
Implementing Transition Diagrams

A sequence of transition diagrams can be converted into a


program to look for the tokens specified by the grammar

Each state gets a segment of code


FUNCTIONS USED
•nextchar(),
•retract(),
•install_num(),
•install_id(),
•gettoken(),
•isdigit(),
•isletter(),
•recover()

Prof. Reshma Pise


Implementing Transition Diagrams, I
int state = 0, start = 0
lexeme_beginning = forward;
token nexttoken()
Input : ab12
{ while(1) {
switch (state) {
case 0: c = nextchar();
repeat /* c is lookahead character */
if (c== blank || c==tab || c== newline) {
until start < =
state = 0;
a “return” 0 1 2
lexeme_beginning++;
occurs >
/* advance 3
beginning of lexeme */ other
} *
= 4
else if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5; 5
>
else if (c == ‘>’) state = 6;
else state = fail(); =
6 7
break;
other
… /* cases 1-8 here */
Prof. Reshma Pise
8
*
Implementing Transition Diagrams, II

.............
case 9: c = nextchar();
if (isletter(c)) state = 10;
else state = fail();
break;
case 10; c = nextchar();
if (isletter(c)) state = 10;
else if (isdigit(c)) state = 10;
else state = 11;
break;
case 11; retract(1); lexical_value = install_id();
return ( gettoken(lexical_value) );
.............
letter or digit
*
letter other
reads token 9 10 11
name fromProf.ST
Reshma Pise
Implementing Transition Diagrams, III
digit
*
digit other
25 26 27
advances
............. forward
case 25; c = nextchar();
if (isdigit(c)) state = 26;
else state = fail();
Case numbers correspond
break;
case 26; c = nextchar();
to transition diagram states
if (isdigit(c)) state = 26;
!
else state = 27;
break;
case 27; retract(1); lexical_value = install_num();
return ( NUM );
.............
looks at the region
retracts lexeme_beginning ... forward
Prof. Reshma Pise
forward
When Failures Occur:

int fail()
{
forward = lexeme beginning;
switch (start) {
case 0: start = 9; break;
case 9: start = 12; break;
case 12: start = 20; break;
case 20: start = 25; break;
Switch to
case 25: recover(); break;
next transition
default: /* lex error */
diagram
}
return start;
}

Prof. Reshma Pise


Buffer Pairs
• Lexical analyzer needs to look ahead several characters beyond the
lexeme for a pattern before a match can be announced.

• Use a function ungetc to push lookahead characters back into the


input stream.

• Large amount of time can be consumed moving characters.

Special Buffering Technique


✓Use a buffer divided into two N-character halves
✓N = Number of characters on one disk block
✓One system command read N characters
✓Fewer than N character => eof
Prof. Reshma Pise
Look ahead example : Fortran

DO10I=1.25 // assignment

DO10I=1,25 // loop statement


DO 10 I = 1 , 25
DO 10 I= 1 , 25 loop statement

Prof. Reshma Pise


Buffer Pairs (2)
✓Two pointers to the input buffer are maintained c^4
c ** 4
✓The string of characters between the pointers is the current lexeme
✓Once the next lexeme is determined, the forward pointer is set to the character at
its right end.

E = M * C * * 2 eof

lexeme beginning
forward (scans
ahead to find
pattern match)

Comments and white space can be treated as patterns that


yield no token
Prof. Reshma Pise
Code to advance forward pointer

if forward at the end of first half then begin


reload second half ;
forward : = forward + 1;
end
else if forward at end of second half then begin
reload first half ;
move forward to beginning of first half
end
else forward : = forward + 1;

Prof. Reshma Pise


Algorithm: Buffered I/O with Sentinels
Current token

E = M * eof C * * 2 eof eof

lexeme beginning forward (scans


forward : = forward + 1 ; ahead to find
if forward  = eof then begin pattern match)
if forward at end of first half then begin
Algorithm performs
reload second half ; Block I/O
forward : = forward + 1 I/O’s. We can still
end have get & ungetchar
else if forward at end of second half then begin
Now these work on
reload first half ; Block I/O
move forward to biginning of first half real memory buffers !
end
else / * eof within buffer signifying end of input * /
terminate lexical analysis
end 2nd eof  no more inputProf.!Reshma Pise
Thanks

Prof. Reshma Pise

You might also like