0% found this document useful (0 votes)
22 views159 pages

CD Part

Uploaded by

24sanjanasri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views159 pages

CD Part

Uploaded by

24sanjanasri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 159

Compiler Design

5th SEM CSE


Language Processors
• A translator inputs and then converts a source program
into an object or target program.
• Source program is written in a source language
• Object program belongs to an object language

• A translators could be: Assembler, Compiler,


Interpreter

Assembler: Assembler

source program object program


(in assembly language) (in machine language)

Compiler Design 2
Overview of Compilers

- Compiler: translates a source program written in a High-


Level Language (HLL) such as Pascal, C++ into
computer’s machine language (Low-Level Language
(LLL)).
* The time of conversion from source program into
object program is called compile time
* The object program is executed at run time

- Interpreter: processes an internal form of the source


program and data at the same time (at run time); no object
program is generated.

Compiler Design 3
A Language Processing System
The Structure of a Compiler

• A compiler can be treated as a single box that maps


a source program into a semantically equivalent
target program.
• The analysis part (Front End) breaks up the source
program into constituent pieces and imposes a
grammatical structure on them.
• The synthesis part (Back End) constructs the desired
target program from the intermediate representation
and the information in the symbol table.
Lexical Analysis (scanner): The first phase of a
compiler

• Lexical analyzer reads the stream of characters making up the


source program and groups the characters into meaningful
sequences called lexeme.
• For each lexeme, the lexical analyzer produces a token of the form
(token-name, attribute-value)
that it passes on to the subsequent phase, syntax analysis.
• Token-name: an abstract symbol is used during syntax analysis
• Attribute-value: points to an entry in the symbol table for this
token.

Compiler Design 8
Example: position =initial + rate * 60
o/p is: <id, 1> <=> <id, 2> <+> <id, 3> <*> <60>

1.”position” is a lexeme mapped into a token <id, 1>, where id is an abstract symbol standing
for identifier and 1 points to the symbol table entry for position. The symbol-table entry for
an identifier holds information about the identifier, such as its name and type .

2. = is a lexeme that is mapped into the token<=>. Since this token needs no attribute-value,
we have omitted the second component. For notational convenience, the lexeme itself is
used as the name of the abstract symbol.

3. “initial” is a lexeme that is mapped into the token <id, 2>, where 2 points to the symbol-
table entry for initial.

4. + is a lexeme that is mapped into the token <+>.

5. “rate” is a lexeme mapped into the token <id, 3>, where 3 points to the symbol-table entry
for rate.

6. * is a lexeme that is mapped into the token <*> .

7. 60 is a lexeme that is mapped into the token <60>

Blanks separating the lexemes would be discarded by the lexical analyzer .

Compiler Design 9
Syntax Analysis (parser) : The second phase of the
compiler
• The parser uses the first components of the tokens produced by the lexical
analyzer to create a tree-like intermediate representation that depicts the
grammatical structure of the token stream.
• A typical representation is a syntax tree in which each interior node
represents an operation and the children of the node represent the arguments
of the operation

Compiler Design 10
Semantic Analysis: Third phase of the compiler
• The semantic analyzer uses the syntax tree and the information in the
symbol table to check the source program for semantic consistency with
the language definition.
• Gathers type information and saves it in either the syntax tree or the
symbol table, for subsequent use during intermediate-code generation.

• An important part of semantic analysis is type checking, where the


compiler checks that each operator has matching operands. For example,
many programming language definitions require an array index to be an
integer; the compiler must report an error if a floating-point number is
used to index an array.
• The language specification may permit some type conversions called
coercions. For example, a binary arithmetic operator may be applied to
either a pair of integers or to a pair of floating-point numbers. If the
operator is applied to a floating-point number and an integer, the
compiler may convert or coerce the integer into a floating-point number.

Compiler Design 11
Intermediate Code Generation: three-address
code
After syntax and semantic analysis of the source program, many compilers
generate an explicit low-level or machine-like intermediate representation (a
program for an abstract machine). This intermediate representation should
have two important properties:
• It should be easy to produce and
• It should be easy to translate into the target machine.
The considered intermediate form called three-address code, which consists of
a sequence of assembly-like instructions with three operands per instruction.
Each operand can act like a register.

Compiler Design 12
Code Optimization: To generate better target code
• The machine-independent code-optimization phase attempts to improve the
intermediate code so that better target code will result.
• Usually better means:
• faster, shorter code, or target code that consumes less power.
• The optimizer can deduce that the conversion of 60 from integer to floating
point can be done once and for all at compile time, so the int to float operation
can be eliminated by replacing the integer 60 by the floating-point number
60.0. Moreover, t3 is used only once

• There are simple optimizations that significantly improve the running time of
the target program without slowing down compilation too much .

Compiler Design 13
Code Generation: takes as input an intermediate representation
of the source program and maps it into the target language
• If the target language is machine code, registers or memory
locations are selected for each of the variables used by the
program.
• Then, the intermediate instructions are translated into sequences of
machine instructions that perform the same task.
• A crucial aspect of code generation is the Judicious assignment of
registers to hold variables.

Compiler Design 14
Compiler Design 15
Symbol-Table Management:
• The symbol table is a data structure containing a
record for each variable name, with fields for the
attributes of the name.
• The data structure should be designed to allow the
compiler to find the record for each name quickly and
to store or retrieve data from that record quickly

• These attributes may provide information about the


storage allocated for a name, its type, its scope and in
the case of procedure names, such things as the
number and types of its arguments, the method of
passing each argument (for example, by value or by
reference), and the type returned.

Compiler Design 16
The Grouping of Phases into Passes
• Deals with the logical organization of a compiler
• For example, the front-end phases of lexical analysis,
syntax analysis, semantic analysis, and intermediate
code generation can be grouped together into one pass.
• Code optimization might be an optional pass.
• Then there could be a back-end pass consisting of code
generation for a particular target machine.
Compiler Construction Tools
• Parser generators that automatically produce syntax
analyzers from a grammatical description of a
programming language.
• Scanner generators that produce lexical analyzers from
a regular-expression description of the tokens of a
language.
• Syntax-directed translation engines that produce
collections of routines for walking a parse tree and
generating intermediate code.
Compiler Construction Tools
• Code Generators translates each operation of the
intermediate language into the machine language for a
target machine.
• Data flow analysis engines that facilitate the gathering
of information about how values are transmitted from
one part of a program to each other part. Data-flow
analysis is a key part of code optimization.
• Compiler Construction Toolkits that provide an
integrated set of routines for constructing various
phases of a compiler.
The Evolution of Programming Languages
The Move to Higher-level Languages
• First-generation - machine languages.
• Second-generation - assembly languages.
• Third-generation - higher-level languages like Fortran,
Cobol, Lisp, C, C++, C#, and Java.
• Fourth-generation languages are languages designed for
specific applications like SQL for database queries and
Postscript for text formatting.
• Fifth-generation language has been applied to logic and
constraint-based languages like Prolog
The Evolution of Programming
Languages
• Imperative languages specifies how a computation is to be done.
• Languages such as C, C++, C#, and Java are imperative languages.
• In imperative languages there is a notion of program state and
statements that change the state.
• Functional languages such as ML and Haskell and constraint logic
languages such as Prolog are often considered to be declarative
languages (what computation is to be done).
The Evolution of Programming
Languages
• Scripting languages are interpreted languages with high-level
operators designed for "gluing together" computations.
• Awk, JavaScript, Perl, PHP, Python, Ruby, and Tel are popular
examples of scripting languages.
Impact on Compilers
• Develop new algorithms and translations to support for new
programming language feature.
• New translation algorithms must take advantage of the new hardware
capabilities.
• Millions of lines of codes may be there in input program.
• A compiler must translate correctly the potentially infinite set of
programs that could be written in the source language.
• The problem of generating the optimal target code from a source
program is undecidable.
The Science of Building a Compiler
Modeling in Compiler Design and Implementation
• The study of compilers is about how we design the
right mathematical models and choose the right
algorithms.
• Some of fundamental models are finite-state machines,
regular expressions and context free languages.
The Science of Code Optimization
• The term "optimization" in compiler design refers to
the attempts that a compiler makes to produce code
that is more efficient than the obvious code.
• Compiler optimizations must meet the following
design objectives:
• Optimization must be correct, that is preserve the meaning of
the compiled program,
• Optimization must improve the performance of programs.
• The compilation time must be kept reasonable and
• engineering effort required must be manageable.
The Science of Code Optimization
• Optimizations speed ups execution time also
conserve power.
• Compilation time should be short to support a
rapid development and debugging cycle.
• Compiler is a complex system; we must keep
the system simple to assure that the engineering
and maintenance costs of the compiler are
manageable.
Applications of Compiler Technology
Implementation of High-Level Programming Languages
• Higher-level programming languages are easier to program in, but are less
efficient, that is, the target programs generated run more slowly.
• Programmers using a low-level language have more control over a
computation and can produce more efficient code.
• The register keyword in the C programming language is an early example of
the interaction between compiler technology and language evolution.
Applications of Compiler Technology

• Different Programming Languages support different


levels of abstractions.
• Data Flow Optimizations has been developed to
analyze the flow of data through the program and
removes redundancies across these constructs.
• The key ideas behind object orientation are
• Data abstraction.
• Inheritance of properties
• Both of which have been found to make programs
more modular and easier to maintain.
Applications of Compiler Technology

• Compiler optimizations have been developed


to reduce the overhead ex: unnecessary range
check, unreachable objects
• Effective algorithms have been developed to
minimize the overhead of garbage collection.
Optimizations for Computer Architectures
• All high-performance systems take advantage of
the same two basic techniques: parallelism and
memory hierarchies.
• Instruction-level parallelism can also appear
explicitly in the instruction set.
• VLIW (Very Long Instruction Word) machines
have instructions that can issue multiple
operations in parallel. The Intel IA64 is a well
known example of such an architecture.
Optimizations for Computer
Architectures
 Programmers can write multithreaded code for
multiprocessors or parallel code can be
automatically generated by a compiler.
 Compiler hides from the programmers the details
of finding parallelism in a program.
 Distributes the computation across the machine,
and minimizing synchronization and
communication among the processors.
 Compilers focus on optimizing the processor
execution by making the memory hierarchy more
effective.
Design of New Computer Architectures
• In the early days of computer architecture
design, compilers were developed after the
machines were built.
• That has changed.
• Since programming in high level languages is
the important, the performance of a computer
system is also determined by the compilers that
can exploit its features.
Design of New Computer Architectures

• Compilers influenced the design of computer


architecture called RISC.
• Prior to RISC, CISC architecture which makes assembly
programming easier was used.
• Compiler optimizations often can reduce these instructions to a small
number of simpler operations by eliminating the redundancies across
complex instructions.
• RISC – Power PC, SPARC.
• CISC – x86.
Specialized Architectures
• They include data flow machines
• VLIW (Very Long Instruction Word) machines.
• SIMD (Single Instruction, Multiple Data) arrays of
processors
• Multiprocessors with shared memory.
• Multiprocessors with distributed memory.
• Compiler technology is needed not only to support
programming for these architectures, but also to
evaluate proposed architectural designs.
Program Translations
• Binary Translation - Compiler technology can be used
to translate the binary code for one machine to that of
another.
• Binary translation can also be used to provide
backward compatibility.
• Hardware Synthesis - hardware designs are mostly
described in high-level hardware description languages
like Verilog and VHDL.
• Hardware Synthesis tools translate RTL descriptions
automatically into gates, which are then mapped to
transistors and eventually to a physical layout.
Program Translations
• Database Query Interpreters – Database queries consist of predicates
containing relational and boolean operators. They can be interpreted or
compiled into commands to search a database for records satisfying that
predicate.
• Compiled Simulation - Simulations can be very expensive. Instead of
writing a simulator that interprets the design.
• It is faster to compile the design to produce machine code that simulates
that particular design natively.
• Compiled simulation can run orders of magnitude faster than an
interpreter-based approach.
Software Productivity Tools
• Type Checking – Type checking is an effective and well-established
technique to catch inconsistencies in programs. It can be used to catch
errors.
• Bounds Checking - It is easier to make mistakes when programming in a
lower-level language than a higher level one.
• Memory Management Tools - Garbage collection is another excellent
example. Purify is a widely used tool that dynamically catches memory
management errors as they occur.
Lexical Analysis
• As the first phase of a compiler, the main task of the lexical analyzer is
to read the input characters of the source program, group them into
lexemes.
• Produce as output a sequence of tokens for each lexeme in the source
program.
• The stream of tokens is sent to the parser for syntax analysis.
The Role of t he Lexical Analyzer
The Role of t he Lexical Analyzer

•Stripping out comments and white


spaces
•Correlating error messages with source
program
•Keeps track of Line numbers
•Expansion of macros
Lexical Analysis

Sometimes, lexical analyzers are divided


into a cascade of two processes:
◦Scanning consists of the simple processes that
do not require tokenization of the input, such
as deletion of comments and compaction of
consecutive whitespace characters into one.
◦Lexical analysis proper is the more complex
portion, scanner produces the sequence of
tokens as output.
Lexical Analysis Versus Parsing
The separation of lexical and syntactic analysis often
allows us to simplify at least one of these tasks.
• Simplicity of design is the most important
consideration.
• Compiler efficiency is improved.
- A separate lexical analyzer allows us to apply
specialized techniques that serve only the lexical task, not
the job of parsing.
- In addition, specialized buffering techniques for reading
input characters can speed up the compiler significantly.
• Compiler portability is enhanced.
Tokens, Patterns and Lexemes

• A token is a pair consisting of a token name and an


optional attribute value.
• The token name is an abstract symbol representing a
kind of lexical unit.
• e.g., a particular keyword, or a sequence of input
characters denoting an identifier.
Tokens, Patterns and Lexemes

• A pattern is a description of the form that the


lexemes of a token may take.
• In the case of a keyword as a token, the
pattern is just the sequence of characters that
form the keyword.
Tokens, Patterns and Lexemes

• A lexeme is a sequence of characters in the


source program that matches the pattern for a
token and is identified by the lexical analyzer
as an instance of that token.
Examples of tokens

printf ("Total = %d", score ) ;


token name & attribute values for the fortran
statement E = M * C ** 2

• The token names and associated attribute values for the Fortran
statement.
• <id, pointer to symbol-table entry for E>
• <assign_op>
• <id, pointer to symbol-table entry for M>
• <mult_op>
• <id, pointer to symbol-table entry for C>
• <exp_op>
• <number, integer value 2>
Lexical Errors

• f i ( a == f (x) ) . . .
• a lexical analyzer cannot find whether fi is a
misspelling of the keyword if or an undeclared
function identifier.
• Since fi is a valid lexeme for the token id, the lexical
analyzer must return the token id to the parser and
let some other phase of the compiler.
Lexical Errors
• Suppose if the lexical analyzer is unable to proceed because
none of the patterns for tokens matches any prefix of the
remaining input.
• The simplest recovery strategy is "panic mode" recovery. Delete
successive characters from the remaining input, until the lexical
analyzer can find a well-formed token at the beginning of what
input is left.(iffff)
• Other possible error-recovery actions are:
1. Delete one character from the remaining input. (iff)
2. Insert a missing character into the remaining input.(pritf)
3. Replace a character by another character.(ef)
4. Transpose two adjacent characters.
Input Buffering- increases the speed of
reading the source pgm
• Two-buffer scheme that handles large lookaheads safely called Buffer
Pairs.
• Specialized buffering techniques have been developed to reduce the
amount of overhead required to process a single input character.
Buffer Pairs

-Each buffer is of the same size N, ie size of disk block- 4096


bytes
-using one system read command-can read N characters into
buffer
Two pointers to the input are maintained:
1. Pointer lexemeBegin, marks the beginning of the current
lexeme, whose extent we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found.
Sentinels
• Thus, for each character read, we make two tests: one for the
end of the buffer and one to determine what character is read.
• Each buffer can hold a sentinel character at the end.
• The sentinel is a special character that cannot be part of the
source program and a natural choice is the character eof.
Sentinels at the end of each buffer
Lookahead code with sentinels
switch ( *forward+ + ) {
case eof:
if (forward is at end of first buffer ) {
reload second buffer;
forward = beginning of second buffer;
}
else if (forward is at end of second buffer ) {
reload first buffer;
forward = beginning of first buffer;
}
else eof within a buffer marks the end of input
terminate lexical analysis;
break;
Cases for the other characters
}
Specification of Tokens
• Regular expressions are an important notation for specifying
lexeme patterns.
• While they cannot express all possible patterns, they are very
effective in specifying those types of patterns that we actually
need for tokens.
• Strings and Languages
Strings and Languages

• An alphabet is any finite set of symbols such as letters,


digits, and punctuation.
• examples
• The set {0,1) is the binary alphabet
• ASCII is used in many software systems
• unicode
Strings and Languages
• A String over an alphabet is a finite sequence of symbols
drawn from that alphabet
• In language theory, the terms "sentence" and "word" are
often used as synonyms for "string."
• |s| represents the length of a string s,
Ex: banana is a string of length 6
• The empty string, is the string of length zero.
• if x and y are strings, then the concatenation of x and y is
also string, denoted xy, For example, if x = usp and y = cd,
then xy = uspcd.
• The empty string is the identity under concatenation; that is,
for any string s, ɛS = Sɛ = s.
Strings and Languages
• A language is any countable set of strings over some
fixed alphabet

• Let L = {A, . . . , Z}, then{“A”,”B”,”C”,


“BE”…,”ABZ”,…] is consider the language defined
by L
• Abstract languages like , the empty set, or
{},the set containing only the empty string, are
languages under this definition.
Terms for Parts of Strings
Operations on Languages
Operations on Languages
Example:
Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and
let D be the set of digits {0,1,.. .9).
L and D are, respectively, the alphabets of uppercase and
lowercase letters and of digits.
Other languages can be constructed from L and D, using the
operators illustrated above

1. L U D is the set of letters and digits - strictly speaking the language


with 62 (52+10) strings of length one, each of which strings is
either one letter or one digit.

2. LD is the set of 520 strings of length two, each consisting of one


letter followed by one digit.(10×52).
Ex: A1, a1,B0,etc
Operations on Languages

3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)

4. L* is the set of all strings of letters, including  , the


empty string.

5. L(L U D)* is the set of all strings of letters and


digits beginning with a letter.

6. D+ is the set of all strings of one or more digits.


Regular Expressions
• Regular Expressions are an important notation for describing
all the languages that can be built from the operators applied
to the symbols of some alphabet

• For Example: we were able to describe identifies by giving


names to sets of letters and digits and using the language
operators union, concatenation and closure.

• In this notation, if letter_ is stand for any letter or the


underscore and digit is stand for any digit, then we could
describe the language of C identifies by:
letter_ (letter_ | digit )*
• The RE are built Recursively out of smaller regular
expressions, using the rules described below.

• Each RE r denotes a language L(r), which is also


defined recursively from the languages denoted by
r’s sub expressions.
• Here are the rules that define the regular expression
over alphabet ∑ and languages that those
expressions denote.
Basis: there are two rules form the basis:
• € is a regular expression, and L(€) is { € }, that is,
the language containing only the empty string.
• if a is symbol in Σ, then a regular expression
denoting L(a)={ a }, is the language with only one
string consisting of the single symbol ‘a’ .

Compiler Construction
• Induction:
- Larger regular expressions are built from smaller ones.
Let r and s are regular expressions denoting languages L(r) and
L(s), respectively.
1. (r) | (s) is a regular expression denoting the language L(r) U
L(s).
2. (r) (s) is a regular expression denoting the language L(r) L(s)
3. (r) * is a regular expression denoting (L (r)) * .
4. (r) is a regular expression denoting L(r).

Compiler Construction
This last rule says that we can add additional pairs of
parentheses around expressions without changing
the language they denote.
• for example, we may replace the regular expression
(a) | ((b) * (c)) by a| b*c.

Compiler Construction
Regular Expressions
Example: Let ∑ = {a, b}
• The regular expression a|b denotes the language {a, b}.
• (a|b) (a|b) denotes {aa, ab, ba, bb} the language of all strings of
length two over the alphabet ∑.
• a* denotes the language consisting of all strings of zero or more
a's, that is, {Є, a, aa, aaa, ... }
• (a|b)* denotes the set of all strings consisting of zero or more
instances of a or b, that is, all strings of a's and b's: {Є, a, b, aa,
ab, ba, bb, aaa, ... }.
• a|a*b denotes the language {ab, b, ab, aab, aaab, ... }, that is, the
string a and all strings consisting of zero or more a's and ending
in b.
Regular Set

• A language that can be defined by a Regular Expressions is


called regular set.
• If two RE r and s denote the same regular set, we say they
are equivalent and write r=s
• For instance: (a|b)= (b|a)
Algebraic laws for Regular Expressions
asserts expression of two different laws are
equivalent
Regular Definitions (give names to RE)
• If ∑ is an alphabet of basic symbols, then a regular definition is a
sequence of definitions of the form:
d1 ⇒ r 1
d2 ⇒ r 2
……
dn ⇒ rn
• Each di is a new symbol, not in ∑ and not the same as
any other of the d's.
• Each ri is a regular expression over the alphabet ∑ ∪
{d1 , d2 , . . . ,di-1}
Regular Definitions for C Identifiers
Regular Definition for Unsigned
Numbers (Integer or Floating Point)
• Examples strings such as 6, 5280, 0 . 0 1234, 6 . 336E4, or
1 . 89E-4
Extensions of R E ( enhance their ability to specify a
string patterns)
1) One or more instances
unary operator + is a positive closure of RE and its language

2) Zero or one Instance


unary operator ?
ex: r?= r | 
3) Character Classes
R.E a1|a2|….|an, can be replaced by the shorthand [a1a2…an]
Moreover when a1,a2,…an form a logical sequence,
we can replace them by [a1-an]
Examples for Extensions of R E

Regular definition for C identifiers:

letter_ -> [A-Z a-z _]


digit -> [0-9]
Id -> letter_ ( letter_ | digit )*
Examples for Extensions of R E

Regular definition for unsigned numbers:

digit -> [0-9]


digits -> digit+
number -> digits ( . digits)? (E[+-]? digits )?
RECOGNITION OF TOKENS– how to take patterns
for all the needed tokens and build a piece of code
•Given the grammar of branching statement:

Compiler Construction
• The terminals of the grammar, which are if, then,
else, relop, id, and number, are the names of tokens
as used by the lexical analyzer.
• The lexical analyzer also has the job of stripping out
whitespace, by recognizing the "token" ws defined by:

• Token ws is not return to parser instead it restart the


LA from character that follows the white space.
The patterns for the given tokens: are
described using regular definitions
• Goal of Lexical Analyzer is, for each lexeme or family of
lexemes, which token name is returned to the parser and
what attribute value is returned..
• Is discussed in table
Tokens, their patterns, and attribute values

Compiler Construction
Recognition of Token
• To recognize tokens there are 2 steps

1. Design of Transition Diagram


2. Implementation of Transition Diagram
Transition Diagrams(convert pattern into flowcharts for the
construction of a LA)

• A transition diagram is similar to a flowchart for (a part of) the


lexer.
• We draw one for each possible token. It shows the decisions
that must be made based on the input seen.
• Transition diagrams have a collection of nodes or circles, called
States.
• Each state represents a condition that could occur during the
process of scanning the input looking for a lexeme that matches
one of several patterns.
• Edges are directed from one state of the transition diagram to
another.– edges are labeled by i/p symbols.
Recognition of Tokens: Transition Diagram
Ex : RELOP: =| < | <= | = | <> | > | >=

2 return(relop,LE)
=
1 3 return(relop,NE)
>
<
start other *
0 = 5 4 return(relop,LT)
return(relop,EQ)
>
=
6 7 return(relop,GE)

* indicates input retraction other * return(relop,GT)


8
Recognition of Reserved words and Identifiers

• Ex2:ID = letter(letter | digit) *


Transition Diagram:
letter or digit

letter other
*
start
9 10 11
return ( getToken(),
installID() )

*indicates input retraction

Compiler Construction
Two ways of handling reserved words that
looks like identifies
• Two questions remain.
1. How do we distinguish between identifiers and keywords such as if,
then and else, which also match the pattern in the transition diagram?
2. What is (getToken(), installID())?
Two ways of handling reserved words
that looks like identifies
1) Install the reserved words in the Symbol Table initially.
installID() checks if the lexeme is already in the symbol table. If it is not
present, the lexeme is installed ( places it in symbol table) as an id token.
In either case a pointer to the entry is returned.
getToken examines the symbol table entry for the lexeme found and
returns the token name
2) Create separate transition for each keyword;
The transition diagram for token unsigned
number Accepting float
Multiple accepting state e.g. 12.31E+45

Accepting float
e.g. 12.31E4

Accepting integer Accepting float


e.g. 12 e.g. 12.31

Compiler Construction
Architecture of a Transition- Diagram-Based
Lexical Analyzer
• There are several ways that a collections of Transition Diagram can be
used to build a LA.

• Each State is represented by a piece of code

• Implementation of relop transition diagram is shown below


Token getRelop() {
TOKEN retToken = new(RELOP);
while (1) {
switch (state) {
case 0: c = nextchar();
if (c is white space) state = 0;
else if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5;
else if (c == ‘>’) state = 6;
else fail ( );
break;
case 1: ….
case 8: retract(1);
retToken.attribute = GT;
return (retToken); }
} }
Different ways of implementing TD to generate
entire LA
1. The transition diagram for each token to be tried sequentially.
- then function fail() resets the pointer forward and start the next TD,
each time its called.

2. Run the various TDs “in parallel,” feeding the next input character to
all of them and allowing each one to make whatever transitions it
required.

3. The Preferred approach is to combine all the TD into one.


Parameter Passing Mechanisms

• Actual parameter
• Formal Parameter

1.Call–by-Value
2.Call-by-Reference
3.Call-by-Name:
Actual parameter were substituted literally for the formal
parameter(as if macro for Actual parameter) in the code of callee.
Aliasing

• Two formal parameter can refer to the same location such variables
called aliases of one another.
Ex: a is an array of procedure p

P calls another proc q(x, y) with a call q(a, a)


Now x and y are become aliases of each other
x[10]=2 is lead to y[10]=2
Module-3 Syntax Analysis

• Programming languages has precise rules that gives


the syntactic structure of programs
• Ex: in C, pgm is made of functions
• Can be specified by CFG
• Grammar benefits for both language designer and
compiler writers
Grammar benefits

• Gives a precise syntactic specification of a


programming language.
• Can construct a parser that determines the syntactic
structure.
• Useful for translating source programs into correct
object code and detecting error
• A grammar allows a language can evolved or
developed iteratively.
Intermedi
ate
represent
Token Parse ation
tree
Lexical analyzer syntax analyzer Rest of front end

getNext
Token

Symbol table
• Verifies the tokens can be generated by the grammar
• Report syntax errors
• Recover from commonly occurring errors
• Construct a parse tree and passes it to rest of the
compiler
Three types of parser
• We categorize the parsers into three groups:
1 Universal parser
Can parse any grammar but too inefficient to use in
production compilers
2 Top-Down Parser
the parse tree is created top to bottom, starting from
the root.
3 Bottom-Up Parser
the parse is created bottom to top; starting from the
leaves
• Both top-down and bottom-up parsers
scan the input from left to right
(one symbol at a time).
• Efficient top-down and bottom-up parsers can be
implemented only for sub-classes of context-free
grammars.
– LL for top-down parsing
– LR for bottom-up parsing
Syntax Error Handling
• Common Programming errors can occur at many different levels.
1. Lexical errors: include misspelling of identifiers, keywords, or
operators.
2.Syntactic errors : include misplaced semicolons or extra or
missing braces.
3.Semantic errors: include type mismatches between operators and
operands.
4.Logical error: incorrect reasoning
use of = instead of ==
Goals of error handler in a Parser

• Report the presence of errors clearly and accurately


• Recover from each error quickly enough to detect
subsequent errors.
• Add minimal overhead to the processing of correct
programs
Error-Recovery Strategies

• Panic-Mode Recovery
• Phrase-Level Recovery
• Error Productions
• Global Correction
Panic-Mode Recovery
• On discovering an error, the parser discards input symbols
one at a time until one of a designated set of Synchronizing
tokens is found.
• Synchronizing tokens are usually delimiters.
• Ex: semicolon or } whose role in the source program is clear
and unambiguous.
• It often skips a considerable amount of input without
checking it for additional errors.
Advantages:
• Simplicity
• Is guaranteed not to go into an infinite loop
Phrase-Level Recovery
• A parser may perform local correction on the remaining
input. i.e
it may replace a prefix of the remaining input by some
string that allows the parser to continue.
Ex: replace a comma by a semicolon, insert a missing
semicolon
• Local correction is left to the compiler designer.
• It is used in several error-repairing compliers, as it can
correct any input string
Error Productions
• We can augment the grammar for the language at with productions
that generate the erroneous constructs.
• Then we can use the grammar augmented by these error
productions to Construct a parser.
If an error production is used by the parser, we can generate
appropriate error diagnostics to indicate the erroneous construct
that has been recognized in the input.
Global Correction
• We use algorithms that perform minimal sequence of changes to
obtain a globally least cost correction
• Given an incorrect input string x and grammar G, these algorithms
will find a parse tree for a related string y.
• Such that the number of insertions, deletions and changes of
tokens required to transform x into y is as small as possible.
• It is too costly to implement in terms of time space, so these
techniques only of theoretical interest
Writing grammar

• Context Free Grammar


• Eliminating ambiguous grammar.
• Eliminating left-recursion
• Left-factoring.
CFG:

• Consider the following grammar:


E→E+E|E*E|(E)|-E| id
• Derivation :
• Right most derivation
• Left most derivation
• What is ambiguity
• Derive a parse tree for the string id+id*id
Eliminating left-recursion

E→E+T | | T
T→T*F | F
F→(E )| id
Left factored
Example :

S → iEtS / iEtSeS / a
E→b
Sol:
Top down parsing

Types –

1. Recursive descent parsing


2. Predictive parsing
Recursive descent parsing

• Top-down parsers are implemented as a set of recursive


functions that descent through a parse tree for a string.
• Also known as LL(k) parsing where the first L stands for
left-to-right, the second L stands for leftmost-derivation, and
k indicates k-symbol lookahead.

Predictive parsing
Algorithm to compute FIRST
Algorithm to compute Follow
Non Id + * ( ) $
termin
al
E
E’
T
T’
F
Non recursive Predictive parsing
Predictive parsing algorithm
Error Recovery in Predictive Parsing
Choice of synchronizing set
Example
Phrase Level Recovery

• Its is implemented by filling in the blank entries in the predictive


parsing table with pointers to error routines
• Routines may change, insert or delete symbols on the input and issue
appropriate error messages.
Top-down parsing

A top-down parser builds the parse tree from the top


to down, starting with the start non-terminal.
A Predictive Parser is a special case of Recursive
Descent Parser, where no Back Tracking is required.
a grammar means eliminating left recursion and left
factoring from it, the resulting grammar will be a
grammar that can be parsed by a recursive descent
parser.
• Recursive descent parsing is one of the top-down
parsing techniques that uses a set of recursive
procedures to scan its input.
• This parsing method may involve backtracking, that
is, making repeated scans of the input.
• G : S → cAd
• A → ab | a
• and the input string w=cad.
What is back tracking
S → cAd
A → ab | a w=cad
First and follow

• Rules for first( ):


• 1. If X is terminal, then FIRST(X) is {X}.
• 2. If X → ε is a production, then add ε to FIRST(X).
• 3. If X is non-terminal and X → aα is a production then
add a to FIRST(X).
• 4. If X is non-terminal and X → Y1 Y2…Yk is a
production, then place a in FIRST(X) if for some i, a is in
FIRST(Yi), and ε is in all of FIRST(Y1),…,FIRST(Yi-1);
that is, Y1,….Yi-1 => ε. If ε is in FIRST(Yj) for all
j=1,2,..,k, then add ε to FIRST(X).
• Rules for follow( ):
• 1. If S is a start symbol, then FOLLOW(S) contains $.
• 2. If there is a production A → αBβ, then everything in
FIRST(β) except ε is placed in follow(B).
• 3. If there is a production A → αB, or a production
A → αBβ where FIRST(β) contains ε, then everything
in FOLLOW(A) is in FOLLOW(B).
• Algorithm for construction of predictive parsing table:
• Input : Grammar G
• Output : Parsing table M
• Method :
• 1. For each production A → α of the grammar, do steps 2 and
3.
• 2. For each terminal a in FIRST(α), add A → α to M[A, a].
• 3. If ε is in FIRST(α), add A → α to M[A, b] for each
terminal b in FOLLOW(A). If ε is in FIRST(α) and $ is in
FOLLOW(A) , add A → α to M[A, $].
• 4. Make each undefined entry of M be error.
Non-recursive predictive parsing:
Algorithm for non recursive predictive parsing:
•Input : A string w and a parsing table M for grammar G.

•Output : If w is in L(G), a leftmost derivation of w;


otherwise, an error indication.

•Method : Initially, the parser has $S on the stack with S,


the start symbol of G on top, and w$ in the input buffer.
The program that utilizes the predictive parsing table M
to produce a parse for the input is as follows:
set ip to point to the first symbol of w$;
set X to the top of stack symbol;
While(X!=$)
{
if (X is a) pop the stack and advance ip;
else if( X is terminal ) error();
else if(if M[X, a] = X →Y1Y2 … Yk
{ pop X from the stack;
push Yk, Yk-1, … ,Y1 onto the stack, with Y1 on
top;
}
set X to the top of stack
}
MATCHED STACK INPUT ACTION

E$ id+id * id$

TE’$ id+id * id$ E->TE’


FT’E’$ id+id * id$ T->FT’

id T’E’$ id+id * id$ F->id


id T’E’$ +id * id$ Match id
id E’$ +id * id$ T’->Є
id +TE’$ +id * id$ E’-> +TE’
id+ TE’$ id * id$ Match +
id+ FT’E’$ id * id$ T-> FT’
id+ idT’E’$ id * id$ F-> id
id+id T’E’$ * id$ Match id
id+id * FT’E’$ * id$ T’-> *FT’
id+id * FT’E’$ id$ Match *
id+id * idT’E’$ id$ F-> id
id+id * id T’E’$ $ Match id
id+id * id E’$ $ T’-> Є
id+id * id $ $ E’-> Є
Bottom up parsing

• Reduction

• Bottom up parsing reduces a string 𝑤to the start symbol 𝑆

• At each reduction step, a chosen substring that is the rhs (or


body) of a production is replaced by the lhs (or head)
nonterminal
• Handle Pruning

• Handle is a substring that matches the body of a production

• Reducing the handle is one step in the reverse of the


rightmost derivation
Shift Reduce Parsing

• Type of bottom up parsing with two primary actions, shift and


reduce
• Other obvious actions are accept and error
• The input string (i.e., being parsed) consists of two parts

• Left part is a string of terminals and nonterminals, and is stored


in stack
• Right part is a string of terminals read from an input buffer
• Bottom of the stack and end of input are represented by $
Shift Reduce Actions

• Shift : shift the next input symbol from the right string onto the
top of
the stack

• Reduce : identify a string on top of the stack that is the body of a


production, and replace the body with the head

Accept: the string is accepted by the grammar

Reject :input string cannot be parsed .


Stack Input Action

… if 𝐸𝑥𝑝𝑟 then 𝑆𝑡𝑚𝑡 else … $


Reduce-Reduce Conflict 𝑀→𝑅+𝑅 𝑅+𝑐 𝑅
𝑅→𝑐

𝑐+𝑐 𝑐

Stack Input Action Stack Input


$ 𝑐 + 𝑐$ Shift $ 𝑐 + 𝑐$ Shift
$𝑐 +𝑐$ Reduce by 𝑅 → 𝑐 $𝑐 +𝑐$ Reduce b
$𝑅 +𝑐$ Shift $𝑅 +𝑐$ Shift
$𝑅 + 𝑐$ Shift $𝑅 + 𝑐$ Shift
$𝑅 + 𝑐 $ Reduce by 𝑀 → 𝑅 + 𝑐 $𝑅 + 𝑐 $ Reduce b
$𝑀 $ $𝑅 + 𝑅 $ Reduce b
$𝑀 $
TOP-DOWN PARSING

Types of top-down parsing :


•1. Recursive descent parsing
•2. Predictive parsing
void A()
{
Choose the production A->X1X2X3…………….XK
For(i=1 to k)
if Xi is a non terminal
call procedure Xi();
else if (Xi is current input symbol a)
advance input pointer to next symbol
else
error occurred
}

You might also like