0% found this document useful (0 votes)
55 views

Unit 1

This document provides an overview of compiler design and the various phases involved in compiling a program from source code to executable code. It discusses the structure of a compiler as having two main parts - analysis and synthesis. The analysis part includes lexical analysis, syntax analysis, and semantic analysis. Lexical analysis involves reading the source code and generating tokens. Syntax analysis uses these tokens to build a syntax tree. Semantic analysis performs type checking. The synthesis part generates intermediate code and target code. Lexical analysis is the first phase and involves grouping characters into meaningful tokens. Input buffering and token specification are also discussed.

Uploaded by

gayathri
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Unit 1

This document provides an overview of compiler design and the various phases involved in compiling a program from source code to executable code. It discusses the structure of a compiler as having two main parts - analysis and synthesis. The analysis part includes lexical analysis, syntax analysis, and semantic analysis. Lexical analysis involves reading the source code and generating tokens. Syntax analysis uses these tokens to build a syntax tree. Semantic analysis performs type checking. The synthesis part generates intermediate code and target code. Lexical analysis is the first phase and involves grouping characters into meaningful tokens. Input buffering and token specification are also discussed.

Uploaded by

gayathri
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Compiler Design

R.Rajakumari
Assistant Professor(Sr. Gr.)
Department of Computer Science and Engineering
National Engineering College- Kovilpatti
Overview
• Structure of a Compiler
• Lexical Analysis
• Role of Lexical Analysis
• Input Buffering
• Specification of Tokens
• Recognition of Tokens
• Lex
Introduction
• Programming Languages
• Compilers
• Reads a program in one language(source language) and translates into equivalent
program in another language(target language)
• Reports any error in the source program
• Interpreter
• Directly executes the operations specified in the source program on inputs supplied
by the user
• Line by line execution
• Compiler is faster than a interpreter
• Interpreter is better at error diagnostics
Contd…
• Assembler
• Assembly language to relocatable machine code
• Linker
• Links relocatable object files and library files
• Loader
• Puts all the executable object files into memory for execution
Structure of a Compiler
• Single box that maps the source program into a semantically
equivalent target program
• Two parts - Analysis and Synthesis
• Analysis
• Breaks the source program into constituent pieces
• Imposes grammatical structure
• Intermediate representation of the source program
• If source program is syntactically ill formed or semantically unsound give
informative messages for corrective action
• Collects information and stores in a data structure called Symbol Table
Contd…
• Synthesis
• Constructs the desired target program from the intermediate representation
and the information in the symbol table
• Analysis part - Front End
• Synthesis part - Back End
Phases of a Compiler
• Lexical Analysis
• Syntax Analysis
• Semantic Analysis
• Intermediate Code Generation
• Machine Independent Code Optimization
• Code Generation
• Machine Dependent Code Optimization
Lexical Analysis
• First phase
• Lexical Analysis or Scanning
• Reads the stream of characters and groups the characters into
meaningful sequences called lexemes
• Outputs <token_name, attribute_value>
• Token_name is an abstract symbol used during syntax analysis
• Attribute_value points to entry in symbol table for this token
• Symbol Table entry is needed for semantic analysis and code
generator
Example
• position = initial + rate * 60
• position is a lexeme
• Token <id,1>
• Assignment Symbol = is a lexeme
• Token <=>
• initial is a lexeme
• Token < id,2>
• Addition operator + is a lexeme
• Token <+>
Example
• rate is a lexeme
• Token <id,3>
• Multiplication operator * is a lexeme
• Token <*>
• 60 is a lexeme
• Token <60>
• <id,1> <=> <id,2> <+> <id,3> <*> <60>
• Blank separating the lexemes would be discarded by the lexical
analyzer
Syntax Analysis
• Second Phase
• Syntax Analysis or Parsing
• Tokens are used to produce a tree-like intermediate representation
• Syntax Tree
• Interior node represent operation
• Children of the node represent the arguments of the operation
• Tree shows the order in which the operations are executed
• Context Free Grammar is used to specify the grammatical structure of
the programming language
Semantic Analysis
• Checks for semantic consistency
• Gathers type information
• Performs type checking, compiler checks whether each operator has
matching operands
• Example:
• Integer as array index
• Coercion – Type conversion
• Example:
• inttofloat
Intermediate Code Generation
• Intermediate representation
• Syntax trees are a form of intermediate representation, used during
syntax and semantic analysis
• After syntax analysis and semantic analysis, compilers generate an
explicit low-level or machine-like intermediate representation
• Two properties of intermediate representation
• Easy to produce
• Easy to translate into the target machine
Three-address Code
• An intermediate form
• Sequence of assembly-like instructions
• Three operands per instruction
• Each operand can act like a register
t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Contd…
• At most one operator in the right side
• Fix the order in which operations are to be done
• Temporary name is generated to hold the value computed by a three-
address instruction
• Some three-address instruction has fewer than three operands
Code Optimization
• Machine-independent code-optimization improves the intermediate code
• Results in better target code
• Better means faster, shorter code which consumes less power
• Optimizer can deduce that conversion of 60 from integer to floating point
can be done once
• inttofloat operation can be eliminated by replacing the integer 60 by the
floating point number 60.0
• Moreover t3 is used to transmit its value to id1
t1 = id3 * 60.0
id1 = id2 + t1
Code Generation
• Input is the intermediate code and output is the target code
• Registers or memory location are selected for each of the variables
• Intermediate representation are translated into sequences of machine
instruction that perform the same task
LDF R2, id3
MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1
Contd…
• First operand of each instruction specifies a destination
• F tells that the instruction deals with floating point numbers
• Storage allocation decisions are made either during intermediate
code generation or during code generation
Symbol Table Management
• Compiler records the variable names used in the source program and
collects information about various attributes.
• Provide information about the storage allocated for a name, its type,
its scope
• Procedure names
• Number and type of its arguments, method of passing each argument, type
returned

• Data structure containing a record for each variable name, with fields
for the attributes of the name
Grouping of Phases into Passes
• Several activities can be grouped together into a pass
• Front end phases such as lexical analysis, syntax analysis, semantic
analysis and intermediate code generation can be grouped together into
one pass
• Code optimization may be an optional pass
• Back end pass consists of code generation for a particular target machine
• Combine different front end with back end of a particular target machine
• Combine front end with back ends for different target machines
Compiler Construction Tools
• Scanner Generator
• Parser Generator
• Syntax-directed translation engines
• Code-generator generators
• Data-flow analysis engines
• Compiler-construction toolkits
Lexical Analysis
• Diagram or description for lexemes of each token
• Code to identify each occurrence of each lexeme on the input
• Return information about the token identified
• Role of lexical analyser
• Read the input characters of the source program, group them into lexemes
and produce a sequence of tokens as output
• Stream of tokens is sent to the parser
• Identifier – Enter the lexeme into the symbol table
• Information regarding the kind of identifier may be read from the symbol
table
Contd…
• Parser call the lexical analyser
• getNextToken
• Token returned to parser
• Strips out comments and whitespaces
• Correlate error messages with source
• Two process
• Scanning
• Lexical analysis
Lexical Analysis Vs Parsing
• Simplicity of Design
• Compiler efficiency is improved
• Compiler portability is enhanced
Tokens, Patterns and Lexemes
• Token – Pair containing a token name and an optional attribute value
• Token name is an abstract symbol representing a kind of lexical unit
• Pattern is a description of the form that the lexemes of the token take
• Lexeme is a sequence of characters that match the pattern of a token
Contd…
• One token for each keyword
• Tokens for the operators
• One token representing all the identifiers
• One or more tokens representing the constants
• Tokens for each punctuation symbols
Example
• Find the number of tokens
1. main()
{ printf(“cd”);
\\ print the message
}
2. while (i > 0)
{ printf( i );
i++;
}
Contd…
int main()
{
int a = 10, b = 20;
printf(“sum is :%d”, a+b);
return 0;
}
Attributes for Tokens
• Attribute value describes the lexeme represented by the token
• Example:
• Token id – lexeme, its type, its location
• Pointer to the symbol table entry for that identifier
Lexical Errors
• fi ( a == f(x)) …
• Lexical analyser cannot tell whether fi is a misspelling or undeclared
function identifier
• fi is a valid lexeme for token id, hence returns token id to the parser
• Parser has to handle the error due to transposition of letters
• If none of the patterns match any prefix of the remaining input then
“panic mode” recovery
• Delete successive character from the remaining input, until a well-
formed token is at the beginning
Other error-recovery actions
• Delete one character from the remaining input
• Insert a missing character into the remaining input
• Replace a character by another character
• Transpose two adjacent letters

• Single transformation
• Smallest number of transformation
Input Buffering
• Look at one or more characters beyond the next lexeme
• Atleast look one additional character ahead
• Single character operators < , > , - , = could be the beginning of two
character operators <=, >=, ==, ->
• End of the identifier – see a character that is not a letter or a digit
• Two buffer scheme
• Sentinels
Buffer Pairs
• Two buffers alternatively loaded
• Each buffer is of size N
• N is size of disk block
• eof marks the end of the source file
• Two pointers
• lexemeBegin – beginning of the current lexeme
• forward – scans ahead until a pattern match is found
Sentinels
• Each time when advancing forward, check that not moved off one of
the buffers, if so reload the other buffer
• For each character ahead two test are performed
• One for the end of the buffer
• One to determine what character is read
• Each buffer holds a sentinel character at the end
• Sentinel is a special character – eof
Specification of Tokens
• Regular expression
• Alphabet – a finite set of symbols
• String over an alphabet – finite sequence of symbols drawn from that
alphabet
• Length of a string s, denoted as |s| - number of occurrences of
symbols in s
• Empty string, ɛ
• Language is any countable set of strings over some fixed alphabet
Operation on Languages
• Union
• Concatenation
• Closure
Regular Expressions
• Describes all the languages that can be built from these operators
applied to symbols of some alphabet
• C identifiers are described as letter_(letter_|digit)*
• letter_ - any letter or underscore
• digit – for any digit
• | means union
• * means “zero or more occurrences of”
• Each regular expression r denotes a language L(r)
Contd…
• Basis:
• ɛ is a regular expression, a is a regular expression
• Induction:
• r and s are regular expression
• (r) | (s) is a regular expression denoting the language L(r) U L(s)
• (r) (s) is a regular expression denoting the language L(r) L(s)
• (r)* is a regular expression denoting (L(r))*
• (r) is a regular expression denoting L(r)
Precedence and Associativity
• Parentheses can be avoided if conventions are followed
• Unary operator * has highest precedence and left associative
• Concatenation has second highest precedence and left associative
• | has lowest precedence and left associative
• (a) | ((b)* (c))
• a | b* c
Examples
• ∑ = {a , b}
• Regular expression a|b denotes the language {a , b}
• (a| b) (a| b) denotes {aa, ab, ba, bb}
• a* denotes {ɛ, a, aa, aaa …..}
• (a | b)* denotes {ɛ, a, b, ab, ba, aa, bb, aaa, ….}
• a | a*b denotes {a, b, ab, aab, aaab, ….}

• Language defined by a regular expression is called regular set


Algebraic Laws for Regular
Expressions
Regular Definitions
• Regular definition is a sequence of definitions of the form
• d1 -> r1 , d2 -> r2 ….. dn -> rn
• Each di is a new symbol not in ∑ and unique
• Each ri is a regular expression over ∑ U {d1, d2, …. di-1}
• Regular definition of C language identifiers
Extensions of Regular Expressions
• One or more instances – Unary postfix operator + represents the
positive closure of a regular expression
• Zero or one instance – Unary postfix operator ?
• Character classes – A regular expression a1|a2|…|an for each symbol
of the alphabet, replace by the shorthand [a1, a2, …. an], [a1-an]
• Definition of identifiers
Recognition of Tokens
• Examine the input string and find a prefix that is a lexeme matching
one of the pattern
Contd…
• Strip whitespace by recognizing token ws defined by
• Blank, tab, newline are abstract symbols
• Expresses the ASCII characters
• Token ws is not returned to the parser
Tokens, Patterns and Attribute
Transition Diagram
• Intermediate step in the construction of lexical analyser
• Convert pattern into stylized flowchart called transition diagram
• Collection of nodes or circles called States
• Edges are directed from one state to another
• Edge is labelled by a symbol or set of symbols
• All transition diagrams are deterministic
• Accepting or final states indicated by double circle
• One state designated as start state or initial state
Thank You

You might also like