Unit 1
Unit 1
R.Rajakumari
Assistant Professor(Sr. Gr.)
Department of Computer Science and Engineering
National Engineering College- Kovilpatti
Overview
• Structure of a Compiler
• Lexical Analysis
• Role of Lexical Analysis
• Input Buffering
• Specification of Tokens
• Recognition of Tokens
• Lex
Introduction
• Programming Languages
• Compilers
• Reads a program in one language(source language) and translates into equivalent
program in another language(target language)
• Reports any error in the source program
• Interpreter
• Directly executes the operations specified in the source program on inputs supplied
by the user
• Line by line execution
• Compiler is faster than a interpreter
• Interpreter is better at error diagnostics
Contd…
• Assembler
• Assembly language to relocatable machine code
• Linker
• Links relocatable object files and library files
• Loader
• Puts all the executable object files into memory for execution
Structure of a Compiler
• Single box that maps the source program into a semantically
equivalent target program
• Two parts - Analysis and Synthesis
• Analysis
• Breaks the source program into constituent pieces
• Imposes grammatical structure
• Intermediate representation of the source program
• If source program is syntactically ill formed or semantically unsound give
informative messages for corrective action
• Collects information and stores in a data structure called Symbol Table
Contd…
• Synthesis
• Constructs the desired target program from the intermediate representation
and the information in the symbol table
• Analysis part - Front End
• Synthesis part - Back End
Phases of a Compiler
• Lexical Analysis
• Syntax Analysis
• Semantic Analysis
• Intermediate Code Generation
• Machine Independent Code Optimization
• Code Generation
• Machine Dependent Code Optimization
Lexical Analysis
• First phase
• Lexical Analysis or Scanning
• Reads the stream of characters and groups the characters into
meaningful sequences called lexemes
• Outputs <token_name, attribute_value>
• Token_name is an abstract symbol used during syntax analysis
• Attribute_value points to entry in symbol table for this token
• Symbol Table entry is needed for semantic analysis and code
generator
Example
• position = initial + rate * 60
• position is a lexeme
• Token <id,1>
• Assignment Symbol = is a lexeme
• Token <=>
• initial is a lexeme
• Token < id,2>
• Addition operator + is a lexeme
• Token <+>
Example
• rate is a lexeme
• Token <id,3>
• Multiplication operator * is a lexeme
• Token <*>
• 60 is a lexeme
• Token <60>
• <id,1> <=> <id,2> <+> <id,3> <*> <60>
• Blank separating the lexemes would be discarded by the lexical
analyzer
Syntax Analysis
• Second Phase
• Syntax Analysis or Parsing
• Tokens are used to produce a tree-like intermediate representation
• Syntax Tree
• Interior node represent operation
• Children of the node represent the arguments of the operation
• Tree shows the order in which the operations are executed
• Context Free Grammar is used to specify the grammatical structure of
the programming language
Semantic Analysis
• Checks for semantic consistency
• Gathers type information
• Performs type checking, compiler checks whether each operator has
matching operands
• Example:
• Integer as array index
• Coercion – Type conversion
• Example:
• inttofloat
Intermediate Code Generation
• Intermediate representation
• Syntax trees are a form of intermediate representation, used during
syntax and semantic analysis
• After syntax analysis and semantic analysis, compilers generate an
explicit low-level or machine-like intermediate representation
• Two properties of intermediate representation
• Easy to produce
• Easy to translate into the target machine
Three-address Code
• An intermediate form
• Sequence of assembly-like instructions
• Three operands per instruction
• Each operand can act like a register
t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Contd…
• At most one operator in the right side
• Fix the order in which operations are to be done
• Temporary name is generated to hold the value computed by a three-
address instruction
• Some three-address instruction has fewer than three operands
Code Optimization
• Machine-independent code-optimization improves the intermediate code
• Results in better target code
• Better means faster, shorter code which consumes less power
• Optimizer can deduce that conversion of 60 from integer to floating point
can be done once
• inttofloat operation can be eliminated by replacing the integer 60 by the
floating point number 60.0
• Moreover t3 is used to transmit its value to id1
t1 = id3 * 60.0
id1 = id2 + t1
Code Generation
• Input is the intermediate code and output is the target code
• Registers or memory location are selected for each of the variables
• Intermediate representation are translated into sequences of machine
instruction that perform the same task
LDF R2, id3
MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1
Contd…
• First operand of each instruction specifies a destination
• F tells that the instruction deals with floating point numbers
• Storage allocation decisions are made either during intermediate
code generation or during code generation
Symbol Table Management
• Compiler records the variable names used in the source program and
collects information about various attributes.
• Provide information about the storage allocated for a name, its type,
its scope
• Procedure names
• Number and type of its arguments, method of passing each argument, type
returned
• Data structure containing a record for each variable name, with fields
for the attributes of the name
Grouping of Phases into Passes
• Several activities can be grouped together into a pass
• Front end phases such as lexical analysis, syntax analysis, semantic
analysis and intermediate code generation can be grouped together into
one pass
• Code optimization may be an optional pass
• Back end pass consists of code generation for a particular target machine
• Combine different front end with back end of a particular target machine
• Combine front end with back ends for different target machines
Compiler Construction Tools
• Scanner Generator
• Parser Generator
• Syntax-directed translation engines
• Code-generator generators
• Data-flow analysis engines
• Compiler-construction toolkits
Lexical Analysis
• Diagram or description for lexemes of each token
• Code to identify each occurrence of each lexeme on the input
• Return information about the token identified
• Role of lexical analyser
• Read the input characters of the source program, group them into lexemes
and produce a sequence of tokens as output
• Stream of tokens is sent to the parser
• Identifier – Enter the lexeme into the symbol table
• Information regarding the kind of identifier may be read from the symbol
table
Contd…
• Parser call the lexical analyser
• getNextToken
• Token returned to parser
• Strips out comments and whitespaces
• Correlate error messages with source
• Two process
• Scanning
• Lexical analysis
Lexical Analysis Vs Parsing
• Simplicity of Design
• Compiler efficiency is improved
• Compiler portability is enhanced
Tokens, Patterns and Lexemes
• Token – Pair containing a token name and an optional attribute value
• Token name is an abstract symbol representing a kind of lexical unit
• Pattern is a description of the form that the lexemes of the token take
• Lexeme is a sequence of characters that match the pattern of a token
Contd…
• One token for each keyword
• Tokens for the operators
• One token representing all the identifiers
• One or more tokens representing the constants
• Tokens for each punctuation symbols
Example
• Find the number of tokens
1. main()
{ printf(“cd”);
\\ print the message
}
2. while (i > 0)
{ printf( i );
i++;
}
Contd…
int main()
{
int a = 10, b = 20;
printf(“sum is :%d”, a+b);
return 0;
}
Attributes for Tokens
• Attribute value describes the lexeme represented by the token
• Example:
• Token id – lexeme, its type, its location
• Pointer to the symbol table entry for that identifier
Lexical Errors
• fi ( a == f(x)) …
• Lexical analyser cannot tell whether fi is a misspelling or undeclared
function identifier
• fi is a valid lexeme for token id, hence returns token id to the parser
• Parser has to handle the error due to transposition of letters
• If none of the patterns match any prefix of the remaining input then
“panic mode” recovery
• Delete successive character from the remaining input, until a well-
formed token is at the beginning
Other error-recovery actions
• Delete one character from the remaining input
• Insert a missing character into the remaining input
• Replace a character by another character
• Transpose two adjacent letters
• Single transformation
• Smallest number of transformation
Input Buffering
• Look at one or more characters beyond the next lexeme
• Atleast look one additional character ahead
• Single character operators < , > , - , = could be the beginning of two
character operators <=, >=, ==, ->
• End of the identifier – see a character that is not a letter or a digit
• Two buffer scheme
• Sentinels
Buffer Pairs
• Two buffers alternatively loaded
• Each buffer is of size N
• N is size of disk block
• eof marks the end of the source file
• Two pointers
• lexemeBegin – beginning of the current lexeme
• forward – scans ahead until a pattern match is found
Sentinels
• Each time when advancing forward, check that not moved off one of
the buffers, if so reload the other buffer
• For each character ahead two test are performed
• One for the end of the buffer
• One to determine what character is read
• Each buffer holds a sentinel character at the end
• Sentinel is a special character – eof
Specification of Tokens
• Regular expression
• Alphabet – a finite set of symbols
• String over an alphabet – finite sequence of symbols drawn from that
alphabet
• Length of a string s, denoted as |s| - number of occurrences of
symbols in s
• Empty string, ɛ
• Language is any countable set of strings over some fixed alphabet
Operation on Languages
• Union
• Concatenation
• Closure
Regular Expressions
• Describes all the languages that can be built from these operators
applied to symbols of some alphabet
• C identifiers are described as letter_(letter_|digit)*
• letter_ - any letter or underscore
• digit – for any digit
• | means union
• * means “zero or more occurrences of”
• Each regular expression r denotes a language L(r)
Contd…
• Basis:
• ɛ is a regular expression, a is a regular expression
• Induction:
• r and s are regular expression
• (r) | (s) is a regular expression denoting the language L(r) U L(s)
• (r) (s) is a regular expression denoting the language L(r) L(s)
• (r)* is a regular expression denoting (L(r))*
• (r) is a regular expression denoting L(r)
Precedence and Associativity
• Parentheses can be avoided if conventions are followed
• Unary operator * has highest precedence and left associative
• Concatenation has second highest precedence and left associative
• | has lowest precedence and left associative
• (a) | ((b)* (c))
• a | b* c
Examples
• ∑ = {a , b}
• Regular expression a|b denotes the language {a , b}
• (a| b) (a| b) denotes {aa, ab, ba, bb}
• a* denotes {ɛ, a, aa, aaa …..}
• (a | b)* denotes {ɛ, a, b, ab, ba, aa, bb, aaa, ….}
• a | a*b denotes {a, b, ab, aab, aaab, ….}