CD 1
CD 1
CD 1
Analysis of the source program - Analysis and synthesis phases, Phases of a compiler. Compiler writing
tools. Bootstrapping. Lexical Analysis - Role of Lexical Analyzer, Input Buffering, Specification of Tokens,
Recognition of tokens
INTRODUCTION TO COMPILERS
A compiler is a program that can read a program in one language (the source language) and
translate it into an equivalent program in another language (the target language).
An important role of the compiler is to report any errors in the source program that it detects
during the translation process.
Error Messages
Fig: Compiler
Lexical Analysis
Syntax Analysis
Semantic Analysis
Lexical Analysis
In a compiler linear analysis is called lexical analysis or scanning. The lexical analysis phase reads
the characters in the source program and grouped into tokens that are sequence of characters
having a collective meaning.
1
EXAMPLE
position: = initial + rate * 60
Blanks separating characters of these tokens are normally eliminated during lexical analysis.
Syntax Analysis
Hierarchical Analysis is called parsing or syntax analysis.
It involves grouping the tokens of the source program into grammatical phrases that are used
by the complier to synthesize output. They are represented using a syntax tree.
A syntax tree is the tree generated as a result of syntax analysis in which the interior nodes
are the operators and the exterior nodes are the operands. This analysis shows an error when
the syntax is incorrect.
2
Semantic Analysis
This phase checks the source program for semantic errors and gathers type
information for subsequent code generation phase.
An important component of semantic analysis is type checking.
Here the compiler checks that each operator has operands that are permitted by the
source language specification.
PHASES OF A COMPILER
The phases include:
1. Lexical Analysis
2. Syntax Analysis
3. Semantic Analysis
4. Intermediate Code Generation
5. Code Optimization
6. Target Code Generation
3
Lexical Analysis
The first phase of a compiler is called lexical analysis or scanning.
The lexical analyzer reads the stream of characters making up the source program and groups the
characters into meaningful sequences called lexemes.
For each lexeme, the lexical analyzer produces as output a token of the form
The characters in this assignment could be grouped into the following lexemes and mapped into the
following tokens passed on to the syntax analyzer:
1. position is a lexeme that would be mapped into a token <id, 1>, where id is an abstract
symbol standing for identifier and 1 points to the symbol table entry for position. The
symbol- table entry for an identifier holds information about the identifier, such as its
name and type.
2. The assignment symbol = is a lexeme that is mapped into the token < = >. Since this token
needs no attribute-value, we have omitted the second component.
3. initial is a lexeme that is mapped into the token < id, 2> , where 2 points to the symbol-
table entry for initial .
4. + is a lexeme that is mapped into the token <+>.
5. rate is a lexeme that is mapped into the token < id, 3 >, where 3 points to the symbol-
table entry for rate.
6. * is a lexeme that is mapped into the token <* > .
7. 60 is a lexeme that is mapped into the token <60>
Blanks separating the lexemes would be discarded by the lexical analyzer. The representation of the
assignment statement position = initial + rate * 60 after lexical analysis as the sequence of tokens as:
< id, l > < = > <id, 2> <+> <id, 3> < * > <60>
4
Token : Token is a sequence of characters that can be treated as a single logical entity. Typical tokens are,
Identifiers
keywords
operators
special symbols
constants
Pattern : A set of strings in the input for which the same token is produced as output. This set of
strings is described by a rule called a pattern associated with the token.
Lexeme : A lexeme is a sequence of characters in the source program that is matched by the pattern
for a token.
Syntax Analysis
The second phase of the compiler is syntax analysis or parsing.
The parser uses the first components of the tokens produced by the lexical analyzer to create a
tree-like intermediate representation that depicts the grammatical structure of the token
stream.
A typical representation is a syntax tree in which each interior node represents an
operation and the children of the node represent the arguments of the operation.
The syntax tree for above token stream is:
The tree has an interior node labeled with (id, 3) as its left child and the integer 60 as its right
child.
The node (id, 3) represents the identifier rate.
The node labeled * makes it explicit that we must first multiply the value of rate by 60.
The node labeled + indicates that we must add the result of this multiplication to the value
of initial.
The root of the tree, labeled =, indicates that we must store the result of this addition into
the location for the identifier position.
5
Semantic Analysis
The semantic analyzer uses the syntax tree and the information in the symbol table to
check the source program for semantic consistency with the language definition.
It also gathers type information and saves it in either the syntax tree or the symbol table,
for subsequent use during intermediate-code generation.
An important part of semantic analysis is type checking, where the compiler checks that
each operator has matching operands.
For example, many programming language definitions require an array index to be an
integer; the compiler must report an error if a floating-point number is used to index an
array.
Some sort of type conversion is also done by the semantic analyzer.
For example, if the operator is applied to a floating point number and an integer, the
compiler may convert the integer into a floating point number.
In our example, suppose that position, initial, and rate have been declared to be floating-
point numbers, and that the lexeme 60 by itself forms an integer.
The semantic analyzer discovers that the operator * is applied to a floating-point number
rate and an integer 60.
In this case, the integer may be converted into a floating-point number.
In the following figure, notice that the output of the semantic analyzer has an extra node
for the operator inttofloat, which explicitly converts its integer argument into a floating-
point number.
In the process of translating a source program into target code, a compiler may construct one
or more intermediate representations, which can have a variety of forms.
Syntax trees are a form of intermediate representation; they are commonly used during syntax
and semantic analysis.
After syntax and semantic analysis of the source program, many compilers generate an
explicit low-level or machine-like intermediate representation, which we can think of as a
program for an abstract machine.
This intermediate representation should have two important properties:
6
It should be simple and easy to produce
In our example, the intermediate representation used is three-address code, which consists
of a sequence of assembly-like instructions with three operands per instruction.
t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2 id1 = t3
Code Optimization
7
Code optimization
The machine-independent code-optimization phase attempts to improve the intermediate
code so that better target code will result.
The objectives for performing optimization are: faster execution, shorter code, or target code that
consumes less power.
In our example, the optimized code is:
t1 = id3 * 60.0
id1 = id2 + t1
Code Generator
The code generator takes as input an intermediate representation of the source program and
maps it into the target language.
If the target language is machine code, registers or memory locations are selected for each
of the variables used by the program.
Then, the intermediate instructions are translated into sequences of machine instructions that
perform the same task.
A crucial aspect of code generation is the judicious assignment of registers to hold
variables.
If the target language is assembly language, this phase generates the assembly code as its
output.
In our example, the code generated is:
LDF R2, id3
MULF R2, #60.0
LDF R1, id2
ADDF R1, R2
STF id1, R1
The first operand of each instruction specifies a destination.
The F in each instruction tells us that it deals with floating-point numbers.
The above code loads the contents of address id3 into register R2, then multiplies it with
floating-point constant 60.0.
The # signifies that 60.0 is to be treated as an immediate constant.
The third instruction moves id2 into register R1 and the fourth adds to it the value previously
computed in register R2.
Finally, the value in register R1 is stored into the address of id1 , so the code correctly
implements the assignment statement
position = initial + rate * 60.
Symbol Table
An essential function of a compiler is to record the variable names used in the source
program and collect information about various attributes of each name.
These attributes may provide information about the storage allocated for a name, its type, its
scope (where in the program its value may be used), and in the case of procedure names,
8
such things as the number and types of its arguments, the method of passing each argument (for
example, by value or by reference), and the type returned.
The symbol table is a data structure containing a record for each variable name, with fields for the
attributes of the name.
The data structure should be designed to allow the compiler to find the record for each name quickly and
to store or retrieve data from that record quickly.
However, after detecting an error, a phase must somehow deal with that error, so that compilation
can proceed, allowing further errors in the source program to be detected.
A compiler that stops when it finds the first error is not a helpful one.
LEXICAL ANALYZER
< id, l > < = > <id, 2> <+> <id, 3> < * > <60>
SYNTAX ANALYZER
SEMANTIC ANALYZER
t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
CODE OPTIMIZER
t1 = id3 * 60.0
id1 = id2 + t1
CODE GENERATOR
LDF R2, id3
MULF R2, #60.0
LDF R1, id2
ADDF R1, R2
STF id1, R1
Analysis Phase
Analysis Phase performs 4 actions namely:
1. Lexical analysis
2. Syntax Analysis
3. Semantic analysis
4. Intermediate Code Generation
The analysis part breaks up the source program into constituent pieces and imposes a
grammatical structure on them.
It then uses this structure to create an intermediate representation of the source program.
If the analysis part detects that the source program is either syntactically ill formed or
semantically unsound, then it must provide informative messages, so the user can take corrective
action.
The analysis part also collects information about the source program and stores it in a data
structure called a symbol table, which is passed along with the intermediate representation to the
synthesis part.
The analysis part breaks up the source program into constituent pieces and imposes a
grammatical structure on them.
It then uses this structure to create an intermediate representation of the source program.
If the analysis part detects that the source program is either syntactically ill formed or
semantically unsound, then it must provide informative messages, so the user can take corrective
action.
The analysis part also collects information about the source program and stores it in a data
structure called a symbol table, which is passed along with the intermediate representation to the
synthesis part.
Synthesis Phase
Synthesis Phase performs 2 actions namely:
1. Code Optimization
2. Code Generation
The synthesis part constructs the desired target program from the intermediate representation
and the information in the symbol table.
The analysis part is often called the front end of the compiler; the synthesis part is the back
end.
10
COMPILER WRITING TOOLS
Compiler writers use software development tools and more specialized tools for implementing
various phases of a compiler. Some commonly used compiler construction tools include the
following.
Parser Generators
Scanner Generators
Syntax-directed translation engine
Automatic code generators
Data-flow analysis Engines
Compiler Construction toolkits
Parser Generators.
Input : Grammatical description of a programming language
Output : Syntax analyzers.
o These produce syntax analyzers, normally from input that is based on a context-free grammar.
o In early compilers, syntax analysis consumed not only a large fraction of the running time of a
compiler, but a large fraction of the intellectual effort of writing a compiler.
o This phase is one of the easiest to implement.
Scanner Generators
Input : Regular expression description of the tokens of a language
Output : Lexical analyzers.
o These automatically generate lexical analyzers, normally from a specification based on regular
expressions.
o The basic organization of the resulting lexical analyzer is in effect a finite automaton.
1. Create , a compiler for a subset, S, of the desired language L, using language A, which
runs on machine A. (Language A may be assembly language.)
The process illustrated by the T-diagrams is called bootstrapping and can besummarized by the
equation:
12
LEXICAL ANALYSIS
ROLE OF LEXICAL ANALYSIS
• As the first phase of a compiler, the main task of the lexical analyzer is to read the input
characters of the source program, group them into lexemes, and produce as output a sequence
of tokens for each lexeme in the source program.
• The stream of tokens is sent to the parser for syntax analysis.
• Commonly, the interaction is implemented by having the parser call the lexical analyzer.
• The call, suggested by the getNextToken command, causes the lexical analyzer to read characters
from its input until it can identify the next lexeme and produce for it the next token, which it
returns to the parser.
Simplicity Of Design
The separation of lexical analysis and syntactic analysis often allows us to simplify at least one of
these tasks. The syntax analyzer can be smaller and cleaner by removing the low level details of
lexical analysis
Efficiency
Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized
techniques that serve only the lexical task, not the job of parsing. In addition, specialized
buffering techniques for reading input characters can speed up the compiler significantly.
Portability
Compiler portability is enhanced. Input-device-specific peculiarities can be restricted to the
lexical analyzer.
Lexical Errors
A character sequence that can’t be scanned into any valid token is a lexical error.
Suppose a situation arises in which the lexical analyzer is unable to proceed because none
of the patterns for tokens matches any prefix of the remaining input.
The simplest recovery strategy is "panic mode" recovery.
We delete successive characters from the remaining input, until the lexical analyzer can
find a well-formed token at the beginning of what input is left.
This recovery technique may confuse the parser, but in an interactive computing
environment it may be quite adequate.
14
Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
• Transformations like these may be tried in an attempt to repair the input.
• The simplest such strategy is to see whether a prefix of the remaining input can be
transformed into a valid lexeme by a single transformation.
• A more general correction strategy is to find the smallest number of transformations needed
to convert the source program into one that consists only of valid lexemes, but this approach
is considered too expensive in practice to be worth the effort.
INPUT BUFFERING
• To ensure that a right lexeme is found, one or more characters have to be looked up beyond the
next lexeme.
• Hence a two-buffer scheme is introduced to handle large lookaheads safely.
• Techniques for speeding up the process of lexical analyzer such as the use of sentinels to mark
the buffer end have been adopted.
• There are three general approaches for the implementation of a lexical analyzer:
a. By using a lexical-analyzer generator, such as lex compiler to produce the lexical analyzer
from a regular expression based specification. In this, the generator provides routines for
reading and buffering the input.
b. By writing the lexical analyzer in a conventional systems-programming language, using
I/O facilities of that language to read the input.
c. By writing the lexical analyzer in assembly language and explicitly managing the reading
of input.
• The three choices are listed in order of increasing difficulty for the implementer
Buffer Pairs
• Because of large amount of time consumption in moving characters, specialized buffering
techniques have been developed to reduce the amount of overhead required to process an
input character.
• Fig shows the buffer pairs which are used to hold the input data.
Scheme
• Consists of two buffers, each consists of N-character size which are reloaded alternatively.
• N-Number of characters on one disk block.
• N characters are read from the input file to the buffer using one system read command.
• eof is inserted at the end if the number of characters is less than N.
15
Pointers
Most of the time, It performs only one test to see whether forward pointer
points to an eof.
Only when it reaches the end of the buffer half or eof, it performs more tests.
Since N input characters are encountered between eofs, the average number
of tests per input character is very close to 1.
forward : = forward + 1;
if forward ↑ = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward + 1
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
end
else /* eof within a buffer signifying end of input */ terminate
lexical analysis
end
17
SPECIFICATION OF TOKENS
There are 3 specifications of tokens:
Strings
Language
Regular expression
Strings and Languages
An alphabet or character class is a finite set of symbols
A string over an alphabet is a finite sequence of symbols drawn from that
alphabet.
A language is any countable set of strings over some fixed alphabet.
In language theory, the terms "sentence" and "word" are often used as
synonyms for "string." The length of a string s, usually written |s|, is the
number of occurrences of symbols in s. For example, banana is a string of
length six. The empty string, denoted ε, is the string of length zero.
Operations On Strings
The following string-related terms are commonly used:
18
Operations On Languages:
Regular Expressions
It allows defining the sets to form tokens precisely.
Eg, letter ( letter|digit )*
Define a Pascal identifier- which says the identifier is formed by a letter followed by zero
or more letters or digits.
A regular expression is built up out of simpler regular expression using a set of defining
rules.
Each regular expression r denotes a language L( r ).
19
Algebraic properties of regular expressions
Regular Definition
If Σ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of
the form
d1 → r1 d2 → r2
…
dn → rn
– Where each di is a distinct name, and each ri is a regular expression over the symbols in
Σ U {d1, d2, … , di-1}, i.e., the basic symbols and the previously defined names.
Example: Identifiers is the set of strings of letters and digits beginning with a letter. Regular
definition for this set:
letter → A|B|…|Z|a|b|…|z digit →
0|1|…|9
id → letter ( letter | digit ) *
Notational Shorthand
Certain constructs occur so frequently in regular expressions that it is convenient to introduce
notational short hands for them.
20
2. Zero or one instance ( ?)
The unary postfix operator ? means “zero or one instance of”. The
notation r? is a shorthand for r | ε.
If ‘r’ is a regular expression, then ( r )? is a regular expression that denotes the language
3. Character Classes
The notation [abc] where a, b and c are alphabet symbols denotes the regular
expression a | b | c.
Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z.
We can describe identifiers as being strings generated by the regular expression, [A– Za–z][A–
Za–z0–9]*
Non-regular Set
A language which cannot be described by any regular expression is a non-regular set. Example:
The set of all strings of balanced parentheses and repeating strings cannot be described by a
regular expression. This set can be specified by a context-free grammar.
RECOGNITION OF TOKENS
The question is how to recognize the tokens?
EXAMPLE
stmt -> if expr then stmt | if expr then stmt else stmt|ɛ
if -> if
then -> then
else -> else
rebop-> <|<=|< >|> |> =|=
id-> letter ( letter|digit )*
num-> digits optional-fraction optional-exponent
Delim blank|tab|newline
ws delim
If a match for ws is found, the lexical analyzer does not return a token to the parser.
Transition Diagram
As an intermediate step in the construction of a lexical analyzer, we first produce a flowchart, called
a r diagram. Transition diagrams.
Transition diagram depict the actions that take place when a lexical analyzer is called by the parser to
get the next token.
The TD uses to keep track of information about characters that are seen as the forward pointer scans the
input.
It does that by moving from position in the diagram as characters are read.
1. One state is labelled the Start Statestart It is the initial state of transitiondiagram
where control resides when we begin to recognize a token.
2. Position is a transition diagram are drawn as circles . and are called states
3. The states are connected by Arrows called edges. Labels on edges are indicatingthe input
characters.
5. Retract one character use * to indicate states on which this input retraction.
22
**********
23