Cdmodule 1
Cdmodule 1
M (AP, CSE)
IES College of Engineering
MODULE 1
Introduction to compilers – Analysis of the source
program, Phases of a compiler, Grouping of phases,
compiler writing tools – bootstrapping
Lexical Analysis:
The role of Lexical Analyzer, Input Buffering,
Specification of Tokens using Regular Expressions,
Review of Finite Automata, Recognition of Tokens.
1
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
An important role of the compiler is to report any errors in the source program that it
detects during the translation process.
Error Meeages
Fig: Compiler
Lexical Analysis
Syntax Analysis
Semantic Analysis
EXAMPLE
2
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
7. The number 60
Blanks separating characters of these tokens are normally eliminated during lexical
analysis.
It involves grouping the tokens of the source program into grammatical phrases that
are used by the complier to synthesize output. They are represented using a syntax
tree.
A syntax tree is the tree generated as a result of syntax analysis in which the interior
nodes are the operators and the exterior nodes are the operands. This analysis shows
an error when the syntax is incorrect.
3
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
Semantic Analysis
This phase checks the source program for semantic errors and gathers type
information for subsequent code generation phase.
Here the compiler checks that each operator has operands that are permitted by the
source language specification.
Lexical Analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generation
Code Optimization
Target Code Generation
4
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
Lexical Analysis
The first phase of a compiler is called lexical analysis or scanning.
The lexical analyzer reads the stream of characters making up the source program and
groups the characters into meaningful sequences called lexemes.
For each lexeme, the lexical analyzer produces as output a token of the form
In the token, the first component token- name is an abstract symbol that is used during
syntax analysis, and the second component attribute-value points to an entry in the
symbol table for this token.
Information from the symbol-table entry 'is needed for semantic analysis and code
generation.
1. position is a lexeme that would be mapped into a token <id, 1>, where id is an
abstract symbol standing for identifier and 1 points to the symbol table entry for
position. The symbol- table entry for an identifier holds information about the
identifier, such as its name and type.
2. The assignment symbol = is a lexeme that is mapped into the token < = >. Since
this token needs no attribute-value, we have omitted the second component.
3. initial is a lexeme that is mapped into the token < id, 2> , where 2 points to the
symbol-table entry for initial .
< id, l > < = > <id, 2> <+> <id, 3> < * > <60>
5
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
Token : Token is a sequence of characters that can be treated as a single logical entity. Typical
tokens are,
Identifiers
keywords
operators
special symbols
constants
Pattern : A set of strings in the input for which the same token is produced as output. This
set of strings is described by a rule called a pattern associated with the token.
Lexeme : A lexeme is a sequence of characters in the source program that is matched by the
pattern for a token.
Syntax Analysis
The second phase of the compiler is syntax analysis or parsing.
The parser uses the first components of the tokens produced by the lexical analyzer to
create a tree-like intermediate representation that depicts the grammatical structure of
the token stream.
6
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
The tree has an interior node labeled with ( id, 3 ) as its left child and the integer 60 as
its right child.
The node labeled * makes it explicit that we must first multiply the value of rate by 60.
The node labeled + indicates that we must add the result of this multiplication to the
value of initial.
The root of the tree, labeled =, indicates that we must store the result of this addition
into the location for the identifier position.
Semantic Analysis
The semantic analyzer uses the syntax tree and the information in the symbol table to
check the source program for semantic consistency with the language definition.
It also gathers type information and saves it in either the syntax tree or the symbol
table, for subsequent use during intermediate-code generation.
An important part of semantic analysis is type checking, where the compiler checks
that each operator has matching operands.
For example, if the operator is applied to a floating point number and an integer, the
compiler may convert the integer into a floating point number.
In our example, suppose that position, initial, and rate have been declared to be
floating- point numbers, and that the lexeme 60 by itself forms an integer.
In the following figure, notice that the output of the semantic analyzer has an extra
node for the operator inttofloat, which explicitly converts its integer argument into a
floating-point number.
7
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
Syntax trees are a form of intermediate representation; they are commonly used
during syntax and semantic analysis.
After syntax and semantic analysis of the source program, many compilers generate
an explicit low-level or machine-like intermediate representation, which we can think
of as a program for an abstract machine.
t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
Code Optimization
The machine-independent code-optimization phase attempts to improve the
intermediate code so that better target code will result.
The objectives for performing optimization are: faster execution, shorter code, or target
code that consumes less power.
t1 = id3 * 60.0
id1 = id2 + t1
Code Generator
The code generator takes as input an intermediate representation of the source
program and maps it into the target language.
If the target language is machine code, registers or memory locations are selected for
each of the variables used by the program.
8
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
If the target language is assembly language, this phase generates the assembly code as
its output.
The above code loads the contents of address id3 into register R2, then multiplies it
with floating-point constant 60.0.
The third instruction moves id2 into register Rl and the fourth adds to it the value
previously computed in register R2.
Finally, the value in register Rl is stored into the address of idl , so the code correctly
implements the assignment statement position = initial + rate * 60.
Symbol Table
The symbol table is a data structure containing a record for each variable name, with
fields for the attributes of the name.
These attributes may provide information about the storage allocated for a name, its
type, its scope (where in the program its value may be used), and in the case of
procedure names, such things as the number and types of its arguments, the method
of passing each argument (for example, by value or by reference), and the type
returned.
The data structure should be designed to allow the compiler to find the record for each
name quickly and to store or retrieve data from that record quickly.
However, after detecting an error, a phase must somehow deal with that error, so that
compilation can proceed, allowing further errors in the source program to be detected.
9
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
A compiler that stops when it finds the first error is not a helpful one.
LEXICAL ANALYZER
< id, l > < = > <id, 2> <+> <id, 3> < * > <60>
SYNTAX ANALYZER
SEMANTIC ANALYZER
t1 = inttofloat(60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3
CODE OPTIMIZER
t1 = id3 * 60.0
id1 = id2 + t1
CODE GENERATOR
ADDF R1, R2
STF id1, R1
10
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
Analysis Phase
Synthesis phase
Analysis Phase
Analysis Phase performs 4 actions namely:
a. Lexical analysis
b. Syntax Analysis
c. Semantic analysis
d. Intermediate Code Generation
The analysis part breaks up the source program into constituent pieces and imposes a
grammatical structure on them.
If the analysis part detects that the source program is either syntactically ill formed or
semantically unsound, then it must provide informative messages, so the user can take
corrective action.
The analysis part also collects information about the source program and stores it in a
data structure called a symbol table, which is passed along with the intermediate
representation to the synthesis part.
Synthesis Phase
Synthesis Phase performs 2 actions namely:
a. Code Optimization
b. Code Generation
The synthesis part constructs the desired target program from the intermediate
representation and the information in the symbol table.
The analysis part is often called the front end of the compiler; the synthesis part is the
back end.
11
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
Parser Generators
Scanner Generators
Syntax-directed translation engine
Automatic code generators
Data-flow analysis Engines
Compiler Construction toolkits
Parser Generators.
Input : Grammatical description of a programming language
These produce syntax analyzers, normally from input that is based on a context-free
grammar.
In early compilers, syntax analysis consumed not only a large fraction of the running
time of a compiler, but a large fraction of the intellectual effort of writing a compiler.
Scanner Generators
Input : Regular expression description of the tokens of a language
The basic organization of the resulting lexical analyzer is in effect a finite automaton.
These produce collections of routines that walk the parse tree, generating intermediate
code.
The basic idea is that one or more "translations" are associated with each node of the
parse tree, and each translation is defined in terms of translations at its neighbour
nodes in the tree.
12
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
Such a tool takes a collection of rules that define the translation of each operation of
the intermediate language into the machine language for the target machine.
The rules must include sufficient detail that we can handle the different possible access
methods for data.
1.1.4.1 BOOTSTRAPPING
Bootstrapping is widely used to design a compiler. Bootstrapping is a process in
which simple language is used to translate more complicated program which in turn
may handle for more complicated program. This complicated program can further
handle even more complicated program and so on.
Bootstrap compiler is used to compile the compiler and then you can use this compiled
compiler to compile everything else as well as future versions of itself.
Source Language
Target Language
Implementation Language
Notation: represents a compiler for Source S , Target T , implemented in I . The
T-diagram shown above is also used to depict the same compiler.
Applying the second T to the first T compiles the first T so that it runs on machine M.
The result is thus a compiler from L to N running on machine M.
Cross Compiler:
A cross compiler is a compiler capable of creating executable code for a platform other
than the one on which the compiler is running. For example, a compiler that runs on
a Windows 7 PC but generates code that runs on Android smartphone is a cross
compiler
Cross Compilers are compilers that execute on one computer and generate object
code that can execute on different platform. for example a cross compiler that is
running on windows pc can produce object code that run on MAC Os or Android
Os.
14
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
LEXICAL
Source Program ANALYSER Sequence of Tokens
In some cases, information regarding the kind of identifier may be read from the
symbol table by the lexical analyzer to assist it in determining the proper token it must
pass to the parser.
Commonly, the interaction is implemented by having the parser call the lexical
analyzer.
The call, suggested by the getNextToken command, causes the lexical analyzer to read
characters from its input until it can identify the next lexeme and produce for it the
next token, which it returns to the parser.
token
Symbol Table
15
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
2. Correlating error messages generated by the compiler with the source program. For
instance, the lexical analyzer may keep track of the number of newline characters
seen, so it can associate a line number with each error message.
3. If the source program uses a macro-pre-processor, the expansion of macros may also
be performed by the lexical analyzer.
Simplicity Of Design
The separation of lexical analysis and syntactic analysis often allows us to simplify at least
one of these tasks. The syntax analyzer can be smaller and cleaner by removing the
lowlevel details of lexical analysis
Efficiency
Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized
techniques that serve only the lexical task, not the job of parsing. In addition, specialized
buffering techniques for reading input characters can speed up the compiler significantly.
Portability
The most important example is the token id, where we need to associate with the token
a great deal of information.
Normally, information about an identifier - e.g., its lexeme, its type, and the location
at which it is first found (in case an error message about that identifier must be issued)
- is kept in the symbol table.
Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table
entry for that identifier.
Lexical Errors
A character sequence that can’t be scanned into any valid token is a lexical error.
16
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
Suppose a situation arises in which the lexical analyzer is unable to proceed because
none of the patterns for tokens matches any prefix of the remaining input.
We delete successive characters from the remaining input, until the lexical analyzer
can find a well-formed token at the beginning of what input is left.
This recovery technique may confuse the parser, but in an interactive computing
environment it may be quite adequate.
The simplest such strategy is to see whether a prefix of the remaining input can be
transformed into a valid lexeme by a single transformation.
Connected with Lexical analysis, there are three important terms with similar
meanings. They are Lexeme, Token and Pattern.
Lexeme: These are the smallest logical units (words) of the program, such as A, B,
1.2, true, if, else, <, = …….
Tokens: They are classes of similar lexemes, such as identifier, constants, operators
etc. Hence the tokens are the category to which a lexeme belongs to.
17
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
Initially both the pointers point to the first character of the input string as shown
below
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme. In above example as soon as ptr (fp)
encounters a blank space the lexeme “int” is identified.
The fp will be moved ahead at white space, when fp encounters white space, it ignore
and moves ahead. Then both the begin ptr(bp) and forward ptr(fp) are set at next
token.
The input character is thus read from secondary storage, but reading in this way from
secondary storage is costly. Hence buffering technique is used. A block of data is first
read into a buffer, and then second by lexical analyzer. There are two methods used in
this context: One Buffer Scheme, and Two Buffer Scheme.
18
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
In this scheme, only one buffer is used to store the input string but the problem with
this scheme is that if lexeme is very long then it crosses the buffer boundary, to scan
rest of the lexeme the buffer has to be refilled, that makes overwriting the first of
lexeme.
To overcome the problem of one buffer scheme, in this method two buffers are used to
store the input string.
The only problem with this method is that if length of the lexeme is longer than length
of the buffer then scanning input cannot be scanned completely.
Initially both the bp and fp are pointing to the first character of first buffer. Then the fp
moves towards right in search of end of lexeme. As soon as blank character is
recognized, the string between bp and fp is identified as corresponding token. To
identify, the boundary of first buffer end of buffer character should be placed at the end
first buffer.
Similarly end of second buffer is also recognized by the end of buffer mark present at
the end of second buffer. When fp encounters first eof, then one can recognize end of
first buffer and hence filling up second buffer is started.
In the same way when second eof is obtained then it indicates of second buffer.
Alternatively both the buffers can be filled up until end of the input program and
stream of tokens is identified. This eof character introduced at the end is
calling Sentinel which is used to identify the end of buffer.
19
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
Alphabet: An alphabet is a finite nonempty set of symbols. Symbols can be letters or other
characters includes number or any special character.
Eg: Σ = {0,1} is a binary alphabet
Σ = {a, b, c, ….., z} is an alphabet contains lower case letters.
The empty sequence of letters is denoted by is called empty string. The length of empty
string is Zero. =0
Language: Set of strings which are generated from alphabet called language. It is a
collection of string.
Let Σ={a,b}
L1 = set of all strings of length two
= {aa, ab, ba, bb}. So L1 is a finite language.
L2= set of all strings of length 3
= {aaa, aab, aba, abb, baa, bab, bba, bbb}. So L2 is a finite language.
L3= set of all strings where each string starts with a
= {a, aa, ab, aaa, aab, aba, abb, ……..}. So L3 is infinite language.
There for language may be finite or infinite.
REGULAR EXPRESSION
a) Choice among alternates: Indicated by the meta character | (vertical bar). Let r and s
be two regular expressions. Then r|s is a regular expression. In terms of langue, r|s
represents the union of the languages represented by r and s.
20
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
expression represents?
1) r|s
It represents the language containing the symbol ‘a’ or ‘b’.
Ie, L(r|s) = L(r) U L(s)
= {a} U {b} = {a,b}
2) r|t
3) (r|s)v
L((r|s)v)= L(r|s)L(v)
={a,b}{c} = {ac, bc}
c) Repetition: Repetitive operation of a regular expression is called Kleene
Closure. It is represented by r*.
Eg: Consider L(r) = {a} , L(s)={b}. What do the following regular expressions
represents?
1) r*
It represents the language containing the zero or more occurrences of the
symbols from L(r)’ ie, L(r*)
2) (rs)*
It represents the language containing the zero or more occurrences of the
symbols from L(rs)’ ie, L((rs)*) , ab, abab,ababab,……}
3) (r|ss)*
L((r|
21
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
Finite Automata
22
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
EXAMPLE
if if
then then
else else
rebop <|<=|< >|> |> =
id letter ( letter|digit )*
num digits optional-fraction optional-exponent
For this language, the lexical analyzer will recognize the keywords i f , then,and else,
as well as lexemes that match the patterns for relop, id, and number.
To simplify matters, we make the common assumption that keywords are also reserved
words: that is they cannot be used as identifiers.
The num represents the unsigned integer and real numbers of Pascal.
Our lexical analyzer will strip out white space. It will do so by comparing a string
against the regular definition ws, below.
Delim blank|tab|newline
ws delim
If a match for ws is found, the lexical analyzer does not return a token to the parser.
23
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
Transition Diagram
As an intermediate step in the construction of a lexical analyzer, we first produce a
flowchart, called a r diagram. Transition diagrams.
Transition diagram depict the actions that take place when a lexical analyzer is called
by the parser to get the next token.
The TD uses to keep track of information about characters that are seen as the forward
pointer scans the input.
It does that by moving from position in the diagram as characters are read.
1. One state is labelled the Start State start It is the initial state of transition
diagram where control resides when we begin to recognize a token.
2. Position is a transition diagram are drawn as circles and are called states.
3. The states are connected by Arrows called edges. Labels on edges are indicating
the input characters.
5. Retract one character use * to indicate states on which this input retraction.
24
Prepared by EBIN P.M (AP, CSE)
IES College of Engineering
25