Module 1
Module 1
DESIGN
Syllabus
CST 302 – COMPILER
DESIGN
MODULE - I
INTRODUCTION TO COMPILERS
A compiler is a program that can read a program in one language (the
source language) and translate it into an equivalent program in
another language (the target language).
Multi-pass,
Load-and-go,
Debugging, or
Optimizing,
Lexical Analysis
Syntax Analysis
Semantic Analysis
Lexical Analysis
In a compiler linear analysis is called lexical analysis or scanning.
The lexical analysis phase reads the characters in the source program
and grouped into tokens that are sequence of characters having a
collective meaning
EXAMPLE
position = initial + rate * 60
Here the compiler checks that each operator has operands that are
permitted by the source language specification.
PHASES OF A COMPILER
The phases include:
Lexical Analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generation
Code Optimization
Target Code Generation
Lexical Analysis
The first phase of a compiler is called lexical analysis or scanning.
The lexical analyzer reads the stream of characters making up the source
program and groups the characters into meaningful sequences called
lexemes.
For each lexeme, the lexical analyzer produces as output a token of the form
In the token, the first component token- name is an abstract symbol that is
used during syntax analysis, and
the second component attribute-value points to an entry in the symbol
table for this token.
Information from the symbol-table entry is needed for semantic
analysis and code generation.
second component.
3. initial is a lexeme that is mapped into the token < id, 2> , where 2
points to the symbol-table entry for initial.
5. rate is a lexeme that is mapped into the token < id, 3 >, where 3
points to the symbol-table entry for rate.
<id, 1> < = > <id,2> <+> <id,3> < * > <60>
Token
Token is a sequence of characters that can be treated as a single
logical entity.
Identifiers
keywords
operators
special symbols
constants
Pattern:
A set of strings in the input for which the same token is produced as
output.
This set of strings is described by a rule called a pattern associated
with the token.
Lexeme:
A lexeme is a sequence of characters in the source program that is
matched by the pattern for a token.
Syntax Analysis
The second phase of the compiler is syntax analysis or parsing.
The parser uses the first components of the tokens produced by the
lexical analyzer to create a tree-like intermediate representation that
depicts the grammatical structure of the token stream.
A typical representation is a syntax tree in which each interior node
represents an operation and the children of the node represent the
arguments of the operation.
The syntax tree for above token stream is:
The tree has an interior node labeled with ( id, 3 ) as its left child and
the integer 60 as its right child.
The node labeled * makes it explicit that we must first multiply the
value of rate by 60.
The node labeled + indicates that we must add the result of this
multiplication to the value of initial.
The root of the tree, labeled =, indicates that we must store the result
of this addition into the location for the identifier position.
Semantic Analysis
The semantic analyzer uses the syntax tree and the information in
the symbol table to check the source program for semantic
consistency with the language definition.
It also gathers type information and saves it in either the syntax tree
or the symbol table, for subsequent use during intermediate-code
generation.
An important part of semantic analysis is type checking, where the
compiler checks that each operator has matching operands.
For example, many programming language definitions require an
array index to be an integer; the compiler must report an error if a
floating-point number is used to index an array.
Some sort of type conversion is also done by the semantic analyzer.
In our example, suppose that position, initial, and rate have been
declared to be floating- point numbers, and that the lexeme 60 by
itself forms an integer.
The data structure should be designed to allow the compiler to find the
record for each name quickly and to store or retrieve data from that
record quickly.
Error Detection And Reporting
Each phase can encounter errors.
A compiler that stops when it finds the first error is not a helpful one.
Example
GROUPING OF PHASES
The process of compilation is split up into following phases:
Analysis Phase
Synthesis phase
Analysis Phase
1. Lexical analysis
2. Syntax Analysis
3. Semantic analysis
Synthesis phase
1. Code Optimization
2. Code Generation
Analysis Phase
The analysis part breaks up the source program into constituent
pieces and imposes a grammatical structure on them.
The analysis part also collects information about the source program
and stores it in a data structure called a symbol table, which is
passed along with the intermediate representation to the synthesis
part.
Synthesis Phase
The synthesis part constructs the desired target program from the
intermediate representation and the information in the symbol table.
The analysis part is often called the front end of the compiler;
1. Parser Generators
2. Scanner Generators
3. Syntax-directed translation engine
4. Automatic code generators
5. Data-flow analysis Engines
6. Compiler Construction toolkits
Parser Generators
Input : Grammatical description of a programming language
Output : Syntax analyzers
The rules must include sufficient detail that we can handle the
different possible access methods for data.
Data-flow Analysis Engines
Data-flow analysis engine gathers the Information that is, the values
transmitted from one part of a program to each of the other parts.
Bootstrap compiler is used to compile the compiler and then you can
use this compiled compiler to compile everything else as well as
future versions of itself.
Target Language
Implementation Language
PASCAL TRANSLATOR – C Language
Pascal code – input
C – Output
As the first phase of a compiler, the main task of the lexical analyzer
is to read the input characters of the source program, group them
into lexemes, and produce as output a sequence of tokens for each
lexeme in the source program.
Efficiency
Compiler efficiency is improved.
A separate lexical analyzer allows us to apply specialized techniques
that serve only the lexical task, not the job of parsing.
In addition, specialized buffering techniques for reading input
characters can speed up the compiler significantly.
Portability
Compiler portability is enhanced. Input-device-specific peculiarities
can be restricted to the lexical analyzer.
Attributes For Tokens
Sometimes a token need to be associate with several pieces of
information.
Fig shows the buffer pairs which are used to hold the input data.
Scheme
Consists of two buffers, each consists of N-character size which are
reloaded alternatively.
N characters are read from the input file to the buffer using one
system read command.
Language
Regular expression
Concatenation
Kleene closure
Positive closure
Regular Expressions
It allows defining the sets to form tokens precisely.
Eg, letter ( letter|digit )*
1. ε is a regular expression that denotes {ε}, i.e. the set containing the empty
string.
3. Suppose r and s are regular expressions denoting the languages L(r) and L(s).
Then
EXAMPLE
Assume the following grammar fragment to generate a specific
language
where the terminals if, then, else, relop, id and num generates sets of
strings given by following regular definitions.
where letter and digits are defined previously
For this language, the lexical analyzer will recognize the keywords if
, then, and else, as well as lexemes that match the patterns for
relop, id, and number.
To simplify matters, we make the common assumption that keywords
are also reserved words: that is they cannot be used as identifiers.
The num represents the unsigned integer and real numbers of Pascal.
In addition, we assume lexemes are separated by white space,
consisting of non-null sequences of blanks, tabs, and newlines.
Our lexical analyzer will strip out white space.
It will do so by comparing a string against the regular definition ws,
below.
If a match for ws is found, the lexical analyzer does not return a token
to the parser.
Transition Diagram
As an intermediate step in the construction of a lexical analyzer, we
first produce a flowchart/ diagram.
Transition diagrams.
Transition diagram depict the actions that take place when a lexical
analyzer is called by the parser to get the next token.
The TD uses to keep track of information about characters that are
seen as the forward pointer scans the input.
It does that by moving from position in the diagram as characters are
read
COMPONENTS OF TRANSITION DIAGRAM
Transition diagram for Identifiers
Transition diagram for unsigned numbers
in Pascal
Thanks…