0% found this document useful (0 votes)
7 views

Module 1

The document outlines the syllabus for CST 302 - Compiler Design, detailing the various phases of a compiler including lexical analysis, syntax analysis, and semantic analysis. It explains the roles of each phase in translating source code into target code, emphasizing the importance of error detection and the use of a symbol table. Additionally, it discusses tools for compiler construction and the concept of bootstrapping in compiler development.

Uploaded by

jisha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module 1

The document outlines the syllabus for CST 302 - Compiler Design, detailing the various phases of a compiler including lexical analysis, syntax analysis, and semantic analysis. It explains the roles of each phase in translating source code into target code, emphasizing the importance of error detection and the use of a symbol table. Additionally, it discusses tools for compiler construction and the concept of bootstrapping in compiler development.

Uploaded by

jisha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 86

CST 302 – COMPILER

DESIGN
Syllabus
CST 302 – COMPILER
DESIGN

MODULE - I
INTRODUCTION TO COMPILERS
 A compiler is a program that can read a program in one language (the
source language) and translate it into an equivalent program in
another language (the target language).

 An important role of the compiler is to report any errors in the source


program that it detects during the translation process.
 Compilers are sometimes classified as
 Single pass,

 Multi-pass,

 Load-and-go,

 Debugging, or

 Optimizing,

depending on how they have been constructed or on what function


they are supposed to perform.
ANALYSIS OF THE SOURCE PROGRAM

 In compiling, analysis consists of three phases:

 Lexical Analysis
 Syntax Analysis
 Semantic Analysis
Lexical Analysis
 In a compiler linear analysis is called lexical analysis or scanning.

 The lexical analysis phase reads the characters in the source program
and grouped into tokens that are sequence of characters having a
collective meaning
EXAMPLE
position = initial + rate * 60

 This can be grouped into the following tokens;

1. The identifier position.


2. The assignment symbol =
3. The identifier initial
4. The plus sign +
5. The identifier rate
6. The multiplication sign *
7. The number 60

Blanks separating characters of these tokens are normally eliminated


during lexical analysis.
Syntax Analysis
 Hierarchical Analysis is called parsing or syntax analysis.

 It involves grouping the tokens of the source program into


grammatical phrases that are used by the complier to synthesize
output.

 They are represented using a syntax tree.

 A syntax tree is the tree generated as a result of syntax analysis in


which the interior nodes are the operators and the exterior nodes are
the operands.

 This analysis shows an error when the syntax is incorrect


position : = initial + rate * 60
Semantic Analysis
 This phase checks the source program for semantic errors and
gathers type information for subsequent code generation phase.

 An important component of semantic analysis is type checking.

 Here the compiler checks that each operator has operands that are
permitted by the source language specification.
PHASES OF A COMPILER
 The phases include:

 Lexical Analysis
 Syntax Analysis
 Semantic Analysis
 Intermediate Code Generation
 Code Optimization
 Target Code Generation
Lexical Analysis
 The first phase of a compiler is called lexical analysis or scanning.

 The lexical analyzer reads the stream of characters making up the source
program and groups the characters into meaningful sequences called
lexemes.

 For each lexeme, the lexical analyzer produces as output a token of the form

< token- name, attribute-value >

that it passes on to the subsequent phase, syntax analysis.

In the token, the first component token- name is an abstract symbol that is
used during syntax analysis, and
the second component attribute-value points to an entry in the symbol
table for this token.
 Information from the symbol-table entry is needed for semantic
analysis and code generation.

 For example, suppose a source program contains the assignment


statement

position = initial + rate * 60

The characters in this assignment could be grouped into the following


lexemes and mapped into the following tokens passed on to the
syntax analyzer:
<id, 1> < = > <id 2> <+> <id 3> < * > <60>

position = initial + rate * 60

1. position is a lexeme that would be mapped into a token <id 1>,


where id is an abstract symbol standing for identifier and 1 points to
the symbol table entry for position.

 The symbol- table entry for an identifier holds information about


the identifier, such as its name and type.

2. The assignment symbol = is a lexeme that is mapped into the token


< = >.
 Since this token needs no attribute-value, we have omitted the

second component.
3. initial is a lexeme that is mapped into the token < id, 2> , where 2
points to the symbol-table entry for initial.

4. + is a lexeme that is mapped into the token .

5. rate is a lexeme that is mapped into the token < id, 3 >, where 3
points to the symbol-table entry for rate.

6. * is a lexeme that is mapped into the token .

7. 60 is a lexeme that is mapped into the token


 Blanks separating the lexemes would be discarded by the lexical
analyzer.

 The representation of the assignment statement

 position = initial + rate * 60 after lexical analysis as the


sequence of tokens as:

<id, 1> < = > <id,2> <+> <id,3> < * > <60>
Token
 Token is a sequence of characters that can be treated as a single
logical entity.

 Typical tokens are

 Identifiers
 keywords
 operators
 special symbols
 constants
Pattern:
 A set of strings in the input for which the same token is produced as
output.
 This set of strings is described by a rule called a pattern associated
with the token.

Lexeme:
 A lexeme is a sequence of characters in the source program that is
matched by the pattern for a token.
Syntax Analysis
 The second phase of the compiler is syntax analysis or parsing.
 The parser uses the first components of the tokens produced by the
lexical analyzer to create a tree-like intermediate representation that
depicts the grammatical structure of the token stream.
 A typical representation is a syntax tree in which each interior node
represents an operation and the children of the node represent the
arguments of the operation.
 The syntax tree for above token stream is:
 The tree has an interior node labeled with ( id, 3 ) as its left child and
the integer 60 as its right child.

 The node (id, 3) represents the identifier rate.

 The node labeled * makes it explicit that we must first multiply the
value of rate by 60.

 The node labeled + indicates that we must add the result of this
multiplication to the value of initial.

 The root of the tree, labeled =, indicates that we must store the result
of this addition into the location for the identifier position.
Semantic Analysis
 The semantic analyzer uses the syntax tree and the information in
the symbol table to check the source program for semantic
consistency with the language definition.
 It also gathers type information and saves it in either the syntax tree
or the symbol table, for subsequent use during intermediate-code
generation.
 An important part of semantic analysis is type checking, where the
compiler checks that each operator has matching operands.
 For example, many programming language definitions require an
array index to be an integer; the compiler must report an error if a
floating-point number is used to index an array.
 Some sort of type conversion is also done by the semantic analyzer.

 For example, if the operator is applied to a floating point number and


an integer, the compiler may convert the integer into a floating point
number.

 In our example, suppose that position, initial, and rate have been
declared to be floating- point numbers, and that the lexeme 60 by
itself forms an integer.

 The semantic analyzer discovers that the operator * is applied to a


floating-point number rate and an integer 60.

 In this case, the integer may be converted into a floating-point


number.
 In the following figure, notice that the output of the semantic
analyzer has an extra node for the operator inttofloat , which
explicitly converts its integer argument into a floating-point number.
Intermediate Code Generation
 In the process of translating a source program into target code, a
compiler may construct one or more intermediate representations,
which can have a variety of forms.

 Syntax trees are a form of intermediate representation; they are


commonly used during syntax and semantic analysis.

 After syntax and semantic analysis of the source program, many


compilers generate an explicit low-level or machine-like intermediate
representation, which we can think of as a program for an abstract
machine.

 This intermediate representation should have two important


properties:
 It should be simple and easy to produce
 In our example, the intermediate representation used is three-address
code, which consists of a sequence of assembly-like instructions with
three operands per instruction.
Code Optimization
 The machine-independent code-optimization phase attempts to
improve the intermediate code so that better target code will result.

 The objectives for performing optimization are: faster execution,


shorter code, or target code that consumes less power.

 In our example, the optimized code is:


Code Generator
 The code generator takes as input an intermediate representation of
the source program and maps it into the target language.

 If the target language is machine code, registers or memory locations


are selected for each of the variables used by the program.

 Then, the intermediate instructions are translated into sequences of


machine instructions that perform the same task.

 A crucial aspect of code generation is the judicious assignment of


registers to hold variables.

 If the target language is assembly language, this phase generates the


assembly code as its output.
 In our example, the code generated is:

 The first operand of each instruction specifies a destination.


 The F in each instruction tells us that it deals with floating-point
numbers.
 The above code loads the contents of address id3 into register R2,
then multiplies it with floating-point constant 60.0.
 The # signifies that 60.0 is to be treated as an immediate constant.
 The third instruction moves id2 into register R1 and the fourth adds to
it the value previously computed in register R2.

 Finally, the value in register Rl is stored into the address of id1 , so


the code correctly implements the assignment statement position =
initial + rate * 60.
Symbol Table
 An essential function of a compiler is to record the variable names
used in the source program and collect information about various
attributes of each name.

 These attributes may provide information about the storage allocated


for a name, its type, its scope (where in the program its value may be
used), and in the case of procedure names, such things as the number
and types of its arguments, the method of passing each argument (for
example, by value or by reference), and the type returned.

 The symbol table is a data structure containing a record for each


variable name, with fields for the attributes of the name.

 The data structure should be designed to allow the compiler to find the
record for each name quickly and to store or retrieve data from that
record quickly.
Error Detection And Reporting
 Each phase can encounter errors.

 However, after detecting an error, a phase must somehow deal with


that error, so that compilation can proceed, allowing further errors in
the source program to be detected.

 A compiler that stops when it finds the first error is not a helpful one.
Example
GROUPING OF PHASES
 The process of compilation is split up into following phases:

 Analysis Phase

 Synthesis phase
 Analysis Phase
1. Lexical analysis

2. Syntax Analysis

3. Semantic analysis

4. Intermediate Code Generation

 Synthesis phase
1. Code Optimization

2. Code Generation
Analysis Phase
 The analysis part breaks up the source program into constituent
pieces and imposes a grammatical structure on them.

 It then uses this structure to create an intermediate representation of


the source program.

 If the analysis part detects that the source program is either


syntactically ill formed or semantically unsound, then it must provide
informative messages, so the user can take corrective action.

 The analysis part also collects information about the source program
and stores it in a data structure called a symbol table, which is
passed along with the intermediate representation to the synthesis
part.
Synthesis Phase
 The synthesis part constructs the desired target program from the
intermediate representation and the information in the symbol table.

 The analysis part is often called the front end of the compiler;

 The synthesis part is the back end.


COMPILER WRITING TOOLS
 Compiler writers use software development tools and more
specialized tools for implementing various phases of a compiler.

 Some commonly used compiler construction tools include the


following.

1. Parser Generators
2. Scanner Generators
3. Syntax-directed translation engine
4. Automatic code generators
5. Data-flow analysis Engines
6. Compiler Construction toolkits
Parser Generators
 Input : Grammatical description of a programming language
 Output : Syntax analyzers

 These produce syntax analyzers, normally from input that is based on


a context-free grammar.

 In early compilers, syntax analysis consumed not only a large fraction


of the running time of a compiler, but a large fraction of the
intellectual effort of writing a compiler.

 This phase is one of the easiest to implement.


Scanner Generators
 Input : Regular expression description of the tokens of a language
 Output : Lexical analyzers.

 These automatically generate lexical analyzers, normally from a


specification based on regular expressions.
 The basic organization of the resulting lexical analyzer is in effect a
finite automaton.
Syntax-directed Translation Engines
 Input : Parse tree.
 Output : Intermediate code.

 These produce collections of routines that walk the parse tree,


generating intermediate code.
 The basic idea is that one or more "translations" are associated with
each node of the parse tree, and each translation is defined in terms
of translations at its neighbor nodes in the tree.
Automatic Code Generators
 Input : Intermediate language.
 Output : Machine language.

 Such a tool takes a collection of rules that define the translation of


each operation of the intermediate language into the machine
language for the target machine.

 The rules must include sufficient detail that we can handle the
different possible access methods for data.
Data-flow Analysis Engines
 Data-flow analysis engine gathers the Information that is, the values
transmitted from one part of a program to each of the other parts.

 Data-flow analysis is a key part of code optimization.


BOOTSTRAPPING
 Bootstrapping is widely used in the compilation development.

 Bootstrapping is used to produce a self-hosting compiler.

 Self-hosting compiler is a type of compiler that can compile its own


source code.

 Bootstrap compiler is used to compile the compiler and then you can
use this compiled compiler to compile everything else as well as
future versions of itself.

 A compiler is characterized by three languages:


 Source Language

 Target Language

 Implementation Language
 PASCAL TRANSLATOR – C Language
 Pascal code – input
 C – Output

Create Pascal translator in C++ ?


LEXICAL ANALYSIS
 ROLE OF LEXICAL ANALYSIS

 As the first phase of a compiler, the main task of the lexical analyzer
is to read the input characters of the source program, group them
into lexemes, and produce as output a sequence of tokens for each
lexeme in the source program.

 The stream of tokens is sent to the parser for syntax analysis.


 Lexical Analyzer also interacts with the symbol table.

 When the lexical analyzer discovers a lexeme constituting an


identifier, it needs to enter that lexeme into the symbol table.

 In some cases, information regarding the kind of identifier may be


read from the symbol table by the lexical analyzer to assist it in
determining the proper token it must pass to the parser.

 These interactions as shown in figure


 Commonly, the interaction is implemented by having the parser call
the lexical analyzer.

 The call, suggested by the getNextToken command, causes the lexical


analyzer to read characters from its input until it can identify the next
lexeme and produce for it the next token, which it returns to the
parser.
Other tasks of Lexical Analyzer
1. Stripping out comments and whitespace (blank, newline, tab, and
perhaps other characters that are used to separate tokens in the
input).

2. Correlating error messages generated by the compiler with the


source program.
 For instance, the lexical analyzer may keep track of the number
of newline characters seen, so it can associate a line number with
each error message.

3. If the source program uses a macro-pre-processor, the expansion of


macros may also be performed by the lexical analyzer.
Reasons why lexical analysis is separated
from syntax analysis
Simplicity Of Design
 The separation of lexical analysis and syntactic analysis often allows
us to simplify at least one of these tasks.
 The syntax analyzer can be smaller and cleaner by removing the low
level details of lexical analysis

Efficiency
 Compiler efficiency is improved.
 A separate lexical analyzer allows us to apply specialized techniques
that serve only the lexical task, not the job of parsing.
 In addition, specialized buffering techniques for reading input
characters can speed up the compiler significantly.

Portability
 Compiler portability is enhanced. Input-device-specific peculiarities
can be restricted to the lexical analyzer.
Attributes For Tokens
 Sometimes a token need to be associate with several pieces of
information.

 The most important example is the token id, where we need to


associate with the token a great deal of information.

 Normally, information about an identifier


 Eg: Its lexeme, its type, and the location at which it is first found (in

case an error message about that identifier must be issued) - is


kept in the symbol table.

 Thus, the appropriate attribute value for an identifier is a pointer to


the symbol-table entry for that identifier
Question
1. COBOL TRANSLATOR – C++ Language
C – Output

Create COBOL translator in Java?

2. Identify the tokens in the expression : a = b + c - 10 and draw


the outputs after syntax and semantic analysis
Lexical Errors
 A character sequence that can’t be scanned into any valid token is a
lexical error.

 Suppose a situation arises in which the lexical analyzer is unable to


proceed because none of the patterns for tokens matches any prefix
of the remaining input.

 The simplest recovery strategy is "panic mode" recovery.

 We delete successive characters from the remaining input, until the


lexical analyzer can find a well-formed token at the beginning of what
input is left.

 This recovery technique may confuse the parser, but in an interactive


computing environment it may be quite adequate.
 Other possible error-recovery actions are:

1. Delete one character from the remaining input.


2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.

 Transformations like these may be tried in an attempt to repair the input.

 The simplest such strategy is to see whether a prefix of the remaining


input can be transformed into a valid lexeme by a single transformation.

 A more general correction strategy is to find the smallest number of


transformations needed to convert the source program into one that
consists only of valid lexeme.
INPUT BUFFERING
 To ensure that a right lexeme is found, one or more characters have to be
looked up beyond the next lexeme.
 Hence a two-buffer scheme is introduced to handle large lookaheads safely.
 Techniques for speeding up the process of lexical analyzer such as the use
of sentinels to mark the buffer end have been adopted.

 There are three general approaches for the implementation of a lexical


analyzer:
 By using a lexical-analyzer generator, such as lex compiler to produce

the lexical analyzer from a regular expression based specification.


 In this, the generator provides routines for reading and buffering the
input.
 By writing the lexical analyzer in a conventional systems-programming
language, using I/O facilities of that language to read the input.

 By writing the lexical analyzer in assembly language and explicitly


managing the reading of input.
Buffer Pairs
 There are times when the lexical analyzer needs to look ahead
several characters beyond the lexeme for a pattern before a match
can be announced. Some lexical analyzers uses a function ungetc to
push look ahead characters back into the input stream. Because of
large amount of time consumption in moving characters, specialized
buffering techniques have been developed to reduce the amount of
overhead required to process an input character.

 Fig shows the buffer pairs which are used to hold the input data.
Scheme
 Consists of two buffers, each consists of N-character size which are
reloaded alternatively.

 N - Number of characters on one disk block.

 N characters are read from the input file to the buffer using one
system read command.

 eof is inserted at the end if the number of characters is less than N.


Pointers
 Two pointers lexemeBegin and forward are maintained.

 lexeme Begin points to the beginning of the current lexeme which is


yet to be found.

 forward scans ahead until a match for a pattern is found.

 Once a lexeme is found, lexemebegin is set to the character


immediately after the lexeme which is just found and forward is set to
the character at its right end.
 Current lexeme is the set of characters between two pointers.
 If the forward pointer is about to move past the halfway mark, the
right half is filled with N new input characters. If the forward pointer
is about to move past the right end of the buffer, the left half is filled
with N new characters and the forward pointer wraps around to the
beginning of the buffer.
Disadvantages Of This Scheme
 This scheme works well most of the time, but the amount of
lookahead is limited.

 This limited lookahead may make it impossible to recognize tokens in


situations where the distance that the forward pointer must travel is
more than the length of the buffer.

 (eg.) DECLARE (ARGl, ARG2, . . . , ARGn) in PL/1 program;

 It cannot determine whether the DECLARE is a keyword or an array


name until the character that follows the right parenthesis.
SPECIFICATION OF TOKENS
 There are 3 specifications of tokens:
 Strings

 Language

 Regular expression

Strings and Languages


 An alphabet or character class is a finite set of symbols
 A string over an alphabet is a finite sequence of symbols drawn from that
alphabet.
 A language is any countable set of strings over some fixed alphabet.
 In language theory, the terms "sentence" and "word" are often used as
synonyms for "string."
 The length of a string s, usually written |s|, is the number of occurrences of
symbols in s.
 For example, banana is a string of length six.
 The empty string, denoted ε, is the string of length zero.
Operations On Strings
 The following string-related terms are commonly used:
Operations On Languages
 The following are the operations that can be applied to languages:
 Union

 Concatenation

 Kleene closure

 Positive closure
Regular Expressions
 It allows defining the sets to form tokens precisely.
 Eg, letter ( letter|digit )*

 Define a Pascal identifier  which says the identifier is formed by


a letter followed by zero or more letters or digits.

 A regular expression is built up out of simpler regular expression


using a set of defining rules.

 Each regular expression r denotes a language L( r ).


The Rules That Define Regular Expressions
Over Alphabet

1. ε is a regular expression that denotes {ε}, i.e. the set containing the empty
string.

2. If a is a symbol in Σ, then a is a regular expression that denotes {a}, i.e. the


set containing the string a.

3. Suppose r and s are regular expressions denoting the languages L(r) and L(s).
Then

a) (r) | (s) is a regular expression denoting the languages L(r) U L(s).

b) (r)(s) is a regular expression denoting the languages L(r)L(s).

c) (r)* is a regular expression denoting the languages (L(r))*.

d) (r)+ is a regular expression denoting the languages (L(r))+.

 A language denoted by a regular expression is said to be a regular set.



Algebraic properties of regular expressions
Regular Definition
 If Σ is an alphabet of basic symbols, then a regular definition is a
sequence of definitions of the form

 Where each di is a distinct name, and each ri is a regular expression


over the symbols in Σ U {d1, d2, … , di-1},
 i.e., the basic symbols and the previously defined names.
Example
 Identifiers is the set of strings of letters and digits beginning with
a letter.
 Regular definition for this set:
RECOGNITION OF TOKENS
 The question is how to recognize the tokens?

EXAMPLE
 Assume the following grammar fragment to generate a specific
language

 where the terminals if, then, else, relop, id and num generates sets of
strings given by following regular definitions.
 where letter and digits are defined previously
 For this language, the lexical analyzer will recognize the keywords if
, then, and else, as well as lexemes that match the patterns for
relop, id, and number.
 To simplify matters, we make the common assumption that keywords
are also reserved words: that is they cannot be used as identifiers.
 The num represents the unsigned integer and real numbers of Pascal.
 In addition, we assume lexemes are separated by white space,
consisting of non-null sequences of blanks, tabs, and newlines.
 Our lexical analyzer will strip out white space.
 It will do so by comparing a string against the regular definition ws,
below.

 If a match for ws is found, the lexical analyzer does not return a token
to the parser.
Transition Diagram
 As an intermediate step in the construction of a lexical analyzer, we
first produce a flowchart/ diagram.

Transition diagrams.
 Transition diagram depict the actions that take place when a lexical
analyzer is called by the parser to get the next token.
 The TD uses to keep track of information about characters that are
seen as the forward pointer scans the input.
 It does that by moving from position in the diagram as characters are
read
COMPONENTS OF TRANSITION DIAGRAM
Transition diagram for Identifiers
Transition diagram for unsigned numbers
in Pascal
Thanks…

You might also like