UNIT-1 Odg
UNIT-1 Odg
Error report
Fig.1.1. A Compiler
Major functions done by compiler:
Compiler is used to convert one form of program to another.
A compiler should convert the source program to a target machine code in such a way
that the generated target code should be easy to understand.
Compiler should preserve
preserv the meaning of source code.
Compiler should report errors that occur during compilation process.
The compilation must be done efficiently.
+
a
a *
b *
c 2
Semantic analysis:
Semantic analyzer determines the meaning of a source string.
For example matching of parenthesis in the expression, or matching of if..else
statement or performing arithmetic operation that are type compatible, or checking the
scope of operation.
=
+
a
a *
b *
c 2
Int to float
Synthesis phase:: synthesis part is divided into three sub parts,
I. Intermediate code generation
II. Code optimization
III. Code generation
Intermediate code generation:
The intermediate representation should have two important properties, it should be
Dixita Kagathara, CE Department | 170701 – Compiler Design 3
Unit 1 - Introduction
easy to produce and easy to translate into target program.
We consider intermediate form called “three address code”.
Three address code consist of a sequence of instruction, each of which has at most
three operands.
The source program might appear in three address code as,
t1= int to real(2)
t2= id3 * t1
t3= t2 * id2
t4= t3 + id1
id1= t4
Code optimization:
The code optimization phase attempt to improve the intermediate code.
This is necessary to have a faster executing code or less consumption of memory.
Thus by optimizing the code the overall running r unning time of a target program can be
improved.
t1= id3 * 2.0
t2= id2 * t1
id1 = id1 + t2
Code generation:
In code generation phase the target code gets generated. The intermediate code
instructions are translated into sequence of machine instruction.
instructio
MOV id3, R1
MUL #2.0, R1
MOV id2, R2
MUL R2, R1
MOV id1, R2
ADD R2, R1
MOV R1, id1
Symbol Table
A symbol table isisaadata
datastructure
structureused
usedbybyaalanguage
languagetranslator
translatorsuch
suchas
asaacompiler
compileror
or
interpreter.
It is used to store names
namesencountered
encounteredininthe
thesource
sourceprogram,
program,along
alongwith
withthe
therelevant
relevant
attributes for those names.
Information
tion about following entities is stored in the symbol table.
Variable/Identifier
Procedure/function
Keyword
Constant
Class name
Label name
Source program
Lexical Analysis
Syntax Analysis
Semantic Analysis
Symbol Table Error detection
and recovery
Intermediate Code
Code Optimization
Code Generation
Target Program
Skeletal source
Preprocessor
Source program
Compiler
Target assembly
Assembler
Linker / Loader
LEXICAL ANALYSIS:
→ As the first phase of a compiler, the main task of the lexical analyzer is to read the input
characters of the source program, group them into lexemes, and produce as output tokens for each
lexeme in the source program. This stream of tokens is sent to the parser for syntax analysis. It is
common for the lexical analyzer to interact with the symbol table as well.
→When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that
lexeme into the symbol table. This process is shown in the following figure.
→When lexical analyzer identifies the first token it will send it to the parser, the parser receives the
token and calls the lexical analyzer to send next token by issuing the getNextToken() command.
This Process continues until the lexical analyzer identifies all the tokens. During this process the
lexical analyzer will neglect or discard the white spaces and comment lines.
LEXICAL ANALYSIS Vs PARSING:
There are a number of reasons why the analysis portion of a compiler is normally separated into
lexical analysis and parsing (syntax analysis) phases.
1. Simplicity of design is the most important consideration . The separation of Lexical
and Syntactic analysis often allows us to simplify at least one of these tasks. For
example, a parser that had to deal with comments and whitespace as syntactic units
would be considerably more complex than one that can assume comments and
whitespace have already been removed by the lexical analyzer.
2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply
specialized techniques that serve only the lexical task, not the job of parsing. In addition,
specialized buffering techniques for reading input characters can speed up the compiler
significantly.
3. Compiler portability is enhanced: Input-device-specific peculiarities can be
restricted to the lexical analyzer.
INPUT BUFFERING:
some ways that the simple but important task of reading the source program can be speeded.
This task is made difficult by the fact that we often have to look one or more characters beyond
the next lexeme before we can be sure we have the right lexeme. There are many situations
where we need to look at least one additional character ahead. For instance, we cannot be sure
we've seen the end of an identifier until we see a character that is not a letter or digit, and
therefore is not part of the lexeme for id. In C, single-character operators like -, =, or <
could also be the beginning of a two-character operator like ->, ==, or <=. Thus, we shall
introduce a two-buffer scheme that handles large look aheads safely. We then consider an
improvement involving "sentinels" that saves time checking for the ends of buffers.
Buffer Pairs
Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096
bytes. Using one system read command we can read N characters in to a buffer, rather than
using one system call per character. If fewer than N characters remain in the input file, then a
special character, represented by eof, marks the end of the source file and is different from any
possible character of the source program.
Two pointers to the input are maintained:
1. The Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent
we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found; the exact strategy
whereby this determination is made will be covered in the balance of this chapter.
→Once the next lexeme is determined, forward is set to the character at its right end. Then,
after the lexeme is recorded as an attribute value of a token returned to the parser, 1exemeBegin
is set to the character immediately after the lexeme just found. In Fig, we see forward has passed
the end of the next lexeme, ** (the FORTRAN exponentiation operator), and must be retracted
one position to its left.
Advancing forward requires that we first test whether we have reached the end of one
of the buffers, and if so, we must reload the other buffer from the input, and move forward to
the beginning of the newly loaded buffer. As long as we never need to look so far ahead of the
actual lexeme that the sum of the lexeme's length plus the distance we look ahead is greater
than N, we shall never overwrite the lexeme in its buffer before determining it.
tree and passes it to the rest of the compiler for further processing.
→During the process of parsing it may encounter some error and present the error information back
to the user
Syntactic errors include misplaced semicolons or extra or missing braces; that is,
―{" or "}." As another example, in C or Java, the appearance of a case statement without
an enclosing switch is a syntactic error (however, this situation is usually allowed by the
parser and caught later in the processing, as the compiler attempts to generate code).
→Based on the way/order the Parse Tree is constructed, Parsing is basically classified in to
following two types:
1. Top Down Parsing : Parse tree construction start at the root node and moves to the
children nodes (i.e., top down order).
2. Bottom up Parsing: Parse tree construction begins from the leaf nodes and proceeds
towards the root node (called the bottom up order).
→in addition to these translators, programs like interpreters, text formatters etc., may be used in
language processing system. To translate a program in a high level language program to an
executable one, the Compiler performs by default the compile and linking functions.
→Normally the steps in a language processing system includes Preprocessing the skeletal Source
program which produces an extended or expanded source program or a ready to compile unit of
the source program, followed by compiling the resultant, then linking / loading , and finally its
equivalent executable code is produced.
→As I said earlier not all these steps are mandatory. In
some cases, the Compiler only performs this linking and loading functions implicitly.