Compiler Design Note Unit 1
Compiler Design Note Unit 1
Introduction to Compiling:
Unit -I
1.1 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Preprocessor
A preprocessor produce input to compilers. They may perform the following
functions.
Macro processing: A preprocessor may allow a user to define macros that are
short hands for longer constructs.
File inclusion: A preprocessor may include header files into the program text.
Rational preprocessor: these preprocessors augment older languages with
more modern flow-of control and data structuring facilities.
Language Extensions: These preprocessor attempts to add capabilities to the
language by certain amounts to build-in macro
COMPILER
Compiler is a translator program that translates a source program written in
(HLL) and translate it into an equivalent program in the target program in low
level language or any intermediate language. It also produces errors if exit in
the program.
Structure of Compiler
ASSEMBLER
Programmers found it difficult to write or read programs in machine language.
They begin to use a mnemonic (symbols) for each machine instruction, which
they would subsequently translate into machine language. Such a mnemonic
machine language is now called an assembly language .Programs known as
assembler were written to automate the translation of assembly language in to
machine language. The input to an assembler program is called source program,
the output is a machine language translation (object program).
INTERPRETER
An interpreter is a program that appears to execute a source program as if it
were machine language
4. Direct Execution
Advantages:
Modification of user program can be easily made and implemented as execution
proceeds. Type of object that denotes a various may change dynamically.
Debugging a program and finding errors is simplified task for a program used for
interpretation. The interpreter for the language makes it machine independent.
Disadvantages:
The execution of the program is slower.
LOADER AND LINK-EDITOR:
Once the assembler procedures an object program, that program must be
placed into memory and executed. The assembler could place the object
program directly in memory and transfer control to it, thereby causing the
machine language program to be execute. This would waste core by leaving the
assembler in memory while the user’s program was being executed. Also the
programmer would have to re-translate his program with each execution, thus
wasting translation time. To over come this problems of wasted translation time
and memory. System programmers developed another component called loader
A loader is a program that places programs into memory and prepares them for
execution.” It would be more efficient if subroutines could be translated into
object form the loader could”relocate” directly behind the user’s program. The
task of adjusting programs o they may be placed in arbitrary core locations is
called relocation. Relocation loaders perform four functions.
1.2 TRANSLATOR
A translator is a program that takes as input a program written in one language
and produces as output a program in another language. Beside program
translation, the translator performs another very important role, the error-
detection. Any violation of d HLL specification would be detected and reported
to the programmers. Important role of translator are:
1 Translating the HLL program input into an equivalent ml program.
2 Providing diagnostic messages wherever the programmer violates
specification of the HLL.
1.3 LIST OF COMPILERS
Ada compilers
ALGOL compilers
BASIC compilers
C# compilers
C compilers
C++ compilers
COBOL compilers
Common Lisp compilers
ECMAScript interpreters
Fortran compilers
Java compilers
Pascal compilers
PL/I compilers
Python compilers
Smalltalk compilers
Phases of Compiler
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this
phase expressions, statements, declarations etc… are identified by using the
results of lexical analysis. Syntax analysis is aided by using techniques based on
formal grammar of the programming language.
Intermediate Code Generations:-
a. Local Optimization:-
There are local transformations that can be applied to a program to make an
improvement. For example,
A > B goto L2
Goto L3
L2 :
This can be replaced by a single statement
If A < B goto L3
Another important local optimization is the elimination of common sub-
expressions
A := B + C + D
E := B + C + F
Might be evaluated as
T1 := B + C
A := T1 + D
E := T1 + F
Take this advantage of the common sub-expressions B + C.
b. Loop Optimization:-
Another important source of optimization concerns about increasing the
speed of loops. A
typical loop improvement is to move a computation that produces the same
result each time
around the loop to a point, in the program just before the loop is entered.
Code generator :-
Code Generator produces the object code by deciding on the memory locations
for data, selecting code to access each datum and selecting the registers in
which each computation is to be done. Many computers have only a few high
speed registers in which computations can be performed quickly. A good code
generator would attempt to utilize registers as efficiently as possible.
Table Management OR Book-keeping :-
A compiler needs to collect information about all the data objects that appear in
the source program. The information about data objects is collected by the early
phases of the compiler-lexical and syntactic analyzers. The data structure used
to record this information is called as Symbol Table.
Error Handing :-
One of the most important functions of a compiler is the detection and reporting
of errors in the source program. The error message should allow the
programmer to determine exactly where the errors have occurred. Errors may
occur in all or the phases of a compiler. Whenever a phase of the compiler
discovers an error, it must report the error to the error handler, which issues an
appropriate diagnostic msg. Both of the table-management and error-Handling
routines interact with all phases of the compiler.
Example:
2. Pass of Compiler:
In the compilation of the program passes many phases each phase takes specific input and produces
specific output.
LEXICAL ANALYSIS
• reads and converts the input into a stream of tokens to be analyzed by
parser.
• lexeme : a sequence of characters which comprises a single token.
• Lexical Analyzer →Lexeme / Token → Parser
Removal of White Space and Comments
• Remove white space(blank, tab, new line etc.) and comments
Contsants
A Lexical Analyzer
c=getchcar(); ungetc(c,stdin);
• token representation
o #define NUM 256
• Function lexan()
eg) input string 76 + a
input , output(returned value)
76 NUM, tokenval=76 (integer)
++
A id , tokeval="a"
• A way that parser handles the token NUM returned by laxan()
o consider a translation scheme
factor → ( expr )
| num { print(num.value) }
#define NUM 256
...
factor() {
if(lookahead == '(' ) {
match('('); exor(); match(')');
} else if (lookahead == NUM) {
printf(" %f ",tokenval); match(NUM);
} else error();
}
• The implementation of function lexan
1) #include <stdio.h>
2) #include <ctype.h>
3) int lino = 1;
4) int tokenval = NONE;
5) int lexan() {
6) int t;
7) while(1) {
8) t = getchar();
9) if ( t==' ' || t=='\t' ) ;
10) else if ( t=='\n' ) lineno +=1;
11) else if (isdigit(t)) {
12) tokenval = t -'0';
13) t = getchar();
14) while ( isdigit(t)) {
15) tokenval = tokenval*10 + t - '0';
16) t =getchar();
17) }
18) ungetc(t,stdin);
19) retunr NUM;
20) } else {
21) tokenval = NONE;
22) return t;
23) }
24) }
25) }
Upon receiving a ‘get next token’ command form the parser, the lexical
analyzer reads the input character until it can identify the next token. The LA
return to the parser representation for the token it has found. The
representation will be an integer code, if the token is a simple construct such
as parenthesis, comma or colon. LA may also perform certain secondary
tasks as the user interface. One such task is striping out from the source
program the commands and white spaces in the form of blank, tab
and new line characters. Another is correlating error message from the
compiler with the source program.
Pattern: A set of strings in the input for which the same token is produced
as output. This set of strings is described by a rule called a pattern
associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is
matched by the pattern for a token.
Operator=+|-|*|/|mod|div
Keyword=if|while|do|then
letter=a|b|c|......|x|y|z|A|B|C|.......|X|Y|Z
digit=0|1|2|..........|9
Identifier = letter (letter | digit)*
Here are the rules that define the regular expression over alphabet .
• is a regular expression denoting { € }, that is, the language containing only
the empty string.
• For each ‘a’ in Σ, is a regular expression denoting { a }, the language with
only one string consisting of the single symbol ‘a’ .
• If R and S are regular expressions, then
(R) | (S) means L(r) U L(s)
R.S means L(r).L(s)
R* denotes L(r*)
3.6. REGULAR DEFINITIONS
For notational convenience, we may wish to give names to regular
expressions and to define regular expressions using these names as if they
were symbols.Identifiers are the set or string of letters and digits beginning
with a letter. The following regular definition provides a precise specification
for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Pascal identifier
Letter - A | B | ……| Z | a | b |……| z|
Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*
Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must
study how to take the patterns for all the needed tokens and build a piece of
code that examines the input string and finds a prefix that is a lexeme
matching one of the patterns.
Stmt →if expr then stmt | If expr then else stmt| є
Expr →term relop term| term
Term →id |number
For relop ,we use the comparison operations of languages like Pascal or SQL
where = is “equals” and < > is “not equals” because it presents an
interesting structure of lexemes. The terminal of grammar, which are if,
then , else, relop ,id and numbers are the names of tokens as far as the
lexical analyzer is concerned, the patterns for the tokens are described using
regular definitions.
digit → [0,9]
digits →digit+
number →digit(.digit)?(e.[+-]?digits)?
letter → [A-Z,a-z]
id →letter(letter/digit)*
if → if
then →then
else →else
relop →< | > |<= | >= | = = | < >
In addition, we assign the lexical analyzer the job stripping out white space,
by recognizing the
“token” we defined by:
WS → (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the
ASCII characters of the same names. Token ws is different from the other
tokens in that ,when we recognize it, we do not return it to parser ,but rather
restart the lexical analysis from the character that follows the white space