0% found this document useful (0 votes)
6 views

Compiler Design Note Unit 1

The document provides an overview of compiler design, covering key components such as preprocessors, compilers, assemblers, interpreters, loaders, and link-editors. It details the phases of compilation including lexical analysis, syntax analysis, intermediate code generation, code optimization, and code generation, along with the roles of error handling and symbol tables. Additionally, it discusses the structure of programming languages and grammar definitions, emphasizing the importance of lexical analysis in converting input into a stream of tokens for further processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Compiler Design Note Unit 1

The document provides an overview of compiler design, covering key components such as preprocessors, compilers, assemblers, interpreters, loaders, and link-editors. It details the phases of compilation including lexical analysis, syntax analysis, intermediate code generation, code optimization, and code generation, along with the roles of error handling and symbol tables. Additionally, it discusses the structure of programming languages and grammar definitions, emphasizing the importance of lexical analysis in converting input into a stream of tokens for further processing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

lOMoARcPSD|49561817

Compiler Design Note Unit -1

Compiler Design (Dr. A.P.J. Abdul Kalam Technical University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by KAILASH KUMAR ([email protected])
lOMoARcPSD|49561817

Subject Name: Compiler Design Subject Code:KIT 052

Introduction to Compiling:
Unit -I
1.1 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM

Language Processing System

Preprocessor
A preprocessor produce input to compilers. They may perform the following
functions.
Macro processing: A preprocessor may allow a user to define macros that are
short hands for longer constructs.
File inclusion: A preprocessor may include header files into the program text.
Rational preprocessor: these preprocessors augment older languages with
more modern flow-of control and data structuring facilities.
Language Extensions: These preprocessor attempts to add capabilities to the
language by certain amounts to build-in macro
COMPILER
Compiler is a translator program that translates a source program written in
(HLL) and translate it into an equivalent program in the target program in low
level language or any intermediate language. It also produces errors if exit in
the program.

Downloaded by KAILASH KUMAR ([email protected])


lOMoARcPSD|49561817

Structure of Compiler

Executing a program written n HLL programming language is basically of two


parts. the source program must first be compiled translated into a object
program. Then the results object program is loaded into a memory executed.

Execution process of source program in Compiler

ASSEMBLER
Programmers found it difficult to write or read programs in machine language.
They begin to use a mnemonic (symbols) for each machine instruction, which
they would subsequently translate into machine language. Such a mnemonic
machine language is now called an assembly language .Programs known as
assembler were written to automate the translation of assembly language in to
machine language. The input to an assembler program is called source program,
the output is a machine language translation (object program).
INTERPRETER
An interpreter is a program that appears to execute a source program as if it
were machine language

Languages such as BASIC, SNOBOL, LISP can be translated using interpreters.


JAVA also uses interpreter. The process of interpretation can be carried out in
following phases.
1. Lexical analysis
2. Synatx analysis
3. Semantic analysis

Downloaded by KAILASH KUMAR ([email protected])


lOMoARcPSD|49561817

4. Direct Execution
Advantages:
Modification of user program can be easily made and implemented as execution
proceeds. Type of object that denotes a various may change dynamically.
Debugging a program and finding errors is simplified task for a program used for
interpretation. The interpreter for the language makes it machine independent.
Disadvantages:
The execution of the program is slower.
LOADER AND LINK-EDITOR:
Once the assembler procedures an object program, that program must be
placed into memory and executed. The assembler could place the object
program directly in memory and transfer control to it, thereby causing the
machine language program to be execute. This would waste core by leaving the
assembler in memory while the user’s program was being executed. Also the
programmer would have to re-translate his program with each execution, thus
wasting translation time. To over come this problems of wasted translation time
and memory. System programmers developed another component called loader
A loader is a program that places programs into memory and prepares them for
execution.” It would be more efficient if subroutines could be translated into
object form the loader could”relocate” directly behind the user’s program. The
task of adjusting programs o they may be placed in arbitrary core locations is
called relocation. Relocation loaders perform four functions.
1.2 TRANSLATOR
A translator is a program that takes as input a program written in one language
and produces as output a program in another language. Beside program
translation, the translator performs another very important role, the error-
detection. Any violation of d HLL specification would be detected and reported
to the programmers. Important role of translator are:
1 Translating the HLL program input into an equivalent ml program.
2 Providing diagnostic messages wherever the programmer violates
specification of the HLL.
1.3 LIST OF COMPILERS
 Ada compilers
 ALGOL compilers
 BASIC compilers
 C# compilers
 C compilers
 C++ compilers
 COBOL compilers
 Common Lisp compilers
 ECMAScript interpreters
 Fortran compilers
 Java compilers
 Pascal compilers
 PL/I compilers
 Python compilers
 Smalltalk compilers

Downloaded by KAILASH KUMAR ([email protected])


lOMoARcPSD|49561817

1.4 STRUCTURE OF THE COMPILER DESIGN


Phases of a compiler: A compiler operates in phases. A phase is a logically
interrelated operation that takes source program in one representation and
produces output in another representation. The phases of a compiler are shown
in below There are two phases of compilation.
a. Analysis (Machine Independent/Language Dependent)
b. Synthesis(Machine Dependent/Language independent)
Compilation process is partitioned into no-of-sub processes called ‘phases’.
Lexical Analysis:-
LA or Scanners reads the source program one character at a time, carving the
source program into a sequence of atomic units called tokens.

Phases of Compiler
Syntax Analysis:-
The second stage of translation is called Syntax analysis or parsing. In this
phase expressions, statements, declarations etc… are identified by using the
results of lexical analysis. Syntax analysis is aided by using techniques based on
formal grammar of the programming language.
Intermediate Code Generations:-

Downloaded by KAILASH KUMAR ([email protected])


lOMoARcPSD|49561817

An intermediate representation of the final machine language code is produced.


This phase bridges the analysis and synthesis phases of translation.
Code Optimization :-
This is optional phase described to improve the intermediate code so that the
output runs faster and takes less space.
Code Generation:-
The last phase of translation is code generation. A number of optimizations to
reduce the length of machine language program are carried out during
this phase. The output of the code generator is the machine language program
of the specified computer.
Table Management (or) Book-keeping:- This is the portion to keep the
names used by the program and records essential information about each. The
data structure used to record this information called a ‘Symbol Table’.
Error Handlers:-
It is invoked when a flaw error in the source program is detected. The output of
LA is a stream of tokens, which is passed to the next phase, the syntax analyzer
or parser. The SA groups the tokens together into syntactic structure called as
expression. Expression may further be combined to form statements. The
syntactic structure can be regarded as a tree whose leaves are the token called
as parse trees.
The parser has two functions. It checks if the tokens from lexical analyzer,
occur in pattern that are permitted by the specification for the source language.
It also imposes on tokens a tree-like structure that is used by the sub-sequent
phases of the compiler.
Example, if a program contains the expression A+/B after lexical analysis this
expression might appear to the syntax analyzer as the token sequence id+/id.
On seeing the /, the syntax analyzer should detect an error situation, because
the presence of these two adjacent binary operators violates the formulations
rule of an expression. Syntax analysis is to make explicit the hierarchical
structure of the incoming token stream by identifying which parts of the
token stream should be grouped.
Example, (A/B*C has two possible interpretations.)
1, divide A by B and then multiply by C or
2, multiply B by C and then use the result to divide A.
each of these two interpretations can be represented in terms of a parse tree.
Intermediate Code Generation:-
The intermediate code generation uses the structure produced by the syntax
analyzer to create a stream of simple instructions. Many styles of intermediate
code are possible. One common style uses instruction with one operator and a
small number of operands. The output of the syntax analyzer is some
representation of a parse tree. the intermediate code generation phase
transforms this parse tree into an intermediate language representation of the
source program.
Code Optimization
This is optional phase described to improve the intermediate code so that the
output runs faster and takes less space. Its output is another intermediate code
program that does the some job as the original, but in a way that saves time
and / or spaces.

Downloaded by KAILASH KUMAR ([email protected])


lOMoARcPSD|49561817

a. Local Optimization:-
There are local transformations that can be applied to a program to make an
improvement. For example,
A > B goto L2
Goto L3
L2 :
This can be replaced by a single statement
If A < B goto L3
Another important local optimization is the elimination of common sub-
expressions
A := B + C + D
E := B + C + F
Might be evaluated as
T1 := B + C
A := T1 + D
E := T1 + F
Take this advantage of the common sub-expressions B + C.
b. Loop Optimization:-
Another important source of optimization concerns about increasing the
speed of loops. A
typical loop improvement is to move a computation that produces the same
result each time
around the loop to a point, in the program just before the loop is entered.
Code generator :-
Code Generator produces the object code by deciding on the memory locations
for data, selecting code to access each datum and selecting the registers in
which each computation is to be done. Many computers have only a few high
speed registers in which computations can be performed quickly. A good code
generator would attempt to utilize registers as efficiently as possible.
Table Management OR Book-keeping :-
A compiler needs to collect information about all the data objects that appear in
the source program. The information about data objects is collected by the early
phases of the compiler-lexical and syntactic analyzers. The data structure used
to record this information is called as Symbol Table.
Error Handing :-
One of the most important functions of a compiler is the detection and reporting
of errors in the source program. The error message should allow the
programmer to determine exactly where the errors have occurred. Errors may
occur in all or the phases of a compiler. Whenever a phase of the compiler
discovers an error, it must report the error to the error handler, which issues an
appropriate diagnostic msg. Both of the table-management and error-Handling
routines interact with all phases of the compiler.
Example:

Downloaded by KAILASH KUMAR ([email protected])


lOMoARcPSD|49561817

Compilation Process of a source code through phases

2. Pass of Compiler:
In the compilation of the program passes many phases each phase takes specific input and produces
specific output.

Downloaded by KAILASH KUMAR ([email protected])


lOMoARcPSD|49561817

INTRODUCTION: In computer programming, a one-pass compiler is a


compiler that passes through the parts of each compilation unit only once,
immediately translating each part into its final machine code. This is in
contrast to a multi-pass compiler which converts the program into one or
more intermediate representations in steps between source code and
machine code, and which reprocesses the entire compilation unit in each
sequential pass.
2.1 OVERVIEW
Language Definition
Appearance of programming language : Vocabulary : Regular expression
Syntax : Backus-Naur Form(BNF) or Context Free Form(CFG)
Semantics : Informal language or some examples

Structure of our compiler front end


SYNTAX DEFINITION
• To specify the syntax of a language : CFG and BNF
Example : if-else statement in C has the form of statement → if ( expression )
statement else statement
• An alphabet of a language is a set of symbols.
Examples : {0,1} for a binary number system(language)={0,1,100,101,...}
{a,b,c} for language={a,b,c, ac,abcc..}
{if,(,),else ...} for a if statements={if(a==1)goto10, if--}
• A string over an alphabet
is a sequence of zero or more symbols from the alphabet.
Examples : 0,1,10,00,11,111,0202 ... strings for a alphabet {0,1}
Null string is a string which does not have any symbol of alphabet.
• Language Is a subset of all the strings over a given alphabet.
Alphabets Ai Languages Li for Ai
A0={0,1} L0={0,1,100,101,...}
A1={a,b,c} L1={a,b,c, ac, abcc..}
A2={all of C tokens} L2= {all sentences of C program }
• Example 2.1. Grammar for expressions consisting of digits and plus and
minus
signs.
Language of expressions L={9-5+2, 3-1, ...}
The productions of grammar for this language L are:
list → list + digit
list → list - digit
list → digit
digit → 0|1|2|3|4|5|6|7|8|9
list, digit : Grammar variables, Grammar symbols
0,1,2,3,4,5,6,7,8,9,-,+ : Tokens, Terminal symbols

Downloaded by KAILASH KUMAR ([email protected])


lOMoARcPSD|49561817

• Convention specifying grammar


Terminal symbols : bold face string if, num, id
Nonterminal symbol, grammar symbol : italicized names, list, digit ,A,B
• Grammar G=(N,T,P,S)
N : a set of nonterminal symbols
T : a set of terminal symbols, tokens
P : a set of production rules
S : a start symbol, S∈N

• Grammar G for a language L={9-5+2, 3-1, ...}


G=(N,T,P,S)
N={list,digit}
T={0,1,2,3,4,5,6,7,8,9,-,+}
P : list -> list + digit
list -> list - digit
list -> digit
digit -> 0|1|2|3|4|5|6|7|8|9
S=list
• Some definitions for a language L and its grammar G
• Derivation :
A sequence of replacements S⇒α1⇒α2⇒…⇒αn is a derivation of αn.
Example, A derivation 1+9 from the grammar G
• left most derivation
list ⇒ list + digit ⇒ digit + digit ⇒ 1 + digit ⇒ 1 + 9
• right most derivation
list ⇒ list + digit ⇒ list + 9 ⇒ digit + 9 ⇒ 1 + 9
• Language of grammar L(G)
L(G) is a set of sentences that can be generated from the grammar G.
L(G)={x| S ⇒* x} where x ∈ a sequence of terminal symbols
• Example: Consider a grammar G=(N,T,P,S):
N={S} T={a,b}
S=S P ={S → aSb | ε } is aabb a sentecne of L(g)?
(derivation of string aabb)
S⇒aSb⇒aaSbb⇒aaεbb⇒aabb(or S⇒* aabb) so,
aabbεL(G) there is no derivation for aa, so aa∉L(G)
note L(G)={anbn| n≧0} where anbn meas n a's followed by n b's.

LEXICAL ANALYSIS
• reads and converts the input into a stream of tokens to be analyzed by
parser.
• lexeme : a sequence of characters which comprises a single token.
• Lexical Analyzer →Lexeme / Token → Parser
Removal of White Space and Comments
• Remove white space(blank, tab, new line etc.) and comments
Contsants

Downloaded by KAILASH KUMAR ([email protected])


lOMoARcPSD|49561817

• Constants: For a while, consider only integers


• eg) for input 31 + 28, output(token representation)?
input : 31 + 28
output: <num, 31> <+, > <num, 28>
num + :token
31 28 : attribute, value(or lexeme) of integer token num
Recognizing
• Identifiers
o Identifiers are names of variables, arrays, functions...
o A grammar treats an identifier as a token.
o eg) input : count = count + increment;
output : <id,1> <=, > <id,1> <+, > <id, 2>;
Keywords are reserved, i.e., they cannot be used as identifiers.
Then a character string forms an identifier only if it is no a keyword.
• punctuation symbols
o operators : + - * / := < > …
Interface to lexical analyzer

Inserting a lexical analyzer between the input and the parser

A Lexical Analyzer

c=getchcar(); ungetc(c,stdin);
• token representation
o #define NUM 256
• Function lexan()
eg) input string 76 + a
input , output(returned value)
76 NUM, tokenval=76 (integer)
++

Downloaded by KAILASH KUMAR ([email protected])


lOMoARcPSD|49561817

A id , tokeval="a"
• A way that parser handles the token NUM returned by laxan()
o consider a translation scheme
factor → ( expr )
| num { print(num.value) }
#define NUM 256
...
factor() {
if(lookahead == '(' ) {
match('('); exor(); match(')');
} else if (lookahead == NUM) {
printf(" %f ",tokenval); match(NUM);
} else error();
}
• The implementation of function lexan
1) #include <stdio.h>
2) #include <ctype.h>
3) int lino = 1;
4) int tokenval = NONE;
5) int lexan() {
6) int t;
7) while(1) {
8) t = getchar();
9) if ( t==' ' || t=='\t' ) ;
10) else if ( t=='\n' ) lineno +=1;
11) else if (isdigit(t)) {
12) tokenval = t -'0';
13) t = getchar();
14) while ( isdigit(t)) {
15) tokenval = tokenval*10 + t - '0';
16) t =getchar();
17) }
18) ungetc(t,stdin);
19) retunr NUM;
20) } else {
21) tokenval = NONE;
22) return t;
23) }
24) }
25) }

ROLE OF LEXICAL ANALYZER


The LA is the first phase of a compiler. It main task is to read the input
character and produce as output a sequence of tokens that the parser uses
for syntax analysis.

Downloaded by KAILASH KUMAR ([email protected])


lOMoARcPSD|49561817

Role of Lexical analyzer

Upon receiving a ‘get next token’ command form the parser, the lexical
analyzer reads the input character until it can identify the next token. The LA
return to the parser representation for the token it has found. The
representation will be an integer code, if the token is a simple construct such
as parenthesis, comma or colon. LA may also perform certain secondary
tasks as the user interface. One such task is striping out from the source
program the commands and white spaces in the form of blank, tab
and new line characters. Another is correlating error message from the
compiler with the source program.

Lexical Analyzer Generator

3.3 TOKEN, LEXEME, PATTERN:


Token: Token is a sequence of characters that can be treated as a single
logical entity. Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants

Downloaded by KAILASH KUMAR ([email protected])


lOMoARcPSD|49561817

Pattern: A set of strings in the input for which the same token is produced
as output. This set of strings is described by a rule called a pattern
associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is
matched by the pattern for a token.

Example of Token, Lexeme and Pattern


LEXICAL ERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue.
Which means that there's no way to recognise a lexeme as a valid token for
you lexer. Syntax errors, on the other side, will be thrown by your scanner
when a given set of already recognised valid tokens don't match any of the
right sides of your grammar rules. simple panic-mode error handling system
requires that we return to a high-level parsing function when a parsing or
lexical error is detected.
Error-recovery actions are:
i. Delete one character from the remaining input.
ii. Insert a missing character in to the remaining input.
iii. Replace a character by another character.
iv. Transpose two adjacent characters.
3.5. REGULAR EXPRESSIONS
Regular expression is a formula that describes a possible set of string.
R* zero or more occurrences…..
R+ one or more occurrences ……
R1R2 an R1 followed by an R2
R1|R1 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain
type. If we view the set of strings in each token class as an language, we can
use the regular-expression notation to describe tokens.

Downloaded by KAILASH KUMAR ([email protected])


lOMoARcPSD|49561817

Consider an identifier, which is defined to be a letter followed by zero or


more letters or digits. In regular expression notation we would write.

Operator=+|-|*|/|mod|div
Keyword=if|while|do|then
letter=a|b|c|......|x|y|z|A|B|C|.......|X|Y|Z
digit=0|1|2|..........|9
Identifier = letter (letter | digit)*

Here are the rules that define the regular expression over alphabet .
• is a regular expression denoting { € }, that is, the language containing only
the empty string.
• For each ‘a’ in Σ, is a regular expression denoting { a }, the language with
only one string consisting of the single symbol ‘a’ .
• If R and S are regular expressions, then
(R) | (S) means L(r) U L(s)
R.S means L(r).L(s)
R* denotes L(r*)
3.6. REGULAR DEFINITIONS
For notational convenience, we may wish to give names to regular
expressions and to define regular expressions using these names as if they
were symbols.Identifiers are the set or string of letters and digits beginning
with a letter. The following regular definition provides a precise specification
for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Pascal identifier
Letter - A | B | ……| Z | a | b |……| z|
Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*
Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must
study how to take the patterns for all the needed tokens and build a piece of
code that examines the input string and finds a prefix that is a lexeme
matching one of the patterns.
Stmt →if expr then stmt | If expr then else stmt| є
Expr →term relop term| term
Term →id |number
For relop ,we use the comparison operations of languages like Pascal or SQL
where = is “equals” and < > is “not equals” because it presents an
interesting structure of lexemes. The terminal of grammar, which are if,
then , else, relop ,id and numbers are the names of tokens as far as the

Downloaded by KAILASH KUMAR ([email protected])


lOMoARcPSD|49561817

lexical analyzer is concerned, the patterns for the tokens are described using
regular definitions.
digit → [0,9]
digits →digit+
number →digit(.digit)?(e.[+-]?digits)?
letter → [A-Z,a-z]
id →letter(letter/digit)*
if → if
then →then
else →else
relop →< | > |<= | >= | = = | < >
In addition, we assign the lexical analyzer the job stripping out white space,
by recognizing the
“token” we defined by:
WS → (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the
ASCII characters of the same names. Token ws is different from the other
tokens in that ,when we recognize it, we do not return it to parser ,but rather
restart the lexical analysis from the character that follows the white space

Downloaded by KAILASH KUMAR ([email protected])

You might also like