CD Unit-I
CD Unit-I
UNIT – 1
Java language processors use both compilation and interpretation shown in figure 1.4. A java
program is first compiled to produce intermediate form called bytecode. The bytecode is then
1
interpreted by a virtual machine. The main advantage of this arrangement is that it supports cross
platform execution. In order to achieve faster processing of inputs to outputs, JIT (Just in time)
compiler is used. It translates bytecode into machine code immediately before running the
intermediate program to process the input.
Assembler: If the source program is assembly language and the target language is machine language
then the translator is called an assembler.
2
Fig 1.6: Language Processing System
Large programs are often compiled in pieces, so the relocatable machine code may have to be linked
together with other relocatable object files and library files into the code that actually runs on the
machine. The linker resolves external memory addresses, where the code in one file may refer to a
location in another file. The loader then puts together the entire executable object files into memory
for execution.
3
Synthesis phase: - The synthesis part constructs the desired target program from the intermediate
representation and the information in the symbol table. Code Generator and Code Optimizer are the
parts of this phase. The synthesis part is the back end of the compiler. The backend deals with
machine-specific details like allocation of registers, number of allowable operators and so on.
P1.C
Front End
Intermediate Code
Back End
Target Code
The above fig. shows the two stage design approach of a compiler using C language source File as
input.
The main advantages of having this two stage design are as follows:
i. The compiler can be extended to support an additional processor by adding the required back
end of the compiler. The existing front end is completely re-used in this case. This is shown in
fig below.
Source Program
ii. The compiler can be easily extended to support an additional input source language by adding
required front end. In this case, the back end is completely re-used. This is shown in fig below.
4
Fig.1.10: Supporting an additional Languages by adding front end.
6
Code Optimization: -The code-optimization phase is used to produce efficient target code. The target
code generated must be executed faster and must consume less power.
Code optimization can also be performed on intermediate code.
The optimization which is performed on intermediate code is called machine independent
code optimization.
The optimization which is performed on target code is called machine dependent code
optimization.
Example:
t1 = id3 * 60.0
id1 = id2 + t1
Code Generator: -The code generator takes intermediate representation of the source program and
coverts into the target code. If the target language is machine code, registers or memory locations are
selected for each of the variables used by the program. Then, the intermediate instructions are
translated into sequences of machine instructions that perform the same task.
Symbol Table: -The symbol table is a data structure containing a record for each variable name, with
fields for the attributes of the name. The data structure should be designed to allow the compiler to
find the record for each name quickly and to store or retrieve data from that record quickly.
Error Handler: - Error handler should report the presence of an error. It must report the place in the
source program where an error is detected. Common programming errors can occur at many
different levels.
Lexical errors include misspellings of identifiers, keywords, or operators.
Syntax errors include misplaced semicolons or extra or missing braces.
Semantic errors include type mismatches between operators and operands.
Logical errors can be anything from incorrect reasoning on the part of the programmer to the
use in a C program of the assignment operator = instead of the comparison operator ==.
The main goal of error handler is
1. Report the presence of errors clearly and accurately.
2. Recover from each error quickly enough to detect subsequent errors.
3. Add minimal overhead to the processing of correct programs.
7
Fig.1.13: - Phases of a compiler
8
Example: - Compile the statement position = initial + rate * 60
9
code generation might be grouped together into one pass. Code optimization might be an optional
pass. Then there could be a back-end pass consisting of code generation for a particular target
machine.
Multi-pass Compilation requires more memory since we need to store the output of each phase in
totality. Multi-pass compiler also takes a longer time to compile since it involves reading of the input
in different forms (tokens, parse tree, etc) multiple number of times.
In practice, Compilers are designed with the idea of keeping the number of passes as minimum as
possible. The number of passes required in a compiler to process an input source program depends
on the structure of programming language. C language compilers can be implemented in a single pass,
while ALGOL-68 compilers cannot be implemented in a single pass.
12
a) Call by Value
In this parameter passing mechanism, the changes made to formal parameters will not be
reflected on the actual parameters because both actual and formal parameters have
separate storage locations.
b) Call by reference
In this parameter passing mechanism, the changes made to formal parameters will be
reflected on the actual parameters because formal parameters references the storage
locations of actual parameters. Hence, the any change made on formal parameter will be
reflected on actual parameter.
c) Call by Name
This technique was used in early programming language such as Algol. In this technique,
symbolic “name” of a variable is passed, which allows it both to be accessed and update.. It
requires that the callee execute as if the actual parameter were substituted literally for the
formal parameter in the code of the callee.
Consider the example below:
procedure double(x);
real x;
begin
x:=x*2
end;
In general, the effect of pass-by-name is to substitute the argument expression in a
procedure call for the corresponding parameters in the body of the procedure, e.g.
double(c[j]) is interpreted as c[j]:=c[j]*2.
vii) Aliasing
It is possible that two formal parameters can refer to the same location; such variables are
said to be aliases of one another. Suppose a is an array belonging to a procedure p, and p
calls another procedure q(x,y) with a call q(a,a). Now, x and y have become aliases of each
other.
13
Lexical Analysis
1. Introduction:
In this chapter we will discuss how to construct a lexical analyzer.
There are three ways to implement Lexical Analyzer:
i) Using State Transition diagrams to recognize various tokens.
ii) We can write a code to identify each occurrences of each lexeme on the input and to return
information about the token identified.
iii) We can also produce a lexical analyzer automatically by specifying the lexeme patterns to a
lexical analyzer generator and compiling those patterns into code that functions as lexical
analyzer. This approach makes it easier to modify a lexical analyzer, since we have only to
rewrite the affected patters, not the entire program. Lexical analyzer generator called LEX.
Lexical Analysis: - The first phase of a compiler is called lexical analysis or scanning. The lexical
analyzer reads the source program and groups the characters into meaningful sequences called
lexemes. It identifies the category (i.e tokens) to which this lexeme belongs. For each lexeme, the
lexical analyzer produces output in the form
<token-name, attribute-value>
This output is passed to the subsequent phase i.e syntax analysis.
2. Role of the Lexical Analyzer
Lexical analyzer is the first phase of a compiler. The main task of the lexical analyzer is to read the
input characters of the source program, group them into lexemes, and produce tokens for each
lexeme in the source program. The stream of tokens is sent to the parser for syntax analysis.
When lexical analyzer discovers a lexeme constituting an identifier, it interacts with the symbol
table to enter that lexeme into the symbol table. Commonly, the interaction is implemented by
having the parser call the lexical analyzer. The getNextToken command given by the parser,
causes the lexical analyzer to read characters from its input until it can identify the next lexeme
and produce the next token, which it returns to the parser.
14
a. Scanning consists of the simple processes that perform such as deletion of comments and
eliminating excessive whitespace characters into one.
b. Lexical analysis is the more complex portion, which produces the sequence of tokens as
output.
3. Lexical Analysis Vs. Parsing
There are a number of reasons for separating Lexical Analysis and Parsing.
i) To simplify the overall design of the compiler.
ii) Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized
techniques that serve only the lexical task, not the job of parsing. In addition, specialized
buffering techniques for reading input characters can speed up the compiler significantly.
iii) Compiler portability is enhanced.
4. Token, patterns and Lexemes: -
A token is a sequence of characters having a collective meaning.
A token is a pair consisting of a token name and an optional attribute value. The token name
is the category of lexical unit, e.g., a particular keyword, or a sequence of input characters
denoting an identifier etc.
A pattern is a description that specify the rules that the lexemes should follow in order to
belong to that token.
A lexeme is a sequence of characters in the source program that matches the pattern for a
token and is identified by the lexical analyzer as an instance of that token.
Example : printf(“total = %d\n”, score);
printf and score are lexemes matching the pattern of token ID.
In many programming languages, the following classes cover most or all of the tokens:
i) One token for each keyword. The pattern for a keyword is the same as the keyword itself.
ii) Tokens for the operators, either individually or in classes such as the token comparison
mentioned.
iii) One token representing all identifiers.
iv) One or more tokens representing constants, such as numbers and literal strings.
v) Tokens for each punctuation symbol, such as left and right parentheses, comma, and
semicolon.
5. Input Buffering
To recognize the right lexeme we have to look one or more characters beyond the next lexeme. For
example we cannot be sure that we have seen the end of identifier until we see the character that is
not a letter or digit and therefore not a part of the lexeme id. The input character is read from
secondary storage, but reading in this way from secondary storage is costly. To reduce the input
processing time two buffer scheme is introduced. This scheme has two buffers that are
alternatively reloaded as shown in figure below.
15
Buffer Pairs (or) Two Buffer Scheme
In this method two buffers are used to store the input string. Each buffer is of the same size N, and N
is usually the size of a disk block, e.g., 4096 bytes. Using one system read command we can read N
characters into buffer, rather than using one system call per character. If fewer than N characters
remain in input file, then eof character is used to mark the end of file.
Two pointers to input are maintained.
i. lexemeBegin pointer marks the beginning of the current lexeme.
ii. Forward pointer scans ahead until a pattern match is found.
Once the next lexeme is determined, forward is set to the character at its right end. Then, after the
lexeme is recognized, an attribute value of a token is returned to the parser, lexemeBegin is set to
the character immediately after the lexeme just found. In fig 1.2.2, we see forward is set to the
character immediately after lexeme just found.
The first buffer and second buffer are scanned alternately. When end of current buffer is
reached the other buffer is filled. Advancing forward requires that we first test whether we have
reached the end of one of the buffers, and if so, we must reload the other buffer from the input, and
move forward to the beginning of the newly loaded buffer. By using this scheme we must check each
time we advance forward, that we have not moved off one of the buffers: if we do, then we must also
reload the other buffer. Thus, for each character read, we make two tests: one for the end of the
buffer, and one to determine what character is read (the latter may be a multiway branch). We can
combine the buffer-end test with the test for the current character if we extend each buffer to hold a
sentinel character at the end. The sentinel is a special character that cannot be part of the source
program, eof character is considered as sentinel. The usage of sentinels is shown in figure below.
Note that eof retain its use as a marker for the end of the entire input. Any eof that appears other than
at the end of a buffer means that the input is at an end.
16
. matches any character except a new line.
^ matches the start of the line.
$ matches end – of- the – line.
[ ] A character class- matches any letter within the parenthesis. [0123456789] matches 0 or 1 or 2
etc.
[^ abcd] ^ inside the square bracket represents the match of any character except the ones in the
bracket.
| matches either preceding Regular Expression or Succeeding Regular Expression.
( ) used for grouping Regular Expressions
*,+,? for specifying repetitions in Regular Expressions
* zero or more occurances
+ one or more occurrences
? zero or one occurrences
{ } indicates how many times the previous pattern is matched. Eg. A {1,3} represents a match of
one to three occurrences of ‘a’. The strings that matches are ‘dad’,’daad’,’daaad’ etc.
Regular expressions are an important notation for specifying lexeme patterns. To describe the set of
valid C identifiers use a notation called regular expressions. In this notation, if letter_ is established to
stand for any letter or the underscore, and digit is established to stand for any digit. Then the
identifiers of C language are defined by
letter (letter| digit)*
The vertical bar above means union, the parentheses are used to group subexpressions, the star
means "zero or more occurrences of". The letter at the beginning indicates that the identifier can
contain any letter at the beginning. The regular expressions are built recursively out of smaller
regular expressions.
The regular expressions are built recursively out of smaller regular expressions, using the rules
described below. Each regular expression r denotes a language L(r), which is also defined recursively
from the languages denoted by r's subexpressions. Here are the rules that define the regular
expressions over some alphabet ∑ and the languages that those expressions denote.
BASIS: There are two rules that form the basis:
1. ∊ (epsilon) is a regular expression, and L(∊) is {∊}, that is, the language whose sole member is the
empty string.
2. If a is a symbol in ∑, then a is a regular expression, and L(a) = {a}, that is, the language with one
string, of length one, with a in its one position.
INDUCTION: There are four parts to the induction whereby larger regular expressions are built from
smaller ones. Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.
1. (r)|(s) is a regular expression denoting the language L(r) U L(s).
2. (r)(s) is a regular expression denoting the language L(r)L(s).
3. (r)* is a regular expression denoting (L(r))*.
4. (r) is a regular expression denoting L(r).
This last rule says that we can add additional pairs of parentheses around expressions without
changing the language they denote.
As defined, regular expressions often contain unnecessary pairs of parentheses. We may drop certain
pairs of parentheses if we adopt the conventions that:
a) The unary operator * has highest precedence and is left associative.
b) Concatenation has second highest precedence and is left associative.
c) | has lowest precedence and is left associative.
There are a number of algebraic laws for regular expressions; each law asserts that expressions of
two different forms are equivalent. Figure below shows some of the algebraic laws that hold for
arbitrary regular expressions r, s, and t.
17
8. Regular definitions
Regular Definitions are names given to certain regular expressions and those names can be used in
subsequent expressions as symbols of the language. If Σ is an alphabet of basic symbols, then a
regular definition is a sequence of definitions of the form:
where
1. Each di is a new symbol, not in Σ and not the same as any other of the di‘s, and
2. Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.
Example 1 : C identifiers are strings of letters, digits, and underscores. Write a regular definition for
the language of C identifiers.
Example 2 : Unsigned numbers (integer or floating point) are strings such as 5280, 0.01234, 6.336E4,
or 1.89E-4. Write a regular definition for unsigned numbers in C language.
9. Recognition of Tokens
In the previous section we learned how to express patterns using regular expressions. Now, we study
how to take the patterns for all the needed tokens and build a piece of code that examines the input
string and finds a lexeme matching one of the patterns.
The below example describes a simple form of branching statements and conditional expressions.
This syntax is similar to that of the language Pascal, in that then appears explicitly after conditions.
18
The terminals of the grammar, which are if, then, else, relop, id, and number, are the names of
tokens. The patterns for these tokens are described using regular definitions.
For this language, the lexical analyzer will recognize the keywords if, then, and e l s e , as well as
lexemes that match the patterns for relop, id, and number. To simplify matters, we make the common
assumption that keywords are also reserved words: that is, they are not identifiers, even though their
lexemes match the pattern for identifiers.
In addition, we assign the lexical analyzer the job of stripping out whitespace, by recognizing the
"token" ws defined by:
ws (tab|blank space|new line)
Here, blank, tab, and newline are abstract symbols that we use to express the ASCII characters of the
same names. Token ws is different from the other tokens in that, when we recognize it, we do not
return it to the parser, but rather restart the lexical analysis from the character that follows the
whitespace. The table shows, for each lexeme or family of lexemes, which token name is returned to
the parser and what attribute value is returned.
Transition Diagrams
Compiler converts regular-expression patterns to transition diagrams. Transition diagrams have a
collection of nodes or circles, called states. Each state represents a condition that could occur during
the process of scanning the input looking for a lexeme that matches one of several patterns. Edges are
directed from one state of the transition diagram to another. Each edge is labeled by a symbol or set
of symbols. If we are in some state s, and the next input symbol is a, we look for an edge out of state s
19
labeled by a (and perhaps by other symbols, as well). If we find such an edge, we enter the state of the
transition diagram to which that edge leads.
Some important conventions about transition diagrams are:
i) Certain states are said to be accepting, or final. These states indicate that a lexeme has been found,
although the actual lexeme may not consist of all positions between the lexemeBegin and
forward pointers.
ii) In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme does not
include the symbol that got us to the accepting state), then we shall additionally place a * near
that accepting state.
iii) One state is designated the start state, or initial state; it is indicated by an edge, labeled "start,"
entering from nowhere. The transition diagram always begins in the start state before any input
symbols have been read.
Below transition diagram that recognizes the lexemes matching the token relop. We begin in state 0,
the start state. If we see < as the first input symbol, then among the lexemes that match the pattern
for relop we can only be looking at <, <>, or <=. We therefore go to state 1, and look at
the next character. If it is =, then we recognize lexeme <=, enter state 2, and return the token relop
with attribute LE, the symbolic constant representing this particular comparison operator. If in state
1 the next character is >, then instead we have lexeme <>, and enter state 3 to return an indication
that the not-equals operator has been found. On any other character, the lexeme is <, and we enter
state 4 to return that information. Note, however, that state 4 has a * to indicate that we must retract
the input one position. On the other hand, if in state 0 the first character we see is =, then this one
character must be the lexeme. We immediately return that fact from state 5. The remaining
possibility is that the first character is >. Then, we must enter state 6 and decide, on the basis of the
next character, whether the lexeme is >= (if we next see the = sign), or just > (on any other character).
20
the symbol table during lexical analysis cannot be a reserved word, so its token is id. The function
getToken examines the symbol table entry for the lexeme found, and returns whatever token
name the symbol table says this lexeme represents — either id or one of the keyword tokens that
was initially installed in the table.
ii) Create separate transition diagrams for each keyword; an example for the keyword then is shown
in Fig.
21
Structure of LEX Program:
The LEX program consists of the following sections.
Declaration Section: Consists of regular definitions that can be used in translation rules.
Example: letter {a-zA-Z}
Apart from the regular definitions, the declaration section usually contains the # defines, C prototype
declaration of functions used in translation rules and some # include statements for C library
functions used in translation rules. all these statements are mentioned between special brackets %{
and %}.
Example: %{
# define WORD 1
%}
These statements are copied into lex.yy.c.
Translation Rules Section: consists of statements in the following form
Pattern 1 { Action 1 }
Pattern 2 { Action 2 }
….
Pattern N { Action N }
Each pattern is a regular expression, which may use the regular definitions of the declaration section.
Where Pattern 1, Pattern 2,…,Pattern N are regular expressions and the Action 1,Action 2,…Action N
are all program segments describing the action to be taken when the pattern matches.
Auxiliary Functions section: usually contains the definition of the C functions used in the action
statements. The whole section is copied as is into lex.yy.c. These functions can be compiled
separately and loaded with the lexical analyzer.
The lexical analyzer created by Lex behaves in concert with the parser as follows. When called by the
parser, the lexical analyzer begins reading its remaining input, one character at a time, until it finds
the longest prefix of the input that matches one of the patterns Pi. It then executes the associated
action Ai. Typically, Ai will return to the parser, but if it does not (e.g., because Pi describes
whitespace or comments), then the lexical analyzer proceeds to find additional lexemes, until one of
the corresponding actions causes a return to the parser. The lexical analyzer returns a single value,
the token name, to the parser, but uses the shared, integer variable yylval to pass additional
information about the lexeme found, if needed.
The actions taken when id is matched are listed below:
1. Function installID() is called to place the lexeme found in the symbol table.
22
2. This function returns a pointer to the symbol table, which is placed in global variable yylval ,
where it can be used by the parser or a later component of the compiler. Note that installID() has
available to it two variables that are set automatically by the lexical analyzer:
(a) yytext is a pointer to the beginning of the lexeme.
(b) yyleng is the length of the lexeme found.
3. The token name ID is returned to the parser.
The action taken when a lexeme matching the pattern number is similar, using the auxiliary
function installNumO.
23
Example: Write a LEX Program to recognize tokens in the given arithmetic expression.
a=b+c*10;
%{
#include<stdion.h>
%}
letter [a-zA-Z]
digit [0-9]
id {letter}({letter}|{digit})*
num {digit}+(\.{digit}+)?
%%
{id} {printf(“%s is an Identifier\n”, yytext);}
{num} {printf(“%s is a Number\n”, yytext);}
“+” {printf(“%s is an Arithmetic Operator\n”, yytext);}
“-“ {printf(“%s is an Arithmetic Operator\n”, yytext);}
“*” {printf(“%s is an Arithmetic Operator\n”, yytext);}
“/” {printf(“%s is an Arithmetic Operator\n”, yytext);}
“=” {printf(“%s is an Assignment Operator\n”, yytext);}
“;” {printf(“%s is a Punctuation\n”, yytext);}
%%
main()
{
yylex(); /* to invoke lexical analyzer */
}
yywrap()
{
return 1; /* returns 1 when the end of input is found */
}
24
25