Module 1-1
Module 1-1
B.Tech, CSE, S6
MODULE - I
INTRODUCTION TO COMPILERS AND LEXICAL ANALYSIS
SYLLABUS:
1. INTRODUCTION
All softwares are written in some programming languages. Before this program can
be run, it must be translated into a form in which it can be executed by computers.
This translation is done by a software system known as Compilers.
Language Processors
A compiler is a program that can read a program in one language (the source
language) and translate it into an equivalent program in another language
(the target language)
An important role of the compiler is to report any errors in the source
program that it detects during the translation process
The below figure shows a Language Processing System (fig 2.1). There we can see
that, in addition to compilers, we need several other programs to create an
executable target program.
Step 1: Preprocessor
Source program acts as the input to the preprocessor. A source program may
be divided into modules stored in separate files. The pre-processor modifies
the source program by replacing the header files with the suitable content.
This is known as file inclusion. It also performs macro-processing, that is,
expansion of macros into source language statements. This modified source
program is then fed to the compiler.
Step 2: Compiler
The compiler translates the modified source program in high level language
into the target program. If the target program is in machine language, then it
can be executed directly. If the target program is in assembly language, then it
is fed to the assembler. The compiler may produce an assembly-language
program as its output, because assembly language is easier to produce as
output and is easier to debug.
Step 3: Assembler
The assembler translates the assembly language code into the relocatable
machine code.
Step 4: Linker/Loader
Large programs are often compiled in pieces, so the relocatable machine code
may have to be linked together with other relocatable object files and library
files into the code that actually runs on the machine. This task is performed by
the Linker. The Loader loads this integrated code into memory for execution.
The output of the Linker/Loader is the equivalent machine language code of
the source code.
A compiler contains two parts for the conversion of source program into the target
code:
Analysis part (front end)
Synthesis part (back end)
Analysis part:
Breaks the source program into constituent pieces and imposes a grammatical
structure on them
This structure is then used to create an intermediate representation of the
source program
If any errors (syntactic or semantic) are found during this transformation, the
analysis part provide informative messages to the user about the same
It also collects information about the source program, and stores it in a data
structure called Symbol Table, which is passed along with the intermediate
representation to the synthesis part
Synthesis part:
Synthesis part constructs the target code from the intermediate code and
information from the symbol table
1. position is a lexeme that would be mapped into a token <id, 1>, where id is
an abstract symbol standing for identifier and 1 points to the symbol table
entry for position. The symbol- table entry for an identifier holds information
about the identifier, such as its name and type.
2. The assignment symbol = is a lexeme that is mapped into the token < = >.
Since this token needs no attribute-value, we have omitted the second
component.
3. initial is a lexeme that is mapped into the token < id, 2> , where 2 points to
the symbol-table entry for initial .
4. + is a lexeme that is mapped into the token <+>.
5. rate is a lexeme that is mapped into the token < id, 3 >, where 3 points to the
symbol-table entry for rate.
6. * is a lexeme that is mapped into the token <* > .
7. 60 is a lexeme that is mapped into the token <60>
< id, l > < = > <id, 2> <+> <id, 3> < * > <60>
Semantic Analysis:
It checks whether the parse tree constructed in the syntax analysis phase
follows the rules of the language.
An important part of semantic analysis is type checking, where the compiler
checks that each operator has matching operands.
The language specification may permit some type conversions called
coercions. For example, a binary arithmetic operator may be applied to either
a pair of integers or to a pair of floating point numbers. If the operator is
applied to a floating point number and an integer, the compiler may convert
or coerce the integer into a floating point number.
Suppose that in our example position, initial and rate are declared as floating
point numbers. Then the type checker will discover that the operator * is
applied to a floating point number rate and an integer 60. In that case, the
integer 60 will be converted to floating point number using inttofloat
Finally, the semantic analyzer will produce an annotated syntax tree as its
output
While translating the source program into the target code, a compiler may construct
one or more intermediate representations. Syntax trees are a form of intermediate
representation, produced during the syntax analysis and the semantic analysis.
After syntax and semantic analysis, many compilers generate an explicit low-level or
machine-like intermediate representation. This intermediate representation has two
properties:
It should be easy to produce
It should be easy to translate into the target machine code
Here we consider an intermediate representation called three-address code, which
consists of a sequence of assembly-like instructions with three operands per
instruction. The output three-address code for our previous example is,
In three-address code,
1. Each three-address assignment instruction has at most one operator on the
right side. Thus, these instructions fix the order in which operations are to be
done. For e.g. multiplication precedes addition in our source program.
2. The compiler must generate a temporary name to hold the value computed by
a three-address instruction
3. Some three-address instructions (first and last instructions in our example),
have fewer than three operands
Code Optimization
Code Generation
The first operand of each instruction specifies a destination. The F in each instruction
tells us that it deals with floating-point numbers.
The symbol table is a data structure containing a record for each variable name used
in the source program, along with its attributes. These attributes may provide
information about the storage allocated for a name, its type, its scope, and in case of
procedure names, things such as the number and types of its arguments, the method
of passing each argument, and the return type. The symbol table should be designed
to allow the compiler to find the record for each name quickly, and to store and
retrieve data from that record quickly.
Symbol table entries are created and used during the analysis phase by the lexical
analyzer, the parser, and the semantic analyzer. In some cases, a lexical analyzer can
create a symbol-table entry as soon as it sees the characters that make up a lexeme.
More often, the lexical analyzer returns a token, say id, to the parser. Only the parser
can decide whether to use a previously created symbol-table entry or create a new
one for this identifier.
Error Handling
Each phase of compiler may encounter errors. However, after detecting an error, the
phase must somehow deal with that error so that the compilation can proceed,
allowing further errors in the source program to be detected. The tasks of the Error
Handling process are to detect each error, report it to the user, and then make
some recovery strategy and implement them to handle the error. During this whole
process processing time of the program should not be slow.
Error Detection
Error Report
Error Recovery
Compiler writers use software development tools and more specialized tools for
implementing various phases of a compiler. Some commonly used compiler
construction tools include the following.
5. BOOTSTRAPPING
Example:
Suppose we have a Pascal translator written in C language that takes Pascal code
as input and produces C code as output. Now create a Pascal translator for the
same in C++.
Step 3: Finally, we compile the first compiler using the second compiler
Step 4: Thus we created a compiler written in C++ language, to convert Pascal code
to C code.
Cross Compiler: A cross compiler is a compiler that runs on one machine, but
produces output for another machine. That is, a compiler capable of creating an
executable code for a platform other than the one on which the compiler is
running.
6. LEXICAL ANALYSIS
The main task of lexical analyzer is to read the input characters of the source
program, group them into lexemes, and produce as output a sequence of
tokens for each lexeme in the source program. The stream of tokens is then
sent to the parser for syntax analysis.
The lexical analyzer interacts with the symbol table. When the lexical
analyzer discovers a lexeme constituting an identifier, it needs to enter that
lexeme into the symbol table.
These interactions are shown in the below figure. These interactions are
implemented by having the parser call the lexical analyzer. The call using,
getNextToken command, causes the lexical analyzer to read characters from
the input until it can identify the next lexeme and produce the token for it.
Fig 6.1: Interactions between the lexical analyzer and the parser
Besides the identification of tokens, the other tasks of a lexical analyzer are:
Stripping out comments and white spaces (blank, newline, tab etc)
Correlating error messages with the source program. That is, display error
messages with its occurrence by specifying the line number.
Expands macros if it is found in the source program
Reasons why the analysis portion of a compiler is normally separated into lexical
analysis (scanning) and syntax analysis (parsing) are:
Tokens are basically a sequence of characters that are treated as a single unit
as they cannot be further broken down. It includes keywords (int, float,
goto, continue, break etc), identifiers (user-defined names), operators (+,-
,/,*), delimiters or punctuations such as comma (,), semicolon (;), braces ({})
etc.
A pattern is a set of rules that the scanner or the lexical analyzer follows to
create a token. For example, in case of keywords, the pattern is just the
sequence of characters that form the keyword.
A lexeme is a sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an instance of
that token.
#tokens = 5
#tokens = 14
Attributes of Tokens
When more than one lexeme can match a pattern, the lexical analyzer must provide
additional information about the particular lexeme that matched. For example, the
pattern for token number matches both 0 and 1. Thus, in many cases the lexical
analyzer returns to the parser not only the token name, but also an attribute value
that describes the lexeme represented by the token. The most important example is
the token id, where we need to associate with the token a great deal of
information. This information can be its lexeme, its type, the location at which it is
first fount etc.
Lexical Errors
Without the help of other components, it is hard for the lexical analyzer to tell that
there is a source code error. For instance, if the string fi is encountered for the first
time in a C program in the context:
fi ( a == f(x) ) …
Suppose that a situation arises in which the lexical analyzer is unable to proceed
because none of the patterns for tokens matches any prefix of the remaining input.
Then it should perform some error recovery strategies.
The task of reading the source program can be speeded up using input buffering.
There are many situations where we need to look at least one additional character
ahead before we can be sure we have the right lexeme. For instance, we cannot be
sure we have seen the end of an identifier until we see a character that is not a
letter or a digit. Similarly, in C, single character operators like -, =, <, > can also be
the beginning of a two character operators like ==, -=, <=, >=. Thus, a two-buffer
scheme is introduced to handle this lookaheads safely.
Both the buffers are of the same size N, and N is usually the size of a disk block.
Using one system read command we can read N characters into a buffer, rather
than using one system call per character. If fewer than N characters remain in the
input file, then a special character, represented by eof, marks the end of the source
file.
Once the next lexeme is determined, forward is set to the character at its right end.
Then, after the lexeme is recorded as a token, lexemeBegin is set to the character
immediately after the lexeme just found.
Advancing forward requires that we first test whether we have reached the end of
one of the buffers, and if so, we must reload the other buffer from the input, and
move forward to the beginning of the newly loaded buffer.
Sentinels
In the above scheme, at each time we advance forward, we must check that we
have not moved off one of the buffers; if we do, then we must also reload the other
buffer. Thus for each character read, we make two tests: one for the end of the
buffer, and one to determine what character is read. We can combine the buffer-
end test with the test for the current character if we extend each buffer to hold a
special character, known as the sentinel character, at the end. Usually we use eof
as the sentinel character.
Regular expressions are the most important notation for specifying tokens.
If x and y are strings, then the concatenation of x and y, denoted xy, is the string
formed by appending y to x. For example, if x = dog and y = house, then xy =
doghouse. The empty string is the identity under concatenation; that is, for any
string s, s = s = s.
If we think of concatenation as a product, we can define the “exponentiation” of
strings as follows. Define to be , and for all i > 0, define to be s. Since s =
s, it follows that = s. then = ss, = sss, and so on.
Operations on Languages
The most important operations on languages are:
Union
Concatenation
Kleene closure
Positive closure
Union is the familiar operation on sets. The concatenation of languages is all strings
formed by taking a string from the first language and a string from the second
language, in all possible ways, and concatenating them. The Kleene closure of a
language L, denoted , is the set of strings you get by concatenating L zero or more
times. Note that , the “concatenation of L zero times," is defined to be { }. Finally,
the positive closure, denoted , is the same as the Kleene closure, but without the
term . That is, will not be in unless it is in L itself.
Note: Sometimes the empty string, , can also be represented using lambda ()
Regular Expression
( |
Rules that define the regular expressions over some alphabet and the languages
that those expressions denote are:
1. „‟ is a regular expression denoting the set {}. i.e., If R = , then L(R) = { }
2. „‟ is a regular expression denoting the set { }. i. e., If R= , then L(R) = { }
3. If „a’ is a symbol in then „a‟ is a regular expression, and L(R) = { a }
The above three rules are also known as Primitive Rules (i.e. the minimal or basis).
In the remaining rules, let‟s assume that r and s are regular expressions denoting
the languages L(r) and L(s) respectively.
1. (r)|(s) is a regular expression denoting the language L(r) U L(s). i.e., the
union of two RE is also regular.
2. (r)(s) is a regular expression denoting the language L(r)L(s). i.e., the
concatenation of two RE is also regular.
3. Kleene closure of a RE is also regular. i.e., ( is a RE denoting ((
4. If (r) is a RE denoting L(r). i.e., we can add additional pairs of parenthesis
around expressions without changing the language they denote.
Some of the algebraic laws that hold for arbitrary regular expressions r, s and t is
given below:
Regular Definitions
For notational convenience, we give names to certain regular expressions and use
those names in subsequent expressions, as if the names were themselves symbols. If
is an alphabet of basic symbols, then a regular definition is a sequence of
definitions of the form:
where:
1. Each is a new symbol, not in and not the same as any other of the d's, and
2. Each is a regular expression over the alphabet U { , , , …., }
3. Character classes.
A regular expression | |….| , where the ‟s are each symbols of the
alphabet, can be replaced by the shorthand [ …. ]. If ….., form