Compiler Design
Compiler Design
Module- I:
Compiler Structure: Model of compilation:
Programming languages are notations for describing computations to people and to machines.
The world as we know it depends on programming languages, because all the software running
on all the computers was written in some programming language. But, before a program can be
run, it first must be translated into a form in which it can be executed by a computer.
The software systems that do this translation are called compilers.
Language Processors
A compiler is a program that can read a program in one lan-guage — the source language —
and translate it into an equivalent program in another language — the target language; see Fig.
1.1. An important role of the compiler is to report any errors in the source program that it detects
during the translation process.
A COMPILER
If the target program is an executable machine-language program, it can then be called by the
user to process inputs and produce outputs;
The machine-language target program produced by a compiler is usually much faster than an
interpreter at mapping inputs to outputs . An interpreter, however, can usually give better error
diagnostics than a compiler, because it executes the source program statement by statement.
The modified source program is then fed to a compiler. The compiler may produce an assembly-
language program as its output, because assembly lan-guage is easier to produce as output and is
easier to debug. The assembly language is then processed by a program called an assembler that
produces relocatable machine code as its output.
Large programs are often compiled in pieces, so the relocatable machine code may have to be
linked together with other relocatable object files and library files into the code that actually runs
on the machine. The linker resolves external memory addresses, where the code in one file may
refer to a location in another file. The loader then puts together all of the executable object files
into memory for execution.
Some compilers have a machine-independent optimization phase between the front end and the
back end. The purpose of this optimization phase is to perform transformations on the
intermediate representation, so that the back end can produce a better target program than it
would have otherwise pro-duced from an unoptimized intermediate representation.
Lexical analysis:
The first phase of a compiler is called lexical analysis or scanning. The lexical analyzer reads the
stream of characters making up the source program and groups the characters into meaningful
sequences called lexemes. For each lexeme, the lexical analyzer produces as output a token of the
form that it passes on to the subsequent phase, syntax analysis. In the token, the first
component token-name is an abstract symbol that is used during syntax analysis, and the second
component attribute-value points to an entry in the symbol table for this token. Information
from the symbol-table entry Is needed for semantic analysis and code generation.
For example, suppose a source program contains the assignment statement
position = initial + rate * 60 (1.1)
The characters in this assignment could be grouped into the following lexemes and mapped into
the following tokens passed on to the syntax analyzer:
1. position is a lexeme that would be mapped into a token (id, 1), where id is an abstract
symbol standing for identifier and 1 points to the symbol table entry for position . The
symbol-table entry for an identifier holds information about the identifier, such as its
name and type.
2. The assignment symbol = is a lexeme that is mapped into the token (=) . Since this token
needs no attribute-value, we have omitted the second component. We could have used any
abstract symbol such as assign for the token-name, but for notational convenience we have
chosen to use the lexeme itself as the name of the abstract symbol.
3. initial is a lexeme that is mapped into the token (id, 2), where 2 points to the symbol-table
entry for initial .
4. + is a lexeme that is mapped into the token (+) .
5. rate is a lexeme that is mapped into the token (id, 3), where 3 points to the symbol-table entry
for rate .
6. * is a lexeme that is mapped into the token (*).
7. 60 is a lexeme that is mapped into the token (60).
Blanks separating the lexemes would be discarded by the lexical analyzer.
lexical analysis as the sequence of tokens
<id,l> <= > <id, 2> <+> <id, 3> <*> <60> (1.2)
In this representation, the token names =, +, and * are abstract symbols for
the assignment, addition, and multiplication operators, respectively.
Interface with input parser and symbol table:
Symbol Table is an important data structure created and maintained by the compiler in order
to keep track of semantics of variables i.e. it stores information about the scope and binding
information about names, information about instances of various entities such as variable and
function names, classes, objects, etc.
It is built-in lexical and syntax analysis phases.
The information is collected by the analysis phases of the compiler and is used by the
synthesis phases of the compiler to generate code.
It is used by the compiler to achieve compile-time efficiency.
It is used by various phases of the compiler as follows:
Lexical Analysis: Creates new table entries in the table, for example like entries about
tokens.
Syntax Analysis: Adds information regarding attribute type, scope, dimension, line of
reference, use, etc in the table.
Semantic Analysis: Uses available information in the table to check for semantics i.e.
to verify that expressions and assignments are semantically correct(type checking) and
update it accordingly.
Intermediate Code generation: Refers symbol table for knowing how much and what
type of run-time is allocated and table helps in adding temporary variable information.
Code Optimization: Uses information present in the symbol table for machine-
dependent optimization.
Target Code generation: Generates code by using address information of identifier
present in the table.
Symbol Table entries – Each entry in the symbol table is associated with attributes that
support the compiler in different phases.
Items stored in Symbol table:
Variable names and constants
Procedure and function names
Literal constants and strings
Compiler generated temporaries
Labels in source languages
Information used by the compiler from Symbol table:
Data type and name
Declaring procedures
Offset in storage
If structure or record then, a pointer to structure table.
For parameters, whether parameter passing by value or by reference
Number and type of arguments passed to function
Base Address
Operations of Symbol table – The basic operations defined on a symbol table include:
Token:
It is basically a sequence of characters that are treated as a unit as it cannot be further broken
down. In programming languages like C language- keywords (int, char, float, const, goto,
continue, etc.) identifiers (user-defined names), operators (+, -, *, /), delimiters/punctuators
like comma (,), semicolon(;), braces ({ }), etc. , strings can be considered as tokens. This phase
recognizes three types of tokens: Terminal Symbols (TRM)- Keywords and Operators,
Literals (LIT), and Identifiers (IDN).
Let’s understand now how to calculate tokens in a source code (C language):
Example 1:
Example 2:
int main() {
It is a sequence of characters in the source code that are matched by given predefined language
rules for every lexeme to be specified as a valid token.
Example:
Pattern
For a keyword to be identified as a valid token, the pattern is the sequence of characters that
make the keyword.
For identifier to be identified as a valid token, the pattern is the predefined rules that it must
start with alphabet, followed by alphabet or a digit.
it must start
with the
Interpretation
name of a variable, alphabet,
of type
followed by the
Identifier function, etc main, a
alphabet or a
Criteria Token Lexeme Pattern
digit.
Interpretation
of type all the operators are
Operator considered tokens. +, = +, =
any string of
characters
Interpretation a grammar rule or (except ‘ ‘)
of type Literal boolean literal. “Welcome to GeeksforGeeks!” between ” and “
Example:
z = x + y;
This statement has the below form for syntax analyzer
<id> = <id> + <id>; //<id>- identifier (token)
The Lexical Analyzer not only provides a series of tokens but also creates a Symbol Table that
consists of all the tokens present in the source code except Whitespaces and comments.
Lexical analysis is the process of producing tokens from the source program. It has the
following issues:
• Lookahead
• Ambiguities
1. Lookahead:
Lookahead is required to decide when one token will end and the next token will begin. The
simple example which has lookahead issues are i vs. if, = vs. ==. Therefore a way to describe the
lexemes of each token is required.
The lexical analysis programs written with lex accept ambiguous specifications and choose the
longest match possible at each input point. Lex can handle ambiguous specifications. When
more than one expression can match the current input, lex chooses as follows:
• The longest match is preferred.
• Among rules which matched the same number of characters, the rule given first is preferred.
Error recovery strategies for a lexical analyzer
A character sequence that cannot be scanned into any valid token is a lexical error. Misspelling
of identifiers, keyword, or operators are considered as lexical errors. Usually, a lexical error is
caused by the appearance of some illegal character, mostly at the beginning of a token.
The following are the error-recovery strategies in lexical analysis:
1) Panic Mode Recovery: Once an error is found, the successive characters are always ignored
until we reach a well-formed token like end, semicolon.
2) Deleting an extraneous character.
3) Inserting a missing character.
4)Replacing an incorrect character by a correct character.
5)Transforming two adjacent characters.
Lexical Error:
When the token pattern does not match the prefix of the remaining input, the lexical analyzer
gets stuck and has to recover from this state to analyze the remaining input. In simple words,
a lexical error occurs when a sequence of characters does not match the pattern of any token.
It typically happens during the execution of a program.
Types of Lexical Error:
Types of lexical error that can occur in a lexical analyzer are as follows:
1. Exceeding length of identifier or numeric constants.
#include <iostream>
using namespace std;
int main() {
Example:
#include <iostream>
using namespace std;
int main() {
printf("Geeksforgeeks");$
return 0;
}
This is a lexical error since an illegal character $ appears at the end of the statement.
3. Unmatched string
Example:
#include <iostream>
using namespace std;
int main() {
/* comment
cout<<"GFG!";
return 0;
}
This is a lexical error since the ending of comment “*/” is not present but the beginning is
present.
4. Spelling Error.
#include <iostream>
using namespace std;
int main() {
int 3num= 1234; /* spelling error as identifier
cannot start with a number*/
return 0;
}
#include <iostream>
using namespace std;
int main() {
int main() {
cout<<"GFG!";
return 0;
}
#include <iostream>
using namespace std;
int main()
{
/* spelling of main here would be treated as an lexical
error and won't be considered as an identifier,
transposition of character 'i' and 'a'*/
cout << "GFG!";
return 0;
}
Error Recovery Technique
When a situation arises in which the lexical analyzer is unable to proceed because none of the
patterns for tokens matches any prefix of the remaining input. The simplest recovery strategy is
“panic mode” recovery. We delete successive characters from the remaining input until the
lexical analyzer can identify a well-formed token at the beginning of what input is left.
input buffering:
The lexical analyzer scans the input from left to right one character at a time. It uses two
pointers begin ptr(bp) and forward ptr(fp) to keep track of the pointer of the input scanned.
Initially both the pointers point to the first character of the input string as:
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is
encountered, it indicates end of lexeme. In above example as soon as ptr (fp) encounters a
blank space the lexeme “int” is identified. The fp will be moved ahead at white space, when fp
encounters white space, it ignore and moves ahead. then both the begin ptr(bp) and forward
ptr(fp) are set at next token. The input character is thus read from secondary storage, but
reading in this way from secondary storage is costly. hence buffering technique is used.A
block of data is first read into a buffer, and then second by lexical analyzer. there are two
methods used in this context: One Buffer Scheme, and Two Buffer Scheme. These are
explained as following
below.
1. One Buffer Scheme: In this scheme, only one buffer is used to store the input string but the
problem with this scheme is that if lexeme is very long then it crosses the buffer boundary,
to scan rest of the lexeme the buffer has to be refilled, that makes overwriting the first of
lexeme.
2. Two Buffer Scheme: To overcome the problem of one buffer scheme, in this method two
buffers are used to store the input string. the first buffer and second buffer are scanned
alternately. when end of current buffer is reached the other buffer is filled. the only problem
with this method is that if length of the lexeme is longer than length of the buffer then
scanning input cannot be scanned completely. Initially both the bp and fp are pointing to the
first character of first buffer. Then the fp moves towards right in search of end of lexeme. as
soon as blank character is recognized, the string between bp and fp is identified as
corresponding token. to identify, the boundary of first buffer end of buffer character should
be placed at the end first buffer. Similarly end of second buffer is also recognized by the
end of buffer mark present at the end of second buffer. when fp encounters first eof, then
one can recognize end of first buffer and hence filling up second buffer is started. in the
same way when second eof is obtained then it indicates of second buffer. alternatively both
the buffers can be filled up until end of the input program and stream of tokens is identified.
This eof character introduced at the end is calling Sentinel which is used to identify the end
of buffer.
Specification of tokens:
In language theory, the terms "sentence" and "word" are often used as synonyms for "string."
The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For
example, banana is a string of length six. The empty string, denoted ε, is the string of length zero.
Operations on strings.
Operations on languages:
The following are the operations that can be applied to languages:
1. Union
2. Concatenation
3. Kleene closure
4. Positive closure
The following example shows the operations on strings: Let L={0,1} and S={a,b,c}
Regular Expressions
· Each regular expression r denotes a language L(r).
· Here are the rules that define the regular expressions over some alphabet Σ
and the languages that those expressions denote
1.ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is
the empty string.
2. If ‘a’ is a symbol in Σ, then ‘a’ is a regular expression, and L(a) = {a}, that is, the
language with one string, of length one, with ‘a’ in its one position.
3.Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then, a)
(r)|(s) is a regular expression denoting the language L(r) U L(s).
b) (r)(s) is a regular expression denoting the language L(r)L(s). c) (r)* is a regular
expression denoting (L(r))*.
d) (r) is a regular expression denoting L(r).
4.The unary operator * has highest precedence and is left associative.
5.Concatenation has second highest precedence and is left associative.
6. | has lowest precedence and is left associative.
Regular set
A language that can be defined by a regular expression is called a regular set. If two
regular expressions r and s denote the same regular set, we say they are equivalent and
write r = s.
There are a number of algebraic laws for regular expressions that can be used to
manipulate into equivalent forms.
For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is associative.
Regular Definitions
Shorthands
Non-regular Set
Regular Languages are the most restricted types of languages and are accepted by finite
automata.
Regular Expressions.
Regular Expressions are used to denote regular languages. An expression is regular if:
ɸ is a regular expression for regular language ɸ.
ɛ is a regular expression for regular language {ɛ}.
If a ∈ Σ (Σ represents the input alphabet), a is regular expression with language {a}.
If a and b are regular expression, a + b is also a regular expression with language {a,b}.
If a and b are regular expression, ab (concatenation of a and b) is also regular.
If a is regular expression, a* (0 or more times a) is also regular.
Regular Grammar : A grammar is regular if it has rules of form A -> a or A -> aB or A -> ɛ
where ɛ is a special symbol called NULL.
Union : If L1 and If L2 are two regular languages, their union L1 ∪ L2 will also be regular.
For example, L1 = {an | n ≥ 0} and L2 = {bn | n ≥ 0}
L3 = L1 ∪ L2 = {an ∪ bn | n ≥ 0} is also regular.
Intersection : If L1 and If L2 are two regular languages, their intersection L1 ∩ L2 will also
be regular. For example,
L1= {am bn | n ≥ 0 and m ≥ 0} and L2= {a m bn ∪ bn am | n ≥ 0 and m ≥ 0}
L3 = L1 ∩ L2 = {am bn | n ≥ 0 and m ≥ 0} is also regular.
Concatenation : If L1 and If L2 are two regular languages, their concatenation L1.L2 will also
be regular. For example,
L1 = {an | n ≥ 0} and L2 = {bn | n ≥ 0}
L3 = L1.L2 = {am . bn | m ≥ 0 and n ≥ 0} is also regular.
Kleene Closure : If L1 is a regular language, its Kleene closure L1* will also be regular. For
example,
L1 = (a ∪ b)
L1* = (a ∪ b)*.
Complement : If L(G) is regular language, its complement L’(G) will also be regular.
Complement of a language can be found by subtracting strings which are in L(G) from all
possible strings. For example,
L(G) = {an | n > 3}
L’(G) = {an | n <= 3}
Note : Two regular expressions are equivalent if languages generated by them are same. For
example, (a+b*)* and (a+b)* generate same language. Every string which is generated by
(a+b*)* is also generated by (a+b)* and vice versa.
Question 1 : Which one of the following languages over the alphabet {0,1} is described by the
regular expression?
(0+1)*0(0+1)*0(0+1)*
(A) The set of all strings containing the substring 00.
(B) The set of all strings containing at most two 0’s.
(C) The set of all strings containing at least two 0’s.
(D) The set of all strings that begin and end with either 0 or 1.
Solution : Option A says that it must have substring 00. But 10101 is also a part of language
but it does not contain 00 as substring. So it is not correct option.
Option B says that it can have maximum two 0’s but 00000 is also a part of language. So it is
not correct option.
Option C says that it must contain atleast two 0. In regular expression, two 0 are present. So
this is correct option.
Option D says that it contains all strings that begin and end with either 0 or 1. But it can
generate strings which start with 0 and end with 1 or vice versa as well. So it is not correct.
Solution : Two regular expressions are equivalent if languages generated by them are same.
Option (A) can generate all strings generated by 0*(10*)*. So they are equivalent.
Option (B) string null can not generated by given languages but 0*(10*)* can. So they are not
equivalent.
Option (C) will have 10 as substring but 0*(10*)* may or may not. So they are not equivalent.
Question 4 : The regular expression for the language having input alphabets a and b, in which
two a’s do not come together:
(A) (b + ab)* + (b +ab)*a
(B) a(b + ba)* + (b + ba)*
(C) both options (A) and (B)
(D) none of the above
Solution:
Option (C) stating both both options (A) and (B) is the correct regular expression for the stated
question.
The language in the question can be expressed as
L={&epsilon,a,b,bb,ab,aba,ba,bab,baba,abab,…}.
In option (A) ‘ab’ is considered the building block for finding out the required regular
expression.(b + ab)* covers all cases of strings generated ending with ‘b’.(b + ab)*a covers all
cases of strings generated ending with a.
Applying similar logic for option (B) we can see that the regular expression is derived
considering ‘ba’ as the building block and it covers all cases of strings starting with a and
starting with b.