UNIT-I Compiler Design (R22)
Overview of language processing: – preprocessors–compiler–assembler–Linkers &loaders, difference
between compiler and interpreter- structure of a compiler –phases of a compiler.
Lexical Analysis: - Role of Lexical Analysis, Lexical Analysis Vs Parsing, Token, patterns and Lexemes,
Lexical Errors, Regular Expressions, Regular definitions for the language constructs, Strings, Sequences,
Comments, Transition diagram for recognition of tokens, Reserved words and identifiers, Examples
Translator: A translator is a system which converts a language from one form to another form. For
example, Compiler, Interpreter.
Basic functions of a translator:
• A translator should convert a program from one form to another form.
• It should not change the meaning of the source program while converting.
• It should convert the source program in an easy and understandable form to the computer
• The speed with which a translator converts the source program should be same or at least match the
computer’s speed
• It should not only translate, but also locate, repair to some extent and report errors to the programmers.
Compiler: A compiler is a program that reads a program written in one language (source language) and
translates it into an equivalent program in another language (target language).
Interpreter: It is similar to the compiler but it doesn’t convert entire program at a time. An interpreter is
also a program that reads the source program and executes the program line by line.
Example: Java language processors combine compilation and interpretation, as shown in Figure. A Java
source program may first be compiled into an intermediate form called byte codes. The byte codes are then
interpreted by a virtual machine. A benefit of this arrangement is that byte codes compiled on one machine
can be interpreted on another machine, perhaps across a network.
1
Prepared by: B. Yugandhar
Compiler Interpreter
• Compilers are larger in size. • Interpreters are smaller in size.
• It has more processing • It has less processing
• It is less secure • It is more secure
• It is faster • It is slower
• Compilers are not portable • Interpreters are portable
• Compilers are not flexible. • Interpreters are relatively flexible.
• Compilers are more efficient. • Interpreters are less efficient.
• Compilers can be implemented easily • Interpreter implementation is not an easy task.
in any programming language.
• Programming languages that use • Programming languages that use interpreter
compilers include C, C++, C#, etc.. include Python, Ruby, Perl, MATLAB, etc.
LANGUAGE PROCESSORS:
An integrated software development environment includes many different kinds of language processors
such as compilers, interpreters, assemblers, loaders, linkers. In addition to a compiler, several other
programs may be required to create an executable target program, as shown in Figure below.
2
Prepared by: B. Yugandhar
A source program may be divided into modules stored in separate files. The task of collecting the source
program is sometimes entrusted to a separate program, called a preprocessor. The preprocessor may also
expand short hands, called macros, into source language statements.
The modified source program is then fed to a compiler. The compiler may produce an assembly-language
program as its output, because assembly language is easier to produce as output and is easier to debug. The
assembly language is then processed by a program called an assembler that produces re-locatable machine
code as its output.
Assembler functions:
• Converts symbolic code to machine instructions
• Build Machine instructions
• Convert data constants into internal representation
Large programs are often compiled in pieces, so the re-locatable machine code may have to be linked
together with other re-locatable object files and library files into the code that actually runs on the machine.
The linker resolves external memory addresses, where the code in one file may refer to a location in another
file.
Linker Functions:
• To find library files in the source program
• Determines the memory locations
• Resolve the memory address
The loader then puts together all of the executable object files into memory for execution.
Loader Functions:
• Create a new address space
• Copies data into address space
• Push arguments on the stack
Debugger is a program that can be used to determine errors during execution in a compiled program.
Profiler is a program used to collect details about the statistics on the behavior of an object program during
execution.
STRUCTURE OF A COMPILER OR PHASES OF COMPILATION:
A compiler takes as input a source program and produces as output an equivalent sequence of machine
instructions. As this process of compilation is highly complex, it is split into a sequence of sub processes
called phases. The compilations are split into analysis part and synthesis part.
• Analysis of source program: - This process is carried out by the front end of the compiler. It
determines the meaning of the source string
• Synthesis of target program: - This process is carried out by the backend of the compiler. It constructs
an equivalent target string from the source string using the semantic actions produced by the front end
of the compiler.
3
Prepared by: B. Yugandhar
In compilation, analysis is split into linear analysis, hierarchical analysis, and semantic analysis. They are
represented by the lexical analysis (scanning), syntax analysis (parsing) and semantic analysis phases of
the compiler. The analysis part also detects lexical, syntax and semantic error in the source program.
Similarly, the synthesis is split into intermediate code generator, code optimization, code generation. In
addition to these phases, we have two databases symbol table & error -handler.
Source program
Lexical Analyzer
Syntax Analyzer
Symbol table Semantic Error handling
analyzer
Intermediate
code generator
Code optimizer
Code generator
Target program
Lexical analysis: The first phase of the compiler is Lexical analysis. It is also called liner analysis or
scanning. This phase performs the linear analysis on the source program. It read each character in the source
program from left to right and groups them into tokens. A token is defined as a sequence of characters that
have a collective meaning. i.e.; identifiers, constants, reserved words.
Example:- position:= initial+rate*60
<id1><=><id2><+><id3><*><60>
Syntax Analysis: The second phase of the compiler is syntax analysis. It is also called Hierarchical analysis
or parsing. The parser takes the tokens from lexical analyzer and checks the syntax of the tokens with help
of grammar(G) and also to create a syntax tree or parse tree. In syntax tree, each internal node represents
an operator and the children of the node represent the operands.
4
Prepared by: B. Yugandhar
:=
/ \
id1 +
/ \
id2 *
/ \
id3 60
Semantic analysis: Semantic analysis checks whether the parse tree constructed follows the rules of
language. This phase to identify the operators and operands of expressions and also gathers type
information. An important component of semantic analysis is type-checking. This syntax tree after
semantic analysis is called as annotated tree.
:=
/ \
id1 +
/ \
id2 *
/\
id3 int to real
|
60
For example, all identifiers are real and number 60 is integer. The general approach is to convert the integer
into a real. Then by creating an extra node for the operator int to real that explicitly converts an integer
into a real.
Intermediate code generation: The compiler converts the source program to target program, this activity
can be done directly, but it is not always possible, then compiler generate an easy to represent form of
source language is called Intermediate Code. After sematic analysis the compiler generates an intermediate
code of the source program for the target machine. It is between the High-level language and Machine
Language. It is easy to translate the intermediate code to target language. It should maintain precedence
ordering of the source program. One popular type of intermediate code generator is called 3-address code.
An example of TAC for the syntax tree.
temp1:=int to real (60)
temp2:=id3*temp1
temp3:=id2+temp2
id1:=temp3
Code Optimization: The main aim of this phase is to improve the intermediate code. This phase to generate
a code that runs faster and/or occupies less space. It is used to establish a trade of between compilation
speed and execution speed.
Temp1:=id3*60.0
id1 :=id2+temp1
The compiler knows that conversion of 60 from integer to real representation can be done once, so the int
to real operation can be eliminated.
5
Prepared by: B. Yugandhar
Code Generator: Code generation is the final phase of the compilation process. It takes the optimized
intermediate code as input and produce target machine code. Memory locations are selected for each of the
variable used by the program.
MOVF id3,R2
MULF #60.0,R2
MOVF id2,R1
ADDF R2,R1
MOVF R1,id1
The ‘F’ in each instruction tells us that instructions deal with floating point numbers. The # sign that 60.0
is to be treated as constant.
Symbol table: A symbol table is data structure that contains a record for each identifier with attributes.
The data structure allows us to find the record for each identifier quickly and to store or retrieve data from
that record quickly. When an identifier is detected by the lexical analyzer; the identifier is entered into the
symbol table. The remaining phases enter information about identifier into the symbol table and then use
this information in various ways.
Error detection and reporting: Each phase can encounter errors. After detecting errors, this phase must
deal with the errors to continue with the process of compilation. Another important feature of compiler is
detecting and reporting errors in the source program.
Example2: Consider the following fragment of ‘c’ code
Float i,j;
i:=i*70+j+2;
Write the output at all phases of the compiler for the above ‘c’ code.
Note: Refer Class Notes
____________________________________________________________________________________
LEXICAL ANALYSIS:
The first phase of the compiler is Lexical analysis. It is also called liner analysis or scanning. This phase
performs the linear analysis on the source program. It reads each character in the source program from left
to right and groups them into tokens. A token is defined as a sequence of characters that have a collective
meaning. i.e.; identifiers, constants, reserved words.
The role of lexical analyzer: The main purpose of a lexical analyzer to read input characters and produce
a sequence of tokens as output.
6
Prepared by: B. Yugandhar
After receiving a ‘getNextToken’ command form the parser, the lexical analyzer reads the input character
until it can identify the next token which it returns to the parser. Lexical analysis not only recognizes token
but also the code and value for that token.
The scanner of a lexical analyzer is responsible for/basic functions of lexical analyzer.
• Identify the tokens such as constants, identifier, and keywords.
• It removes all blank spaces and comment lines.
• Reporting errors.
• Creating storage for Identifiers in the symbol table.
Issues in lexical analysis: There are several reasons for separating the analysis phase of compiling into
lexical and parsing.
• The separation of the lexical analysis from syntax analysis improves the efficiency of the
compiler.
• By the separation of lexical analysis and syntax analysis into two phases, the design of a compiler
is made very easy. The reason is, if these phases are put together, all the functions that are to be
performed by the lexical analysis has to be performed by the syntax analysis. The most difficult is
removing the comments, whitespace, new line characters from the input string. A syntax analysis is
a complicated phase, that performs syntax checks with the help of grammar. If such a complicated
phase performs reading the input string, removing spaces looks very absurd. Hence, the division of
these two phases makes the design of each phase.
• The division reduces compilation time.
• The division also improves the portability of the compiler.
Tokens, Patterns and Lexemes:
A token is a sequence of characters that have a collective meaning. i.e.; identifiers, constants, keywords.
A pattern is a rule that describes a token. For example, an identifier has a pattern that describes starting
with a letter followed by a letter or digit repeatedly.
Lexeme is a sequence of input characters that is matched by the pattern for a token. For example, if, else
etc.
The following procedure contains different kind of tokens recognized by lexical Analyzer.
Swap(int x,int y)
{
int temp;
Temp=x;
x=y;
=temp;y
}
Token Lexeme
keyword int
Identifier swap,x,y,temp
Operator =
Special Symbols (,),;,{,}
7
Prepared by: B. Yugandhar
Attributes for tokens: When more than one lexeme can match a pattern, the lexical analyzer must provide
additional information about the particular lexeme that matched to the subsequent phases of the compiler.
The lexical analyzer returns to the parser not only a token name, but also an attribute value for that token.
Example: E=M*C*2
<id , pointer to symbol_table entry for E>
<assign_op >
<id , pointer to symbol_table entry for M>
<mult_op >
<id , pointer to symbol_table entry for C>
<exp_op>
<number, integer value 2>
Lexical Errors:
Few errors can be identified at the lexical level alone, because a lexical analyzer has a very localized view
of a source program. Let us consider the statement fi (a==f(x)). Here ‘fi’ is a misspelled keyword. A lexical
analyzer can’t tell whether ‘fi’ is a misspelling of the keyword if. The lexical analyzer taken ‘fi’ as an
identifier. This error is then detected in other phases of compilation.
However, suppose a situation arises in which the lexical analyzer is unable to proceed because the token
doesn’t match any prefixes in the remaining input. The simplest recovery strategy is “panic mode”
recovery. We delete successive characters from the remaining input, until the lexical analyzer can find a
well-formed token.
Other possible error- recovery actions are:
• Delete extraneous characters.
• Insert a missing character.
• Replace an incorrect character with a correct character.
• Transpose two adjacent characters.
Minimum distance error recovery is used to correct the errors in the lexeme. Minimum distance error
correction is nothing but the minimum number of the corrections to be made to convert an invalid lexeme
to a valid lexeme.
SPECIFICATION OF TOKENS:
There are three specifications of tokens: 1) Strings 2) Language 3) Regular expression
Strings:
An alphabet is a finite, non-empty set of symbols. It is denoted by Σ.
For example,
• Σ ={0,1} set of binary alphabets.
• Σ ={a,b,c,……..,z} set of all lower case letters.
• Σ ={A,B,C,………Z} set of all upper case letters.
• Σ ={+,&,%,……….} set of all special characters.
8
Prepared by: B. Yugandhar
A string is finite sequences of symbols choose from some alphabets.
For example: 00011001 is a string from the binary alphabet Σ={0,1} and aabbccdd is a string from the
alphabet Σ={a,b,c,d}.
The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For example,
banana is a string of length six. The empty string, denoted ε, is the string of length zero.
Operations on strings:
1. A prefix of string s is any string obtained by removing zero or more symbols from the end of string s.
For example, ban is a prefix of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the beginning of s.
For example, nana is a suffix of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from s. For example, nan is a substring
of banana.
A language is a set of strings from some alphabet.
Examples:
• Σ = {0, 1}
L = {x | x is in Σ * and x contains an even number of 0’s}
Operations on languages:
The following example shows the operations on strings: Let L={0,1} and S={a,b,c}
1. Union: LUS={0,1,a,b,c}
2. Concatenation: L.S={0a,1a,0b,1b,0c,1c}
3. Kleene closure: L* ={ ε,0,1,00….}
4. Positive closure: L+={0,1,00….}
Regular expressions: A regular expression is a special notational form for a set of strings. Each regular
expression r will denote a regular language L(r).
Regular expressions are defined from the following rules.
1. Φ is regular expression denoting empty set { }
2. ε is a regular expression denoting the empty string { ε }.
3. Suppose ‘a’ is the symbol in Σ, then ‘a’ is a regular expression denoting a set {a}
4. if r and s are two regular expressions denoting the language L(r) and L(s) respectively then
(i) (r|s) is a regular expression denoting L(r) U L(s) - union
(ii) rs is a regular expression denoting L(r)L(s) – concatenation
(iii) r* is a regular expression denoting (L(r))*. This operation is called Closure.
9
Prepared by: B. Yugandhar
Properties of RE:
Extensions of Regular Expressions:
The following extensions can be used for writing regular expressions.
1. One or more instances (r+): The unary operator ‘+’ represents the positive closure of the regular
expression.
2. Zero or one instance(r?): The unary postfix operator ? is used to represent zero or one occurrence
in the regular expression.
3. Character classes: If a1, a2, ..., an are symbols in the alphabet, then
[a1a2...an] = a1 | a2 | ... | an. In the special case where all the a's are consecutive, we can simplify
the notation further to just [a1-an].
A regular definition is a sequence of regular expressions where each regular expression is given a name
for notational convenience.
d 1-→r1
d2-→r2
:
:
d n→r n
Where d1,d2,………. is a distinct name, and each r1 is a regular expression.
Examples: Refer class notes
Transition diagrams: We first convert patterns into flowcharts, called "transition diagrams." Transition
diagrams have a collection of nodes or circles, called states. Edges are directed from one state to another
state. Each edge is labeled by a symbol or set of symbols. The initial state is marked with an arrow. The
transition diagram always begins in the start state before any input symbols have been read. The accepting
or final state is denoted by a double circle. In addition, if it is necessary to retract the forward pointer one
position, then we add a * near that accepting state.
Examples: Refer Class Notes
10
Prepared by: B. Yugandhar