0% found this document useful (0 votes)
13 views18 pages

CD Unit-1 (Part-1)

The document outlines the structure and phases of a compiler, detailing six key phases: Lexical Analysis, Syntax Analysis, Semantic Analysis, Intermediate Code Generation, Code Optimization, and Code Generation. It explains the role of the lexical analyzer in converting source code into tokens, the use of symbol tables, and error handling across various phases. Additionally, it discusses input buffering techniques and the design of lexical analyzers using tools like Lex, along with examples of regular expressions and finite automata for token recognition.

Uploaded by

farhanqasim290
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views18 pages

CD Unit-1 (Part-1)

The document outlines the structure and phases of a compiler, detailing six key phases: Lexical Analysis, Syntax Analysis, Semantic Analysis, Intermediate Code Generation, Code Optimization, and Code Generation. It explains the role of the lexical analyzer in converting source code into tokens, the use of symbol tables, and error handling across various phases. Additionally, it discusses input buffering techniques and the design of lexical analyzers using tools like Lex, along with examples of regular expressions and finite automata for token recognition.

Uploaded by

farhanqasim290
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

KHADAR SIR 9966648277

STRUCTURE OF COMPILER or PHASES OF COMPILER


There are 6 phases in a compiler. Each of this phase help in converting the high-
level langue to machine code. The phases of a compiler are:

Lexical Analysis:
Lexical Analysis is the first phase, in this phase compiler scans the source code
and splits these characters into tokens
Example:
x = y + 10
Tokens
<X , Identifier>
<= , Assignment operator>
<Y , Identifier>
<+ , Addition operator>
<10 , Number>

Syntax Analyzer:
Syntax analysis is the second phase of compilation process. It takes tokens as input
and generates a parse tree as output. In syntax Analyzer phase, the parser checks
that the expression made by the tokens is syntactically correct or not.
Example:
KHADAR SIR 9966648277

Semantic Analyzer:
Semantic Analyzer will check for Type mismatches, incompatible operands, a
function called with improper arguments, an undeclared variable, etc.
Example:
float x = 20;
In the above code, the semantic Analyzer will typecast the integer 20 to float 20.0
before assign.
Example:
int a=5.5. Assigning a float value to an integer variable is not compatible. So compiler
will give error while compiling the program.
Intermediate Code Generation:
After the semantic analyzer phase is over, In next phase the compiler generates
intermediate code.
Intermediate code is between high-level and machine-level language. This
intermediate code should be designed in a way that makes it easy to translate into
target machine code.
Example
For example, total = count + rate * 5
t1 := int_to_float(5)
t2 := rate * t1
t3 := count + t2
total := t3

Code Optimizer :
This phase removes unnecessary code line and arranges the sequence of statements to speed up
the execution of the program without wasting resources.
Example:
Consider the following code
KHADAR SIR 9966648277
a = int_to_float(10)
b=c*a
d=e+b
f=d
Can become
b = c * 10.0
f = e+b

Code Generator:
This is the last stage of the compiler. It takes optimized code as an input and generates the target
code for the machine.
Example:
consider the following code
b =c * 10.0
f = e+b
can become
Mul
Add
Store

Symbol table: It is a data structure being used and maintained by the compiler, it consists all
the identifier’s name along with their types. It helps the compiler to function smoothly by finding
the identifiers quickly.
consider the following statement
x = a+b*50
The symbol table for the above example is given below. In symbol table are clearly mentions the
variable name and variable types.

S.No. Variable name Variable type


1 X Float
2 A Float
3 B Float

Error handler: The error may be encountered in any of the above phases. After finding errors,
the phase needs to deal with the errors to continue with the compilation process. These errors
need to be reported to the error handler which handles the error to perform the compilation
process. Generally, the errors are reported in the form of message.

In the compiler design process error may occur in all the below-given phases:

• Lexical analyzer: Wrongly spelled tokens


KHADAR SIR 9966648277
• Syntax Analyzer: Missing parenthesis, semicolons,
• Semantic Analyzer: Mismatched operands for an operator(Ex 5%2.5)
• Code Optimizer: When the statement is not reachable
• Code Generator: Unreachable statements
• Symbol tables: Error of multiple declared identifiers

THE ROLE OF LEXICAL ANALYZER


Lexical analysis is read the source program as sequence of characters and convert
into a sequence of tokens. A program that convert the program into tokens is also
called a lexical Analyzer (lexer), tokenizer or scanner. The sequence of tokens
produced by the lexical analyzer helps the parser Analyse the syntax of
programming languages.
block diagram of lexical analyzer:

Lexical analysis consists of two stages which are as follows:


Scanning: Performs reading of input characters, removal of white spaces and
comments.
Lexical Analysis or tokenization : It converts the source code into tokens. It gives
tokens as output
valid tokens are:
i. keywords,
ii. constant,
iii. identifiers,
iv. numbers,
v. operators etc.
A few more jobs that does:
i. it writes the identified identifiers token into the symbol table.
ii. removing white spaces and comments from source program.
iii. Expands the macros if it is found in the source program.
KHADAR SIR 9966648277
It can detect the following errors:
i. Exceeding length
ii. illegal characters
iii. unterminated string etc.
Examples:
find the number of tokens in the following C statement.
Example1: printf("i = %d, j = %d", i, j);
ans: total 9 tokens they are
[printf] [(] ["i = %d, &i = %x"] [,] [i] [,] [j] [)] [;]

Example2:
Input: int p=0, d=1, c=2;
ans: total no. of tokens = 13

Example3:
void main()
{
i/*nt*/a=10;
return;
}
ans: total tokens =13 they are void main ( ) { i a = 10 ; return ;
}

INPUT BUFFERING
The lexical analyzer scans only one character in the input string at a time from left
to right
It use of two pointers, begin pointer (bp) and forward pointer (fp), for Identify the
token.
Initially both the pointers point to the first character of the input string as shown
in below
KHADAR SIR 9966648277
The bp remains at the beginning of the string to be read and the fp move forward
to search for end of lexeme. Once the blank space is encountered it indicates end
of lexeme

In above example as soon as fp reaches a blank space the lexeme “int” is identified.
now both pointers takes place to first letter of next lexeme. as shown in below.

One Buffer Scheme:


In this scheme, only one buffer is used to store the input string but the problem
with this scheme is that if lexeme is very long then it crosses the buffer boundary,
to scan rest of the lexeme the buffer has to be refilled, that makes overwriting the
first of lexeme.

Two Buffer Scheme:


To overcome the problem of one buffer scheme, in this method two buffers are
used to store the input string. the first buffer and second buffer are scanned
alternately.
KHADAR SIR 9966648277
Every buffer has a Sentinel(eof) at the end. when fp reaches to eof of first buffer
it means that first buffer is over. then second buffer will load. and fp moves to
first location of second buffer. when fp reaches to eof of second buffer then first
buffer will reload. and fp moves to first location of first buffer.
This process will continue until the entire program is scanned.
Algorithm:
if fp reaches eof of first buffer then
{
begin reload second buffer;
fp=fp+1;
}
else if fp reaches eof of second buffer then
{
begin reload first buffer;
fp moves to beginning of first buffer;
}
else
{
fp=fp+1;
}

RECOGNITION OF TOKENS
Tokens can be recognized by Finite Automata. A finite automaton is a simple machine
computation model with very small amount of memory. It does not give any output
except an indication of acceptance or rejection of the string. There are two
notations for representing Finite Automata. They are
Transition Diagram
Transition Table
Transition diagram is a directed labelled graph in which it contains nodes and edges
Nodes represents the states and edges represents the transition of a state
Every transition diagram is only one initial state represented by an arrow mark (-->)
and zero or more final states are represented by double circle.
Example:
Finite Automata for recognizing identifiers
regular expression:
KHADAR SIR 9966648277
letter -> [A-Za-z_]
id -> letter (letter|digit)*
transition diagram:

Finite Automata for recognizing keywords


regular expression:
if -> if
transition diagram:

Finite Automata for recognizing numbers


Regular expression:
digit -> [0-9]
Digits -> digit+
Transition diagram:

Finite Automata for relational operators


Regular Expression:
Relop -> < | > | <= | >= | = | <>
Transition diagram:
KHADAR SIR 9966648277
Finite Automata for recognizing white spaces
Regular expression:
ws -> (blank | tab | newline)+
Transition diagram:

LEXICAL ANALYZER GENERATOR- LEX:


Lex is a tool / program that produces Lexical Analyzer. Lexical Analyzer is a program
that converts the input stream into a series of tokens. The figure below shows the
creation of a lexical Analyzer with Lex.

An input file, which we call lex.l , is written in the Lex language and describes the
lexical analyzer to be generated. Then Lex compiler runs the lex.l program and
produces a C program lex.yy.c.
Finally C compiler runs the lex.yy.c program and produces an object program a.out.
a.out is lexical analyzer that transforms an input stream into a sequence of tokens.

A Lex program is separated into three sections by %% delimiters. The formal of Lex
source is as follows:
{ declaration}
%%
KHADAR SIR 9966648277
{ rules }
%%
{ user subroutines }
Declaration: include declarations of constant, variable and regular definitions.
Rules: It contains rules, it syntax is
pattern action

user subroutines: This section contain C statements and additional functions.

Example-1: Lex program for to identify identifiers, numbers and keyboard


%{
#include <stdio.h>
%}

%%
if|else|while|int|switch|for|char { printf(“keyboard”);}
[a-zA-Z_]([a-z]|[A-Z]|[0-9]|_)* printf(“\n identifier”);}
[0-9]* printf(“\n number”);}
.* {printf(“invalid”);}
%%
main()
{
yylex();
return 0;
}
int yywrap()
{

Example-2: count vowels and consonants


%{
#include<stdio.h>
int v=0;
int c =0;
%}
KHADAR SIR 9966648277
%%
[aeiouAEIOU] {v++;}
[a-zA-Z] {c++;}
%%
int yywrap(){}
int main()
{
printf("Enter the string of vowels and consonents:");
yylex();
printf("Number of vowels are: %d\n", v);
printf("Number of consonants are: %d\n", c);
return 0;
}
DESIGN OF LEXICAL ANALYZER
Lexical analyzer can either be generated by NFA or by DFA. DFA is preferable in
the implementation of lex.
Architecture of lexical analyzer generated by lex is given in Fig.

Lex contains automata to recognize each token.


Example: Automata for identifier
Regular Expression:
id = letter(letter|digit)*

Transition table
KHADAR SIR 9966648277
Letter Digit other
→S1 S2
S2 S2 S2 S3
*S3

Automata for above regular expression:

All NFAs are formed into a single NFA by a new common initial state as shown below.

finite automaton recognize the tokens of input stream and action model takes
appropriate action.
SPECIFICATION OF TOKENS:
There are 3 specifications of tokens:
1)Strings
2) Language
3)Regular expression
1. Alphabets: Any finite set of symbols
{0,1} is a set of binary alphabets,
{0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets,
{a-z, A-Z} is a set of English language alphabets.

2. Strings: Any finite sequence of alphabets is called a string.


Operations ( For Alphabet and Strings):
Cardinality:
KHADAR SIR 9966648277
The Number of symbols in a Given String is called as the Length of the string. For a
String w, its length is denoted by |w|
Example:
= { a, b}
abbab is string over  with length 5
Null String:
The string with length zero is called as a null string. it is represented by the symbol
epsilon(ε) or Lambda(Λ ).
Prefix:
Any Sequence of leading or beginning symbols in a string is called as prefix. Null
string, ε is a prefix of all the string.
Example:
 = {a, b, c, . . . , y, z}
ayc is a string over 
prefixes of ayc : ε, a, ay, ayc
Proper prefix: For a string, any prefix other than the given string is called as a
proper prefix.
Example:
 = {a, b, c, . . . , y, z}
ayc is a string over 
proper prefixes of ayc : ε, a, ay
Suffix: Any sequence of trailing or ending symbols in a string are called as suffix of
that string. Null string is the suffix of all the strings.
Example:
 = {a, b, c, . . . , y, z}
ayc is a string over 
suffixes of ayc : ε, c, yc, ayc
Proper suffix: Any suffix of a string other than the string itself is called as a proper
suffix.
Example:
 = {a, b, c, . . . , y, z}
KHADAR SIR 9966648277
ayc is a string over 
proper suffixes of ayc : ε, c, yc
Concatenation: concatenation is the process of combining two strings to form a new
string.
Example:
 = {a, b, c, . . . , y, z}
String1, w1= ab
String2, w2=xy
w1.w2=abxy
Concatenation with Null string ε gives the string itself i.e. w.ε = w
3) Language:
Any subset of * is called a language. In other words, if  is an alphabet and L⊂ *
then L is a language.
Example:
Let = {0, 1}
i. the set of strings consisting of n 0’s followed by n 1’s is a language.
{ ε, 01, 0011,000111 - - -}
ii. the set of strings with equal number of 0’s and 1’s is called a language.
{ε, 01, 1001, 101001, ---}
iii. L={ } is an empty language. it is denoted by Φ
iv. L={ε} is also a language that contains an empty string.
Note: Φ ≠ { ε}
A language can be infinite, but the underlying alphabet  on which the language is
constructed is always finite.
Empty language is the minimal language over any given alphabet.
L={ }= Φ is minimal
Operations on language:
Union: given two languages L and M, the union operation can be denoted as below:
L U M = { w: w ε L or w ε M}
Example:
L={ ab, aba, abbc, - -}
KHADAR SIR 9966648277
M={ xyz, xz, x, y, - - }
L UM= { ab, aba, abbc, - -, xyz, xz, x, y, - -}
concatenation: Given two languages L and M, the concatenation operation can be
defined as below:
L . M = { w : w=xy, x ε L or y ε M}
example:
L={ ab, aba, abbc, - -}
M={ xyz, xz, x, y, - - }
L . M= { abxyz, abxz, abx, aby, abaxyz, abaxz, abax, abay, - - }
4) Regular Expressions
Regular expression are used to describe a language. Let  be a non-empty alphabet.
then
Examples of Regular expressions
i. ε is a regular expression for L={ε}
ii. Φ is a Regular Expression for L={}
iii. ‘a’ is a regular expression for L={a}
iv. a+b is regular expression for L={a,b}
v. if R1 and R2 are regular expressions, then R1 U R2 is a regular expression.
vi. if R1 and R2 are regular expressions, then R1 . R2 is a regular expression.
vii. if R is a expression, then R* is a regular expression.

Example:
i. the language {w:w contains exactly two 0’s} is described by the expression
1*01*01*
some of the strings of the above language will be {00, 100,1010,10101,001,010,
- -}
ii. the language { w: the length of w is even} is described by the
expression((0U1)(0U1))*

For every regular expression there is a regular language and for every regular
language there is a regular expression.
Every regular expression or regular language can be represented using a finite
automata and finite automata can be converted to a regular expression.
KHADAR SIR 9966648277
Let <digit> = 0|1|2|3|---|8|9
Integer = <digit><digit>*
TRANSLATION PROCESS
A translator is used to translate a high level language program into efficient
machine code
It divides the translation process into a series of phases
Each phase manages some particular aspect of translation
Phases of Translation:

Lexical Analysis:
Lexical Analysis is the first phase in this phase compiler scans the source code. and
group these characters into tokens.
Syntax Analysis:
Syntax analysis is the second phase of compilation process. It takes tokens as
input and generates a parse tree as output.
Semantic analyser:
Semantic Analyzer will check for Type mismatches, incompatible operands, a
function called with improper arguments, an undeclared variable, etc.
Intermediate Code Generation:
Once the semantic analysis phase is over the compiler, generates intermediate code
for the target machine.
Code Optimizer :
KHADAR SIR 9966648277
This phase removes unnecessary code line and arranges the sequence of statements to speed up
the execution of the program without wasting resources.
Code Generator:
This is the last stage of the compiler. It takes optimized code as an input and generates the target
code for the machine.
Example:
KHADAR SIR 9966648277
MAJOR DATA STRUCTURE IN C:
Symbol table is an important major data structure created and maintained by
compilers in order to store information about the occurrence of various entities
such as variable names, function names, objects, classes, interfaces, etc.
Implementation: we can implement Symbol table by using following data structure
• Linear list.
• Binary Search Tree.
• Hash table.
BOOTSTRAPPING AND PORTING
Bootstrapping is a process to create new compiler. it is also used to create cross
compiler.
For Example if programming language L2 is used to design compiler for language L1
then it represented by T diagram as shown below.

For Example if programming language L3 is used to design compiler for language L2


then it represented by T diagram as shown below.

This process is called Bootstrapping. Above process can represent by using T


diagram as shown below.

You might also like