0% found this document useful (0 votes)
42 views92 pages

CD 1

The document provides an overview of compiler design, focusing on the phases of compilation, particularly the lexical analysis phase. It details the responsibilities of the lexical analyzer, including token identification and error handling, and explains the use of regular expressions for token specification. Additionally, it introduces the LEX tool for generating lexical analyzers and describes the structure of a LEX program.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views92 pages

CD 1

The document provides an overview of compiler design, focusing on the phases of compilation, particularly the lexical analysis phase. It details the responsibilities of the lexical analyzer, including token identification and error handling, and explains the use of regular expressions for token specification. Additionally, it introduces the LEX tool for generating lexical analyzers and describes the structure of a LEX program.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

COMPILER DESIGN

UNIT-1
Language Processors
PHASES OF COMPILER DESIGN
All these phases convert the source code
by dividing into tokens, creating parse
trees, and optimizing the source code by
different phases.
(a+b)*c
Working of Compiler Phases with
Example
Front End and Back end of Compiler
Lexical Analyzer
Introduction
• Lexical analysis is the starting phase of the compiler.
It gathers modified source code that is written in
the form of sentences from the language
preprocessor.
• The lexical analyzer is responsible for breaking
these syntaxes into a series of tokens, by removing
whitespace in the source code.
• If the lexical analyzer gets any invalid token, it
generates an error.
• The stream of character is read by it and it seeks the
legal tokens, and then the data is passed to the
syntax analyzer, when it is asked for.
Roles and Responsibility of Lexical
Analyzer
The lexical analyzer performs the following tasks-
• The lexical analyzer is responsible for removing
the white spaces and comments from the source
program.
• It corresponds to the error messages with the
source program.
• It helps to identify the tokens.
• The input characters are read by the lexical
analyzer from the source code.
Example
Count number of tokens:
int max(int i);
• Lexical analyzer first read int and finds it to be
valid and accepts as token.
• max is read by it and found to be a valid
function name after reading (
• int is also a token , then again i as another
token and finally ;
• Answer:
Total number of tokens 7:
int, max, ( ,int, i, ), ;
• We can represent in the form of lexemes and
tokens as under
Input Buffering

• Lexical Analysis has to access secondary


memory each time to identify tokens. It is
time-consuming and costly. So, the input
strings are stored into a buffer and then
scanned by Lexical Analysis.
• Lexical Analysis scans input string from left to
right one character at a time to identify
tokens.
• It uses two pointers to scan tokens −
• Begin Pointer (bptr) − It points to the
beginning of the string to be read.
• Look Ahead Pointer (lptr) − It moves ahead to
search for the end of the token.
Example − For statement int a, b;
• Both pointers start at the beginning of the
string, which is stored in the buffer.
•Look Pointer scans buffer until the token is
found.

•The character ("blank space") beyond the token


("int") have to be examined before the token
("int") will be determined.
After processing token ("int") both pointers will
set to the next token ('a'), & this process will be
repeated for the whole program.
• A buffer can be divided into two halves. If the
look Ahead pointer moves towards halfway in
First Half, the second half is filled with new
characters to be read. If the look Ahead
pointer moves towards the right end of the
buffer of the second half, the first half will be
filled with new characters, and it goes on.
Specification of Tokens
• Specification of tokens depends on the pattern of the
lexeme. Here we will be using regular expressions to
specify the different types of patterns that can actually
form tokens.
• Although the regular expressions are inefficient in
specifying all the patterns forming tokens. Yet it reveals
almost all types of pattern that forms a token.
• There are 3 specifications of tokens:
1. String
2. Language
3. Regular Expression
1. String

• An alphabet or character class is a finite set of symbols.


• A string over an alphabet is a finite sequence of symbols
drawn from that alphabet.
• In language theory, the term "word" is often used as
synonyms for "string".
• The length of a string s, usually written |s|, is the
number of occurrences of symbols in s. For example,
"banana" is a string of length six.
• The empty string, denoted ε, is the string of length zero.
1. Prefix of String
• The prefix of the string is the preceding symbols
present in the string and the string(s) itself.
For Example: s = abcd
The prefix of the string abcd: ∈, a, ab, abc, abcd
2. Suffix of String
• Suffix of the string is the ending symbols of the
string and the string(s) itself
For Example: s = abcd
Suffix of the string abcd: ∈, d, cd, bcd, abcd
3. Proper Prefix of String
• The proper prefix of the string includes all the prefixes
of the string excluding ∈ and the string(s) itself.

Proper Prefix of the string abcd: a, ab, abc


4. Proper Suffix of String
The proper suffix of the string includes all the suffixes
excluding ∈ and the string(s) itself.

Proper Suffix of the string abcd: d, cd, bcd


5. Substring of String
• The substring of a string s is obtained by deleting any
prefix or suffix from the string.

Substring of the string abcd: ∈, abcd, bcd, abc, …


7. Subsequence of String
The subsequence of the string is obtained by
eliminating zero or more (not necessarily
consecutive) symbols from the string.

A subsequence of the string abcd: abd, bcd, bd,


8. Concatenation of String
If s and t are two strings, then st denotes
concatenation.

s = abc t = def

Concatenation of string s and t i.e. st = abcdef


3. Regular Expression

• A regular expression is a sequence of symbols


used to specify lexeme patterns.
• A regular expression is helpful in describing the
languages that can be built using operators such
as union, concatenation, and closure over the
symbols.
• The grammar defined by regular expressions is
known as regular grammar. The language defined
by regular grammar is known as regular language.
Notations
If r and s are regular expressions denoting the
languages L(r) and L(s), then
• Union : (r)|(s) is a regular expression denoting
L(r) U L(s)
• Concatenation : (r)(s) is a regular expression
denoting L(r)L(s)
• Kleene closure : (r)* is a regular expression
denoting (L(r))*
• (r) is a regular expression denoting L(r)
Precedence and Associativity

• *, concatenation (.), and | (pipe sign) are left


associative
• * has the highest precedence
• Concatenation (.) has the second highest
precedence.
• | (pipe sign) has the lowest precedence of all.
Representing valid tokens of a
language in regular expression
If x is a regular expression, then:
• x* means zero or more occurrence of x.
i.e., it can generate { e, x, xx, xxx, xxxx, … }
• x+ means one or more occurrence of x.
i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
• x? means at most one occurrence of x
i.e., it can generate either {x} or {e}.
Recognition of Tokens

• Tokens obtained during lexical analysis


are recognized by Finite Automata.
• Finite Automata (FA) is a simple idealized
machine that can be used to recognize
patterns within input taken from a character
set or alphabet (denoted as C). The primary
task of an FA is to accept or reject an input
based on whether the defined pattern occurs
within the input.
• EXAMPLE
Assume the following grammar fragment to generate
a specific language

• where the
terminals if, then, else, relop, id and num generates
sets of strings given by following regular definitions.
Lexical-Analyzer Generator
• LEX is a tool that generates a lexical
analyzer program for a given input string. It
processes the given input string/file and
transforms them into tokens. It is used by
YACC programs to generate complete parsers.
• There are many versions available for LEX, but
the most popular is flex which is readily
available on Linux systems as a part of the
compiler package.
The function of Lex is as follows:

• Firstly lexical analyzer creates a program lex.1


in the Lex language. Then Lex compiler runs
the lex.1 program and produces a C program
lex.yy.c.
• Finally C compiler runs the lex.yy.c program
and produces an object program a.out.
• a.out is lexical analyzer that transforms an
input stream into a sequence of tokens.
Parts of the LEX program

The layout of a LEX source program is:


• Definitions
• Rules
• Auxiliary routines
A double modulus sign separates each
section %%.

You might also like