0% found this document useful (0 votes)

57 views51 pages

Module 2

This document discusses regular expressions and languages. It defines regular expressions recursively and describes how they can be used to define regular languages. Regular expressions can be formed from primitive expressions using operators like concatenation, union, and Kleene closure. A regular language is one that can be accepted by a finite automaton. The document then shows how to construct a non-deterministic finite automaton from a given regular expression using the structure of the expression.

Uploaded by

indrajvyadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views51 pages

Module 2

Uploaded by

indrajvyadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

AUTOMATA THEORY
AND COMPILER
DESIGN- 21CS51

Dr. Sampada K S, Associate Professor

DEPT. OF CSE | RNSIT

DR. SAMPADA K S, CSE, RNSIT 1

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

MODULE-2
Regular Expressions and Languages: Regular Expressions, Finite Automata and Regular
Expressions, Proving Languages Not to Be Regular

Lexical Analysis Phase of compiler Design: Role of Lexical Analyzer, Input Buffering,
Specification of Token, Recognition of Token.

PART 1: REGULAR LANGUAGE:

One way of defining regular language is via regular expressions. This involves combination
of strings of symbols from some alphabet , parenthesis and the operator + , . , *
Definition of regular Expressions:
Let ∑ be a given alphabet, then
 Ø, Ɛ and a ∈ ∑ are all primitive regular expressions.
 If R and S are regular expressions then
o R+S,
o R.S,
o R* are also regular expressions
 A string is a regular expression iff it can be derived from the primitive regular expression by
a finite number of application of the rules.

Language associated with regular expressions.

Definition: Let M = (Q, ∑, δ, q0, A) be a DFA. The language L is regular if there exists a machine
M such that L = L(M).
 Definition: A regular expression is recursively defined as follows.
1. Ø is a regular expression denoting an empty language.
2. Ɛ-(epsilon) is a regular expression indicates the language containing an empty string.
3. a is a regular expression which indicates the language containing only {a}.
4. if R is a regular expression denoting a language L(R) and S is regular language denoting
L(S) then

a. R+S is a regular expression corresponding to the language L(R)UL(S).

b. R.S is a regular expression corresponding to the language L(R).L(S).
c. R* is a regular expression corresponding to the language L R*.
5. The expressions obtained by applying any of the rules from 1-4 are regular expressions.

DR. SAMPADA K S, CSE, RNSIT 2

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
The following table shows examples of regular expressions and the language corresponding to
these regular expressions:

Regular Meaning
expressions
a* String consisting of nay number of a’s(0 or more)
a+ String consisting of atleast 1 a (1 or more)
(a+b) String consisting of either one a or one b
(a+b)* Set of strings of a’s and b’s of any length including the NULL string.
(a+b)*abb Set of strings of a’s and b’s ending with the string abb
ab(a+b)* Set of strings of a’s and b’s starting with the string ab.
(a+b)*aa(a+b)* Set of strings of a’s and b’s having a substring aa.
Set of string consisting of any number of a’s (may be empty string
a*b*c* also) followed by any number of b’s (may include empty string)
followed by any number of c’s (may include empty string).
a+b+c+ Set of string consisting of atleast one ‘a’ followed by string consisting
of atleast one ‘b’ followed by string consisting of at least one ‘c’.
aa*bb*cc* Set of string consisting of atleast one ‘a,b,c’
(a+b)*(a+bb) Set of strings of a’s and b’s ending with either a or bb
(aa)*(bb)*b Set of strings consisting of even number of a’s followed by odd
number of b’s

01*+1 Set of strings consisting of 1’s or 0 followed by 1(where 1 may appear 0

or more times)
(01)*+1 Set of strings consisting of 0 followed by 1 ( 0 or more times) or 1

0(1*+1) Set of strings consisting of a zero followed by any number of 1’s

(1+ Ɛ )(00*1)*0* Strins of 0’s and 1’s without any consecutive 1’s

(0+10)1 Strings of 0’s and 1’s ending withany number of 1’s

(a+b)(a+b) Strings of a’s and b’s of length 2

(0+1)*000 Set of strings of 0’s and 1’s ending with three consecutive zeros
(11)* Set consisting of even number of 1’s

DR. SAMPADA K S, CSE, RNSIT 3

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 4

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 5

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 6

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 7

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Finite automata and Regular Expressions

DR. SAMPADA K S, CSE, RNSIT 8

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
To Obtain NFA from the regular expression
Theorem: Let R be a regular expression. Then there exists a finite automaton M = (Q, , , q0, A)
which accepts L(R).

Proof: By definition,,  and a are regular expressions. So, the corresponding machines to recognize
these expressions are shown in figure

q0 qf
 q0  qf q0 a qf

(a) (b) (c)

The schematic representation of a regular expression R to accept the language L(R) is shown in
figure where q is the start state and f is the final state of machine M.

L(R)
q M f

Schematic representation of FA accepting L(R)

In the definition of a regular expression it is clear that if R and S are regular expressions, then R+S
and R.S and R* are regular expressions which clearly uses three operators ‘+’, ‘-‘ and ‘.’. Let us take
each case separately and construct equivalent machine. Let M1 = (Q1, 1, 1, q1, f1) be a machine
which accepts the language L(R1) corresponding to the regular expression R1. Let M2 = (Q2, 2, 2,
q2, f2) be a machine which accepts the language L(R2) corresponding to the regular expression R2.

Case 1: R = R1 + R2. We can construct an NFA which accepts either L(R1) or L(R2) which can be
represented as L(R1 + R2) as shown in figure 3.3.

L(R1)
 q1 M1 f1 
q0 qf
 q2 M2 f2 
L(R2)
To accept the language L(R1 + R2)

It is clear from figure that the machine can either accept L(R1) or L(R2). Here, q0 is the start state of
the combined machine and qf is the final state of combined machine M.

Case 2: R = R1 . R2. We can construct an NFA which accepts L(R1) followed by L(R2) which can be
represented as L(R1 . R2) as shown in figure 3.4.

DR. SAMPADA K S, CSE, RNSIT 9

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

L(R1) L(R2)

q1 M1 f1 q2 M2 f2
To accept the language L(R1 . R2)

It is clear from figure that the machine after accepting L(R1) moves from state q1 to f1. Since there is
a -transition, without any input there will be a transition from state f 1 to state q2. In state q2, upon
accepting L(R2), the machine moves to f2 which is the final state. Thus, q1 which is the start state of
machine M1 becomes the start state of the combined machine M and f2 which is the final state of
machine M2, becomes the final state of machine M and accepts the language L(R1.R2).

Case 3: R = (R1)*. We can construct an NFA which accepts either L(R1)*) as shown in figure 3.5.a. It
can also be represented as shown in figure 3.5.b.


 
q0 q1 M1 f1 qf


L(R1)

(a)

q0 q1 M1 f1 qf
 

To accept the language L(R1)*

It is clear from figure that the machine can either accept  or any number of L(R1)s thus accepting
the language L(R1)*. Here, q0 is the start state qf is the final state.

Example
Obtain an NFA which accepts strings of a’s and b’s starting with the string ab.

The regular expression corresponding to this language is ab(a+b)*.

DR. SAMPADA K S, CSE, RNSIT 10

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
Step 1: The machine to accept ‘a’ is shown below.

4 a
5

Step 2: The machine to accept ‘b’ is shown below.

6 b
7

Step 3: The machine to accept (a + b) is shown below.

a
 4 5 

3 8
 6 7 
b

Step 4: The machine to accept (a+b)* is shown below.


a
 4 5 
 
2 3 8 9
 6 7 
b


Step 5: The machine to accept ab is shown below.

a b
0 1 2

Step 6: The machine to accept ab(a+b)* is shown below.


a
 4 5 
a b  
0 1 2 3 8 9
 6 7 
b


To accept the language L(ab(a+b)*)

DR. SAMPADA K S, CSE, RNSIT 11

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Obtain the regular expression from FA (KLEEN’S Theorem)

Theorem: Let M = (Q, , , q0, A) be an FA recognizing the language L. Then there exists an
equivalent regular expression R for the regular language L such that L = L(R).

DR. SAMPADA K S, CSE, RNSIT 12

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 13

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 14

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 15

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 16

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 17

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 18

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 19

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 20

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 21

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
The general procedure to obtain a regular expression from FA is shown below. Consider the
generalized graph
r1 r r
q0 2 q1 4

r
3
Generalized transition graph

where r1, r2, r3 and r4 are the regular expressions and correspond to the labels for the edges. The
regular expression for this can take the form:

r = r1r2 (r4 + r3r1r2)* (3.1)

Note:
1. Any graph can be reduced to the graph shown in figure 3.9. Then substitute the regular
expressions appropriately in the equation 3.1 and obtain the final regular expression.
2. If r3 is not there in figure 3.9, the regular expression can be of the form
r = r1*r2 r4* (3.2)

3. If q0 and q1 are the final states then the regular expression can be of the form
r = r1* + r1*r2 r4* (3.3)

Example
Obtain a regular expression for the FA shown below:
0
q0 q1
1
0 1 0

q2 q3 0,1
1

The figure can be reduced as shown below:

01
q0

It is clear from this figure that the machine accepts strings of 01’s and 10’s of any length and the
regular expression can be of the form

(01 + 10)*

DR. SAMPADA K S, CSE, RNSIT 22

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

What is the language accepted by the following FA

0 1 0,1
0 q2
q0 q1
1
Since, state q2 is the dead state, it can be removed and the following FA is obtained.

0 1
q0 q1
1

The state q0 is the final state and at this point it can accept any number of 0’s which can be
represented as using notation 0*

q1 is also the final state. So, to reach q1 one can input any number of 0’s followed by 1 and followed
by any number of 1’s and can be represented as
0*11*

So, the final regular expression is obtained by adding 0 * and 0*11*. So, the regular expression is

R.E = 0* + 0*11*
= 0* (  + 11*)
= 0* (  + 1+)
= 0* (1*) = 0*1*
It is clear from the regular expression that language consists of any number of 0’s (possibly ) followed
by any number of 1’s(possibly ).

DR. SAMPADA K S, CSE, RNSIT 23

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 24

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 25

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 26

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 27

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 28

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 29

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 30

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 31

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Regular languages
In theoretical computer science and formal language theory, a regular language is a formal
language that can be expressed using a regular expression. Note that the "regular
expression" features provided with many programming languages are augmented with
features that make them capable of recognizing languages that cannot be expressed by the
formal regular expressions. In the Chomsky hierarchy, regular languages are defined to be
the languages that are generated by Type-3 grammars (regular grammars).Regular
languages are very useful in input parsing and programming language design.

The collection of regular languages over an alphabet Σ is defined recursively as follows:

 The empty language Ø is a regular language.

 For each a ϵ Σ(a belongs to Σ),the singleton language {a} is a regular language.
 If A and B are regular languages, then
o A U B(union),
o A • B (concatenation),and
o A* (Kleene star) are regular languages.
 No other languages over Σ are regular.

Proving languages not to be regular languages

Pumping Lemma (PL) for Regular Languages:

Theorem: Let M=( ) be an FA and has n number of states. Let L be the regular language
accepted by M. for every string x in L, there exists a constant n such that |x| >=n. if the
string can be broken into three substrings x,y,z such that .
1. |y|>0
2. |xy|≤ n
3. For all k≥0, the string xykz is also in L.
PROOF:

Let L be regular defined by an FA having ‘n’ states. Let x=a1,a2,a3----an and is in L. |x|=n ≥ n.
Let the start state be P 1. Let w=xyz where x=a1,a2,a3----an-1,y=an and z= 

DR. SAMPADA K S, CSE, RNSIT 32

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Uses of Pumping Lemma: - This is to be used to show that, certain languages are not regular. It
should never be used to show that some language is regular. If you want to show that language is
regular, write separate expression, DFA or NFA.

General Method of proof: -

(i) Select w such that |w|  n
(ii) Select y such that |y|  1
(iii) Select x such that |xy|  n
(iv) Assign remaining string to z
Select k suitably to show that, resulting string is not in L.
Example 1: To prove that L={w|w  anbn, where n ≥ 1} is not regular
Proof:

Let L be regular. Let n is the constant (PL Definition). Consider a word w in L.

Let w = anbn, such that |w|=2n. Since 2n > n and L is regular it must satisfy PL.

DR. SAMPADA K S, CSE, RNSIT 33

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
xy contain only a’s. (Because |xy| ≤ n).

Let |y|=l, where l > 0 (Because |y| > 0).

Then, the break up of x. y and z can be as follows

from the definition of PL , w=xykz, where k=0,1,2,------, should belong to L.

That is an-l (al)k bn L, for all k=0,1,2,------ 

Put k=0. we get an-l bn  L.Contradiction.

Hence the Language is not regular.

Example 2: To prove that L={w|w is a palindrome on {a,b}*} is not regular. i.e., L={aabaa,
aba, abbbba,…}

Proof:

Let L be regular. Let n is the constant (PL Definition). Consider a word w in L. Let w =
anban, such that |w|=2n+1. Since 2n+1 > n and L is regular it must satisfy PL.

xy contain only a’s. (Because |xy| ≤ n).

Let |y|=l, where l > 0 (Because |y| > 0).

That is, the break up of x. y and z can be as follows

DR. SAMPADA K S, CSE, RNSIT 34

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

from the definition of PL w=xykz, where k=0,1,2,------, should belong to L.

That is an-l (al)k ban L, for all k=0,1,2,------ .

Put k=0. we get an-l b an L, because, it is not a palindrome.

Contradiction, hence the language is not regular

Example 3: To prove that L={ all strings of 1’s whose length is prime} is not regular. i.e.,
L={12, 13 ,15 ,17 ,111 ,----}

Proof: Let L be regular. Let w = 1p where p is prime and | p| = n +2

Let y = m.
by PL xykz L
| xykz | = | xz | + | yk | Let k = p-m
= (p-m) + m (p-m)
= (p-m) (1+m) ----- this cannot be prime
if p-m ≥ 2 or 1+m ≥ 2
1. (1+m) ≥ 2 because m ≥ 1
2. Limiting case p=n+2
(p-m) ≥ 2 since m ≤n

Example 4: To prove that L={ 0i2 | i is integer and i >0} is not regular. i.e., L={0 2, 04 ,09 ,016
,025 ,----}
Proof: Let L be regular. Let w = 0n2 where |w| = n2 ≥ n
by PL xykz L, for all k = 0,1,---

Select k = 2
| xy2z | = | xyz | + | y |
= n2 + Min 1 and Max n
Therefore n2 < | xy2z | ≤ n2 + n

n2 < | xy2z | < n2 + n + 1+n adding 1 + n (Note that less than or equal to is
n2 < | xy2z | < (n + 1)2 replaced by less than sign

DR. SAMPADA K S, CSE, RNSIT 35

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 36

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 37

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 38

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 39

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 40

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

PART 2: LEXICAL ANALYSIS PHASE OF COMPILER

DESIGN
OVER VIEW OF LEXICAL ANALYSIS
 To identify the tokens we need some method of describing the possible
tokens that can appear in the input stream. For this purpose we introduce regular
expression, a notation that can be used to describe essentially all the tokens
of programming language.
 Secondly , having decided what the tokens are, we need some
mechanism to recognize these in the input stream. This is done by the token
recognizers, which are designed using transition diagrams and finite automata.

ROLE OF LEXICAL ANALYZER

the LA is the first phase of a compiler. It main task is to read the input character
and produce as output a sequence of tokens that the parser uses for syntax analysis.

Upon receiving a ‘get next token’ command form the parser, the lexical analyzer reads the input
character until it can identify the next token. The LA return to the parser representation for the
token it has found. The representation will be an integer code, if the token is a simple construct
such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is
striping out from the source program the commands and white spaces in the form of blank, tab
and new line characters. Another is correlating error message from the compiler with the source
program.

DR. SAMPADA K S, CSE, RNSIT 41

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

LEXICAL ANALYSIS VS PARSING:

Lexical analysis Parsing

Scanner simply turns an input String (say a file) into a A parser converts this list of tokens into a Tree-like
list of tokens. These tokens represent things like object to represent how the tokens fit together to
identifiers, parentheses, operators etc. form a cohesive whole (sometimes referred to as a
sentence).
The lexical analyzer (the "lexer") parses individual
symbols from the source code file into tokens. From A parser does not give the nodes any meaning
there, the "parser" proper turns those whole tokens beyond structural cohesion. The next thing to do is
into sentences of your grammar extract meaning from this structure (sometimes called
contextual analysis).

TOKEN, LEXEME, PATTERN:

Token: Token is a sequence of characters that can be treated as a single logical

entity. Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern: A set of strings in the input for which the same token is produced as output. This set
of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by
the pattern for a token.
Example:
Description of token

Token lexeme pattern

const const const
if if If
relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or letter
followed by letters & digit
i pi any numeric constant
nun 3.14 any character b/w “and “except"
literal "core" pattern

A pattern is a rule describing the set of lexemes that can represent a particular token in source
program.

LEXICAL ERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. Which means that
there's no way to recognize a lexeme as a valid token for the lexer.
These errors are detected during the lexical analysis phase. Typical lexical errors are:
 Exceeding length of identifier or numeric constants.
 The appearance of illegal characters
 Unmatched string

DR. SAMPADA K S, CSE, RNSIT 42

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
Example 1 : printf("Geeksforgeeks");$
This is a lexical error since an illegal character $ appears at the end of statement.

Example 2 : This is a comment */

This is an lexical error since end of comment is present but beginning is not present
Syntax errors, on the other side, will be thrown by your scanner when a given set of already
recognized valid tokens don't match any of the right sides of your grammar rules. simple panic-
mode error handling system requires that we return to a high-level parsing function when a
parsing or lexical error is detected.

Error recovery for lexical errors:

Panic Mode Recovery
 In this method, successive characters from the input are removed one at a time until a designated
set of synchronizing tokens is found. Synchronizing tokens are delimiters such as; or }
 The advantage is that it is easy to implement and guarantees not to go into an infinite loop
 The disadvantage is that a considerable amount of input is skipped without checking it for
additional errors
Error-recovery actions are:
i. Delete one character from the remaining input.
ii. Insert a missing character in to the remaining input.
iii. Replace a character by another character.
iv. Transpose two adjacent characters.

INPUT BUFFERING
Lexical Analysis has to access secondary memory each time to identify tokens. It is time-consuming
and costly. So, the input strings are stored into a buffer and then scanned by Lexical Analysis. The
lexical analyzer scans the input from left to right one character at a time. It uses two pointers lexeme
begin ptr(bp) and forward to keep track of the pointer of the input scanned. Initially both the
pointers point to the first character of the input string as shown below

The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is encountered,
it indicates end of lexeme. In above example as soon as ptr (fp) encounters a blank space the lexeme
“int” is identified.

DR. SAMPADA K S, CSE, RNSIT 43

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

The fp will be moved ahead at white space, when fp encounters white space, it ignore and moves
ahead. then both the begin ptr(bp) and forward ptr(fp) are set at next token.
The input character is thus read from secondary storage, but reading in this way from secondary
storage is costly. hence buffering technique is used.A block of data is first read into a buffer, and
then second by lexical analyzer. there are two methods used in this context: One Buffer Scheme, and
Two Buffer Scheme. These are explained as following below.

One Buffer Scheme:

In this scheme, only one buffer is used to store the input string but the problem with this scheme is
that if lexeme is very long then it crosses the buffer boundary, to scan rest of the lexeme the buffer
has to be refilled, that makes overwriting the first of lexeme.

Two Buffer Scheme:

To overcome the problem of one buffer scheme, in this method two buffers are used to store the
input string. the first buffer and second buffer are scanned alternately. when end of current buffer is
reached the other buffer is filled. the only problem with this method is that if length of the lexeme is
longer than length of the buffer then scanning input cannot be scanned completely.

A specialized buffering techniques0 used to reduce the amount of overhead, which is required to
process an input character in moving characters.

DR. SAMPADA K S, CSE, RNSIT 44

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Buffer pairs
 Consists of two buffers, each consists of N-character size which are reloaded alternatively.
 Two pointers lexemeBegin and forward are maintained.
 Lexeme Begin points to the beginning of the current lexeme which is yet to be found.
 Forward scans ahead until a match for a pattern is found.
 Once a lexeme is found, lexeme begin is set to the character immediately after the lexeme
which is just found and forward is set to the character at its right end.
 Current lexeme is the set of characters between two pointers.

What is Sentinels ?
Sentinels is used to make a check, each time when the forward pointer is moved, a check is done to
ensure that one half of the buffer has not moved off. If it is done, then the other half must be
reloaded.
Therefore the ends of the buffer halves require two tests for each advance of the forward
pointer. Test 1: For end of buffer. Test 2: To determine what character is read. The usage of sentinel
reduces the two tests to one by extending each buffer half to hold a sentinel character at the end.

The sentinel is a special character that cannot be part of the source program. (eof character is used as
sentinel).
Disadvantages

DR. SAMPADA K S, CSE, RNSIT 45

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
 This scheme works well most of the time, but the amount of lookahead is limited.
 This limited lookahead may make it impossible to recognize tokens in situations where the
distance that the forward pointer must travel is more than the length of the buffer.
Advantages
 Most of the time, It performs only one test to see whether forward pointer points to an eof.
 Only when it reaches the end of the buffer half or eof, it performs more tests.
 Since N input characters are encountered between eofs, the average number of tests per input
character is very close to 1.
Observe from the above algorithm that instead of having two tests as in buffer pair technique, there is
only one test i.e., testing the eof marker.

SPECIFICATION OF TOKENS
1. In theory of compilation regular expressions are used to formalize the specification of tokens
2. Regular expressions are means for specifying regular languages
3. Example: Letter_(letter_ | digit)*
4. Each regular expression is a pattern specifying the form of strings

REGULAR EXPRESSIONS
1. Ɛ is a regular expression, L(Ɛ) = {Ɛ}
2. If a is a symbol in ∑then a is a regular expression, L(a) = {a}
3. (r) | (s) is a regular expression denoting the language L(r) L(s)
4. (r)(s) is a regular expression denoting the language L(r)L(s)
5. (r)* is a regular expression denoting (L(r))*

6. (r) is a regular expression denoting L(r)

Regular expression is a formula that describes a possible set of string.

Component of regular expression..

DR. SAMPADA K S, CSE, RNSIT 46

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
X the character x
. any character, usually accept a new line
[x y z] any of the characters x, y, z, …..
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
R+ one or more occurrences ……
R1R2 an R1 followed by an R2
R2|R1 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain type. If we view
the set of strings in each token class as an language, we can use the regular-
expression notation to describe tokens.

Consider an identifier, which is defined to be a letter followed by zero or more letters

or digits. In regular expression notation we would write.

Identifier = letter (letter | digit)*

Here are the rules that define the regular expression over alphabet .
o is a regular expression denoting { € }, that is, the language containing only the
empty string.
o For each ‘a’ in ∑, is a regular expression denoting { a }, the language with
only one string consisting of the single symbol ‘a’ .
o If R and S are regular expressions, then

(R) | (S) means LrULs

R.S means Lr.Ls
R* denotes Lr*

REGULAR DEFINITIONS
For notational convenience, we may wish to give names to regular expressions and
to define regular expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter.
The following regular definition provides a precise specification for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Pascal identifier
Letter - A | B | ……| Z | a | b |……| z|
Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*

RECOGNITION OF TOKENS:

We learn how to express pattern using regular expressions. Now, we must study how to take the

DR. SAMPADA K S, CSE, RNSIT 47

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
patterns for all the needed tokens and build a piece of code that examines the input string and finds a
prefix that is a lexeme matching one of the patterns. The two functions that are used while designing
the lexical analyzer are:
This function is invoked only if we want to unread the last character read. It is
identified by having a an edge labeled other to the final state with * marked to it. This function un
reads the last character read
Once the identifier is identified, the function install_ID() is called. This function
checks whether the identifier is already there in the symbol table. If it is not there, it is entered into
the symbol table and returns the pointer to that entry. If it is already there, it returns the pointer to
that entry.

Stmt expr then stmt

| If expr then else stmt
| є
Ex rm relop term
| term
Te
|number
For relop ,we use the comparison operations of languages like Pascal or SQL where = is
“equals” and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the
names of tokens as far as the lexical analyzer is concerned, the patterns for the
tokens are described using regular definitions.

digit -->[0,9]
digits -->digit+
number -->digit(.digit)?(e.[+-]?digits)?
letter -->[A-Z,a-z]
id -->letter(letter/digit)*
if --> if
then -->then
else -->else
relop --> </>/<=/>=/==/< >

In addition, we assign the lexical analyzer the job stripping out white space, by recognizing
the “token” we defined by:
ws --> ank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the
ASCII characters of the same names. Token ws is different from the other tokens in
that ,when we recognize it, we do not return it to parser ,but rather restart the
lexical analysis from the character that follows the white space . It is the following
token that gets returned to the parser.

Lexeme Token Name Attribute Value

Any ws _ _
if if _

DR. SAMPADA K S, CSE, RNSIT 48

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
then then _
else else _
Any id id pointer to table entry
Any number number pointer to table entry
< relop LT
<= relop LE
= relop ET
<> relop NE

TRANSITION DIAGRAM:
Transition Diagram has a collection of nodes or circles, called states. Each state
represents a condition that could occur during the process of scanning the input
looking for a lexeme that matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled
by a symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out
of state s labeled by a. if we find such an edge ,we advance the forward pointer
and enter the state of the transition diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a
lexeme has been found, although the actual lexeme may not consist of all
positions b/w the lexeme
Begin and forward pointers we always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer one position, then
we shall additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge
labeled “start” entering from nowhere .the transition diagram always begins in
the state before any input symbols have been used.

As an intermediate step in the construction of a LA, we first produce

a stylized flowchart, called a transition diagram. Position in a transition diagram,
are drawn as circles and are called as states.

DR. SAMPADA K S, CSE, RNSIT 49

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
Architecture of a transition-diagram-based lexical analyzer
TOKEN getRelop()
{
TOKEN retToken = new (RELOP)
while (1)
{/*repeat character processing until a return or failure occurs*/
switch(state)
{
case 0: c= nextchar();
if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5;
else if (c == ‘>’) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1: …
…
case 8: retract();
retToken.attribute = GT;
return(retToken);
}

 Transition diagrams for identifier

The above TD for an identifier, defined to be a letter followed by any no of letters

or digits.A sequence of transition diagram can be converted into program to look for
the tokens specified by the diagrams. Each state gets a segment of code.
state = 0;
for (;;)
{
switch (state)
{
case 0:
ch = getchar()
if (ch == letter) state = 10;
else state = 3; // Identify the next token
break;

DR. SAMPADA K S, CSE, RNSIT 50

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
case 1:
ch = getchar()
if (ch == letter or ch == digit) state = 11;
else state = 2;
break;
case 2: retract(); // undo the last character read
return ( ID, install_ID() )
case 3: /* Identify the next token */
}
}
⚫ Transition diagram for whitespace

Transition diagram for unsigned numbers

DR. SAMPADA K S, CSE, RNSIT 51

Regular Expressions & Compiler Design
No ratings yet
Regular Expressions & Compiler Design
51 pages
At&CD Material
No ratings yet
At&CD Material
82 pages
Toc CHP-2
No ratings yet
Toc CHP-2
15 pages
Presentation 7741 Content Document 20250625033851PM
No ratings yet
Presentation 7741 Content Document 20250625033851PM
68 pages
Assignment No 1 Toa
No ratings yet
Assignment No 1 Toa
16 pages
Implementation of The Regular Expression
No ratings yet
Implementation of The Regular Expression
10 pages
Regular Expression, DFA and NFA: Prepared By: Prof. J. S. Dhobi Prof. M. D. Mehta
No ratings yet
Regular Expression, DFA and NFA: Prepared By: Prof. J. S. Dhobi Prof. M. D. Mehta
82 pages
cs212 Lect02 63 Inter
No ratings yet
cs212 Lect02 63 Inter
39 pages
CS1303 Theory of Computation-ANSWERS
80% (5)
CS1303 Theory of Computation-ANSWERS
23 pages
Theory of Automata Lecture#3: by Riaz Ahmad Ziar R.ziar@kardan - Edu.af
No ratings yet
Theory of Automata Lecture#3: by Riaz Ahmad Ziar R.ziar@kardan - Edu.af
19 pages
Regular Expressions in Compiler Construction
No ratings yet
Regular Expressions in Compiler Construction
35 pages
Automata Module 2
No ratings yet
Automata Module 2
69 pages
Flat 10 PDF
No ratings yet
Flat 10 PDF
23 pages
Chapter 2 REGULAR EXPRESSION
No ratings yet
Chapter 2 REGULAR EXPRESSION
26 pages
Cs2303 Theory of Computation 2marks
100% (1)
Cs2303 Theory of Computation 2marks
20 pages
Question Solved TCS
No ratings yet
Question Solved TCS
15 pages
TOC Question Bank - Unit - 1 - 2 - 3 - 4 - 2022
100% (1)
TOC Question Bank - Unit - 1 - 2 - 3 - 4 - 2022
7 pages
CS1303 Theory of Computation
No ratings yet
CS1303 Theory of Computation
26 pages
Automata & Compiler Design Handout
No ratings yet
Automata & Compiler Design Handout
59 pages
Practice Questions Modules 2 Answer
No ratings yet
Practice Questions Modules 2 Answer
33 pages
Chapter 03 - Regular Expression and Language
No ratings yet
Chapter 03 - Regular Expression and Language
42 pages
CS1303 Theory of Computation
No ratings yet
CS1303 Theory of Computation
25 pages
Theory of Computation
No ratings yet
Theory of Computation
22 pages
CS3452 Theory of Computation Important
No ratings yet
CS3452 Theory of Computation Important
13 pages
Flat Unit
No ratings yet
Flat Unit
18 pages
Flat Unit-Iii
No ratings yet
Flat Unit-Iii
11 pages
Rahul Kumar Shaw
No ratings yet
Rahul Kumar Shaw
10 pages
Automata & Compiler Design Guide
No ratings yet
Automata & Compiler Design Guide
2 pages
Theory On FA
No ratings yet
Theory On FA
14 pages
Assignment-1 Jan2025
No ratings yet
Assignment-1 Jan2025
6 pages
It B.tech II Year II Sem Acd (R18a1201) Notes
No ratings yet
It B.tech II Year II Sem Acd (R18a1201) Notes
59 pages
Compilation Techniques
No ratings yet
Compilation Techniques
21 pages
Scsb1303 Toc Notes
No ratings yet
Scsb1303 Toc Notes
112 pages
Finite Automata Answers
No ratings yet
Finite Automata Answers
33 pages
Dfa 2
No ratings yet
Dfa 2
51 pages
B.Tech Exam: Formal Languages & Automata Theory
No ratings yet
B.Tech Exam: Formal Languages & Automata Theory
8 pages
It (Acd-Question Bank)
No ratings yet
It (Acd-Question Bank)
5 pages
CS3452 Theory of Computaion QuestionBank
No ratings yet
CS3452 Theory of Computaion QuestionBank
14 pages
It - (R22) - 2-2 - Automata and Compiler Design - Digital Notes - (2023-24)
No ratings yet
It - (R22) - 2-2 - Automata and Compiler Design - Digital Notes - (2023-24)
64 pages
Regular Languages & Finite Automata
No ratings yet
Regular Languages & Finite Automata
140 pages
BCS503 - TOC TIE SIMP @vtunetwork
No ratings yet
BCS503 - TOC TIE SIMP @vtunetwork
5 pages
Automata & Compiler Design Syllabus
No ratings yet
Automata & Compiler Design Syllabus
61 pages
Chapter-2 Compiler Design
No ratings yet
Chapter-2 Compiler Design
98 pages
Lecture 4 Regular Expression
No ratings yet
Lecture 4 Regular Expression
30 pages
Regular Expressions & Automata
No ratings yet
Regular Expressions & Automata
28 pages
21cs51 Assignment 1
No ratings yet
21cs51 Assignment 1
2 pages
Unit 2 - (Regular Language
No ratings yet
Unit 2 - (Regular Language
26 pages
MCQs by Ali Hassan Soomro
100% (1)
MCQs by Ali Hassan Soomro
19 pages
Automata and Compiler Design
No ratings yet
Automata and Compiler Design
67 pages
Atc-21cs51 Module 2
No ratings yet
Atc-21cs51 Module 2
56 pages
Finite Automata
No ratings yet
Finite Automata
36 pages
4 Dfa
No ratings yet
4 Dfa
54 pages
Assignment2 Solution Compiler Design
No ratings yet
Assignment2 Solution Compiler Design
4 pages
Automata Theory & Compiler Design
No ratings yet
Automata Theory & Compiler Design
69 pages
Homework 1a
No ratings yet
Homework 1a
12 pages
This Study Resource Was: CSE 303 - Introduction To Theory of Computation Sample Solutions - Context-Free Languages
No ratings yet
This Study Resource Was: CSE 303 - Introduction To Theory of Computation Sample Solutions - Context-Free Languages
5 pages
Ai Unit 3 Notes
No ratings yet
Ai Unit 3 Notes
48 pages
Testing Argument Validity Using Truth Table
No ratings yet
Testing Argument Validity Using Truth Table
5 pages
Automata Theory Tutorial
0% (1)
Automata Theory Tutorial
17 pages
Session 3 - Mathematics in The Modern World-Nature of Logic
No ratings yet
Session 3 - Mathematics in The Modern World-Nature of Logic
50 pages
AIML - Unit 2 2022 - Upload
No ratings yet
AIML - Unit 2 2022 - Upload
67 pages
A Precis of Mathematical Logic - Bochenski, Jozef Maria, O.P. & - 5775
100% (2)
A Precis of Mathematical Logic - Bochenski, Jozef Maria, O.P. & - 5775
109 pages
Law Students' Logic Guide
No ratings yet
Law Students' Logic Guide
18 pages
Discrete Structures
No ratings yet
Discrete Structures
40 pages
Format Date Time
No ratings yet
Format Date Time
3 pages
DSGT Predicate Quantifier
No ratings yet
DSGT Predicate Quantifier
16 pages
Formal Methods in Software Engineering: (FMSE)
No ratings yet
Formal Methods in Software Engineering: (FMSE)
25 pages
Practical File: Department of Computer Science and Engineering
No ratings yet
Practical File: Department of Computer Science and Engineering
32 pages
MTH202 Assignment Spring25 Solution Incharge (Zakia-Rehmat)
100% (1)
MTH202 Assignment Spring25 Solution Incharge (Zakia-Rehmat)
2 pages
Summary - Logical - Constraints - For - Binary - Programs - UPDATED
No ratings yet
Summary - Logical - Constraints - For - Binary - Programs - UPDATED
3 pages
AI Unit 3 - Prolog
No ratings yet
AI Unit 3 - Prolog
98 pages
GEC4 Logic and Critical Thinking - Module 4
No ratings yet
GEC4 Logic and Critical Thinking - Module 4
2 pages
Construct CFG that generates the language L = (w є (a,b) : length (w) ≥ 2 and second letter of w from right is a)
No ratings yet
Construct CFG that generates the language L = (w є (a,b) : length (w) ≥ 2 and second letter of w from right is a)
19 pages
Discrete Mathematics 1.6 - 1.7 Solution
No ratings yet
Discrete Mathematics 1.6 - 1.7 Solution
6 pages
Normal Forms For Context Free Grammars
No ratings yet
Normal Forms For Context Free Grammars
43 pages
Propositional Quantifiers
No ratings yet
Propositional Quantifiers
118 pages
Logic in Computer Science Course Out Line
No ratings yet
Logic in Computer Science Course Out Line
5 pages
Knowledge Representation: Compiled By: Er. Mohan Bhandari
No ratings yet
Knowledge Representation: Compiled By: Er. Mohan Bhandari
127 pages
ToC Notes - Unit 3
No ratings yet
ToC Notes - Unit 3
10 pages
4 - Top-Down
No ratings yet
4 - Top-Down
67 pages
CPT 317 - AI Week 7 FOL (Question and Answer)
No ratings yet
CPT 317 - AI Week 7 FOL (Question and Answer)
5 pages
Discrete Structures - Section 1
No ratings yet
Discrete Structures - Section 1
91 pages
Unit 2 Context Free Grammar
No ratings yet
Unit 2 Context Free Grammar
10 pages

Module 2

Uploaded by

Module 2

Uploaded by

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Dr. Sampada K S, Associate Professor

DR. SAMPADA K S, CSE, RNSIT 1

PART 1: REGULAR LANGUAGE:

Language associated with regular expressions.

a. R+S is a regular expression corresponding to the language L(R)UL(S).

DR. SAMPADA K S, CSE, RNSIT 2

01*+1 Set of strings consisting of 1’s or 0 followed by 1(where 1 may appear 0

0(1*+1) Set of strings consisting of a zero followed by any number of 1’s

(0+10)*1* Strings of 0’s and 1’s ending withany number of 1’s

(a+b)(a+b) Strings of a’s and b’s of length 2

DR. SAMPADA K S, CSE, RNSIT 3

DR. SAMPADA K S, CSE, RNSIT 4

DR. SAMPADA K S, CSE, RNSIT 5

DR. SAMPADA K S, CSE, RNSIT 6

DR. SAMPADA K S, CSE, RNSIT 7

Finite automata and Regular Expressions

DR. SAMPADA K S, CSE, RNSIT 8

(a) (b) (c)

Schematic representation of FA accepting L(R)

DR. SAMPADA K S, CSE, RNSIT 9

To accept the language L(R1)*

The regular expression corresponding to this language is ab(a+b)*.

DR. SAMPADA K S, CSE, RNSIT 10

Step 2: The machine to accept ‘b’ is shown below.

Step 3: The machine to accept (a + b) is shown below.

Step 4: The machine to accept (a+b)* is shown below.

Step 5: The machine to accept ab is shown below.

Step 6: The machine to accept ab(a+b)* is shown below.

To accept the language L(ab(a+b)*)

DR. SAMPADA K S, CSE, RNSIT 11

Obtain the regular expression from FA (KLEEN’S Theorem)

DR. SAMPADA K S, CSE, RNSIT 12

DR. SAMPADA K S, CSE, RNSIT 13

DR. SAMPADA K S, CSE, RNSIT 14

DR. SAMPADA K S, CSE, RNSIT 15

DR. SAMPADA K S, CSE, RNSIT 16

DR. SAMPADA K S, CSE, RNSIT 17

DR. SAMPADA K S, CSE, RNSIT 18

DR. SAMPADA K S, CSE, RNSIT 19

DR. SAMPADA K S, CSE, RNSIT 20

DR. SAMPADA K S, CSE, RNSIT 21

r = r1*r2 (r4 + r3r1*r2)* (3.1)

The figure can be reduced as shown below:

DR. SAMPADA K S, CSE, RNSIT 22

What is the language accepted by the following FA

DR. SAMPADA K S, CSE, RNSIT 23

DR. SAMPADA K S, CSE, RNSIT 24

DR. SAMPADA K S, CSE, RNSIT 25

DR. SAMPADA K S, CSE, RNSIT 26

DR. SAMPADA K S, CSE, RNSIT 27

DR. SAMPADA K S, CSE, RNSIT 28

DR. SAMPADA K S, CSE, RNSIT 29

DR. SAMPADA K S, CSE, RNSIT 30

DR. SAMPADA K S, CSE, RNSIT 31

The collection of regular languages over an alphabet Σ is defined recursively as follows:

 The empty language Ø is a regular language.

Proving languages not to be regular languages

Pumping Lemma (PL) for Regular Languages:

DR. SAMPADA K S, CSE, RNSIT 32

General Method of proof: -

Let L be regular. Let n is the constant (PL Definition). Consider a word w in L.

DR. SAMPADA K S, CSE, RNSIT 33

Let |y|=l, where l > 0 (Because |y| > 0).

Then, the break up of x. y and z can be as follows

from the definition of PL , w=xykz, where k=0,1,2,------, should belong to L.

That is an-l (al)k bn L, for all k=0,1,2,------ 

Put k=0. we get an-l bn  L.Contradiction.

Hence the Language is not regular.

xy contain only a’s. (Because |xy| ≤ n).

Let |y|=l, where l > 0 (Because |y| > 0).

That is, the break up of x. y and z can be as follows

DR. SAMPADA K S, CSE, RNSIT 34

from the definition of PL w=xykz, where k=0,1,2,------, should belong to L.

That is an-l (al)k ban L, for all k=0,1,2,------ .

Put k=0. we get an-l b an L, because, it is not a palindrome.

(0+10)1 Strings of 0’s and 1’s ending withany number of 1’s

r = r1r2 (r4 + r3r1r2)* (3.1)