Module 2
Module 2
AUTOMATA THEORY
AND COMPILER
DESIGN- 21CS51
MODULE-2
Regular Expressions and Languages: Regular Expressions, Finite Automata and Regular
Expressions, Proving Languages Not to Be Regular
Lexical Analysis Phase of compiler Design: Role of Lexical Analyzer, Input Buffering,
Specification of Token, Recognition of Token.
Definition: Let M = (Q, ∑, δ, q0, A) be a DFA. The language L is regular if there exists a machine
M such that L = L(M).
Definition: A regular expression is recursively defined as follows.
1. Ø is a regular expression denoting an empty language.
2. Ɛ-(epsilon) is a regular expression indicates the language containing an empty string.
3. a is a regular expression which indicates the language containing only {a}.
4. if R is a regular expression denoting a language L(R) and S is regular language denoting
L(S) then
Regular Meaning
expressions
a* String consisting of nay number of a’s(0 or more)
a+ String consisting of atleast 1 a (1 or more)
(a+b) String consisting of either one a or one b
(a+b)* Set of strings of a’s and b’s of any length including the NULL string.
(a+b)*abb Set of strings of a’s and b’s ending with the string abb
ab(a+b)* Set of strings of a’s and b’s starting with the string ab.
(a+b)*aa(a+b)* Set of strings of a’s and b’s having a substring aa.
Set of string consisting of any number of a’s (may be empty string
a*b*c* also) followed by any number of b’s (may include empty string)
followed by any number of c’s (may include empty string).
a+b+c+ Set of string consisting of atleast one ‘a’ followed by string consisting
of atleast one ‘b’ followed by string consisting of at least one ‘c’.
aa*bb*cc* Set of string consisting of atleast one ‘a,b,c’
(a+b)*(a+bb) Set of strings of a’s and b’s ending with either a or bb
(aa)*(bb)*b Set of strings consisting of even number of a’s followed by odd
number of b’s
(1+ Ɛ )(00*1)*0* Strins of 0’s and 1’s without any consecutive 1’s
(0+1)*000 Set of strings of 0’s and 1’s ending with three consecutive zeros
(11)* Set consisting of even number of 1’s
Proof: By definition,, and a are regular expressions. So, the corresponding machines to recognize
these expressions are shown in figure
q0 qf
q0 qf q0 a qf
The schematic representation of a regular expression R to accept the language L(R) is shown in
figure where q is the start state and f is the final state of machine M.
L(R)
q M f
Case 1: R = R1 + R2. We can construct an NFA which accepts either L(R1) or L(R2) which can be
represented as L(R1 + R2) as shown in figure 3.3.
L(R1)
q1 M1 f1
q0 qf
q2 M2 f2
L(R2)
To accept the language L(R1 + R2)
It is clear from figure that the machine can either accept L(R1) or L(R2). Here, q0 is the start state of
the combined machine and qf is the final state of combined machine M.
Case 2: R = R1 . R2. We can construct an NFA which accepts L(R1) followed by L(R2) which can be
represented as L(R1 . R2) as shown in figure 3.4.
L(R1) L(R2)
q1 M1 f1 q2 M2 f2
To accept the language L(R1 . R2)
It is clear from figure that the machine after accepting L(R1) moves from state q1 to f1. Since there is
a -transition, without any input there will be a transition from state f 1 to state q2. In state q2, upon
accepting L(R2), the machine moves to f2 which is the final state. Thus, q1 which is the start state of
machine M1 becomes the start state of the combined machine M and f2 which is the final state of
machine M2, becomes the final state of machine M and accepts the language L(R1.R2).
Case 3: R = (R1)*. We can construct an NFA which accepts either L(R1)*) as shown in figure 3.5.a. It
can also be represented as shown in figure 3.5.b.
q0 q1 M1 f1 qf
L(R1)
(a)
q0 q1 M1 f1 qf
It is clear from figure that the machine can either accept or any number of L(R1)s thus accepting
the language L(R1)*. Here, q0 is the start state qf is the final state.
Example
Obtain an NFA which accepts strings of a’s and b’s starting with the string ab.
4 a
5
6 b
7
3 8
6 7
b
a
4 5
2 3 8 9
6 7
b
a b
0 1 2
r
3
Generalized transition graph
where r1, r2, r3 and r4 are the regular expressions and correspond to the labels for the edges. The
regular expression for this can take the form:
Note:
1. Any graph can be reduced to the graph shown in figure 3.9. Then substitute the regular
expressions appropriately in the equation 3.1 and obtain the final regular expression.
2. If r3 is not there in figure 3.9, the regular expression can be of the form
r = r1*r2 r4* (3.2)
3. If q0 and q1 are the final states then the regular expression can be of the form
r = r1* + r1*r2 r4* (3.3)
Example
Obtain a regular expression for the FA shown below:
0
q0 q1
1
0 1 0
q2 q3 0,1
1
10
It is clear from this figure that the machine accepts strings of 01’s and 10’s of any length and the
regular expression can be of the form
(01 + 10)*
0 1 0,1
0 q2
q0 q1
1
Since, state q2 is the dead state, it can be removed and the following FA is obtained.
0 1
q0 q1
1
The state q0 is the final state and at this point it can accept any number of 0’s which can be
represented as using notation 0*
q1 is also the final state. So, to reach q1 one can input any number of 0’s followed by 1 and followed
by any number of 1’s and can be represented as
0*11*
So, the final regular expression is obtained by adding 0 * and 0*11*. So, the regular expression is
R.E = 0* + 0*11*
= 0* ( + 11*)
= 0* ( + 1+)
= 0* (1*) = 0*1*
It is clear from the regular expression that language consists of any number of 0’s (possibly ) followed
by any number of 1’s(possibly ).
Regular languages
In theoretical computer science and formal language theory, a regular language is a formal
language that can be expressed using a regular expression. Note that the "regular
expression" features provided with many programming languages are augmented with
features that make them capable of recognizing languages that cannot be expressed by the
formal regular expressions. In the Chomsky hierarchy, regular languages are defined to be
the languages that are generated by Type-3 grammars (regular grammars).Regular
languages are very useful in input parsing and programming language design.
Theorem: Let M=( ) be an FA and has n number of states. Let L be the regular language
accepted by M. for every string x in L, there exists a constant n such that |x| >=n. if the
string can be broken into three substrings x,y,z such that .
1. |y|>0
2. |xy|≤ n
3. For all k≥0, the string xykz is also in L.
PROOF:
Let L be regular defined by an FA having ‘n’ states. Let x=a1,a2,a3----an and is in L. |x|=n ≥ n.
Let the start state be P 1. Let w=xyz where x=a1,a2,a3----an-1,y=an and z=
Uses of Pumping Lemma: - This is to be used to show that, certain languages are not regular. It
should never be used to show that some language is regular. If you want to show that language is
regular, write separate expression, DFA or NFA.
Let w = anbn, such that |w|=2n. Since 2n > n and L is regular it must satisfy PL.
Example 2: To prove that L={w|w is a palindrome on {a,b}*} is not regular. i.e., L={aabaa,
aba, abbbba,…}
Proof:
Let L be regular. Let n is the constant (PL Definition). Consider a word w in L. Let w =
anban, such that |w|=2n+1. Since 2n+1 > n and L is regular it must satisfy PL.
Example 3: To prove that L={ all strings of 1’s whose length is prime} is not regular. i.e.,
L={12, 13 ,15 ,17 ,111 ,----}
Example 4: To prove that L={ 0i2 | i is integer and i >0} is not regular. i.e., L={0 2, 04 ,09 ,016
,025 ,----}
Proof: Let L be regular. Let w = 0n2 where |w| = n2 ≥ n
by PL xykz L, for all k = 0,1,---
Select k = 2
| xy2z | = | xyz | + | y |
= n2 + Min 1 and Max n
Therefore n2 < | xy2z | ≤ n2 + n
n2 < | xy2z | < n2 + n + 1+n adding 1 + n (Note that less than or equal to is
n2 < | xy2z | < (n + 1)2 replaced by less than sign
Upon receiving a ‘get next token’ command form the parser, the lexical analyzer reads the input
character until it can identify the next token. The LA return to the parser representation for the
token it has found. The representation will be an integer code, if the token is a simple construct
such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is
striping out from the source program the commands and white spaces in the form of blank, tab
and new line characters. Another is correlating error message from the compiler with the source
program.
Scanner simply turns an input String (say a file) into a A parser converts this list of tokens into a Tree-like
list of tokens. These tokens represent things like object to represent how the tokens fit together to
identifiers, parentheses, operators etc. form a cohesive whole (sometimes referred to as a
sentence).
The lexical analyzer (the "lexer") parses individual
symbols from the source code file into tokens. From A parser does not give the nodes any meaning
there, the "parser" proper turns those whole tokens beyond structural cohesion. The next thing to do is
into sentences of your grammar extract meaning from this structure (sometimes called
contextual analysis).
A pattern is a rule describing the set of lexemes that can represent a particular token in source
program.
LEXICAL ERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. Which means that
there's no way to recognize a lexeme as a valid token for the lexer.
These errors are detected during the lexical analysis phase. Typical lexical errors are:
Exceeding length of identifier or numeric constants.
The appearance of illegal characters
Unmatched string
INPUT BUFFERING
Lexical Analysis has to access secondary memory each time to identify tokens. It is time-consuming
and costly. So, the input strings are stored into a buffer and then scanned by Lexical Analysis. The
lexical analyzer scans the input from left to right one character at a time. It uses two pointers lexeme
begin ptr(bp) and forward to keep track of the pointer of the input scanned. Initially both the
pointers point to the first character of the input string as shown below
The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is encountered,
it indicates end of lexeme. In above example as soon as ptr (fp) encounters a blank space the lexeme
“int” is identified.
The fp will be moved ahead at white space, when fp encounters white space, it ignore and moves
ahead. then both the begin ptr(bp) and forward ptr(fp) are set at next token.
The input character is thus read from secondary storage, but reading in this way from secondary
storage is costly. hence buffering technique is used.A block of data is first read into a buffer, and
then second by lexical analyzer. there are two methods used in this context: One Buffer Scheme, and
Two Buffer Scheme. These are explained as following below.
A specialized buffering techniques0 used to reduce the amount of overhead, which is required to
process an input character in moving characters.
Buffer pairs
Consists of two buffers, each consists of N-character size which are reloaded alternatively.
Two pointers lexemeBegin and forward are maintained.
Lexeme Begin points to the beginning of the current lexeme which is yet to be found.
Forward scans ahead until a match for a pattern is found.
Once a lexeme is found, lexeme begin is set to the character immediately after the lexeme
which is just found and forward is set to the character at its right end.
Current lexeme is the set of characters between two pointers.
What is Sentinels ?
Sentinels is used to make a check, each time when the forward pointer is moved, a check is done to
ensure that one half of the buffer has not moved off. If it is done, then the other half must be
reloaded.
Therefore the ends of the buffer halves require two tests for each advance of the forward
pointer. Test 1: For end of buffer. Test 2: To determine what character is read. The usage of sentinel
reduces the two tests to one by extending each buffer half to hold a sentinel character at the end.
The sentinel is a special character that cannot be part of the source program. (eof character is used as
sentinel).
Disadvantages
SPECIFICATION OF TOKENS
1. In theory of compilation regular expressions are used to formalize the specification of tokens
2. Regular expressions are means for specifying regular languages
3. Example: Letter_(letter_ | digit)*
4. Each regular expression is a pattern specifying the form of strings
REGULAR EXPRESSIONS
1. Ɛ is a regular expression, L(Ɛ) = {Ɛ}
2. If a is a symbol in ∑then a is a regular expression, L(a) = {a}
3. (r) | (s) is a regular expression denoting the language L(r) L(s)
4. (r)(s) is a regular expression denoting the language L(r)L(s)
5. (r)* is a regular expression denoting (L(r))*
REGULAR DEFINITIONS
For notational convenience, we may wish to give names to regular expressions and
to define regular expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter.
The following regular definition provides a precise specification for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Pascal identifier
Letter - A | B | ……| Z | a | b |……| z|
Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*
RECOGNITION OF TOKENS:
We learn how to express pattern using regular expressions. Now, we must study how to take the
digit -->[0,9]
digits -->digit+
number -->digit(.digit)?(e.[+-]?digits)?
letter -->[A-Z,a-z]
id -->letter(letter/digit)*
if --> if
then -->then
else -->else
relop --> </>/<=/>=/==/< >
In addition, we assign the lexical analyzer the job stripping out white space, by recognizing
the “token” we defined by:
ws --> ank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the
ASCII characters of the same names. Token ws is different from the other tokens in
that ,when we recognize it, we do not return it to parser ,but rather restart the
lexical analysis from the character that follows the white space . It is the following
token that gets returned to the parser.
TRANSITION DIAGRAM:
Transition Diagram has a collection of nodes or circles, called states. Each state
represents a condition that could occur during the process of scanning the input
looking for a lexeme that matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled
by a symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out
of state s labeled by a. if we find such an edge ,we advance the forward pointer
and enter the state of the transition diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a
lexeme has been found, although the actual lexeme may not consist of all
positions b/w the lexeme
Begin and forward pointers we always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer one position, then
we shall additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge
labeled “start” entering from nowhere .the transition diagram always begins in
the state before any input symbols have been used.