10 1 1 37 307

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

CHAPTER 3

Finite Automata
and Regular Languages

3.1 Introduction
3.1.1 States and Automata
A finite-state machine or finite automaton (the noun comes from the Greek;
the singular is automaton, the Greek-derived plural is automata,
although automatons is considered acceptable in modern English) is a
limited, mechanistic model of computation. Its main focus is the notion
of state. This is a notion with which we are all familiar from interaction
with many different controllers, such as elevators, ovens, stereo systems,
and so on. All of these systems (but most obviously one like the elevator)
can be in one of a fixed number of states. For instance, the elevator can be
on any one of the floors, with doors open or closed, or it can be moving
between floors; in addition, it may have pending requests to move to certain
floors, generated from inside (by passengers) or from outside (by would-be
passengers). The current state of the system entirely dictates what the system
does nextsomething we can easily observe on very simple systems such as
single elevators or microwave ovens. To a degree, of course, every machine
ever made by man is a finite-state system; however, when the number of
states grows large, the finite-state model ceases to be appropriate, simply
because it defies comprehension by its usersnamely humans. In particular,
while a computer is certainly a finite-state system (its memory and registers
can store either a 1 or a 0 in each of the bits, giving rise to a fixed number
of states), the number of states is so large (a machine with 32 Mbytes of
43

44

Finite Automata and Regular Languages

memory has on the order of 1081,000,000 statesa mathematician from the


intuitionist school would flatly deny that this is a finite number!) that it is
altogether unreasonable to consider it to be a finite-state machine. However,
the finite-state model works well for logic circuit design (arithmetic and
logic units, buffers, I/O handlers, etc.) and for certain programming utilities
(such well-known Unix tools as lex, grep, awk, and others, including the
pattern-matching tools of editors, are directly based on finite automata),
where the number of states remains small.
Informally, a finite automaton is characterized by a finite set of states and
a transition function that dictates how the automaton moves from one state
to another. At this level of characterization, we can introduce a graphical
representation of the finite automaton, in which states are represented as
disks and transitions between states as arcs between the disks. The starting
state (the state in which the automaton begins processing) is identified by
a tail-less arc pointing to it; in Figure 3.1(a), this state is q1 . The input
can be regarded as a string that is processed symbol by symbol from
left to right, each symbol inducing a transition before being discarded.
Graphically, we label each transition with the symbol or symbols that cause
it to happen. Figure 3.1(b) shows an automaton with input alphabet {0, 1}.
The automaton stops when the input string has been completely processed;
thus on an input string of n symbols, the automaton goes through exactly
n transitions before stopping.
More formally, a finite automaton is a four-tuple, made of an alphabet,
a set of states, a distinguished starting state, and a transition function. In
the example of Figure 3.1(b), the alphabet is 6 = {0, 1}; the set of states is
Q = {q1 , q2 , q3 }; the start state is q1 ; and the transition function , which uses
the current state and current input symbol to determine the next state, is
given by the table of Figure 3.2. Note that is not defined for every possible
input pair: if the machine is in state q2 and the current input symbol is 1,
then the machine stops in error.

q1

q2

q3

(a) an informal finite automaton

q1

0,1

q2

q3

(b) a finite automaton with state transitions

Figure 3.1

Informal finite automata.

3.1 Introduction

q1
q2
q3

0
q2
q3
q3

1
q2
q2

Figure 3.2 The transition function for the automaton of Figure 3.1(b).

As defined, a finite automaton processes an input string but does


not produce anything. We could define an automaton that produces a
symbol from some output alphabet at each transition or in each state,
thus producing a transducer, an automaton that transforms an input string
on the input alphabet into an output string on the output alphabet. Such
transducers are called sequential machines by computer engineers (or, more
specifically, Moore machines when the output is produced in each state
and Mealy machines when the output is produced at each transition)
and are used extensively in designing logic circuits. In software, similar
transducers are implemented in software for various string handling tasks
(lex, grep, and sed, to name but a few, are all utilities based on finitestate transducers). We shall instead remain at the simpler level of language
membership, where the transducers compute maps from 6 to {0, 1} rather
than to 1 for some output alphabet 1. The results we shall obtain in this
simpler framework are easier to derive yet extend easily to the more general
framework.

3.1.2 Finite Automata as Language Acceptors


Finite automata can be used to recognize languages, i.e., to implement
functions f : 6 {0, 1}. The finite automaton decides whether the string
is in the language with the help of a label (the value of the function)
assigned to each of its states: when the finite automaton stops in some state
q, the label of q gives the value of the function. In the case of language
acceptance, there are only two labels: 0 and 1, or reject and accept.
Thus we can view the set of states of a finite automaton used for language
recognition as partitioned into two subsets, the rejecting states and the
accepting states. Graphically, we distinguish the accepting states by double
circles, as shown in Figure 3.3. This finite automaton has two states, one
accepting and one rejecting; its input alphabet is {0, 1}; it can easily be seen
to accept every string with an even (possibly zero) number of 1s. Since the
initial state is accepting, this automaton accepts the empty string. As further
examples, the automaton of Figure 3.4(a) accepts only the empty string,

45

46

Finite Automata and Regular Languages


1
1
0

Figure 3.3 An automaton that accepts strings with an even number of 1s.

0,1

0,1

(a) a finite automaton that accepts {}


0,1

0,1

(b) a finite automaton that accepts {0, 1}+

Figure 3.4 Some simple finite automata.

while that of Figure 3.4(b) accepts everything except the empty string. This
last construction may suggest that, in order to accept the complement of a
language, it suffices to flip the labels assigned to states, turning rejecting
states into accepting ones and vice versa.
Exercise 3.1 Decide whether this idea works in all cases.

A more complex example of finite automaton is illustrated in Figure 3.5.


It accepts all strings with an equal number of 0s and 1s such that, in any
prefix of an accepted string, the number of 0s and the number of 1s differ
by at most one. The bottom right-hand state is a trap: once the automaton

0
1
0

1
1

0,1

Figure 3.5 A more complex finite automaton.

3.1 Introduction

has entered this state, it cannot leave it. This particular trap is a rejecting
state; the automaton of Figure 3.4(b) had an accepting trap.
We are now ready to give a formal definition of a finite automaton.
Definition 3.1 A deterministic finite automaton is a five-tuple, (6, Q, q0 , F,
), where 6 is the input alphabet, Q the set of states, q0 Q the start state,
F Q the final states, and : Q 6 Q the transition function.
h
Our choice of the formalism for the transition function actually makes
the automaton deterministic, conforming to the examples seen so far.
Nondeterministic automata can also be definedwe shall look at this
distinction shortly.
Moving from a finite automaton to a description of the language that it
accepts is not always easy, but it is always possible. The reverse direction
is more complex because there are many languages that a finite automaton
cannot recognize. Later we shall see a formal proof of the fact, along with
an exact characterization of those languages that can be accepted by a finite
automaton; for now, let us just look at some simple examples.
Consider first the language of all strings that end with 0. In designing
this automaton, we can think of its having two states: when it starts or
after it has seen a 1, it has made no progress towards acceptance; on the
other hand, after seeing a 0 it is ready to accept. The result is depicted in
Figure 3.6.
Consider now the set of all strings that, viewed as natural numbers in
unsigned binary notation, represent numbers divisible by 5. The key here is
to realize that division in binary is a very simple operation with only two
possible results (1 or 0); our automaton will mimic the longhand division
by 5 (101 in binary), using its states to denote the current value of the
remainder. Leading 0s are irrelevant and eliminated in the start state (call it
A); since this state corresponds to a remainder of 0 (i.e., an exact division by
5), it is an accepting state. Then consider the next bit, a 1 by assumption. If
the input stopped at this point, we would have an input value and thus also
a remainder of 1; call the state corresponding to a remainder of 1 state Ba
rejecting state. Now, if the next bit is a 1, the input (and also remainder)

0
1
1

Figure 3.6

An automaton that accepts all strings ending with a 0.

47

48

Finite Automata and Regular Languages

D
1

0
A

0
1

1
B

Figure 3.7 An automaton that accepts multiples of 5.

so far is 11, so we move to a state (call it C) corresponding to a remainder


of 3; if the next bit is a 0, the input (and also remainder) is 10, so we move
to a state (call it D) corresponding to a remainder of 2. From state D, an
input of 0 gives us a current remainder of 100, so we move to a state (call
it E) corresponding to a remainder of 4; an input of 1, on the other hand,
gives us a remainder of 101, which is the same as no remainder at all, so
we move back to state A. Moves from states C and E are handled similarly.
The resulting finite automaton is depicted in Figure 3.7.

3.1.3

Determinism and Nondeterminism

In all of the fully worked examples of finite automata given earlier, there
was exactly one transition out of each state for each possible input symbol.
That such must be the case is implied in our formal definition: the transition
function is well defined. However, in our first example of transitions
(Figure 3.2), we looked at an automaton where the transition function
remained undefined for one combination of current state and current input,
that is, where the transition function did not map every element of
its domain. Such transition functions are occasionally useful; when the
automaton reaches a configuration in which no transition is defined, the
standard convention is to assume that the automaton aborts its operation
and rejects its input string. (In particular, a rejecting trap has no defined
transitions at all.) In a more confusing vein, what if, in some state, there
had been two or more different transitions for the same input symbol?
Again, our formal definition precludes this possibility, since (qi , a) can
have only one value in Q; however, once again, such an extension to our
mechanism often proves useful. The presence of multiple valid transitions
leads to a certain amount of uncertainty as to what the finite automaton
will do and thus, potentially, as to what it will accept. We define a finite
automaton to be deterministic if and only if, for each combination of state
and input symbol, it has at most one transition. A finite automaton that

3.1 Introduction

allows multiple transitions for the same combination of state and input
symbol will be termed nondeterministic.
Nondeterminism is a common occurrence in the worlds of particle
physics and of computers. It is a standard consequence of concurrency:
when multiple systems interact, the timing vagaries at each site create an
inherent unpredictability regarding the interactions among these systems.
While the operating system designer regards such nondeterminism as
both a boon (extra flexibility) and a bane (it cannot be allowed to
lead to different outcomes, a catastrophe known in computer science
as indeterminacy, and so must be suitably controlled), the theoretician
is simply concerned with suitably defining under what circumstances a
nondeterministic machine can be termed to have accepted its input. The
key to understanding the convention adopted by theoreticians regarding
nondeterministic finite automata (and other nondeterministic machines) is
to realize that nondeterminism induces a tree of possible computations for
each input string, rather than the single line of computation observed in a
deterministic machine. The branching of the tree corresponds to the several
possible transitions available to the machine at that stage of computation.
Each of the possible computations eventually terminates (after exactly n
transitions, as observed earlier) at a leaf of the computation tree. A stylized
computation tree is illustrated in Figure 3.8. In some of these computations,
the machine may accept its input; in others, it may reject iteven though it is
the same input. We can easily dispose of computation trees where all leaves
correspond to accepting states: the input can be defined as accepted; we
can equally easily dispose of computation trees where all leaves correspond
to rejecting states: the input can be defined as rejected. What we need to
address is those computation trees where some computation paths lead
to acceptance and others to rejection; the convention adopted by the

branching point

leaf

Figure 3.8

A stylized computation tree.

49

50

Finite Automata and Regular Languages

(evidently optimistic) theory community is that such mixed trees also result
in acceptance of the input. This convention leads us to define a general
finite automaton.
Definition 3.2 A nondeterministic finite automaton is a five-tuple, (6, Q,
q0 , F, ), where 6 is the input alphabet, Q the set of states, q0 Q the start
state, F Q the final states, and : Q 6 2 Q the transition function. h
Note the change from our definition of a deterministic finite automaton:
the transition function now maps Q 6 to 2 Q , the set of all subsets of Q,
rather than just into Q itself. This change allows transition functions that
map state/character pairs to zero, one, or more next states. We say that
a finite automaton is deterministic whenever we have |(q, a)| 1 for all
q Q and a 6.
Using our new definition, we say that a nondeterministic machine
accepts its input whenever there is a sequence of choices in its transitions
that will allow it to do so. We can also think of there being a separate
deterministic machine for each path in the computation treein which
case there need be only one deterministic machine that accepts a string for
the nondeterministic machine to accept that string. Finally, we can also
view a nondeterministic machine as a perfect guesser: whenever faced with
a choice of transitions, it always chooses one that will allow it to accept the
input, assuming any such transition is availableif such is not the case, it
chooses any of the transitions, since all will lead to rejection.
Consider the nondeterministic finite automaton of Figure 3.9, which
accepts all strings that contain one of three possible substrings: 000, 111,
or 1100. The computation tree on the input string 01011000 is depicted
in Figure 3.10. (The paths marked with an asterisk denote paths where
the automaton is stuck in a state because it had no transition available.)
There are two accepting paths out of ten, corresponding to the detection
of the substrings 000 and 1100. The nondeterministic finite automaton thus
accepts 01011000 because there is at least one way (here two) for it to do

0,1
B

0
A

0
F

1
1
D

0,1

Figure 3.9 An example of the use of nondeterminism.

3.1 Introduction
A
A
A
A
A
A
A

B
D *

B * *
D

* * *

D B E * * *
B * C *

* * *

B C * F *

* * *

A B C F * F *

* * *

Figure 3.10 The computation tree for the automaton of Figure 3.9 on
input string 01011000.

so. For instance, it can decide to stay in state A when reading the first three
symbols, then guess that the next 1 is the start of a substring 1100 or 111
and thus move to state D. In that state, it guesses that the next 1 indicates
the substring 1100 rather than 111 and thus moves to state B rather than E.
From state B, it has no choice left to make and correctly ends in accepting
state F when all of the input has been processed. We can view its behavior
as checking the sequence of guesses (left, left, left, right, left, , , ) in the
computation tree. (That the tree nodes have at most two children each is
peculiar to this automaton; in general, a node in the tree can have up to |Q|
children, one for each possible choice of next state.)
When exploiting nondeterminism, we should consider the idea of
choice. The strength of a nondeterministic finite automaton resides in its
ability to choose with perfect accuracy under the rules of nondeterminism.
For example, consider the set of all strings that end in either 100 or in
001. The deterministic automaton has to consider both types of strings
and so uses states to keep track of the possibilities that arise from either
suffix or various substrings thereof. The nondeterministic automaton can
simply guess which ending the string will have and proceed to verify the
guesssince there are two possible guesses, there are two verification paths.
The nondeterministic automaton just gobbles up symbols until it guesses
that there are only three symbols left, at which point it also guesses which
ending the string will have and proceeds to verify that guess, as shown in
Figure 3.11. Of course, with all these choices, there are many guesses that

51

52

Finite Automata and Regular Languages


0,1

Figure 3.11

Checking guesses with nondeterminism.

lead to a rejecting state (guess that there are three remaining symbols when
there are more, or fewer, left, or guess the wrong ending), but the string
will be accepted as long as there is one accepting path for it.
However, this accurate guessing must obey the rules of nondeterminism:
the machine cannot simply guess that it should accept the string or guess that
it should reject itsomething that would lead to the automaton illustrated
in Figure 3.12. In fact, this automaton accepts 6 , because it is possible for
it to accept any string and thus, in view of the rules of nondeterminism, it
must then do so.

Figure 3.12

3.1.4

A nondeterministic finite automaton that simply guesses


whether to accept or reject.

Checking vs. Computing

A better way to view nondeterminism is to realize that the nondeterministic


automaton need only verify a simple guess to establish that the string is in
the language, whereas the deterministic automaton must painstakingly process the string, keeping information about the various pieces that contribute
to membership. This guessing model makes it clear that nondeterminism
allows a machine to make efficient decisions whenever a series of guesses
leads rapidly to a conclusion. As we shall see later (when talking about
complexity), this aspect is very important. Consider the simple example of

3.1 Introduction

deciding whether a string has a specific character occurring 10 positions


from the end. A nondeterministic automaton can simply guess which is
the tenth position from the end of the string and check that (i) the desired
character occurs there and (ii) there are indeed exactly 9 more characters
left in the string. In contrast, a deterministic automaton must keep track in
its finite-state control of a window of 9 consecutive input characters
a requirement that leads to a very large number of states and a complex
transition function. The simple guess of a position within the input string
changes the scope of the task drastically: verifying the guess is quite easy,
whereas a direct computation of the answer is quite tedious.
In other words, nondeterminism is about guessing and checking: the
machine guesses both the answer and the path that will lead to it, then
follows that path, verifying its guess in the process. In contrast, determinism
is just straightforward computingno shortcut is available, so the machine
simply crunches through whatever has to be done to derive an answer.
Hence the question (which we tackle for finite automata in the next section)
of whether or not nondeterministic machines are more powerful than
deterministic ones is really a question of whether verifying answers is easier
than computing them. In the context of mathematics, the (correct) guess
is the proof itself! We thus gain a new perspective on Hilberts program:
we can indeed write a proof-checking machine, but any such machine will
efficiently verify certain types of proofs and not others. Many problems
have easily verifiable proofs (for instance, it is easy to check a proof that
a Boolean formula is satisfiable if the proof is a purported satisfying truth
assignment), but many others do not appear to have any concise or easily
checkable proof. Consider for instance the question of whether or not
White, at chess, has a forced win (a question for which we do not know the
answer). What would it take for someone to convince you that the answer
is yes? Basically, it would appear that verifying the answer, in this case,
is just as hard as deriving it.
Thus, depending on the context (such as the type of machines involved
or the resource bounds specified), verifying may be easier than or just as
hard as solvingoften, we do not know which is the correct statement. The
most famous (and arguably the most important) open question in computer
science, Is P equal to NP? (about which we shall have a great deal to
say in Chapters 6 and beyond), is one such question. We shall soon see
that nondeterminism does not add power to finite automatawhatever a
nondeterministic automaton can do can also be done by a (generally much
larger) deterministic finite automaton; the attraction of nondeterministic
finite automata resides in their relative simplicity.

53

54

Finite Automata and Regular Languages

3.2 Properties of Finite Automata


3.2.1

Equivalence of Finite Automata

We see from their definition that nondeterministic finite automata include


deterministic ones as a special casethe case where the number of transitions defined for each pair of current state and current input symbol never
exceeds one. Thus any language that can be accepted by a deterministic
finite automaton can be accepted by a nondeterministic onethe same machine. What about the converse? Are nondeterministic finite automata more
powerful than deterministic ones? Clearly there are problems for which a
nondeterministic automaton will require fewer states than a deterministic one, but that is a question of resources, not an absolute question of
potential.
We settle the question in the negative: nondeterministic finite automata
are no more powerful than deterministic ones. Our proof is a simulation:
given an arbitrary nondeterministic finite automaton, we construct a
deterministic one that mimics the behavior of the nondeterministic machine.
In particular, the deterministic machine uses its state to keep track of all of
the possible states in which the nondeterministic machine could find itself
after reading the same string.
Theorem 3.1 For every nondeterministic finite automaton, there exists an
equivalent deterministic finite automaton (i.e., one that accepts the same
language).
h
Proof. Let the nondeterministic finite automaton be given by the fivetuple (6, Q, F, q0 , ). We construct an equivalent deterministic automaton
(6 , Q , F , q0 , ) as follows:

6 = 6
Q = 2Q
F = {s Q | s F 6= }
q0 = {q0 }

The key idea is to define one state of the deterministic machine for each
possible combination of states of the nondeterministic onehence the 2|Q|
possible states of the equivalent deterministic machine. In that way, there
is a unique state for the deterministic machine, no matter how many computation paths exist at the same step for the nondeterministic machine. In
order to define , we recall that the purpose of the simulation is to keep
track, in the state of the deterministic machine, of all computation paths of

3.2 Properties of Finite Automata

the nondeterministic one. Let the machines be at some step in their computation where the next input symbol is a. If the nondeterministic machine can
be in any of states qi1 , qi2 , . . . , qik at that stepso that the corresponding
deterministic machine is then in state {qi1 , qi2 , . . ., qik }then it can move to
any of the states contained in the sets (qi1 , a), (qi2 , a), . . ., (qik , a)so
that the corresponding deterministic machine moves to state
({qi1 , qi2 , . . ., qik }, a) =

k
[

(qi j , a)

j =1

Since the nondeterministic machine accepts if any computation path leads


to acceptance, the deterministic machine must accept if it ends in a state
that includes any of the final states of the nondeterministic machine
hence our definition of F . It is clear that our constructed deterministic
finite automaton accepts exactly the same strings as those accepted by the
given nondeterministic finite automaton.
Q.E.D.
Example 3.1 Consider the nondeterministic finite automaton given by
6 = {0, 1}, Q = {a, b}, F = {a}, q0 = a,
:

(a, 0) = {a, b}
(b, 0) = {b}

(a, 1) = {b}
(b, 1) = {a}

and illustrated in Figure 3.13(a). The corresponding deterministic finite


automaton is given by

0
1,0

b
1

(a) the nondeterministic finite automaton


0
{a}

{b}

{a,b}

0,1

0,1

(b) the equivalent deterministic finite automaton

Figure 3.13 A nondeterministic automaton and an equivalent


deterministic finite automaton.

55

56

Finite Automata and Regular Languages

6 = {0, 1}, Q = {, {a}, {b}, {a, b}}, F = {{a}, {a, b}}, q0 = {a},
:

(, 0) =
({a}, 0) = {a, b}
({b}, 0) = {b}
({a, b}, 0) = {a, b}

(, 1) =
({a}, 1) = {b}
({b}, 1) = {a}
({a, b}, 1) = {a, b}

and illustrated in Figure 3.13(b) (note that state is unreachable).

Thus the conversion of a nondeterministic automaton to a deterministic one creates a machine, the states of which are all the subsets of the
set of states of the nondeterministic automaton. The conversion takes a
nondeterministic automaton with n states and creates a deterministic automaton with 2n states, an exponential increase. However, as we saw briefly,
many of these states may be useless, because they are unreachable from the
start state; in particular, the empty state is always unreachable. In general,
the conversion may create any number of unreachable states, as shown in
Figure 3.14, where five of the eight states are unreachable. When generating a deterministic automaton from a given nondeterministic one, we can
avoid generating unreachable states by using an iterative approach based
on reachability: begin with the initial state of the nondeterministic automaton and proceed outward to those states reachable by the nondeterministic
automaton. This process will generate only useful statesstates reachable
from the start stateand so may be considerably more efficient than the
brute-force generation of all subsets.

0,1

0,1

Figure 3.14

0
0
B
1

1
1
0

AB
0
0

ABC

1
AC

0,1

1
BC

A conversion that creates many unreachable states.

3.2 Properties of Finite Automata

3.2.2 Transitions
An transition is a transition that does not use any inputa spontaneous
transition: the automaton simply decides to change states without reading any symbol.
Such a transition makes sense only in a nondeterministic automaton: in
a deterministic automaton, an transition from state A to state B would
have to be the single transition out of A (any other transition would induce
a nondeterministic choice), so that we could merge state A and state B,
simply redirecting all transitions into A to go to B, and thus eliminating
the transition. Thus an transition is essentially nondeterministic.
Example 3.2 Given two finite automata, M1 and M2 , design a new finite
automaton that accepts all strings accepted by either machine. The new
machine guesses which machine will accept the current string, then sends
the whole string to that machine through an transition.
h
The obvious question at this point is: Do transitions add power to
finite automata? As in the case of nondeterminism, our answer will be
no.
Assume that we are given a finite automaton with transitions; let its
transition function be . Let us define (q, a) to be the set of all states that
can be reached by
1. zero or more transitions; followed by
2. one transition on a; followed by
3. zero or more transitions.
This is the set of all states reachable from state q in our machine while
reading the single input symbol a; we call the -closure of .
In Figure 3.15, for instance, the states reachable from state q through
the three steps are:
1. {q, 1, 2, 3}
2. {4, 6, 8}
3. {4, 5, 6, 7, 8, 9, 10}
so that we get (q, a) = {4, 5, 6, 7, 8, 9, 10}
Theorem 3.2 For every finite automaton with transitions, there exists an
equivalent finite automaton without transitions.
h
We do not specify whether the finite automaton is deterministic or nondeterministic, since we have already proved that the two have equivalent
power.

57

58

Finite Automata and Regular Languages

a
b

a
q

2
b

a
a
b

a
b

8
b

a
10

Figure 3.15 Moving through transitions.

Proof. Assume that we have been given a finite automaton with


transitions and with transition function . We construct as defined earlier.
Our new automaton has the same set of states, the same alphabet, the same
starting state, and (with one possible exception) the same set of accepting
states, but its transition function is now rather than and so does not
include any moves. Finally, if the original automaton had any (chain
of) transitions from its start state to an accepting state, we make that
start state in our new automaton an accepting state. We claim that the
two machines recognize the same language; more specifically, we claim that
the set of states reachable under some input string x 6= in the original
machine is the same as the set of states reachable under the same input
string in our -free machine and that the two machines both accept or both
reject the empty string. The latter is ensured by our correction for the start
state. For the former, our proof proceeds by induction on the length of
strings. The two machines can reach exactly the same states from any given
state (in particular from the start state) on an input string of length 1, by
construction of . Assume that, after processing i input characters, the two
machines have the same reachable set of states. From each of the states
that could have been reached after i input characters, the two machines can
reach the same set of states by reading one more character, by construction
of . Thus the set of all states reachable after reading i + 1 characters is the
union of identical sets over an identical index and thus the two machines
can reach the same set of states after i + 1 steps. Hence one machine can
accept whatever string the other can.
Q.E.D.
Thus a finite automaton is well defined in terms of its power to recognize
languageswe do not need to be more specific about its characteristics,

3.3 Regular Expressions

since all versions (deterministic or not, with or without transitions) have


equivalent power. We call the set of all languages recognized by finite
automata the regular languages.
Not every language is regular: some languages cannot be accepted by
any finite automaton. These include all languages that can be accepted only
through some unbounded count, such as {1, 101, 101001, 1010010001, . . . }
or {, 01, 0011, 000111, . . . }. A finite automaton has no dynamic memory:
its only memory is its set of states, through which it can count only to
a fixed constantso that counting to arbitrary values, as is required in the
two languages just given, is impossible. We shall prove this statement and
obtain an exact characterization later.

3.3 Regular Expressions


3.3.1 Definitions and Examples
Regular expressions were designed by mathematicians to denote regular
languages with a mathematical tool, a tool built from a set of primitives
(generators in mathematical parlance) and operations.
For instance, arithmetic (on nonnegative integers) is a language built
from one generator (zero, the one fundamental number), one basic operation (successor, which generates the next numberit is simply an incrementation), and optional operations (such as addition, multiplication, etc.),
each defined inductively (recursively) from existing operations. Compare
the ease with which we can prove statements about nonnegative integers
with the incredible lengths to which we have to go to prove even a small
piece of code to be correct. The mechanical modelsautomata, programs,
etc.all suffer from their basic premise, namely the notion of state. States
make formal proofs extremely cumbersome, mostly because they offer no
natural mechanism for induction.
Another problem of finite automata is their nonlinear format: they are
best represented graphically (not a convenient data entry mechanism), since
they otherwise require elaborate conventions for encoding the transition
table. No one would long tolerate having to define finite automata for
pattern-matching tasks in searching and editing text. Regular expressions,
on the other hand, are simple strings much like arithmetic expressions,
with a simple and familiar syntax; they are well suited for use by humans
in describing patterns for string processing. Indeed, they form the basis for
the pattern-matching commands of editors and text processors.

59

60

Finite Automata and Regular Languages

Definition 3.3 A regular expression on some alphabet 6 is defined inductively as follows:


, , and a (for any a 6) are regular expressions.
If P and Q are regular expressions, P + Q is a regular expression
(union).
If P and Q are regular expressions, P Q is a regular expression
(concatenation).
If P is a regular expression, P is a regular expression (Kleene closure).
Nothing else is a regular expression.
h
The three operations are chosen to produce larger sets from smaller
oneswhich is why we picked union but not intersection. For the sake
of avoiding large numbers of parentheses, we let Kleene closure have
highest precedence, concatenation intermediate precedence, and union
lowest precedence.
This definition sets up an abstract universe of expressions, much like
arithmetic expressions. Examples of regular expressions on the alphabet
{0, 1} include , 0, 1, + 1, 1 , (0 + 1) , 10 ( + 1)1, etc. However, these
expressions are not as yet associated with languages: we have defined the
syntax of the regular expressions but not their semantics. We now rectify
this omission:
is a regular expression denoting the empty set.
is a regular expression denoting the set {}.
a 6 is a regular expression denoting the set {a}.
If P and Q are regular expressions, P Q is a regular expression
denoting the set {x y | x P and y Q}.
If P and Q are regular expressions, P + Q is a regular expression
denoting the set {x | x P or x Q}.
If P is a regular expression, P is a regular expression denoting the
set {} {xw | x P and w P }.

This last definition is recursive: we define P in terms of itself. Put in English,


the Kleene closure of a set S is the infinite union of the sets obtained by
concatenating zero or more copies of S. For instance, the Kleene closure
of {1} is simply the set of all strings composed of zero or more 1s, i.e.,
1 = {, 1, 11, 111, 1111, . . . }; the Kleene closure of the set {0, 11} is the
set {, 0, 11, 00, 011, 110, 1111, . . . }; and the Kleene closure of the set 6
(the alphabet) is 6 (yes, that is the same notation!), the set of all possible
strings over the alphabet. For convenience, we shall define P + = P P ; that
is, P + differs from P in that it must contain at least one copy of an element
of P.

3.3 Regular Expressions

Let us go through some further examples of regular expressions. Assume


the alphabet 6 = {0, 1}; then the following are regular expressions over 6:
representing the empty set
0 representing the set {0}
1 representing the set {1}
11 representing the set {11}
0 + 1, representing the set {0, 1}
(0 + 1)1, representing the set {01, 11}
(0 + 1)1, representing the infinite set {1, 11, 111, 1111, . . ., 0, 01, 011,
0111, . . . }
(0 + 1) = + (0 + 1) + (0 + 1)(0 + 1) + . . . = 6
(0 + 1)+ = (0 + 1)(0 + 1) = 6 + = 6 {}

The same set can be denoted by a variety of regular expressions; indeed,


when given a complex regular expression, it often pays to simplify it before
attempting to understand the language it defines. Consider, for instance, the
regular expression ((0 + 1)10(0 + 1 )) . The subexpression 10 (0 + 1 ) can
be expanded to 10 0 + 10 1 , which, using the + notation, can be rewritten
as 10+ + 10 1 . We see that the second term includes all strings denoted
by the first term, so that the first term can be dropped. (In set union, if A
contains B, then we have A B = A.) Thus our expression can be written
in the simpler form ((0 + 1)101 ) and means in English: zero or more
repetitions of strings chosen from the set of strings made up of a 0 or a 1
followed by a 1 followed by zero or more 0s followed by zero or more 1s.

3.3.2 Regular Expressions and Finite Automata


Regular expressions, being a mathematical tool (as opposed to a mechanical
tool like finite automata), lend themselves to formal manipulations of
the type used in proofs and so provide an attractive alternative to finite
automata when reasoning about regular languages. But we must first prove
that regular expressions and finite automata are equivalent, i.e., that they
denote the same set of languages.
Our proof consists of showing that (i) for every regular expression,
there is a (nondeterministic) finite automaton with transitions and (ii) for
every deterministic finite automaton, there is a regular expression. We have
previously seen how to construct a deterministic finite automaton from a
nondeterministic one and how to remove transitions. Hence, once the
proof has been made, it will be possible to go from any form of finite
automaton to a regular expression and vice versa. We use a deterministic
finite automaton in part (i) because it is an easier machine to simulate with

61

62

Finite Automata and Regular Languages

regular expressions; conversely, we use nondeterministic finite automata


with transitions for part (ii) because they are a more expressive (though
not more powerful) model in which to translate regular expressions.
Theorem 3.3 For every regular expression there is an equivalent finite
automaton.
h
Proof. The proof hinges on the fact that regular expressions are defined
recursively, so that, once the basic steps are shown for constructing finite
automata for the primitive elements of regular expressions, finite automata
for regular expressions of arbitrary complexity can be constructed by
showing how to combine component finite automata to simulate the basic
operations. For convenience, we shall construct finite automata with a
unique accepting state. (Any nondeterministic finite automaton with
moves can easily be transformed into one with a unique accepting state
by adding such a state, setting up an transition to this new state from
every original accepting state, and then turning all original accepting states
into rejecting ones.)
For the regular expression denoting the empty set, the corresponding
finite automaton is

For the regular expression denoting the set {}, the corresponding finite
automaton is

For the regular expression a denoting the set {a}, the corresponding finite
automaton is
a

If P and Q are regular expressions with corresponding finite automata M P


and M Q , then we can construct a finite automaton denoting P + Q in the
following manner:

MP

MQ

The transitions at the end are needed to maintain a unique accepting state.

3.3 Regular Expressions

If P and Q are regular expressions with corresponding finite automata


M P and M Q , then we can construct a finite automaton denoting P Q in the
following manner:

MP

MQ

Finally, if P is a regular expression with corresponding finite automaton


M P , then we can construct a finite automaton denoting P in the following
manner:

MP

Again, the extra transitions are here to maintain a unique accepting state.
It is clear that each finite automaton described above accepts exactly the
set of strings described by the corresponding regular expression (assuming
inductively that the submachines used in the construction accept exactly
the set of strings described by their corresponding regular expressions).
Since, for each constructor of regular expressions, we have a corresponding
constructor of finite automata, the induction step is proved and our proof
is complete.
Q.E.D.
We have proved that for every regular expression, there exists an equivalent
nondeterministic finite automaton with transitions. In the proof, we
chose the type of finite automaton with which it is easiest to proceed
the nondeterministic finite automaton. The proof was by constructive
induction. The finite automata for the basic pieces of regular expressions (,
, and individual symbols) were used as the basis of the proof. By converting
the legal operations that can be performed on these basic pieces into finite
automata, we showed that these pieces can be inductively built into larger
and larger finite automata that correspond to the larger and larger pieces of
the regular expression as it is built up. Our construction made no attempt
to be efficient: it typically produces cumbersome and redundant machines.
For an efficient conversion of regular expressions to finite automata, it
is generally better to understand what the expression is conveying, and
then design an ad hoc finite automaton that accomplishes the same thing.
However, the mechanical construction used in the proof was needed to
prove that any regular expression can be converted to a finite automaton.

63

64

Finite Automata and Regular Languages

3.3.3

Regular Expressions from Deterministic Finite Automata

In order to show the equivalence of finite automata to regular expressions,


it is necessary to show both that there is a finite automaton for every
regular expression and that there is a regular expression for every finite
automaton. The first part has just been proved. We shall now demonstrate
the second part: given a finite automaton, we can always construct a
regular expression that denotes the same language. As before, we are free
to choose the type of automaton that is easiest to work with, since all finite
automata are equivalent. In this case the most restricted finite automaton,
the deterministic finite automaton, best serves our purpose. Our proof is
again an inductive, mechanical construction, which generally produces an
unnecessarily cumbersome, though infallibly correct, regular expression.
In finding an approach to this proof, we need a general way to talk about
and to build up paths, with the aim of describing all accepting paths through
the automaton with a regular expression. However, due to the presence of
loops, paths can be arbitrarily large; thus most machines have an infinite
number of accepting paths. Inducting on the length or number of paths,
therefore, is not feasible. The number of states in the machine, however, is a
constant; no matter how long a path is, it cannot pass through more distinct
states than are contained in the machine. Therefore we should be able to
induct on some ordering related to the number of distinct states present in a
path. The length of the path is unrelated to the number of distinct states seen
on the path and so remains (correctly) unaffected by the inductive ordering.
For a deterministic finite automaton with n states, which are numbered
from 1 to n, consider the paths from node (state) i to node j . In building up
an expression for these paths, we proceed inductively on the index of the
highest-numbered intermediate state used in getting from i to j . Define Rikj
as the set of all paths from state i to state j that do not pass through any
intermediate state numbered higher than k. We will develop the capability
to talk about the universe of all paths through the machine by inducting
on k from 0 to n (the number of states in the machine), for all pairs of nodes
i and j in the machine.
On these paths, the intermediate states (those states numbered no higher
than k through which the paths can pass), can be used repeatedly; in
contrast, states i and j (unless they are also numbered no higher than k)
can be only left (i) or entered ( j ). Put another way, passing through a
node means both entering and leaving the node; simply entering or leaving
the node, as happens with nodes i and j , does not matter in figuring k.
This approach, due to Kleene, is in effect a dynamic programming
technique, identical to Floyds algorithm for generating all shortest paths

3.3 Regular Expressions

in a graph. The construction is entirely artificial and meant only to yield


an ordering for induction. In particular, the specific ordering of the states
(which state is labeled 1, which is labeled 2, and so forth) is irrelevant: for
each possible labeling, the construction proceeds in the same way.

The Base Case


The base case for the proof is the set of paths described by Ri0j for all pairs
of nodes i and j in the deterministic finite automaton. For a specific pair
of nodes i and j , these are the paths that go directly from node i to node j
without passing through any intermediate states. These paths are described
by the following regular expressions:
if we have i = j ( is the path of length 0); and/or
a if we have (qi , a) = q j (including the case i = j with a self-loop).
Consider for example the deterministic finite automaton of Figure 3.16.
Some of the base cases for a few pairs of nodes are given in Figure 3.17.

The Inductive Step


We now devise an inductive step and then proceed to build up regular
expressions inductively from the base cases.
The inductive step must define Rikj in terms of lower values of k (in
terms of k 1, for instance). In other words, we want to be able to talk
about how to get from i to j without going through states higher than k
in terms of what is already known about how to get from i to j without
going through states higher than k 1. The set Rikj can be thought of as the
union of two sets: paths that do pass through state k (but no higher) and
paths that do not pass through state k (or any other state higher than k).
The second set can easily be recursively described by Rik1
j . The first set
presents a bit of a problem because we must talk about paths that pass

2
0

1
0

1
3

Figure 3.16

A simple deterministic finite automaton.

65

66

Finite Automata and Regular Languages

Path Sets

Regular Expression

0
R11
= {}
0
R12 = {0}
0
R13
= {1}
0
R21 = { }
0
R22 = {, 1}

0
1

+1
...
+1

0
R33

Figure 3.17

...
= {, 1}

Some base cases in constructing a regular expression for the


automaton of Figure 3.16.

through state k without passing through any state higher than k 1, even
though k is higher than k 1. We can circumvent this difficulty by breaking
any path through state k every time it reaches state k, effectively splitting
the set of paths from i to j through k into three separate components, none
of which passes through any state higher than k 1. These components
are:
k1
Rik
, the paths that go from i to k without passing through a state
higher than k 1 (remember that entering the state at the end of the
path does not count as passing through the state);
k1
Rkk
, one iteration of any loop from k to k, without passing through
a state higher than k 1 (the paths exit k at the beginning and enter
k at the end, but never pass through k); and
k1
, the paths that go from state k to state j without passing through
Rkj
a state higher than k 1.
k1
The expression Rkk
describes one iteration of a loop, but this loop could
occur any number of times, including none, in any of the paths in Rikj .
The expression corresponding to any number of iterations of this loop
k1
therefore must be (Rkk
) . We now have all the pieces we need to build up
the inductive step from k 1 to k:
k1
k1 k1
Rikj = Rik1
+ Rik
(Rkk
) Rkj
j
k1 k1
k1
) Rkj .
(Rkk
Figure 3.18 illustrates the second term, Rik
With this inductive step, we can proceed to build all possible paths in
the machine (i.e., all the paths between every pair of nodes i and j for each

3.3 Regular Expressions


no k

k
no k

Figure 3.18

j
no k

Adding node k to paths from i to j .

k from 1 to n) from the expressions for the base cases. Since the Rk s are built
from the regular expressions for the various Rk1 s using only operations
that are closed for regular expressions (union, concatenation, and Kleene
closurenote that we need all three operations!), the Rk s are also regular
expressions. Thus we can state that Rikj is a regular expression for any
value of i, j , and k, with 1 i, j, k n, and that this expression denotes
all paths (or, equivalently, strings that cause the automaton to follow these
paths) that lead from state i to state j while not passing through any state
numbered higher than k.

Completing the Proof


The language of the deterministic finite automaton is precisely the set of
all paths through the machine that go from the start state to an accepting
state. These paths are denoted by the regular expressions R1n j , where j is
some accepting state. (Note that, in the final expressions, we have k = n;
that is, the paths are allowed to pass through any state in the machine.)
The language of the whole machine is then
by the union of
P described
n
these expressions, the regular expression
R
.
Our
proof is now
j F 1 j
complete: we have shown that, for any deterministic finite automaton, we
can construct a regular expression that defines the same language. As before,
the technique is mechanical and results in cumbersome and redundant
expressions: it is not an efficient procedure to use for designing regular
expressions from finite automata. However, since it is mechanical, it works
in all cases to derive correct expressions and thus serves to establish the
theorem that a regular expression can be constructed for any deterministic
finite automaton.

67

68

Finite Automata and Regular Languages

In the larger picture, this proof completes the proof of the equivalence
of regular expressions and finite automata.

Reviewing the Construction of Regular Expressions


from Finite Automata
Because regular expressions are defined inductively, we need to proceed
inductively in our proof. Unfortunately, finite automata are not defined
inductively, nor do they offer any obvious ordering for induction. Since we
are not so much interested in the automata as in the languages they accept,
we can look at the set of strings accepted by a finite automaton. Every
such string leads the automaton from the start state to an accepting state
through a series of transitions. We could conceivably attempt an induction
on the length of the strings accepted by the automaton, but this length has
little relationship to either the automaton (a very short path through the
automaton can easily produce an arbitrarily long stringthink of a loop on
the start state) or the regular expressions describing the language (a simple
expression can easily denote an infinite collection of strings).
What we need is an induction that allows us to build regular expressions
describing strings (i.e., sequences of transitions through the automaton) in
a progressive fashion; terminates easily; and has simple base cases. The
simplest sequence of transitions through an automaton is a single transition
(or no transition at all). While that seems to lead us right back to induction
on the number of transitions (on the length of strings), such need not be
the case. We can view a single transition as one that does not pass through
any other state and thus as the base case of an induction that will allow a
larger and larger collection of intermediate states to be used in fabricating
paths (and thus regular expressions).
Hence our preliminary idea about induction can be stated as follows: we
will start with paths (strings) that allow no intermediate state, then proceed
with paths that allow one intermediate state, then a set of two intermediate
states, and so forth. This ordering is not yet sufficient, however: which
intermediate state(s) should we allow? If we allow any single intermediate
state, then any two, then any three, and so on, the ordering is not strict:
there are many different subsets of k intermediate states out of the n states
of the machine and none is comparable to any other. It would be much
better to have a single subset of allowable intermediate states at each step
of the induction.
We now get to our final idea about induction: we shall number the states
of the finite automaton and use an induction based on that numbering.
The induction will start with paths that allow no intermediate state, then

3.3 Regular Expressions

proceed to paths that can pass (arbitrarily often) through state 1, then to
paths that can pass through states 1 and 2, and so on. This process looks
good until we remember that we want paths from the start state to an
accepting state: we may not be able to find such a path that also obeys our
requirements. Thus we should look not just at paths from the start state to
an accepting state, but at paths from any vertex to any other. Once we have
regular expressions for all source/target pairs, it will be simple enough to
keep those that describe paths from the start state to an accepting state.
Now we can formalize our induction: at step k of the induction, we
shall compute, for each pair (i, j ) of vertices, all paths that go from vertex i
through vertex j and that are allowed to pass through any of the vertices
numbered from 1 to k. If the starting vertex for these paths, vertex i, is
among the first k vertices, then we allow paths that loop through vertex i;
otherwise we allow the path only to leave vertex i but not see it again on
its way to vertex j . Similarly, if vertex j is among the first k vertices, the
path may go through it any number of times; otherwise the path can only
reach it and stop.
In effect, at each step of the induction, we define a new, somewhat larger
finite automaton composed of the first k states of the original automaton,
together with all transitions among these k states, plus any transition from
state i to any of these states that is not already included, plus any transition
to state j from any of these states that is not already included, plus any
transition from state i to state j , if not already included. Think of these
states and transitions as being highlighted in red, while the rest of the
automaton is blue; we can play only with the red automaton at any step of
the induction. However, from one step to the next, another blue state gets
colored red along with any transitions between it and the red states and
any transition to it from state i and any transition from it to state j . When
the induction is complete, k equals n, the number of states of the original
machine, and all states have been colored red, so we are playing with the
original machine.
To describe with regular expressions what is happening, we begin
by describing paths from i to j that use no intermediate state (no state
numbered higher than 0). That is simple, since such transitions occur either
under (when i = j ) or under a single symbol, in which case we just look
up the transition table of the automaton. The induction step simply colors
one more blue node in red. Hence we can add to all existing paths from
i to j those paths that now go through the new node; these paths can go
through the new node several times (they can include a loop that takes
them back to the new node over and over again) before reaching node j .
Since only the portion that touches the new node is new, we simply break

69

70

Finite Automata and Regular Languages

any such paths into segments, each of which leaves or enters the new node
but does not pass through it. Every such segment goes through only old red
nodes and so can be described recursively, completing the induction.

3.4 The Pumping Lemma and Closure Properties


3.4.1

The Pumping Lemma

We saw earlier that a language is regular if we can construct a finite


automaton that accepts all strings in that language or a regular expression
that represents that language. However, so far we have no tool to prove
that a language is not regular.
The pumping lemma is such a tool. It establishes a necessary (but
not sufficient) condition for a language to be regular. We cannot use the
pumping lemma to establish that a language is regular, but we can use it to
prove that a language is not regular, by showing that the language does not
obey the lemma.
The pumping lemma is based on the idea that all regular languages
must exhibit some form of regularity (pun intendedthat is the origin of
the name regular languages). Put differently, all strings of arbitrary length
(i.e., all sufficiently long strings) belonging to a regular language must
have some repeating pattern(s). (The short strings can each be accepted in
a unique way, each through its own unique path through the machine. In
particular, any finite language has no string of arbitrary length and so has
only short strings and need not exhibit any regularity.)
Consider a finite automaton with n states, and let z be a string of
length at least n that is accepted by this automaton. In order to accept
z, the automaton makes a transition for each input symbol and thus
moves through at least n + 1 states, one more than exist in the automaton.
Therefore the automaton will go through at least one loop in accepting the
string. Let the string be z = x 1 x 2 x 3 . . . x |z| ; then Figure 3.19 illustrates the
accepting path for z. In view of our preceding remarks, we can divide the

q0

x1

Figure 3.19

q1

x2

x|z|

qk

An accepting path for z.

3.4 The Pumping Lemma and Closure Properties

y
y

x
no loop

y
loop

t
tail

Figure 3.20 The three parts of an accepting path, showing potential


looping.

path through the automaton into three parts: an initial part that does not
contain any loop, the first loop encountered, and a final part that may or
may not contain additional loops. Figure 3.20 illustrates this partition. We
used x, y, and t to denote the three parts and further broke the loop into
two parts, y and y , writing y = y y y , so that the entire string becomes
x y y y t. Now we can go through the loop as often as we want, from zero
times (yielding x y t) to twice (yielding x y y y y y t) to any number of times
(yielding a string of the form x y (y y ) t); all of these strings must be in the
language. This is the spirit of the pumping lemma: you can pump some
string of unknown, but nonzero length, here y y , as many times as you
want and always obtain another string in the languageno matter what
the starting string z was (as long, that is, as it was long enough). In our case
the string can be viewed as being of the form uvw, where we have u = x y ,
v = y y , and w = t. We are then saying that any string of the form uv w is
also in the language. We have (somewhat informally) proved the pumping
lemma for regular languages.
Theorem 3.4 For every regular language L, there exists some constant n
(the size of the smallest automaton that accepts L) such that, for every string
z L with |z| n, there exist u, v, w 6 with z = uvw, |v| 1, |uv| n,
and, for all i N, uvi w L.
h
Writing this statement succinctly, we obtain
L is regular (nz, |z| n, u, v, w, |uv| n, |v| 1, i, uvi w L)
so that the contrapositive is
(nz, |z| n, u, v, w, |uv| n, |v| 1, i, uvi w
/ L)
L is not regular

71

72

Finite Automata and Regular Languages

Thus to show that a language is not regular, all we need to do is find a string
z that contradicts the lemma. We can think of playing the adversary in a
game where our opponent is attempting to convince us that the language
is regular and where we are intent on providing a counterexample. If our
opponent claims that the language is regular, then he must be able to provide
a finite automaton for the language. Yet no matter what that automaton is,
our counterexample must work, so we cannot pick n, the number of states
of the claimed automaton, but must keep it as a parameter in order for our
construction to work for any number of states. On the other hand, we get to
choose a specific string, z, in the language and give it to our opponent. Our
opponent, who (claims that he) knows a finite automaton for the language,
then tells us where the first loop used by his machine lies and how long
it is (something we have no way of knowing since we do not have the
automaton). Thus we cannot choose the decomposition of z into u, v, and
w, but, on the contrary, must be prepared for any decomposition given to
us by our opponent. Thus for each possible decomposition into u, v, and w
(that obeys the constraints), we must prepare our counterexample, that is, a
pumping number i (which can vary from decomposition to decomposition)
such that the string uvi w is not in the language.
To summarize, the steps needed to prove that a language is not regular
are:
1.
2.
3.
4.

Assume that the language is regular.


Let some parameter n be the constant of the pumping lemma.
Pick a suitable string z with |z| n.
Show that, for every legal decomposition of z into uvw (i.e., obeying
|v| 1 and |uv| n), there exists i 0 such that uvi w does not belong
to L.
5. Conclude that assumption (1) was false.

Failure to proceed through these steps invalidates the potential proof that L
is not regular but does not prove that L is regular! If the language is finite,
the pumping lemma is useless, as it has to be, since all finite languages are
regular: in a finite language, the automatons accepting paths all have length
less than the number of states in the machine, so that the pumping lemma
holds vacuously.
Consider the language L 1 = {0i 1i | i 0}. Let n be the constant of the
pumping lemma (that is, n is the number of states in the corresponding
deterministic finite automaton, should one exist). Pick the string z = 0n 1n ;
it satisfies |z| n. Figure 3.21 shows how we might decompose z = uvw to
ensure |uv| n and |v| 1. The uv must be a string of 0s, so pumping v

3.4 The Pumping Lemma and Closure Properties


n

all 0s

all 1s

n
u

Figure 3.21 Decomposing the string z into possible choices for u, v, and w.

will give more 0s than 1s. It follows that the pumped string is not in L 1 ,
which would contradict the pumping lemma if the language were regular.
Therefore the language is not regular.
As another example, let L 2 be the set of all strings, the length of
which is a perfect square. (The alphabet does not matter.) Let n be the
constant of the lemma. Choose any z of length n 2 and write z = uvw with
|v| 1 and |uv| n; in particular, we have 1 |v| n. It follows from the
pumping lemma that, if the language is regular, then the string z = uv2 w
must be in the language. But we have |z | = |z| + |v| = n 2 + |v| and, since we
assumed 1 |v| n, we conclude n 2 < n 2 + 1 n 2 + |v| n 2 + n < (n + 1)2,
or n 2 < |z | < (n + 1)2, so that |z | is not a perfect square and thus z is not
in the language. Hence the language is not regular.
As a third example, consider the language L 3 = {a i b j ck | 0 i < j < k}.
Let n be the constant of the pumping lemma. Pick z = a n bn+1cn+2 , which
clearly obeys |z| n as well as the inequalities on the exponentsbut is
as close to failing these last as possible. Write z = uvw, with |uv| n
and |v| 1. Then uv is a string of as, so that z = uv2 w is the string
a n+|v| bn+1cn+2 ; since we assumed |v| 1, the number of as is now at least
equal to the number of bs, not less, so that z is not in the language. Hence
L is not regular.
As a fourth example, consider the set L 4 of all strings x over {0, 1} such
that, in at least one prefix of x, there are four more 1s than 0s. Let n be the
constant of the pumping lemma and choose z = 0n 1n+4 ; z is in the language,
because z itself has four more 1s than 0s (although no other prefix of z does:
once again, our string z is on the edge of failing membership). Let z = uvw;
since we assumed |uv| n, it follows that uv is a string of 0s and that, in
particular, v is a string of one or more 0s. Hence the string z = uv2 w, which
must be in the language if the language is regular, is of the form 0n+|v| 1n+4 ;

73

74

Finite Automata and Regular Languages

but this string does not have any prefix with four more 1s than 0s and so is
not in the language. Hence the language is not regular.
As a final example, let us tackle the more complex language L 5 =
{a i b j ck | i 6= j or j 6= k}. Let n be the constant of the pumping lemma and
choose z = a n bn!+n cn!+n the reason for this mysterious choice will become
clear in a few lines. (Part of the choice is the now familiar edge position:
this string already has the second and third groups of equal size, so it
suffices to bring the first group to the same size to cause it to fail entirely.)
Let z = uvw; since we assumed |uv| n, we see that uv is a string of as and
thus, in particular, v is a string of one or more as. Thus the string z = uvi w,
which must be in the language for all values of i 0 if the language is regular,
is of the form a n+(i1)|v|bn!+n cn!+n . Choose i to be (n!/|v|) + 1; this value is
a natural number, because |v| is between 1 and n, and because n! is divisible
by any number between 1 and n (this is why we chose this particular value
n! + n). Then we get the string a n!+n bn!+n cn!+n , which is not in the language.
Hence the language is not regular.
Consider applying the pumping lemma to the language L 6 = {a i b j ck |
i > j > k 0}. L 6 is extremely similar to L 3 , yet the same application of
the pumping lemma used for L 3 fails for L 6 : it is no use to pump more
as, since that will not contradict the inequality, but reinforce it. In a
similar vein, consider the language L 7 = {0i 1 j 0 j | i, j > 0}; this language is
similar to the language L 1 , which we already proved not regular through a
straightforward application of the pumping lemma. Yet the same technique
will fail with L 7 , because we cannot ensure that we are not just pumping
initial 0ssomething that would not prevent membership in L 7 .
In the first case, there is a simple way out: instead of pumping up, pump
down by one. From uvw, we obtain uw, which must also be in the language
if the language is regular. If we choose for L 6 the string z = a n+2 bn+1 , then
uv is a string of as and pumping down will remove at least one a, thereby
invalidating the inequality. We can do a detailed case analysis for L 7 , which
will work. Pick z = 01n 0n ; then uv is 01k for some k 0. If k equals 0, then
uv is just 0, so u is and v is 0, and pumping down once creates the string
1n 0n , which is not in the language, as desired. If k is at least 1, then either
u is , in which case pumping up once produces the string 01k 01n 0n , which
is not in the language; or u has length at least 1, in which case v is a string
of 1s and pumping up once produces the string 01n+|v| 0n , which is not in
the language either. Thus in all three cases we can pump the string so as to
produce another string not in the language, showing that the language is
not regular. But contrast this laborious procedure with the proof obtained
from the extended pumping lemma described below.

3.4 The Pumping Lemma and Closure Properties

What we really need is a way to shift the position of the uv substring


within the entire string; having it restricted to the front of z is too limiting.
Fortunately our statement (and proof) of the pumping lemma does not
really depend on the location of the n characters within the string. We
started at the beginning because that was the simplest approach and we
used n (the number of states in the smallest automaton accepting the
language) rather than some larger constant because we could capture in
that manner the first loop along an accepting path. However, there may
be many different loops along any given path. Indeed, in any stretch of
n characters, n + 1 states are visited and so, by the pigeonhole principle,
a loop must occur. These observations allow us to rephrase the pumping
lemma slightly.
Lemma 3.1 For any regular language L there exists some constant n > 0
such that, for any three strings z 1 , z 2 , and z 3 with z = z 1 z 2 z 3 L and |z 2 | = n,
there exists strings u, v, w 6 with z 2 = uvw, |v| 1, and, for all i N,
z 1 uvi wz 3 L.
h
This restatement does not alter any of the conditions of the original
pumping lemma (note that |z 2 | = n implies |uv| n, which is why the latter
inequality was not stated explicitly); however, it does allow us to move our
focus of attention anywhere within a long string. For instance, consider
again the language L 7 : we shall pick z 1 = 0n , z 2 = 1n , and z 3 = 0n ; clearly,
z = z 1 z 2 z 3 = 0n 1n 0n is in L 7 . Since z 2 consists only of 1s, so does v; therefore
the string z 1 uv2 wz 3 is 0n 1n+|v| 0n and is not in L 7 , so that L 7 is not regular.
The new statement of the pumping lemma allowed us to move our focus
of attention to the 1s in the middle of the string, making for an easy proof.
Although L 2 does not need it, the same technique is also advantageously
applied: if n is the constant of the pumping lemma, pick z 1 = a n+1 , z 2 = bn ,
and z 3 = . Now write z 2 = uvw: it follows that v is a string of one or more
bs, so that the string z 1 uv2 wz 3 is a n+1 bn+|v| , which is not in the language,
since we have n + |v| n + 1. Table 3.1 summarizes the use of (our extended
version of) the pumping lemma.
Exercise 3.2 Develop a pumping lemma for strings that are not in the
language. In a deterministic finite automaton where all transitions are
specified, arbitrary long strings that get rejected must be rejected through
a path that includes one or more loops, so that a lemma similar to the
pumping lemma can be proved. What do you think the use of such a
lemma would be?
h

75

76

Finite Automata and Regular Languages

Table 3.1 How to use the pumping lemma to prove nonregularity.


Assume that the language is regular.
Let n be the constant of the pumping lemma; it will be used to parameterize the
construction.
Pick a suitable string z in the language that has length at least n. (In many cases,
pick z at the edge of membershipthat is, as close as possible to failing some
membership criterion.)
Decompose z into three substrings, z = z 1 z 2 z 3 , such that z 2 has length exactly n.
You can pick the boundaries as you please.
Write z 2 as the concatenation of three strings, z 2 = uvw; note that the boundaries
delimiting u, v, and w are not knownall that can be assumed is that v has
nonzero length.
Verify that, for any choice of boundaries, i.e., any choice of u, v, and w with
z 2 = uvw and where v has nonzero length, there exists an index i such that the
string z 1 uv i wz 3 is not in the language.
Conclude that the language is not regular.

3.4.2

Closure Properties of Regular Languages

By now we have established the existence of an interesting family of sets, the


regular sets. We know how to prove that a set is regular (exhibit a suitable
finite automaton or regular expression) and how to prove that a set is not
regular (use the pumping lemma). At this point, we should ask ourselves
what other properties these regular sets may possess; in particular, how do
they behave under certain basic operations? The simplest question about
any operator applied to elements of a set is Is it closed? or, put negatively,
Can an expression in terms of elements of the set evaluate to an element
not in the set? For instance, the natural numbers are closed under addition
and multiplication but not under divisionthe result is a rational number;
the reals are closed under the four operations (excluding division by 0) but
not under square rootthe square root of a negative number is not a real
number; and the complex numbers are closed under the four operations
and under any polynomial root-finding.
From our earlier work, we know that the regular sets must be closed
under concatenation, union, and Kleene closure, since these three operations were defined on regular expressions (regular sets) and produce more
regular expressions. We alluded briefly to the fact that they must be closed
under intersection and complement, but let us revisit these two results.

3.4 The Pumping Lemma and Closure Properties

The complement of a language L 6 is the language L = 6 L.


Given a deterministic finite automaton for L in which every transition is
defined (if some transitions are not specified, add a new rejecting trap state
and define every undefined transition to move to the new trap state), we
can build a deterministic finite automaton for L by the simple expedient of
turning every rejecting state into an accepting state and vice versa. Since
regular languages are closed under union and complementation, they are
also closed under intersection by DeMorgans law. To see directly that
intersection is closed, consider regular languages L 1 and L 2 with associated
automata M1 and M2 . We construct the new machine M for the language
L 1 L 2 as follows. The set of states of M is the Cartesian product of the
sets of states of M1 and M2 ; if M1 has transition (qi, a) = q j and M2
has transition (qk , a) = ql , then M has transition ((qi , qk ), a) = (q j , ql );
finally, (q , q ) is an accepting state of M if q is an accepting state of M1
and q is an accepting state of M2 .
Closure under various operations can simplify proofs. For instance,
consider the language L 8 = {a i b j | i 6= j }; this language is closely related to
our standard language {a i bi | i N} and is clearly not regular. However,
a direct proof through the pumping lemma is somewhat challenging; a
much simpler proof can be obtained through closure. Since regular sets
are closed under complement and intersection and since the set a b is
regular (denoted by a regular expression), then, if L 8 is regular, so must
be the language L 8 a b . However, the latter is our familiar language
{a i bi | i N} and so is not regular, showing that L 8 is not regular either.
A much more impressive closure is closure under substitution. A
substitution from alphabet 6 to alphabet 1 (not necessarily distinct) is

a mapping from 6 to 21 {} that maps each character of 6 onto a


(nonempty) regular language over 1. The substitution is extended from a
character to a string by using concatenation as in a regular expression: if
we have the string ab over 6, then its image is f (ab), the language over 1
composed of all strings constructed of a first part chosen from the set f (a)
concatenated with a second part chosen from the set f (b). Formally, if w
is ax, then f (w) is f (a) f (x), the concatenation of the two sets. Finally the
substitution is extended to a language in the obvious way:
[
f (L) =
f (w)
wL

To see that regular sets are closed under this operation, we shall use regular
expressions. Since each regular set can be written as a regular expression,
each of the f (a) for a 6 can be written as a regular expression. The

77

78

Finite Automata and Regular Languages

language L is regular and so has a regular expression E. Simply substitute


for each character a 6 appearing in E the regular (sub)expression for
f (a); the result is clearly a (typically much larger) regular expression.
(The alternate mechanism, which uses our extension to strings and then
to languages, would require a new result. Clearly, concatenation of sets
corresponds exactly to concatenation of regular expressions and union
of sets corresponds exactly to union of regular expressions. However,
f (L) = wL f (w) involves a countably infinite union, not just a finite one,
and we do not yet know whether or not regular expressions are closed
under infinite union.)
A special case of substitution is homomorphism. A homomorphism
from a language L over alphabet 6 to a new language f (L) over alphabet
1 is defined by a mapping f : 6 1 ; in words, the basic function maps
each symbol of the original alphabet to a single string over the new alphabet.
This is clearly a special case of substitution, one where the regular languages
to which each symbol can be mapped consist of exactly one string each.
Substitution and even homomorphism can alter a language significantly.
Consider, for instance, the language L = (a + b) over the alphabet {a, b}
this is just the language of all possible strings over this alphabet. Now
consider the very simple homomorphism from {a, b} to subsets of {0, 1}
defined by f (a) = 01 and f (b) = 1; then f (L) = (01 + 1) is the language
of all strings over {0, 1} that do not contain a pair of 0s and (if not equal
to ) end with a 1a rather different beast. This ability to modify languages considerably without affecting their regularity makes substitution a
powerful tool in proving languages to be regular or not regular.
To prove a new language L regular, start with a known regular language
L 0 and define a substitution that maps L 0 to L. To prove a new language
L not regular, define a substitution that maps L to a new language L 1
known not to be regular. Formally speaking, these techniques are known as
reductions; we shall revisit reductions in detail throughout the remaining
chapters of this book.
We add one more operation to our list: the quotient of two languages.
Given languages L 1 and L 2 , the quotient of L 1 by L 2 , denoted L 1 /L 2 , is the
language {x | y L 2 , x y L 1 }.
Theorem 3.5 If R is regular, then so is R/L for any language L.

The proof is interesting because it is nonconstructive, unlike all other proofs


we have used so far with regular languages and finite automata. (It has to be
nonconstructive, since we know nothing whatsoever about L; in particular,
it is possible that no procedure exists to decide membership in L or to
enumerate the members of L.)

3.4 The Pumping Lemma and Closure Properties

Proof. Let M be a finite automaton for R. We define the new finite


automaton M to accept R/L as follows. M is an exact copy of M, with
one exception: we define the accepting states of M differentlythus M
has the same states, transitions, and start state as M, but possibly different
accepting states. A state q of M is an accepting state of M if and only if
there exists a string y in L that takes M from state q to one of its accepting
states.
Q.E.D.
M , including its accepting states, is well defined; however, we may be
unable to construct M , because the definition of accepting state may not be
computable if we have no easy way of listing the strings of L. (Naturally, if
L is also regular, we can turn the existence proof into a constructive proof.)
Example 3.3 We list some quotients of regular expressions:
0 10 /0 = 0 10
0 10 /0 1 = 0
101/101 =

0 10 /10 = 0
0 10+ /0 1 =
(1 + 10+ )/(0+ + 11) = 1 + 10

Exercise 3.3 Prove the following closure properties of the quotient:


If L 2 includes , then, for any language L, L/L 2 includes all of L.
If L is not empty, then we have 6 /L = 6 .
The quotient of any language L by 6 is the language composed of
all prefixes of strings in L.
h
If L 1 is not regular, then we cannot say much about the quotient L 1 /L 2 ,
even when L 2 is regular. For instance, let L 1 = {0n 1n | n N}, which we
know is not regular. Now contrast these two quotients:
L 1 /1+ = {0n 1m | n > m N}, which is not regular, and
L 1 /0+ 1+ = 0 , which is regular.
Table 3.2 summarizes the main closure properties of regular languages.

Table 3.2 Closure properties of regular languages.

concatenation and Kleene closure


complementation, union, and intersection
homomorphism and substitution
quotient by any language

79

80

Finite Automata and Regular Languages

3.4.3

Ad Hoc Closure Properties

In addition to the operators just shown, numerous other operators are


closed on the regular languages. Proofs of closure for these are often ad
hoc, constructing a (typically nondeterministic) finite automaton for the
new language from the existing automata for the argument languages. We
now give several examples, in increasing order of difficulty.
Example 3.4 Define the language swap(L) to be
{a2 a1 . . . a2n a2n1 | a1 a2 . . . a2n1 a2n L}
We claim that swap(L) is regular if L is regular.
Let M be a deterministic finite automaton for L. We construct a
(deterministic) automaton M for swap(L) that mimics what M does when
it reads pairs of symbols in reverse. Since an automaton cannot read a pair
of symbols at once, our new machine, in some state corresponding to a state
of M (call it q), will read the odd-indexed symbol (call it a) and memorize
itthat is, use a new state (call it [q, a]) to denote what it has read. It then
reads the even-indexed symbol (call it b), at which point it has available a
pair of symbols and makes a transition to whatever state machine M would
move to from q on having read the symbols b and a in that order.
As a specific example, consider the automaton of Figure 3.22(a). After
grouping the symbols in pairs, we obtain the automaton of Figure 3.22(b).

a,c
a,b

c
a,b

b
a,c

b
c

(a) the original automaton


aa,ab,ba,bc,ca,cb

ac,bc,cb

ac,bb,cc
aa,ab,ba,bb,ca,cc

(b) the automaton after grouping symbols in pairs

Figure 3.22

A finite automaton used for the swap language.

3.4 The Pumping Lemma and Closure Properties

a
a
b

c
c

Figure 3.23

The substitute block of states for the swap language.

Our automaton for swap(L) will have a four-state block for each state of
the pair-grouped automaton for L, as illustrated in Figure 3.23. We can
formalize this construction as followsalbeit at some additional cost in
the number of states of the resulting machine. Our new machine M has
state set Q (Q 6), where Q is the state set of M; it has transitions of
the type (q, a) = [q, a] for all q Q and a 6 and transitions of the type
([q, a], b) = ((q, b), a) for all q Q and a, b 6; its start state is q0 , the
start state of M; and its accepting states are the accepting states of M.
h
Example 3.5 The approach used in the previous example works when
trying to build a machine that reads strings of the same length as those
read by M; however, when building a machine that reads strings shorter
than those read by M, nondeterministic transitions must be used to guess
the missing symbols.
Define the language odd(L) to be
{a1 a3 a5 . . . a2n1 | a2 , a4 , . . ., a2n , a1 a2 . . . a2n1 a2n L}
When machine M for odd(L) attempts to simulate what M would do, it
gets only the odd-indexed symbols and so must guess which even-indexed
symbols would cause M to accept the full string. So M in some state q
corresponding to a state of M reads a symbol a and moves to some new
state not in M (call it [q, a]); then M makes an transition that amounts to
guessing what the even-indexed symbol could be. The replacement block
of states that results from this construction is illustrated in Figure 3.24.
Thus we have q ([q, a], ) for all states q with q = ((q, a), b) for any
choice of b; formally, we write
([q, a], ) = {((q, a), b) | b 6}

81

82

Finite Automata and Regular Languages

ab

a
b
c

Figure 3.24

The substitute block of states for the odd language.

In this way, M makes two transitions for each symbol read, enabling it to
simulate the action of M on the twice-longer string that M needs to verify
acceptance.
As a specific example, consider the language L = (00 + 11), recognized
by the automaton of Figure 3.25(a). For this choice of L, odd(L) is just
6 . After grouping the input symbols in pairs, we get the automaton of
Figure 3.25(b). Now our new nondeterministic automaton has a block
of three states for each state of the pair-grouped automaton and so six
states in all, as shown in Figure 3.26. Our automaton moves from the start
state to one of the two accepting states while reading a character from the

0,1
1

(a) the original automaton


all pairs

00,11
01,10

(b) the automaton after grouping symbols in pairs

Figure 3.25

The automaton used in the odd language.

3.4 The Pumping Lemma and Closure Properties

0
1

0
0
1

Figure 3.26 The nondeterministic automaton for the odd language.

inputcorresponding to an odd-indexed character in the string accepted


by Mand makes an transition on the next move, effectively guessing
the even-indexed symbol in the string accepted by M. If the guess is good
(corresponding to a 0 following a 0 or to a 1 following a 1), our automaton
returns to the start state to read the next character; if the guess is bad, it
moves to a rejecting trap state (a block of three states). As must be the
case, our automaton accepts 6 albeit in an unnecessarily complicated
way.
h
Example 3.6 As a final example, let us consider the language
{x | u, v, w 6 , |u| = |v| = |w| = |x| and uvxw L}
In other words, given L, our new language is composed of the third quarter
of each string of L that has length a multiple of 4. Let M be a (deterministic)
finite automaton for L with state set Q, start state q0 , accepting states F,
and transition function . As in the odd language, we have to guess a large
number of absent inputs to feed to M. Since the input is the string x, the
processing of the guessed strings u, v, and w must take place while we
process x itself. Thus our machine for the new language will be composed,
in effect, of four separate machines, each a copy of M; each copy will
process its quarter of uvxw, with three copies processing guesses and one
copy processing the real input. The key to a solution is tying together these
four machines: for instance, the machine processing x should start from
the state reached by the machine processing v once v has been completely
processed.
This problem at first appears dauntingnot only is v guessed, but it is
not even processed when the processing of x starts. The answer is to use yet
more nondeterminism and to guess what should be the starting state of each
component machine. Since we have four of them, we need a guess for the
starting states of the second, third, and fourth machines (the first naturally

83

84

Finite Automata and Regular Languages

starts in state q0 ). Then we need to verify these guesses by checking, when


the input has been processed, that the first machine has reached the state
guessed as the start of the second, that the second machine has reached
the state guessed as the start of the third, and the the third machine has
reached the state guessed as the start of the fourth. In addition, of course,
we must also check that the fourth machine ended in some state in F. In
order to check initial guesses, these initial guesses must be retained; but
each machine will move from its starting state, so that we must encode in
the state of our new machine both the current state of each machine and
the initial guess about its starting state.
This chain of reasoning leads us to define a state of the new machine
as a seven-tuple, say (qi , q j , qk , ql , qm , qn , qo ), where qi is the current state
of the first machine (no guess is needed for this machine), q j is the guessed
starting state for the second machine and qk its current state, ql is the
guessed starting state for the third machine and qm its current state, and qn
is the guessed starting state for the fourth machine and qo its current state;
and where all qx are states of M.
The initial state of each machine is the same as the guess, that is, our
new machine can start from any state of the form (q0 , q j , q j , ql , ql , qn , qn ),
for any choice of j , l, and n. In order to make it possible, we add one
more state to our new machine (call it S ), designate it as the unique
starting state, and add transitions from it to the |Q|3 states of the
form (q0 , q j , q j , ql , ql , qn , qn ). When the input has been processed, it will
be accepted if the state reached by each machine matches the start state
used by the next machine and if the state reached by the fourth machine
is a state in F, that is, if the state of our new machine is of the form
(q j , q j , ql , ql , qn , qn , q f ), with q f F and for any choices of j , l, and n.
Finally, from some state (qi , q j , qk , ql , qm , qn , qo ), our new machine can
move to a new state (qi , q j , qk , ql , qm , qn , qo ) when reading character c
from the input string x whenever the following four conditions are met:

there exists a 6 with (qi , a) = qi


there exists a 6 with (qk , a) = qk
(qm , c) = qm
there exists a 6 with (qo , a) = qo

Overall, our new machine, which is highly nondeterministic, has


|Q|7 + 1 states. While the machine is large, its construction is rather
straightforward; indeed, the principle generalizes easily to more complex
situations, as explored in Exercises 3.31 and 3.32.
h

3.5 Conclusion

These examples illustrate the conceptual power of viewing a state of the


new machine as a tuple, where, typically, members of the tuple are states
from the known machine or alphabet characters. State transitions of the
new machine are then defined on the tuples by defining their effect on each
member of tuple, where the state transitions of the known machine can be
used to good effect. When the new language includes various substrings of
the known regular language, the tuple notation can be used to record
starting and current states in the exploration of each substring. Initial
state(s) and accepting states can then be set up so as to ensure that the
substrings, which are processed sequentially in the known machine but
concurrently in the new machine, have to match in the new machine as
they automatically did in the known machine.

3.5 Conclusion
Finite automata and regular languages (and regular grammars, an equivalent mechanism based on generation that we did not discuss, but that is
similar in spirit to the grammars used in describing legal syntax in programming languages) present an interesting model, with enough structure
to possess nontrivial properties, yet simple enough that most questions
about them are decidable. (We shall soon see that most questions about
universal models of computation are undecidable.) Finite automata find
most of their applications in the design of logical circuits (by definition,
any chip is a finite-state machine, the difference from our model being
simply that, whereas our finite-state automata have no output function,
finite-state machines do), but computer scientists see them most often in
parsers for regular expressions. For instance, the expression language used
to specify search strings in Unix is a type of regular expression, so that the
Unix tools built for searching and matching are essentially finite-state automata. As another example, tokens in programming languages (reserved
words, variables names, etc.) can easily be described by regular expressions
and so their parsing reduces to running a simple finite-state automaton
(e.g., lex).
However, finite automata cannot be used for problem-solving; as we
have seen, they cannot even count, much less search for optimal solutions.
Thus if we want to study what can be computed, we need a much more
powerful model; such a model forms the topic of Chapter 4.

85

86

Finite Automata and Regular Languages

3.6 Exercises
Exercise 3.4 Give deterministic finite automata accepting the following
languages over the alphabet 6 = {0, 1}:
1. The set of all strings that contain the substring 010.
2. The set of all strings that do not contain the substring 000.
3. The set of all strings such that every substring of length 4 contains at
least three 1s.
4. The set of all strings that contain either an even number of 0s or at
most three 0s (that is, if the number of 0s is even, the string is in
the language, but if the number of 0s is odd, then the string is in the
language only if that number does not exceed 3).
5. The set of all strings such that every other symbol is a 1 (starting at the
first symbol for odd-length string and at the second for even-length
strings; for instance, both 10111 and 0101 are in the language).
This last problem is harder than the previous four since this automaton
has no way to tell in advance whether the input string has odd or even
length. Design a solution that keeps track of everything needed for
both cases until it reaches the end of the string.
Exercise 3.5 Design finite automata for the following languages over {0, 1}:
1. The set of all strings where no pair of adjacent 0s appears in the last
four characters.
2. The set of all strings where pairs of adjacent 0s must be separated by
at least one 1, except in the last four characters.
Exercise 3.6 In less than 10 seconds for each part, verify that each of the
following languages is regular:
1. The set of all C programs written in North America in 1997.
2. The set of all first names given to children born in New Zealand in
1996.
3. The set of numbers that can be displayed on your hand-held calculator.
Exercise 3.7 Describe in English the languages (over {0, 1}) accepted by the
following deterministic finite automata. (The initial state is identified by a
short unlabeled arrow; the final statethese deterministic finite automata
have only one final state eachis identified by a double circle.)

3.6 Exercises

1.
0
0,1

0,1

0,1

2.
1
0

0
0

1
1

0,1

3.
0

0
0,1

1
0

Exercise 3.8 Prove or disprove each of the following assertions:


1. Every nonempty language contains a nonempty regular language.
2. Every language with nonempty complement is contained in a regular
language with nonempty complement.
Exercise 3.9 Give both deterministic and nondeterministic finite automata
accepting the following languages over the alphabet 6 = {0, 1}; then prove
lower bounds on the size of any deterministic finite automaton for each
language:
1. The set of all strings such that, at some place in the string, there are
two 0s separated by an even number of symbols.

87

88

Finite Automata and Regular Languages

2. The set of all strings such that the fifth symbol from the end of the
string is a 1.
3. The set of all strings over the alphabet {a, b, c, d} such that one of the
three symbols a, b, or c appears at least four times in all.
Exercise 3.10 Devise a general procedure that, given some finite automaton
M, produces the new finite automaton M such that M rejects , but
otherwise accepts all strings that M accepts.
Exercise 3.11 Devise a general procedure that, given a deterministic finite
automaton M, produces an equivalent deterministic finite automaton M
(i.e., an automaton that defines the same language as M) in which the start
state, once left, cannot be re-entered.
Exercise 3.12 Give a nondeterministic finite automaton to recognize the
set of all strings over the alphabet {a, b, c} such that the string, interpreted
as an expression to be evaluated, evaluates to the same value left-to-right
as it does right-to-left, under the following nonassociative operation:
a
b
c

a
a
b
c

b
b
c
a

c
b
a
b

Then give a deterministic finite automaton for the same language and
attempt to prove a nontrivial lower bound on the size of any deterministic
finite automaton for this problem.
Exercise 3.13 Prove that every regular language is accepted by a planar
nondeterministic finite automaton. A finite automaton is planar if its
transition diagram can be embedded in the plane without any crossings.
Exercise 3.14 In contrast to the previous exercise, prove that there exist
regular languages that cannot be accepted by any planar deterministic finite
automaton. (Hint: Exercise 2.21 indicates that the average degree of a
node in a planar graph is always less than six, so that every planar graph
must have at least one vertex of degree less than six. Thus a planar finite
automaton must have at least one state with no more than five transitions
leading into or out of that state.)
Exercise 3.15 Write regular expressions for the following languages over
{0, 1}:
1. The language of Exercise 3.5(1).

3.6 Exercises

2.
3.
4.
5.

The language of Exercise 3.5(2).


The set of all strings with at most one triple of adjacent 0s.
The set of all strings not containing the substring 110.
The set of all strings with at most one pair of consecutive 0s and at
most one pair of consecutive 1s.
6. The set of all strings in which every pair of adjacent 0s appears before
any pair of adjacent 1s.

Exercise 3.16 Let P and Q be regular expressions. Which of the following


equalities is true? For those that are true, prove it by induction; for the
others, give a counterexample.
1. (P ) = P
2. (P + Q) = (P Q )
3. (P + Q) = P + Q
Exercise 3.17 For each of the following languages, give a proof that it is
or is not regular.
1.
2.
3.
4.
5.

6.
7.
8.
9.

10.

{x {0, 1} | x 6= x R }
{x {0, 1, 2} | x = w2w, with w {0, 1} }
{x {0, 1} | x = w R wy, with w, y {0, 1}+ }
{x {0, 1} | x
/ {01, 10} }
The set of all strings (over {0, 1}) that have equal numbers of 0s and
1s and such that the number of 0s and the number of 1s in any prefix
of the string never differ by more than two.
{0l 1m 0n | n l or n m, l, m, n N}
{0 n | n N}
The set of all strings x (over {0, 1}) such that, in at least one substring
of x, there are four more 1s than 0s.
The set of all strings over {0, 1} that have the same number of
occurrences of the substring 01 as of the substring 10. (For instance,
we have 101 L and 1010
/ L.)
{0i 1 j | gcd(i, j ) = 1} (that is, i and j are relatively prime)

Exercise 3.18 Let L be {0n 1n | n N}, our familiar nonregular language.


Give two different proofs that the complement of L (with respect to {0, 1} )
is not regular.
Exercise 3.19 Let 6 be composed of all two-component vectors
  with
entries of 0 and 1; that is, 6 has four characters in it: 00 , 01 , 10 , and

1
. Decide whether each of the following languages over 6 is regular:
1

89

90

Finite Automata and Regular Languages

1. The set of all strings such that the top


 row
 is the reverse
 of
 the
 bottom row. For instance, we have 00 10 01 00 L and 00 11 01 00
/ L.
2. The set of all strings such that the top row is the complement of the
bottom row (that is, where the top row has a 1, the bottom row
has a 0 and vice versa).
3. The set of all strings such that the top row has the same number of
1s as the bottom row.
Exercise 3.20 Let 6 be composed of all three-component vectors with
entries of 0 and 1; thus 6 has eight characters in it. Decide whether each of
the following languages over 6 is regular:
1. The set of all strings such that the sum of the first row and second
row equals the third row, where each row is read left-to-right as
an unsigned binary integer.
2. The set of all strings such that the product of the first row and
second row equals the third row, where each row is read left-toright as an unsigned binary integer.
Exercise 3.21 Recall that Roman numerals are written by stringing together symbols from the alphabet 6 = {I, V, X, L, C, D, M}, always using
the largest symbol that will fit next, with one exception: the last digit is
obtained by subtraction from the previous one, so that 4 is I V , 9 is I X ,
40 is X L, 90 is XC, 400 is C D, and 900 is C M. For example, the number 4999 is written M M M MC M XC I X while the number 1678 is written
M DC L X X V I I I . Is the set of Roman numerals regular?
Exercise 3.22 Let L be the language over {0, 1, +, } that consists of
all legal (nonempty) regular expressions written without parentheses and
without Kleene closure (the symbol stands for concatenation). Is L regular?
Exercise 3.23 Given a string x over the alphabet {a, b, c}, define ||x|| to
be the value of string according to the evaluation procedure defined in
Exercise 3.12. Is the language {x y | ||x|| = ||y||} regular?
Exercise 3.24 A unitary language is a nonempty regular language that is
accepted by a deterministic finite automaton with a single accepting state.
Prove that, if L is a regular language, then it is unitary if and only if,
whenever strings u, uv, and w belong to L, then so does string wv.
Exercise 3.25 Prove or disprove each of the following assertions.
1. If L is regular, then L is regular.
2. If L = L 1 L 2 is regular and L 2 is finite, then L 1 is regular.

3.6 Exercises

3. If L = L 1 + L 2 is regular and L 2 is finite, then L 1 is regular.


4. If L = L 1 /L 2 is regular and L 2 is regular, then L 1 is regular.
Exercise 3.26 Let L be a language and define the language SUB(L) = {x |
w L, x is a subsequence of w}. In words, SUB(L) is the set of all
subsequences of strings of L. Prove that, if L is regular, then so is SUB(L).
Exercise 3.27 Let L be a language and define the language CIRC(L) = {w |
w = x y and yx L}. If L is regular, does it follow that CIRC(L) is also
regular?
Exercise 3.28 Let L be a language and define the language NPR(L) = {x
L | x = yz and z 6= y
/ L}; that is, NPR(L) is composed of exactly those
strings of L that are prefix-free (the proper prefixes of which are not also
in L). Prove that, if L is regular, then so is NPR(L).
Exercise 3.29 Let L be a language and define the language PAL(L) = {x |
x x R L}, where x R is the reverse of string x; that is, L is composed of the
first half of whatever palindromes happen to belong to L. Prove that, if L
is regular, then so is PAL(L).
Exercise 3.30 Let L be any regular language and define the language
FL(L) = {x z | y, |x| = |y| = |z| and x yz L}; that is, FL(L) is composed
of the first and last thirds of strings of L that happen to have length 3k for
some k. Is FL(L) always regular?
Exercise 3.31 Let L be a language and define the language FRAC(i, j )(L)
to be the set of strings x such that there exist strings x 1 ,. . .,x i1,x i+1,. . .,x j
with x 1 . . . x i1 x x i+1 . . . x j L and |x 1 | = . . . = |x i1| = |x i+1| = . . . = |x j | =
|x|. That is, FRAC(i, j )(L) is composed of the ith of j pieces of equal length
of strings of L that happen to have length divisible by j . In particular,
FRAC(1, 2)(L) is made of the first halves of even-length strings of L and
FRAC(3, 4)(L) is the language used in Example 3.6. Prove that, if L is
regular, then so is FRAC(i, j )(L).
Exercise 3.32 Let L be a language and define the language f (L) = {x |
y, z, |y| = 2|x| = 4|z| and x yx z L}. Prove that, if L is regular, then so is
f (L).
Exercise 3.33 Prove that the language SUB(L) (see Exercise 3.26) is
regular for any choice of language Lin particular, L need not be regular.
Hint: observe that the set of subsequences of a fixed string is finite and
thus regular, so that the set of subsequences of a finite collection of strings
is also finite and regular. Let S be any set of strings. We say that a string x

91

92

Finite Automata and Regular Languages

is a minimal element of S if x has no proper subsequence in S. Let M(L)


be the set of minimal elements of the complement of SUB(L). Prove that
M(L) is finite by showing that no element of M(L) is a subsequence of any
other element of M(L) and that any set of strings with that property must
be finite. Conclude that the complement of SUB(L) is finite.

3.7 Bibliography
The first published discussion of finite-state machines was that of McCulloch and Pitts [1943], who presented a version of neural nets. Kleene [1956]
formalized the notion of a finite automaton and also introduced regular expressions, proving the equivalence of the two models (Theorem 3.3 and Section 3.3.3). At about the same time, three independent authors, Huffman
[1954], Mealy [1955], and Moore [1956], also discussed the finite-state
model at some length, all from an applied point of viewall were working on the problem of designing switching circuits with feedback loops,
or sequential machines, and proposed various design and minimization
methods. The nondeterministic finite automaton was introduced by Rabin
and Scott [1959], who proved its equivalence to the deterministic version
(Theorem 3.1). Regular expressions were further developed by Brzozowski
[1962, 1964]. The pumping lemma (Theorem 3.4) is due to Bar-Hillel et
al. [1961], who also investigated several closure operations for regular languages. Closure under quotient (Theorem 3.5) was shown by Ginsburg
and Spanier [1963]. Several of these results use a grammatical formalism
instead of regular expressions or automata; this formalism was created in
a celebrated paper by Chomsky [1956]. Exercises 3.31 and 3.32 are examples of proportional removal operations; Seiferas and McNaughton [1976]
characterized which operations of this type preserve regularity. The interested reader should consult the classic text of Hopcroft and Ullman [1979]
for a lucid and detailed presentation of formal languages and their relation
to automata; the texts of Harrison [1978] and Salomaa [1973] provide
additional coverage.

You might also like