Formal Language Theory
Formal Language Theory
Formal Language Theory
6.1 Languages
What is a language? Formally, a language L is defined as as set (possibly
infinite) of strings over some finite alphabet.
Definition 7 (Language) A language L is a possibly infinite set of strings
over a finite alphabet Σ.
We define Σ∗ as the set of all possible strings over some alphabet Σ. Thus
L ⊆ Σ∗ . The set of all possible languages over some alphabet Σ is the set of
all possible subsets of Σ∗ , i.e. 2Σ or ℘(Σ∗ ). This may seem rather simple,
∗
6.2 Grammars
A grammar is a way to characterize a language L, a way to list out which
strings of Σ∗ are in L and which are not. If L is finite, we could simply list
94
CHAPTER 6. FORMAL LANGUAGE THEORY 95
the strings, but languages by definition need not be finite. In fact, all of the
languages we are interested in are infinite. This is, as we showed in chapter 2,
also true of human language.
Relating the material of this chapter to that of the preceding two, we
can view a grammar as a logical system by which we can prove things. For
example, we can view the strings of a language as WFFs. If we can prove
some string u with respect to some language L, then we would conclude that
u is in L, i.e. u ∈ L.
Another way to view a grammar as a logical system is as a set of formal
statements we can use to prove that some particular string u follows from
some initial assumption. This, in fact, is precisely how we presented the
syntax of sentential logic in chapter 4. For example, we can think of the
symbol WFF as the initial assumption or symbol of any derivational tree of
a well-formed formula of sentential logic. We then follow the rules for atomic
statements (page 47) and WFFs (page 47).
Our notion of grammar will be more specific, of course. The grammar
includes a set of rules from which we can derive strings. These rules are
effectively statements of logical equivalence of the form: ψ → ω, where ψ
and ω are strings.1
Consider again the WFFs of sentential logic. We know a formula like
(p∧q ′ ) is well-formed because we can progress upward from atomic statements
to WFFs showing how each fits the rules cited above. For example, we
know that p is an atomic statement and q is an atomic statement. We also
know that if q is an atomic statement, then so is q ′ . We also know that
any atomic statement is a WFF. Finally, we know that two WFFs can be
assembled together into a WFF with parentheses around the whole thing and
a conjunction ∧ in the middle.
We can represent all these steps in the form ψ → ω if we add some
additional symbols. Let’s adopt W for a WFF and A for an atomic statement.
If we know that p and q can be atomic statements, then this is equivalent to
A → p and A → q. Likewise, we know that any atomic statement followed
by a prime is also an atomic statement: A → A′ . We know that any atomic
statement is a WFF: W → A. Last, we know that any two WFFs can be
1
These statements seem to go in only one direction, yet they are not bound by the
restriction we saw in first-order logic where a substitution based on logical consequence
can only apply to an entire formula. It’s probably best to understand these statements
as more like biconditionals, rather than conditionals, even though the traditional symbol
here is the same as for a logical conditional.
CHAPTER 6. FORMAL LANGUAGE THEORY 96
conjoined: W → (W ∧ W ).
Each of these rules is part of the grammar of the syntax of WFFs. If
every part of a formula follows one of the rules of the grammar of the syntax
of WFFs, then we say that the formula is indeed a WFF.
Returning to the example (p ∧ q ′ ), we can show that every part of the
formula follows one of these rules by constructing a tree.
(6.1) W
( W ∧ W )
A A
p A ′
Looking more closely at R, we will require that the left side of a rule
contain at least one non-terminal element and any number of other elements.
We define Σ as VT ∪ VN , all of the terminals and non-terminals together. R
is a finite set of ordered pairs from Σ∗ VN Σ∗ × Σ∗ . Thus ψ → ω is equivalent
to hψ, ωi.
CHAPTER 6. FORMAL LANGUAGE THEORY 97
We will see that these three types of grammars allow for successively
more restrictive languages and can be paired with specific types of abstract
models of computers. We will also see that the formal properties of the most
restrictive grammar types are quite well understood and that as we move up
the hierarchy, the systems become less and less well understood, or, more
and more interesting.
CHAPTER 6. FORMAL LANGUAGE THEORY 98
Let’s look at a few examples. For all of these, assume the alphabet is
Σ = {a, b, c}.
How might we define a grammar for the language that includes all strings
composed of one instance of b preceded by any number of instances of a:
{b, ab, aab, aaab, . . .}? We must first decide what sort of grammar to write
among the three types we’ve discussed. In general, context-free grammars
are the easiest and most intuitive to write. In this case, we might have
something like this:
(6.3) S → A b
A→ǫ
A→Aa
(6.4)
S S S
A b A b A b
ǫ A a A a
ǫ A a
(6.5) VT = {a, b}
VN = {S, A}
S = S
S→Ab
R = A→ǫ
A→Aa
Other grammars are possible for this language too. For example:
(6.6) S → b
S→Ab
A→a
A→Aa
This grammar is context-free, but also qualifies as context-sensitive. We no
longer have ǫ on the right side of any rule and a single non-terminal on the left
qualifies as a string of terminals and non-terminals. This grammar produces
the following trees for the same three strings.
(6.7)
S S S
b A b A b
a A a
a
We can also write a grammar that qualifies as right-linear that will char-
acterize this language.
(6.8) S → b
S→aS
This produces trees as follows for our three examples.
CHAPTER 6. FORMAL LANGUAGE THEORY 100
(6.9)
S S S
b a S a S
b a S
b
Let’s consider a somewhat harder case: a language where strings begin
with an a, end with a b, with any number of intervening instances of c, e.g.
{ab, acb, accb, . . .}. This can be described using all three grammar types.
First, a context-free grammar:
(6.10) S → a C b
C→C c
C→ǫ
This grammar is neither right-linear nor context-sensitive. It produces trees
like these:
(6.11)
S S S
a C b a C b
a C b
ǫ C c
C c
ǫ C c
ǫ
Here is a right-linear grammar that generates the same strings:
(6.12) S → a C
C→cC
C→b
CHAPTER 6. FORMAL LANGUAGE THEORY 101
(6.13)
S S S
a C a C a C
b c C c C
b c C
(6.14) S → a b
S→aC b
C→C c
C→c
(6.15)
S S S
a b a C b a C b
c C c
We will see that the set of languages that can be described by the three
types of grammar are not the same. Right-linear grammars can only accom-
modate a subset of the languages that can be treated with context-free and
context-sensitive grammars. If we set aside the null string ǫ, context-free
grammars can only handle a subset of the languages that context-sensitive
grammars can treat.
CHAPTER 6. FORMAL LANGUAGE THEORY 102
a a
b
q0 q1
(6.16) b
There are two states q0 and q1 . The first state, q0 , is the designated start
state and the second state, q1 , is a designated final state. This is indicated
with a dark circle for the start state and a double circle for any final state.
The alphabet Σ is defined as {a, b}.
This automaton describes the language where all strings contain an odd
number of the symbol b, for it is only with an input string that satisfies that
restriction that the automaton will end up in state q1 . For example, let’s
go through what happens when the machine reads the string bab. It starts
in state q0 and reads the first symbol b. It then follows the arc labeled b to
CHAPTER 6. FORMAL LANGUAGE THEORY 103
state q1 . It then reads the symbol a and follows the arc from q1 back to q1 .
Finally, it reads the last symbol b and follows the arc back to q0 . Since q0 is
not a designated final state, the string is not accepted.
Consider now a string abbb. The machine starts in state q0 and reads the
symbol a. It then follows the arc back to q0 . It reads the first b and follows
the arc to q1 . It reads the second b and follows the arc labeled b back to q0 .
Finally, it reads the last b and follows the arc from q0 back to q1 . Since q1 is
a designated final state, the string is accepted.
We can define a DFA more formally as follows:
For example, in the DFA in figure 6.16, K is {q0 , q1 }, Σ is {a, b}, q0 is the
designated start state and F = {q1 }. The function δ has the following domain
and range:
The steps of the derivation are encoded with the turnstile symbol ⊢. The
entire derivation is given below:
b
b
q0 q1
a
b
(6.20)
CHAPTER 6. FORMAL LANGUAGE THEORY 105
Given that there are multiple paths through an NFA for any particular
string, how do we assess whether a string is accepted by the automaton? To
see if some string is accepted by a NFA, we see if there is at least one path
through the automaton that terminates in a state of F .
Consider the automaton above and the string bab. There are several paths
that work.
There are three paths possible. The first, (6.22a), doesn’t terminate. The
second terminates, but only in a non-final state. The third, (6.22c), termi-
nates in a final state. Hence, since there is at least one path that terminates
in a final state, the string is accepted.
It’s a little trickier when the NFA contains arcs labeled with ǫ. For
example:
a b
a
q0 ǫ q1
(6.23) b a
CHAPTER 6. FORMAL LANGUAGE THEORY 106
Here we have the usual sort of non-determinism with two arcs labeled with
a from q0 . We also have an arc labeled ǫ from q1 to q0 . This latter sort of
arc can be followed at any time without consuming a symbol. Let’s consider
how a string like aba might be parsed by this machine. The following chart
shows all possible paths.
Let’s show this. DFAs are obviously a subcase of NFAs; hence any language
generated by a DFA is trivially generated by an NFA.
3
This can arise if we have cycles involving ǫ.
CHAPTER 6. FORMAL LANGUAGE THEORY 107
δ ′ ([q1 , q2 , . . . , qn ], a) = [p1 , p2 , . . . , pn ]
∆({q1 , q2 , . . . , qn }, a) = {p1 , p2 , . . . , pn }
The latter means that we apply ∆ to every state in the first list of states and
union together the resulting states.
Applying this to the NFA in (6.20), we get this chart for the new DFA.
The initial start state was q0 , so the new start state is [q0 ]. Any set containing
a possible final state from the initial automaton is a final state in the new
automaton: [q1 ] and [q0 , q1 ]. The new automaton is given below.
4
Recall that there will be 2K of these.
CHAPTER 6. FORMAL LANGUAGE THEORY 108
a
[∅] [q1 ]
a a
b
b b
[q0 ] a [q0 , q1 ]
(6.26)
This automaton accepts exactly the same language as the previous one.
If we can always construct a DFA from an NFA that accepts exactly the
same language, it follows that there is no language accepted by an NFA that
cannot be accepted by a DFA. 2
Notice two things about the resulting DFA in (6.26). First, there is a
state that cannot be reached: [q1 ]. Such states can safely be pruned. The
following automaton is equivalent to (6.26).
a
[∅]
b b
[q0 ] a [q0 , q1 ]
(6.27)
Second, notice that the derived DFA can, in principle, be massively bigger
than the original NFA. In the worst case, if the original NFA has n states,
the new automaton can have as many as 2n states.5
In the following, since NFAs and DFAs are equivalent, I will refer to the
general class as Finite State Automata (FSAs).
5
There are algorithms for minimizing the number of states in a DFA, but they are
beyond the scope of this introduction. See Hopcroft and Ullman (1979). Even minimized,
it is generally true that an NFA will be smaller than its equivalent DFA.
CHAPTER 6. FORMAL LANGUAGE THEORY 109
a
q0 q1
(6.28)
Second, we have concatenation of two regular languages by taking two
automata and connecting them with an arc labeled with ǫ.
... ...
ǫ
FSA 1 FSA 2
(6.29)
We connect all final states of the first automaton with the start state of the
second automaton with ǫ-arcs. The final states of the first automaton are
made non-final. The start state of the second automaton is made a non-start
state.
Union is straightforward as well. We simply create a new start state
and then create arcs from that state to the former start states of the two
automata labeled with ǫ. We create a new final state as well, with ǫ-arcs
from the former final states of the two automata.
ǫ FSA 1 ǫ
q0 q1
ǫ ǫ
FSA 2
(6.30)
CHAPTER 6. FORMAL LANGUAGE THEORY 111
Finally, we can get Kleene star by creating a new start state (which is
also a final state), a new final state, and an ǫ-loop between them.
ǫ ǫ
q0 FSA 1 q1
ǫ
(6.31)
If we can construct an automaton for every step in the construction of a
regular language, it should follow that any regular language can be accepted
by some automaton.8
(6.32) a. S → aA
b. A → aA
c. A → aB
d. B → b
This generates the language where all strings are composed of two or more
instances of a, followed by exactly one b.
If we follow the construction of the FSA above, we get this:
a
a a b
(6.33) S A B F
8
A rigorous proof would require that we go through this in the other direction as well,
from automaton to regular language.
CHAPTER 6. FORMAL LANGUAGE THEORY 112
This FSA accepts the same language generated by the right-linear grammar
in (6.32).
Notice now that if FSAs and right-linear grammars generate the same
set of languages and FSAs generate regular languages, then it follows that
right-linear grammars generate regular languages. Thus we have a three-way
equivalence between regular languages, right-linear grammars, and FSAs.
a a
b
q0 q1
(6.34) b
This now generates the complement language: a∗ (ba∗ ba∗ )∗ . Every legal string
has an even number of instances of b (including zero), and any number of
instances of a.
With complement so defined, and DeMorgan’s Law (the set-theoretic ver-
sion), it follows that the regular languages are closed under intersection as
well. Recall the following equivalences from chapter 3.
(6.35) (X ∪ Y )′ = X ′ ∩ Y ′
(X ∩ Y )′ = X ′ ∪ Y ′
CHAPTER 6. FORMAL LANGUAGE THEORY 113
Therefore, since the regular languages are closed under union and under
complement, it follows that they are closed under intersection. Thus if we
want to intersect the languages L1 and L2 , we union their complements, i.e.
L1 ∩ L2 = (L′1 ∪ L′2 )′ .
(6.36) S → a S b
S→ǫ
Here are some sample trees produced by this grammar.
(6.37)
S S S
ǫ a S b
a S b
ǫ
a S b
ǫ
Another language type that cannot be treated with a right-linear gram-
mar is xxR where a string x is followed by its mirror-image xR , including
CHAPTER 6. FORMAL LANGUAGE THEORY 114
strings like aa, bb, abba, baaaaaab, etc. This can be treated with a context-
free grammar like this:
(6.38) S → a S a
S→bS b
S→ǫ
This produces trees like this one:
(6.39) S
b S b
a S a
a S a
a S a
ǫ
The problem is that both sorts of language require that we keep track
of a potentially infinite amount of information over the string. Context-free
grammars do this by allowing the edges of the right side of rules to depend on
each other (with other non-terminals in between). This sort of dependency
is, of course, not possible with a right-linear grammar.
For example, If the symbols a, b, and c are put on the stack in that order,
they can only be retrieved from the stack in the opposite order: c, b, and
then a. This is the intended sense of the term pushdown.9
Thus, at each step of the PDA, we need to know what state we are in,
what symbol is on the tape, and what symbol is on top of the stack. We can
then move to a different state, reading the next symbol on the tape, adding
or removing the topmost symbol of the stack, or leaving it intact. A string
is accepted by a PDA if the following hold:
9
A stack is also referred to as “last in first out” (LIFO) memory.
CHAPTER 6. FORMAL LANGUAGE THEORY 116
The PDA puts the symbol c on the stack every time it reads the symbol a
on the tape. As soon as it reads the symbol b, it removes the topmost c from
the stack and moves to state q1 , where it removes an c from the stack for
every b that it reads on the tape. If the same number of instances of a and
b are read, then the stack will be empty when there are no more symbols on
the tape.
To see this more clearly, let us define a situation for a PDA as follows.
This is just like the situation of an FSA, except that it includes a specification
of the state of the stack in z.
Consider now the sequence of situations which shows the operation of the
previous PDA on the string aaabbb.
Notice that this PDA is deterministic in the sense that there is no more
than one arc from any state on the same symbol.10 This PDA still qualifies
as non-deterministic under Definition 14, since deterministic automata are
always a subset of non-deterministic automata.
The context-free languages cannot all be treated with deterministic PDAs,
however. Consider the language xxR , where a string is followed by its mirror
image, e.g. aa, abba, bbaabb, etc. We’ve already seen that this is trivially
generated using context-free rules. Here is a non-deterministic PDA that
generates the same language.
10
Notice that the PDA is not complete, as there is no arc on a from state q1 .
CHAPTER 6. FORMAL LANGUAGE THEORY 117
Here is the sequence of situations for abba that results in the string being
accepted.
Notice that at any point where two identical symbols occur in a row, the PDA
can guess wrong and presume the reversal has occurred or that it has not.
In the case of abba, the second b does signal the beginning of the reversal,
but in abbaabba, the second b does not signal the beginning of the reversal.
With a string of all identical symbols, like aaaaaa, there are many ways to
go wrong.
This PDA is necessarily non-deterministic. There is no way to know,
locally, when the reversal begins. It then follows that the set of languages
that are accepted by a deterministic PDA are not equivalent to the set of
languages accepted by a non-deterministic PDA. For example, any kind of
PDA can accept an bn , but only a non-deterministic PDA will accept xxR .
We’ve said that non-determistic PDAs accept the set of languages gener-
ated by context-free grammars.
This is actually rather complex to show, but we will show how to construct
a non-deterministic PDA from a context-free grammar. Given a CFG G =
hVN , VT , S, Ri, we construct a non-deterministic PDA as follows.
1. K = {q0 , q1 }
2. s = q0
3. F = {q1 }
4. Σ = VT
5. Γ = {VN ∪ VT }
There are only two states, one being the start state and the other the sole
final state. The input alphabet is identical to the set of terminal elements
allowed by the CFG and the stack alphabet is identical to the set of terminal
plus non-terminal elements.
The transition relation ∆ is constructed as follows:
1. (q0 , ǫ, ǫ) → (q1 , S) is in ∆.
(6.44) S → NP V P
VP → V NP
NP → N
N → John
N → Mary
V → loves
Let’s now look at how this PDA treats an input sentence like Mary loves
John.
This is not a proof that CFGs and PDAs are equivalent, but it shows the
basic logic of that proof.
Notice that if the grammar does not allow center-embedding beyond some
specific number of clauses n, then the grammar can easily be regular. For
example, imagine the upper bound on the an bn pattern is n ≤ 3; this consti-
tutes a finite set of sentences and can just be listed.
Is natural language syntax context-free or context-sensitive? Shieber
(1985) argues that natural language syntax must be context-sensitive based
on data from Swiss German. Examples like the following are grammatical.
(Assume the sentence begins with the phrase Jan säit das ‘John said that’.)
This is equivalent to the language xx, e.g. aa, abab, abaaba, etc., which is
known not to be context-free.13
If the Swiss German pattern is correct, then it means that any formal
account of natural language syntax requires more than a PDA and that a
formalism based on context-free grammar is inadequate. Notice, however,
that, once again, it is essential that the center-embedding be unbounded.
must entertain the hypothesis that the reversal begins at any point between
1 and n. This entails that we must consider lots of paths for a long string.14
What this means, in concrete terms, is that if some phenomenon can
be treated in finite-state terms or in context-free terms, and efficiency is a
concern, go with the finite-state treatment.
Here # is used to mark initially unused portions of the input tape. The logic
of δ is that it maps from particular state/symbol combinations to a new
state, simultaneously (potentially) writing a symbol to the tape and moving
left or right.
TMs can describe far more complex languages than are thought to be
appropriate for human language. For example an bn! can be treated with a
TM. Likewise, an , where n is prime, can be handled with a TM. Hence, while
there is a lot of work in computer science on the properties of TMs, there
has not been a lot of work in grammatical theory using them.
Another machine type that we have not treated so far is the finite state
transducer (FST). The basic idea is that we start with an FSA, but label the
arcs with pairs of symbols. The machine can be thought of as reading two
tapes in parallel. If the pairs of symbols match what is on the two tapes—and
the machine finishes in a designated final state—then the pair of strings is
14
It might be thought that we must consider an infinite number of paths, but this is
not necessarily so. Any non-deterministic PDA with an infinite number of paths for some
finite string can be converted into a non-deterministic PDA with only a finite number of
paths for some finite string. See Hopcroft and Ullman (1979).
CHAPTER 6. FORMAL LANGUAGE THEORY 123
accepted. Another way to think of these, however, is that the machine reads
one tape and spits out some symbol every time it transits an arc (perhaps
writing those latter symbols to a new blank tape).
Formally, we can define an FST as follows:
The relation ∆ moves from state to state pairing symbols of Σ with symbols
of Γ. The instances of ǫ in ∆ allow it to insert or remove symbols, thus
matching strings of different lengths.
For example, here is an FST that operates with the alphabet Σ = {a, b, c},
where anytime a b is confronted on one tape, the machine spits out c.
b:c c:c
q0
(6.50) a:a
It should be apparent that the regular relations are closed under the
usual regular operations.15 The regular relations are closed under a number
of other operations too, e.g. the ones above, but also reversal, inverse, and
composition.
They are not closed under intersection and complementation, however.
For example, imagine we have
R1 = {han , bn c∗ i | n ≥ 0}
and
R2 = {han , b∗ cn i | n ≥ 0}
The intersection is clearly not regular. Each of these languages is, however,
regular. We can see R1 as a : b∗ ǫ : c∗ and R2 as ǫ : b∗ a : c∗ . It follows, of
course, that the regular relations are not closed under complementation (by
DeMorgan’s Law).16
6.7 Summary
The chapter began with a presentation of the formal definition of language
and grammar.
15
Note that the regular relations are equivalent to the “rational relations” of algebra for
the mathematically inclined.
16
Same-length regular relations are closed under intersection.
CHAPTER 6. FORMAL LANGUAGE THEORY 125
6.8 Exercises
1. Write a right-linear grammar that generates the language where strings
must have exactly one instance of a and any number of instances of the
other symbols.
5. Write a DFA that generates the language where strings must have
exactly one instance of a and any number of instances of the other
symbols.
11. Formalize this language as a regular language: all strings contain pre-
cisely three symbols.
12. Formalize this language as a regular language: all strings contain more
instances of a than of b, in any order, with no instances of c.