0% found this document useful (0 votes)
69 views26 pages

The Complexity of Some Regex Crossword Problems: November 2014

This document summarizes the complexity and decidability of regular expression crossword puzzles. Some key results discussed are: 1. Determining if a solution exists to a regex crossword puzzle is NP-complete. 2. Restricting the problem such that all row expressions are equal and all column expressions are equal still results in an NP-hard problem. 3. Allowing the crossword size to be unbounded results in an undecidable problem, equivalent to the halting problem. This can be used to encode the computation of a Turing machine.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views26 pages

The Complexity of Some Regex Crossword Problems: November 2014

This document summarizes the complexity and decidability of regular expression crossword puzzles. Some key results discussed are: 1. Determining if a solution exists to a regex crossword puzzle is NP-complete. 2. Restricting the problem such that all row expressions are equal and all column expressions are equal still results in an NP-hard problem. 3. Allowing the crossword size to be unbounded results in an undecidable problem, equivalent to the halting problem. This can be used to encode the computation of a Turing machine.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/268525625

The complexity of some regex crossword problems

Article · November 2014


Source: arXiv

CITATIONS READS
3 243

1 author:

Stephen A. Fenner
University of South Carolina
97 PUBLICATIONS   1,044 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Stephen A. Fenner on 30 May 2016.

The user has requested enhancement of the downloaded file.


The complexity of some regex crossword problems
Stephen A. Fenner
University of South Carolina∗
[email protected]

January 27, 2015


arXiv:1411.5437v2 [cs.CC] 28 Nov 2014

Abstract
In a typical regular expression (regex) crossword puzzle, you are given two nonempty lists
R1 , . . . , Rm and C1 , . . . , Cn of regular expressions over some alphabet, and your goal is to fill
in an m × n grid with letters from that alphabet so that the string formed by the ith row is in
L(Ri ), and the string formed by the jth column is in L(Cj ), for all 1 ≤ i ≤ m and 1 ≤ j ≤ n.
Such a grid is a solution to the puzzle. It is known that determining whether a solution exists is
NP-complete. We consider a number of restrictions and variants to this problem where all the
Ri are equal to some regular expression R, and all the Cj are equal to some regular expression C.
We call the solution to such a puzzle an (R, C)-crossword. Our main results are the following:
1. There exists a fixed regular expression C over the alphabet {0, 1} such that the following
problem is NP-complete: “Given a regular expression R over {0, 1} and positive integers
m and n given in unary, does an m × n (R, C)-crossword exist?” This improves the result
mentioned above.
2. The following problem is NP-hard: “Given a regular expression E over {0, 1} and positive
integers m and n given in unary, does an m × n (E, E)-crossword exist?”
3. There exists a fixed regular expression C over {0, 1} such that the following problem is
undecidable (equivalent to the Halting Problem): “Given a regular expression R over
{0, 1}, does an (R, C)-crossword exist (of any size)?”
4. The following problem is undecidable (equivalent to the Halting Problem): “Given a reg-
ular expression E over {0, 1}, does an (E, E)-crossword exist (of any size)?”

Keywords: complexity, decidability, undecidability, regular expression, regex crossword, NP-


complete, two-dimensional language, picture language

1 Introduction
Regular expression crossword puzzles (regex crosswords, for short) share some traits in common
with traditional crossword puzzles and with sudoku. One is typically given two lists R1 , . . . , Rm
and C1 , . . . , Cn of regular expressions labeling the rows and columns, respectively, of an m × n
grid of blank squares. The object is to fill in the squares with letters so that each row, read left to
right as a string, matches (i.e., is in the language denoted by) the corresponding regular expression,
and similarly for each column, read top to bottom. The solution itself may have some additional

1
(FY|F|RG)+
(YE|OT)K

[NODE]+
(FI|A)+

(.)[IF]+
(Y|F)(.)\2[DAF]\1

(U|O|I)*T[FRO]+

[KANE]*[GIN]*

Figure 1: A regex crossword. The alphabet is {A, B, . . . , Z}. The expression syntax includes:
character classes [· · · ], which are matched by any single letter between the brackets; the period “.”,
which matches any single letter; back references \1 and \2, which are matched by whatever string
matched the first (respectively second) parenthesized subexpression.

property, e.g., spelling out a phrase or sentence in row major order. Figure 1 shows a 3 × 5 regex
crossword (“Royal Dinner”) from the website regexcrossword.com, with a unique solution [roy].
Regex crosswords have enjoyed some recent popularity, having been discussed in several popular
media sources [mik13, Bla13], and thanks to some websites where people can solve the puzzles online
[rc, rcs]. There are variants of the basic puzzle, including having two regular expressions for each
row and column, one to match each of the two opposite directions [rcs]. Another variant is a
hexagonal grid made up of hexagonal cells, with regular expressions for each of three separate
directions, created by Dan Gulotta from an idea by Palmer Mebane (see [Bla13]), that showed up
as part of the 2013 MIT Mystery Hunt [MIT, Bla13, mik13].
A natural complexity theoretic question to ask is: How hard is it to solve a regex crossword in
general? In September 2014, Glen Takahashi asked on StackExchange whether this task is NP-hard
[Tak14]. (The same question has been asked by other people). A positive answer to his query, along
with a proof, was posted by FrankW about a half hour later (see Appendix A, which includes a
different proof found previously and independently by the author). Another post observed the next
day that the solution existence problem is in NP.
In this paper, we determine the complexity and decidability of several problems related to
restricted regex crosswords. In Section 6, we show that two restrictions of the problem above remain
NP-hard: (1) when all the row expressions are equal to each other and all the column expressions
are equal to a fixed expression, independent of the input; and (2) when all the expressions (both
row and column) are equal. In both problems above, the dimensions of the crossword are given
in unary. But first, in Section 3, we consider the problem of existence of crosswords of any size.
We show that this question and related ones are equivalent to the Halting Problem and are thus
undecidable. (We also show that if only one of the two dimensions is unbounded, then the problem
becomes decidable, in fact, in PSPACE.) Thus, in the spirit of the Post Correspondence Problem
and questions about context-free languages, we have another simple yet undecidable problem in

Computer Science and Engineering Department, Columbia, SC 29208 USA.

2
automata theory, one accessible to any undergraduate theory student. The proofs given here are
just as accessible, and their ideas carry over to the complexity theoretic setting of Section 6.
The undecidability results we prove in Section 3 all follow from Lemma 3, which uses the cross-
word to encode a halting computation of a one-tape Turing machine as a two-dimensional tableau:
one dimension for space, the other for time. The row regular expression enforces consistency within
each configuration, and the row and column regular expressions together enforce the legality of the
machine’s transitions. In some sense, these are age-old techniques (albeit with a few new twists);
Lemma 3 could have been proved half a century ago. They bear some similarity to results in cel-
lular automata, to the Cook-Levin theorem, and to results of Berger from the 1960s showing the
undecidability of tiling the plane with Wang tiles (the so-called “domino problem” [Ber66], which
was the first proof that there exist finite tile sets that tile the whole plane but only aperiodically).
Berger’s construction is quite complicated, and although it may be possible to harness his result to
prove our Lemma 3, our proof is direct enough to stand on its own.
The results of Section 3 perhaps have their closest connection to the theory of two-dimensional
languages (picture languages) in formal language theory [GR97]. In fact, one can show that
the recognizable picture languages coincide exactly with the letter-to-letter projections of (R, C)-
crosswords [GR97, Theorem 8.6] (except that the empty picture may also be included in the lan-
guage). Recognizable picture languages can be defined in terms of finite objects known as tiling
systems [GR92] (cf. [GR97, Definition 7.2]), and given a tiling system T , it is not hard to show that
one can effectively find two regular expressions R and C (over some alphabet) and a projection π
that defines the same picture language as T . The existence problem for recognizable picture lan-
guages (“Given a tiling system, does it define a nonempty language?”) is known to be undecidable
([GR97, Theorem 9.1]), and so, putting these results together, we get that the existence problem
for (R, C)-crosswords is undecidable as well. This essentially proves most of the results we give in
Section 3, below, or at least weaker versions of them. However, the proof we give for Lemma 3
gives a much more direct reduction from the halting problem to (R, C)-crossword existence than
what can be put together using the results in [GR97]. In particular, we can fix the alphabet and
even the expression C, independent of the input. We will also need details of the proof later in the
paper.
In the undecidability results of Section 3 just described, the alphabet we use depends on the
Turing machine being simulated and may be quite large if the machine recognizes the Halting
Problem (e.g., a universal machine). In Section 4, which is the heart of the paper, we show that
restricting to a binary alphabet does not reduce the complexity of any of our problems. We do this
by giving a polynomial-time function mapping a regular expression over any alphabet to one over a
binary alphabet in a way that preserves the existence (and nonexistence) of crosswords (Lemma 9).
This turns out to be the most difficult result of the paper, using a surprisingly delicate construction
that was arrived at only after many failed attempts.
We give a number of open problems in Section 7.

2 Preliminaries
For any integers a and b with b > 0, we define a mod b to be the unique integer r such that 0 ≤ r < b
and a ≡ r (mod b).
Our notation for regular expressions is standard (see [Sip13]), except that in addition to the
three usual operations—union (∪), concatenation (juxtaposition), and Kleene-closure (∗)—we also

3
allow intersection (∩), so that L(r ∩ s) = L(r) ∩ L(s) for any regular expressions r and s. We treat
the ∩ operation only as syntactic sugar and not part of the formal definition of a regular expression;
it can be effectively removed from a regular expression to obtain an equivalent one by standard
techniques [HMU07, Sip13]. We also use the unary “+” and “?” operators as syntactic sugar: “E +
is shorthand for EE ∗ , and “E?” is shorthand for E ∪ ε, where ε is matched by the empty string
only. We will also identify any finite set of strings with its corresponding regular expression. We
say a string w matches a regular expression R to mean that w ∈ L(R).

Definition 1. Let Σ be an alphabet. A Σ-crossword is a nonempty two-dimensional array X of


symbols from Σ (say, m × n, where m, n ≥ 1). The symbol in the ith row and jth column is usually
denoted xi,j for 1 ≤ i ≤ m and 1 ≤ j ≤ n. Let R and C be any regular expressions over Σ. X is
an (R, C)-crossword iff each row of X, read left to right, matches R, and each column of X, read
top to bottom, matches C; that is, for all 1 ≤ i ≤ m, the string xi,1 xi,2 · · · xi,n is in L(R) and for
all 1 ≤ j ≤ n, the string x1,j x2,j · · · xm,j is in L(C).

We may call a Σ-crossword simply a crossword if Σ is not relevant or is clear from the context.
The next definition is for purely technical reasons. Removing these restrictions does not affect
our complexity results.

Definition 2. We say that a regular expression is positive iff it is not matched by the empty
string. A pair (R, C) of regular expressions is plural iff both R and C are positive and every
(R, C)-crossword has at least two rows and at least two columns.

Given two regular expressions R and C, one can decide in polynomial time whether or not R is
positive and whether or not (R, C) is plural.
We abbreviate “computably enumerable” (a.k.a. recursively enumerable) by “c.e.” Our notion
of m-reduction (mapping reduction) and polynomial reduction come from Sipser [Sip13].

3 Undecidability
In this section, we prove that, given a regular expression R, it is undecidable whether an (R, C)-
crossword exists, for some fixed regular expression C. We also show that it is undecidable whether
an (R, R)-crossword exists. As mentioned in the introduction, it is already known that, given
both regular expressions R and C, determining whether an (R, C)-crossword exists is undecidable
[GR97].
We reduce from the Halting Problem. Our computational model—a slight modification of that
found in many textbooks, e.g., [Sip13]—is that of a deterministic Turing machine with a unique
halting state (distinct from the start state) and a single two-way infinite tape whose initial contents
is an input string w of nonblank symbols, surrounded on both sides with blank tape. In each step,
the tape head must move either left or right by one cell. The crossword to be filled in encodes the
tableau of a halting computation. Each symbol in the crossword represents the contents of a tape
cell at a certain time in the computation, possibly with some extra information about the state of
the machine and the position of the head. The expression R ensures that the whole configuration of
the TM is legitimate at each time step, and C ensures that the contents of each tape cell is correct
over time. We view the tableau with the initial configuration on the top row and time moving
downward.

4
One might think that, in order to handle transitions correctly, a crossword symbol should
represent a “window” in the tableau, spanning perhaps two or three adjacent tape cells at two
adjacent time steps, and that these windows should overlap consistently. It is possible to do this,
but it turns out to be unnecessary; we use a trick whereby the machine’s transition information is
passed in two directions—first horizontally (checked by R), then vertically (checked by C). (This
idea is somewhat analogous to the characterization of recognizable picture languages via domino
systems and hv-local languages [LS97].)
Both results of this section use the following lemma:

Lemma 3. Let M be a Turing machine (as described above). There exists an alphabet Σ and a
regular expression C := C(M ) over Σ (both depending on M ), and for any input string w there
exists a regular expression R := R(M, w) over Σ (depending on M and w) such that (R, C) is
plural, and M halts on input w if and only if an (R, C)-crossword exists, and if this is the case,
then the (R, C)-crossword is unique. Furthermore, R is computable from M and w in polynomial
time, and C is computable from M .

Proof. Let M = (Q, Γ, δ, q0 , qhalt , B), where

• Q is the (finite) state set,

• q0 ∈ Q is the start state,

• qhalt ∈ Q is the halting state, different from q0 (M halts just when this state is entered),

• Γ is the tape alphabet,

• B ∈ Γ is the blank symbol, and

• δ : (Q \ {qhalt }) × Γ → Q × Γ × {L, R} is the transition function. The left and right head
directions are indicated by L and R, respectively.

Given some input string w ∈ (Γ \ {B})∗ , we construct the two regular expressions R and C over
an alphabet Σ (defined below). The expression C only depends on M and not on w. For technical
convenience and without loss of generality, we will make the following four additional assumptions
about M ’s computation: M ’s head initially scans the blank cell immediately to the left of w; M ’s
initial transition is δ(q0 , B) = (q1 , B, L) for some state q1 6= q0 ; M never re-enters state q0 after
its first step, nor scans any tape cells to the left of where it is after the first step (it might write
a special symbol in the cell to keep itself from doing this); and at some point, M scans the blank
cell immediately to the right of the input w (which of course requires it to scan every symbol of
w). M can be modified if necessary to meet these conditions without altering its halting versus
non-halting behavior on any input.
To avoid confusion, we will call the elements of the alphabet Σ markers, reserving the word
symbol to refer to elements of Γ. The markers in Σ are of the following three disjoint types:

Unscanned tape markers: For all a ∈ Γ, the marker [a] is in Σ. Each of these markers is used
to depict a cell of the tape containing the symbol a and which is scanned neither currently
nor in the next time step. We let U := {[a] : a ∈ Γ} denote the set of all unscanned tape
markers.

5
Scanned tape markers: For all a ∈ Γ and all q ∈ Q, the marker [a, q] is in Σ. Each of these
depicts a cell of the tape containing a that is currently being scanned, and M ’s current state
is also included in the marker.

State transmission markers: For all a ∈ Γ and all q ∈ Q\{q0 }, the marker [a, ↓ q] is in Σ. These
markers depict tape cells that are currently unscanned but will be scanned in the next time
step (and so they always appear horizontally adjacent to scanned tape markers for nonhalting
states). M ’s state in the next time step is also included in the marker.

To summarize: At each time step of M ’s computation, the tape cell scanned by the head is recorded
in the crossword by the scanned tape marker, which includes M ’s current state. All the unscanned
cells of M ’s tape are recorded in the crossword by their corresponding unscanned tape markers
with one exception: the unscanned tape cell that will become scanned in the next time step will
be recorded by a state transition marker, which includes M ’s state in the next time step.
Here are two typical examples. Suppose M ’s current state is q and it is scanning a b on the
tape, with a to the left and c to the right. The corresponding configuration is traditionally denoted
· · · aqbc · · · . If δ(q, b) = (r, x, R), then the part of the crossword corresponding to the transition
aqbc 7→ axrc looks like this:
··· ··· ··· ··· ···
· · · [a] [b, q] [c, ↓ r] · · ·
· · · [a] [x] [c, r] · · ·
··· ··· ··· ··· ···
If instead, δ(q, b) = (s, y, L), then we get this for the transition aqbc 7→ sayc:

··· ··· ··· ··· ···


··· [a, ↓ s] [b, q] [c] ···
··· [a, s] [y] [c] ···
··· ··· ··· ··· ···

The one exception to this rule is a halting configuration, say · · · aqhalt bc · · · , which is represented
in the crossword thus:
··· ··· ··· ··· ···
· · · [a] [b, qhalt ] [c] · · ·
We will guarantee that there can be no rows of the crossword below this one.

The regular expression R


R ensures that all the rows of the crossword look like they should. First we define a regular
expression giving the initial configuration of M on input w: Let w = w1 w2 · · · wn , where n ≥ 0 and
each wi is in Γ \ {B}. Define

Iw := [B, ↓ q1 ][B, q0 ][w1 ][w2 ] · · · [wn ][B]+ . (1)

This is the only component of our construction that depends on the string w. Since in its first step
M ’s head moves left and its state becomes q1 , this is the correct description of the first row. Since

6
M never scans any cells further to the left thereafter, we can take [B, ↓ q1 ] to start Iw . Next, we
define strings of markers indicating configurations beyond the initial one. Set

TL := {[b, ↓ r][a, q] : a, b ∈ Γ & q ∈ Q \ {q0 , qhalt } & (∃c ∈ Γ)δ(q, a) = (r, c, L)} ,
TR := {[a, q][b, ↓ r] : a, b ∈ Γ & q ∈ Q \ {q0 , qhalt } & (∃c ∈ Γ)δ(q, a) = (r, c, R)} ,
T := TL ∪ TR ∪ {[a, qhalt ] : a ∈ Γ} ,

describing portions of the tape undergoing transitions. Then we define

R := Iw ∪ U ∗ T U ∗ .

Note that R requires each row to include exactly one scanned tape marker. If the corresponding
state is nonhalting, then it is adjacent to some state transmission marker (and this is the only place
the latter marker can appear in the row). If the corresponding state is halting, then there is no
state transmission marker on the row.
Clearly, R is positive and computable in polynomial time given w and a description of M .

The regular expression C


C ensures that all the columns of the crossword look like they should. We define C := S ∩ W as
the intersection of two subexpressions: S ensures that each tape cell stays constant (“static”)—
except just after it is scanned by M ’s head—and that when a cell becomes scanned, the new state
information is faithfully copied from the previous time step; W ensures that the correct symbol is
written into a scanned cell on the next time step.
For S we define
[
D := [a]∗ [a, ↓ q][a, q] ,
a∈Γ, q∈Q\{q0 }
[
E := [a]+ [a, ↓ q][a, q] ,
a∈Γ, q∈Q\{q0 }
[
F := [a]∗ ,
a∈Γ
S := (E ∪ [B, q0 ] ∪ [B, ↓ q1 ][B, q1 ])D∗ F .

A string matching E ∪ [B, q0 ] ∪ [B, ↓ q1 ][B, q1 ] gives the contents of a tape cell starting at the
beginning up through the first time it is scanned. Thereafter, each string matching D represents a
time interval ending with the cell being scanned again. F is matched by the cell contents after the
last time it is scanned. Note that S is positive, and hence C is positive.
For W we define

X := {[a, q][b] : a ∈ Γ & q ∈ Q \ {qhalt } & (∃r ∈ Q)(∃d ∈ {L, R})[δ(a, q) = (r, b, d)]}
Y := {[a, q][b, ↓ s] : a ∈ Γ & q ∈ Q \ {qhalt } & s ∈ Q \ {q0 } & (∃r ∈ Q)(∃d ∈ {L, R})[δ(a, q) = (r, b, d)]}
H := {[a, qhalt ] : a ∈ Γ}
Z := Σ \ {[a, q] : a ∈ Γ & q ∈ Q}
W := Z ∗ (XZ ∗ ∪ Y )∗ H? .

7
Note that W matches all strings in which any occurrence of a non-halting scanned tape marker is
immediately followed by either an unscanned tape marker (or state transmission marker) giving the
cell’s correct contents after the corresponding transition of M . W also allows an optional halting
scanned tape marker at the very end of the string.
Notice that C is computable from M alone and does not depend on the input string w at all.
Note that we are not asserting that C is computable in polynomial time. Our description of C
includes the intersection operator ∩, which is not part of the formal syntax of regular expressions.
As we mentioned, one can effectively compute an equivalent regular expression without the ∩
operator, but the resulting regular expression may be exponentially larger.

Correctness
One direction of the lemma is now fairly clear: If M halts starting with w on its tape, then an
(R, C)-crossword exists. Such a crossword is also unique: S makes sure that every column contains
at least one scanned tape marker, and so the crossword represents exactly those tape cells that are
scanned at least once by M (which, by assumption, include the entire input string w); furthermore,
any row containing a marker of the form [a, qhalt ] (for some a ∈ Γ) must be the last row—this is
enforced by W . Finally, we note that, because M makes at least one transition before it halts, the
corresponding (R, C)-crossword has at least two rows and two columns, which makes (R, C) plural.
For the other direction, suppose X is an (R, C)-crossword. Let r1 , . . . , rm ∈ Σ∗ and c1 , . . . , cn ∈
Σ∗ be the rows and columns of X, respectively, for some m, n ≥ 1. S ensures that r1 matches
(U ∪ [B, q0 ] ∪ [B, ↓ q1 ])∗ , and since R forces r1 to contain a scanned tape marker somewhere, that
marker must be [B, q0 ]. It follows that r1 does not match U ∗ T U ∗ , and so it matches Iw , providing
the right starting configuration for M (and ensuring that n ≥ 2). We also have m ≥ 2, ensured by
S because r1 contains [B, ↓ q1 ]. Thus (R, C) is plural. Subsequent rows must then conform to M ’s
computation, as was described previously.
Finally, the last row rm must contain a marker of the form [a, qhalt ] for some a ∈ Γ, indicating
that M halts. This is because R ensures that rm contains some scanned tape marker, and supposing
this marker is of the form [a, q] for some q 6= qhalt , there must be a state transmission marker on
either side of it in rm , whence S ensures that this latter marker is followed by a scanned tape
marker in its column, which means rm could not have been the last row.

Lemma 3 yields the following result:

Theorem 4. There exists an alphabet Σ and a positive regular expression C over Σ such that the
decision problem

Given a regular expression R over Σ such that (R, C) is plural, does an (R, C)-crossword
exist?

is m-equivalent to the Halting Problem (and is thus undecidable).

Proof. We apply Lemma 3 letting M be a universal Turing machine (or any Turing machine rec-
ognizing the Halting Problem). Let Σ and C be as constructed in the proof. Letting W be the
decision problem above, we get a computable function g such that, for any string w, g(w) is a
regular expression R such that (R, C) is plural, and for all w, M halts on w if and only if an
(R, C)-crosswords exists. Thus g m-reduces the Halting problem to W . Conversely, W is clearly
c.e., and thus m-reduces to the Halting Problem.

8
Corollary 5 (Giammarresi, Restivo [GR97]). Given regular expressions R and C, it is undecidable
(m-equivalent to the Halting Problem) whether an (R, C)-crossword exists.

Proof. Just note that W in the proof of Theorem 4 is c.e. uniformly in C.

3.1 Making the row and column expressions equal


The (R, C)-crossword existence problem remains undecidable even if we insist that R = C. We get
this from the following lemma:

Lemma 6. There exists a polynomial-time computable function b such that, for any alphabet Σ and
any regular expressions R and C over Σ such that (R, C) is plural, E := b(Σ, R, C) is a positive
regular expression (over a slightly bigger alphabet Σ0 ) such that an (E, E)-crossword exists if and
only if an (R, C)-crossword exists. Furthermore, there is a one-to-one map ρ mapping Σ-crosswords
of size m×n (where m, n ≥ 2) to Σ0 -crosswords of size (m+1)×(n+1) that takes (R, C)-crosswords
to (E, E)-crosswords, and for every (E, E)-crossword Y , there exists an (R, C)-crossword X such
that ρ(X) is either Y or the matrix transpose of Y .

Proof. Let Σ, R, and C be given as in the lemma. We want to effectively find an E so that a
unique (E, E)-crossword corresponds to any given (R, C)-crossword and vice versa. A first attempt
at constructing E would be to set E := R ∪ C. This may not work, because an (R, C)-crossword
may not exist, but there is an (E, E)-crossword where each row and column might match R, but the
columns do not match C, say. (In the case of R and C in the proof of Lemma 3, any square array
with a single [B, qhalt ] in each row and column, and the rest filled with all [B]’s is an (R ∪ C, R ∪ C)-
crossword, regardless of w.) There are perhaps several ways to correct this problem, and here is a
fairly simple fix:

1. Introduce three new symbols not in Σ: ♠ (the “bottom edge marker”); ♥ (the “left edge
marker”); and ♦ (the “corner marker”).

2. Then modify R and C slightly to R0 and C 0 , respectively, so that any (R0 ∪ C 0 , R0 ∪ C 0 )-


crossword or its matrix transpose has its first column matching ♥∗ ♦, its last row matching
♦♠∗ , and the rest of the array being an (R, C)-crossword as before:


..
. (R, C)-crossword

♦ ♠ ··· ♠

Informally, the ♥ and ♠ markers prevent rows from being confused with columns, and the ♦ marker
prevents ♥ and ♠ from being confused with each other. Here are the formal definitions:

Σ0 := Σ ∪ {♠, ♥, ♦} ,
R0 := ♥R ∪ ♦♠♠♠∗ ,
C 0 := C♠ ∪ ♥♥♥∗ ♦ ,
E := R0 ∪ C 0 .

9
Clearly, E = b(Σ, R, C) is positive and computable in polynomial time. To see that this construction
works, first observe that an m × n (R, C)-crossword X (with m, n ≥ 2 because (R, C) is plural)
becomes an (m + 1) × (n + 1) (E, E)-crossword ρ(X) by prepending the column ♥m then appending
the row ♦♠n . This defines the map ρ, which is clearly one-to-one and maps (R, C)-crosswords to
(E, E)-crosswords with one more row and column. It follows that an (E, E)-crossword exists if an
(R, C)-crossword exists.
Conversely, let Y be any (E, E)-crossword—say, m × n—with rows r1 , . . . , rm and columns
c1 , . . . , cn , all matching E. We show first that m, n ≥ 3. Suppose not. We must have m, n ≥ 2,
because both R and C are positive. We may assume that m = 2; otherwise, we apply the same
argument to the transpose of Y , which is still an (E, E)-crossword. Then each column of Y has
length 2 and thus must match either ♥R or C♠. Suppose c2 starts with ♥. Then since r1 has ♥ as
its second symbol, it must match ♥♥♥∗ ♦, whence cn starts with ♦; but then |cn | ≥ 3, contradicting
our assumption that m = 2. Now suppose instead that c2 matches C♠. Then either r2 matches
♦♠♠♠∗ or r2 = a♠ for some a ∈ Σ matching C. The former case would make c1 have ♦ as its
second symbol, which is impossible. In the latter case, we must have c1 = ♥a, which matches ♥R.
But then a matches both R and C, making a 1 × 1 (R, C)-crossword, which contradicts the fact
that (R, C) is plural.
Having established that m, n ≥ 3, we next show that removing the first column and last row
from either Y or its transpose results in an (R, C)-crossword X, from which it will be clear that
ρ(X) is either Y or its transpose, respectively.
Consider r2 , which has length ≥ 3 and matches either R0 or C 0 .

Case 1: r2 matches R0 . Then r2 must begin with ♥: otherwise, it begins with ♦, but then c1
has ♦ as its second symbol, which is impossible. Then we have r2 = ♥r for some string
r matching R, and since c1 has ♥ as its second symbol, we have c1 = ♥m−1 ♦, whence it
follows that rm = ♦♠n−1 . Now consider the columns c2 , . . . , cn . These all end with ♠, and
so they must all match C♠, because they all contain symbols in Σ (from r2 ). So now we
know that all symbols in Y other than the first column and last row are in Σ, that is, for
each 1 ≤ i ≤ m − 1, all symbols in ri , except possibly the first, are in Σ. The only way this
can happen is if each ri matches ♥R. This establishes that Y minus the first column and last
row is an (R, C)-crossword (whose image under ρ is Y ).

Case 2: r2 matches C 0 . By transposing Y , we can assume instead that c2 matches C 0 , which is


conceptually simpler. The argument here is similar to Case 1. The string c2 cannot end with
♦, as that would also be the second symbol of rm , which is impossible. So we have that
c2 = c♠ for some string c matching C, and thus rm = ♦♠n−1 (because |rm | ≥ 3), whence it
follows that c1 = ♥m−1 ♦. Now since r1 , . . . , rm−1 all start with ♥ and contain at least one
symbol from Σ, they all match ♥R. So again, all symbols in Y except the first column and
last row are from Σ, and since c2 , . . . , cn all end in ♠, they much all match C♠. So again we
have that deleting the first column and last row results in an (R, C)-crossword.

We have shown that removing the first column and last row from either Y (in Case 1) or its
transpose (in Case 2) results in an (R, C)-crossword X such that ρ(X) is either Y or its transpose,
respectively. In particular, if an (E, E)-crossword exists, then an (R, C)-crossword exists.

Theorem 7. Given a positive regular expression E, it is undecidable (in fact, m-equivalent to the
Halting Problem) whether a (E, E)-crossword exists.

10
Proof. The problem is clearly c.e. and hence m-reduces to the Halting Problem. Conversely, let b be
the function of Lemma 6, and let M , Σ, C, and g be as in the proof of Theorem 4. Then (g(w), C)
is plural by Lemma 3. For any string w, M halts on w if and only if a (g(w), C)-crossword exists.
Then, letting E := b(Σ, g(w), C) (computable from w), we get by Lemma 6 that E is positive
and that M halts on w if and only if an (E, E)-crossword exists. Thus the mapping b(Σ, g(·), C)
m-reduces the Halting problem to the (E, E)-crossword existence problem.

3.2 A decidable crossword existence problem


In contrast with the previous results, we have the following:

Theorem 8. There is an algorithm that decides, given a list of regular expressions hR1 , . . . , Rm i
and a regular expression C over an arbitrary alphabet Σ, whether there exists an n ≥ 1 and an
m × n array all of whose columns match C and whose ith row matches Ri for all 1 ≤ i ≤ m. In
fact, this decision problem is in PSPACE.

Proof Sketch. First, we convert each Ri into an equivalent -NFA Ni (see [HMU07]). These au-
tomata have sizes polynomial in the sizes of the regular expressions. Then we nondeterministically
guess a crossword one column at a time, starting with the first, and for each guessed column, we
simulate one step of each of the Ni on its corresponding symbol (this can be done in polynomial
time by keeping track of a subset of the state set of each Ni ). We accept if ever all the Ni accept
simultaneously. We can also stop after 2n guesses, where n is the total number of states of all the
Ni combined. This nondeterministic algorithm uses polynomial space, and hence can be converted
into a deterministic polynomial-space algorithm by Savitch’s theorem.

4 Regular expressions over the binary alphabet


The alphabets used in Theorems 4 and 7 are fixed, but they are likely quite large, having to encode
all the states of a universal Turing machine M . In this section, we show how to map (in polynomial
time) regular expressions over an arbitrary alphabet to regular expressions over the binary alphabet
in a way that preserves crosswords. Thus the crossword existence problem remains undecidable
even when restricted to a binary alphabet.

Lemma 9. There is a function f such that, for any k ≥ 2 and positive regular expression R
over alphabet Σ := {0, . . . , k − 1}, f (k, R) is a positive regular expression over the alphabet {0, 1}
such that the following holds: There exists a one-to-one map ψk between Σ-crosswords and {0, 1}-
crosswords (that maps m × n crosswords to (3k(m + 1) + 1) × (3k(n + 1) + 1) crosswords) such that,
for any positive regular expressions T and U over Σ,

1. for any (T, U )-crossword X, ψk (X) is a (f (k, T ), f (k, U ))-crossword, and

2. for every (f (k, T ), f (k, U ))-crossword Y , there is a (T, U )-crossword X such that ψk (X) = Y .

Furthermore, f is computable in time polynomial in k + |R|.

Proof. Fix k and a positive regular expression R over Σ := {0, . . . , k − 1}. The regular expression
F := f (k, R) over {0, 1}, defined below, will be formed from several components. Let ` := 3k,
noting that ` ≥ 6. Any string w ∈ L(F ) will satisfy |w| ≡ 1 (mod `). For 0 ≤ i < ` − 1 and any

11
string x of length `, define RotLi (x) to be the cyclic shift of x by i places to the left. That is, if
x = x0 · · · x`−1 , then
RotLi (x) := xi · · · x`−1 x0 · · · xi−1 .
Now define s0 := 0`−2 11, and for 0 < i < ` define si := RotLi (s0 ). We will use the si to encode
symbols from Σ.
Let h : Σ∗ → {0, 1}∗ be the string homomorphism determined by

h(j) := s3j ,

for all 0 ≤ j < k. We extend h to apply to regular expressions over Σ in the usual way (see [HMU07]
for example).
Given a positive regular expression R over Σ, the subexpressions making up F := f (k, R) come
in four types—alignment, calibration, encoding, and duplication—defined as follows:

Alignment: Define
A := 1` (0` )+ .

Calibration: Define

C0 := 0001`−3 (s0 )+ ,
C1 := 01`−1 (s1 )+ ,
C2 := 01`−1 (s2 )+ ,

and for 3 ≤ i < ` − 1, define


Ci := 1` (si )+ .
Now define
`−1
[
C := Ci .
i=0

Encoding: Define
E (R) := s0 (h(R)) ,
that is, s0 concatenated with the regular expression h(R). Note that we make the dependence
+
on R explicit. We use E as shorthand for E (Σ ) and note that L(E (R) ) ⊆ L(E), because R
is positive.

Duplication: Define [
D0 := s3c ,
1≤c<k

and for j ∈ {1, 2}, define [


Dj := s3c+j .
0≤c<k

Define
D := D0 (D0 )+ ∪ D1 (D1 )+ ∪ D2 (D2 )+ .

12
=0

=1

S0 S1 S2 S3 S4

Figure 2: The 15 × 15 squares S0 , . . . , S4 used to encode the individual letters 0, . . . , 4, respectively.


A white cell denotes 0, and a black cell denotes 1.

3 0 4
=⇒
2 1 0

Figure 3: The encoding ψ5 (X) of a sample 2 × 3 Σ-crossword X, where Σ = {0, 1, 2, 3, 4}. The
slightly thicker lines give the boundaries between the 15 × 15 squares.

Finally, define
F := 1(A ∪ C) ∪ 0(D ∪ E (R) ) .
This completes the description of F = f (k, R). It is evident that f is computable in the specified
time bounds. Notice that all subexpressions of F except E (R) depend only on k and not on R.
Next we show how to convert any Σ-crossword X into a unique {0, 1}-crossword Y = ψk (X)
such that, for any positive regular expressions T and U over Σ, X is a (T, U )-crossword if and
only if Y is an (F, G)-crossword, where F := f (k, T ) and G := f (k, U ). It will help first to see
an example of how this is done. Suppose Σ = {0, 1, 2, 3, 4}. Then each cell of a Σ-crossword is
encoded by a 15 × 15 square in the {0, 1}-crossword, as shown in Figure 2. Generally, for 0 ≤ c < k
we define Sc be the ` × ` square whose ith row (starting with i = 0) is s(3c+i) mod ` . These squares
are pairwise distinct, and we use Sc to encode the letter c. Notice that the Sc are symmetric (with
respect to matrix transpose), and so the ith column of Sc is also s(3c+i) mod ` . In Figure 3, we show
the encoding ψk (X) of a sample 2 × 3 Σ-crossword X. The top row and left column form the
alignment region, and these two strings will both match 1A. The rest of the crossword is made up

13
of (` × `)-size squares Qt,u for t, u ≥ 0, with Q0,0 being the top leftmost square, Q0,1 immediately
to its right, Q1,0 immediately below it, etc. Squares of the form Q0,u and Qt,0 form the calibration
region, and, except for Q0,0 , all these squares are equal to S0 . The rows and columns making up
this region all match 1C. The rest of the crossword (squares Qt,u for t, u ≥ 1) forms the encoding
region, each square encoding a single corresponding entry in the Σ-crossword. Rows and columns
that intersect this region all match 0(D ∪ E).
Now the detailed description. Let X be any Σ-crossword with m rows and n columns, where
m, n ≥ 1. For 1 ≤ t ≤ m and 1 ≤ u ≤ n, let xt,u be the symbol in row t and column u of X. Then
we define a {0, 1}-crossword Y = ψk (X) as follows: Y has dimensions ((m + 1)` + 1) × ((n + 1)` + 1),
where ` = 3k as above. It will be convenient to index the rows of Y as (−1), . . . , (m + 1)` − 1
and the columns as (−1), . . . , (n + 1)` − 1. With this indexing, the alignment region comprises
row (−1) and column (−1), and each square Qt,u (for 0 ≤ t ≤ m and 0 ≤ u ≤ n) is the intersection
of rows t`, . . . , (t + 1)` − 1 with columns u`, . . . , (u + 1)` − 1. We will define Y row by row, with
rows r−1 , . . . , r(m+1)`−1 , then discuss the columns. (It will help to refer back to Figure 3.)

• Set r−1 := 1`+1 0n` . Then r−1 matches 1A.

• Set

r0 := 10001`−3 (s0 )n ,
r1 := 101`−1 (s1 )n ,
r2 := 101`−1 (s2 )n .

Then r0 , r1 , and r2 match 1C0 , 1C1 , and 1C2 , respectively.

• For 3 ≤ i < `, set ri := 1`+1 (si )n . Then ri matches 1Ci .

• For 1 ≤ t ≤ m, let x := xt,1 · · · xt,n . For 0 ≤ i < `, set

rt`+i := 0si s(3xt,1 +i) mod ` · · · s(3xt,n +i) mod ` .

Note that for 1 ≤ u ≤ n, block u of rt`+i equals RotLi (h(xt,u )). Also notice that if all the
rows of X match some positive regular expression T over Σ, then all the rt` match 0E (T ) .
The rest of the rows rt`+i match 0D; in particular, rt`+i matches 0Di mod 3 .

This completes the definition of the map ψk .


We have established that if the rows of X all match some positive regular expression T , then
each row of Y matches F = 1(A ∪ C) ∪ 0(D ∪ E (T ) ), and from the arrangement of the rows, we
can see by symmetry that if the columns of X all match some positive regular expression U over
Σ, then each column of Y matches 1(A ∪ C) ∪ 0(D ∪ E (U ) ) in a similar manner:

• c−1 = 1`+1 0m` , matching 1A.

• c0 = 10001`−3 (s0 )m , c1 = 101`−1 (s1 )m , and c2 = 101`−1 (s2 )m , matching 1C0 , 1C1 , and 1C2 ,
respectively.

• For 3 ≤ i < `, ci = 1`+1 (si )m , matching 1Ci .

14
• For 1 ≤ u ≤ n, letting x := x1,u · · · xm,u , and for 0 ≤ i < `, we have

cu`+i := 0si s(3x1,u +i) mod ` · · · s(3xm,u +i) mod ` .

That is, for 1 ≤ t ≤ m, block t of cu`+i equals Since x matches U , we have that cu` matches
0E (U ) , and the rest of the cu`+i match 0D.

This establishes that, if X is a (T, U )-crossword, then Y = ψk (X) is an (F, G)-crossword (of the
correct size), where F := f (k, T ) and G = f (k, U ). It is also clear that, since Y has the original
crossword X completely encoded within it, ψk is a one-to-one map.
It remains to show that for any (F, G)-crossword Y , there is a (T, U )-crossword X such that
ψk (X) = Y , where T , U , F , and G are as above. We establish this through a series of claims. Each
claim is proved using “sudoku-like” arguments. Let Y be any (F, G)-crossword. First observe that
any string w matching A ∪ C ∪ D ∪ E has length v` for some v ≥ 2, and so we can chop w into
substrings of length ` that we call blocks (at least two), starting with block 0 through block v − 1.
This forces Y , minus its top row and left column, to be divided into (` × `)-size squares Qt,u as
described earlier, the rows and columns of each Qt,u being blocks in the rows and columns of Y
that intersect Qt,u . Y has squares Qt,u for each 0 ≤ t ≤ m and 0 ≤ u ≤ n for some m, n ≥ 1.
As before, we index the rows and columns of Y as −1, . . . , (m + 1)` − 1 and −1, . . . , (n + 1)` − 1,
respectively.
We extend the block concept to strings of length v` + 1, e.g., the rows and columns of an
(F, G)-crossword, by ignoring the first symbol in the string, that is, block 0 starts with the second
symbol of the string.
Claim 10. Each square Qt,0 and Q0,u of Y , for 1 ≤ t ≤ m and 1 ≤ u ≤ n, has exactly two 1’s in
each of its rows and each of its columns, the rest of the entries being 0.

Proof of Claim 10. Let w be any string matching A ∪ C ∪ D ∪ E. Then each block of w, other than
block 0, has at most two 1’s; in particular, it is either 0` (if w matches A), or it is of the form si for
some i (if w matches C ∪ D ∪ E). Moreover, block 0 of w has at least two 1’s. Thus for 1 ≤ u ≤ n,
square Q0,u has each of its rows containing at most two 1’s and each of its columns containing at
least two 1’s. The only way this can happen is if each row and column of Q0,u contains exactly
two 1’s. A similar argument shows that each row and column of Qt,0 contains exactly two 1’s, for
i ≤ t ≤ m.

Claim 11. No row other than the topmost, and no column other than the leftmost, matches 1A.

Proof of Claim 11. Consider any row except the topmost. This row is either 0r or 1r for some
string r matching A ∪ C ∪ D ∪ E, and it intersects either Q0,1 or else Qt,0 for some t ≥ 1. In the
former case, block 1 of r (i.e., the block of r intersecting Q0,1 ) has a 1, and so r cannot match A;
in the latter case, block 0 of r has a 0, and so again, r cannot match A. (Both cases follow from
Claim 10.) Thus the row in question cannot match 1A. The same argument applies to the columns
except the leftmost; none of them can match 1A.

Claim 12. The topmost row and leftmost column of Y each match 1A.

Proof of Claim 12. Observe that any string w matching C must have at least three 1’s in its block 0.
Now consider any row of Y that intersects square Q1,0 . This row is of the form 0r or 1r, for some
r matching A ∪ C ∪ D ∪ E. By Claim 11, this row does not match 1A, and so it must match

15
1C ∪ 0(D ∪ E). However, r cannot match C because (by Claim 10) r has only two 1’s in block 0.
Thus the row must match 0(D ∪ E)—in particular, it starts with 0. That means that the leftmost
column (column (−1)) has all 0’s in its block 1, and so it cannot match 1C ∪ 0(D ∪ E), and thus
it must match 1A. A similar, transposed argument shows that the topmost row must also match
1A.

Claim 13. Rows 0, . . . , ` − 1 and columns 0, . . . , ` − 1 of Y each match 1C, and the rows and
columns of Y starting with index ` each match 0(D ∪ E).

Proof of Claim 13. By the previous claim, the topmost row and leftmost column of Y each match
1A = 1`+1 (0` )+ . Thus rows 0, . . . , ` − 1 each start with 1, and the rows starting with index ` each
start with 0. By Claim 11, none of these rows match 1A, so rows 0 through ` − 1 all must match
1C and the rest must match 0(D ∪ E). A similar argument holds for the columns.

Notice that each row and column of Q0,0 matches (000 ∪ 011 ∪ 111)1`−3 . For 0 ≤ i < (m + 1)`,
let ri denote the row of Y with index i, and for 0 ≤ j < (n + 1)` let cj denote the column of Y
with index j. Rows r0 , . . . , r`−1 and columns c0 , . . . , c`−1 all match 1C by Claim 13, and the rest
match 0(D ∪ E).
Claim 14. For all 0 ≤ i < `, ri and ci both match 1Ci .

Proof of Claim 14. First we show that r0 and c0 both match 1C0 . Suppose that r0 does not match
1C0 (the argument for c0 is similar). Then (since r0 matches 1C) r0 matches 1Ci for some i ≥ 1,
and so has a prefix matching 1(0 ∪ 1)1`−1 , which makes c1 , . . . , c`−1 all have 11 as a prefix. This in
turn implies that each of these columns must match 1Cj for some j ≥ 3. Now notice that block 1 of
any string x matching Cj is sj , and so if 3 ≤ j < `, then x must have 0 as the next to last symbol
in its block 1. From these facts it follows that the next to last row of Q1,0 (i.e., block 0 of r2`−2 )
matches (0 ∪ 1)0`−1 . But this is impossible, because this block must have two 1’s by Claim 10.
Next we show that r1 and c1 match 1C1 and r2 and c2 match 1C2 . By what we just showed, r1
r2 both have prefix 10 (because c0 matches 1C0 ), and so they each match 1(C0 ∪ C1 ∪ C2 ). Neither
of them can match 1C0 , however: Consider the 2 × 2 square S forming the intersection or rows 1, 2
with columns 1, 2. If either r1 or r2 matches 1C0 , then S contains a 0, and hence at least one of
the columns c1 or c2 must also match 1C0 , which implies that S contains all 0’s, which means that
both r1 and r2 match 1C0 . But this would make c2`−1 have prefix 0111 putting three 1’s in the last
column of Q0,1 and contradicting Claim 10. (By a similar argument, neither c1 nor c2 can match
1C0 .) Thus we have r1 and r2 both matching 1(C1 ∪ C2 ). Now r2 cannot match 1C1 , for if it does,
then c2`−2 has prefix either 0101 or 0111, neither of which is possible because block 0 of c2`−2 must
be sj for some j. Thus r2 matches 1C2 . We have one more case to eliminate, i.e., showing that r1
cannot match 1C2 . Suppose r1 matches 1C2 . Then column c2`−2 has prefix 0100, and the only way
this can happen is if c2`−2 has prefix 0s`−1 . But that means that row r`−1 has a 1 as the next to
last symbol of its block 1. Since r`−1 matches 1C, this can only happen if r`−1 matches 1(C0 ∪ C1 ),
whence it has 10 as a prefix. This puts a 0 as the last symbol of block 0 of c0 , but this is impossible,
because c0 matches 1C0 and hence has 10001`−3 as a prefix. Thus r1 cannot match 1C2 , and so it
matches 1C1 . A symmetric argument holds for c1 and c2 .
Finally, we show that ri matches 1Ci for 3 ≤ i < `. This is by induction on i, starting with
i = 3, with the inductive hypothesis that rj matches 1Cj for all 0 ≤ j < i. We have then that
c2`−i−1 has prefix 0i 1 and c2`−i has prefix 0i−1 11. Since both of these columns match 0(D ∪ E)

16
and hence must each start with 0sj for some j’s, we can only have that c2`−i−1 has prefix 0i 11 and
c2`−i has prefix 0i−1 110. Then block 1 of ri must be si , and it follows that ri matches 1Ci .

Claim 15. Qt,0 = Q0,u = S0 for all 1 ≤ t ≤ m and 1 ≤ u ≤ n.

Proof of Claim 15. This follows immediately from Claim 14.

Claim 16. For each 1 ≤ t ≤ m and each 1 ≤ u ≤ n, rt` matches 0E (T ) and cu` matches 0E (U ) .

Proof of Claim 16. By assumption, all rows of Y match F = 1(A ∪ C) ∪ 0(D ∪ E (T ) ), and all
columns of Y match G = 1(A ∪ C) ∪ 0(D ∪ E (U ) ). By Claim 15, each row rt` and each column cu`
has prefix 0s0 , and thus none can match 1(A ∪ C) ∪ 0D. Thus each such row must match 0E (T ) ,
and each such column matches 0E (U ) .

Claim 17. For all t, u with 1 ≤ t ≤ m and 1 ≤ u ≤ n, there exists a unique xt,u ∈ Σ such that
Qt,u = Sxt,u .

Proof of Claim 17. For simplicity, we will assume t = u = 1; the same argument works for any t, u.
By Claim 13, rows r` , . . . , r2`−1 and columns c` , . . . , c2`−1 all match 0(D ∪ E). By Claim 15, the
ith row of Q1,0 (i.e., block 0 of r`+i ) is si , for 0 ≤ i < `. We have r` matching 0E by Claim 16.
For 0 ≤ j < `, let bj be block 1 of r`+j (i.e., the jth row of Q1,1 ), and let b0j be block 1 of c`+j
(i.e., the jth column of Q1,1 ). Row r` matching 0E makes b0 = s3x for some unique 0 ≤ x < k.
For 1 ≤ i < `, row r`+i , having si as its block 0, cannot match 0E. Thus r`+i matches 0D, and
in fact, it must match 0Di mod 3 , owing to its block 0, and this makes bi = s(3v+i) mod ` for some
0 ≤ v < k. The same goes for the columns of Q1,1 . Furthermore, notice that D and E ensure that
the columns b0j and b0(j−1) mod ` are distinct for any 0 ≤ j < `, because j and j − 1 have different
remainders modulo 3.
We show by induction on 1 ≤ i < ` that bi = s(3x+i) mod ` , and this will imply that Q1,1 = Sx ,
finishing the proof of the claim. Now assume (inductive hypothesis) that bi−1 = s(3x+i−1) mod ` (we
have established this for i = 1). We have bi = s3v+i for some 0 ≤ v < k, and so it suffices to show
that v = x. Suppose v 6= x. Then there is no position where the strings bi and bi−1 share a 1 in
common. The two 1’s in bi−1 occur in columns b0z1 and b0z2 of Q1,1 , where z1 := (−3x − i) mod `
and z2 := (z1 − 1) mod ` = (−3x − i − 1) mod `, and by assumption, these 1’s are then immediately
followed by 0’s in their respective columns. Since b0z1 = sj1 and b0z2 = sj2 for some 0 ≤ j1 , j2 < `,
and they share the substring 10 in the same position in each, it must be that j1 = j2 . But
this contradicts what we said above about columns being distinct. Therefore, v = x, and we are
done.

Claim 18. For all 1 ≤ t ≤ m and 1 ≤ u ≤ n, let xt,u ∈ Σ be the unique symbol such that Qt,u = Sxt,u
(cf. Claim 17). Then the m × n array X whose (t, u)th entry is xt,u forms a (T, U )-crossword.

Proof of Claim 18. For 1 ≤ t ≤ m, let dt := xt,1 · · · xt,n , and for 1 ≤ u ≤ n, let eu := x1,u · · · xm,u .
We show that the dt all match T and the eu all match U . We have

rt` = 0s0 s3xt,1 · · · s3xt,n = 0s0 (h(dt )) ,

and because of the symmetry of the squares Qt,u , we also have

cu` = 0s0 s3x1,u · · · s3xm,u = 0s0 (h(eu )) ,

17
for all 1 ≤ t ≤ m and 1 ≤ u ≤ n. By Claim 16, rt` matches 0E (T ) = 0s0 (h(T )) and cu` matches
0E (U ) = 0s0 (h(U )). Then because h is clearly a one-to-one map, it must be that dt matches T and
eu matches u.

Finally, if X is as defined in Claim 18, then is clear by our definition of ψk above that Y = ψk (X).
This ends the proof of Lemma 9.

The next two theorems are just corollaries of Lemma 9. They strengthen Theorems 4 and 7,
respectively.

Theorem 19. Let G := f (k, C), where C is as in Theorem 4, and k is the size of the alphabet
used in that proof. Then the following problem is m-equivalent to the Halting Problem:

Given a positive regular expression F over the alphabet {0, 1}, does an (F, G)-crossword
exist?

Proof. The problem is c.e. For the other direction, we m-reduce from the problem of Theorem 4
via the map f (k, ·). Given any positive regular expression R over an alphabet of size k, which we
can assume is {0, . . . , k − 1}, we set F := f (k, R). Then an (F, G)-crossword exists if and only if
an (R, C)-crossword exists, by Lemma 9.

Theorem 20. The following problem is m-equivalent to the Halting Problem:

Given a positive regular expression E 0 over the alphabet {0, 1}, does an (E 0 , E 0 )-crossword
exist?

Proof. This works as in the proof of Theorem 19. The problem is c.e. Conversely, we m-reduce
from the problem of Theorem 7. Given a positive regular expression E, we can effectively determine
the size k of E’s alphabet. Then adjusting the alphabet to {0, . . . , k − 1}, we let E 0 := f (k, E),
where f is the function of Lemma 9. Then E 0 is positive, and an (E 0 , E 0 )-crossword exists if and
only if an (E, E)-crossword exists.

5 Square crosswords
An m × n Σ-crossword is square iff m = n. In this section, we explain briefly why the complexities
of all our problems are unaffected by restricting all crosswords to be square.
First, in the proof of Lemma 3, the R and C we construct are such that if an m × n (R, C)-
crossword exists, then m ≥ n. This is because each row records a configuration of the machine M ,
and each column records a tape cell that is scanned at least once, and M can only scan at most
as many different tape cells as there are configurations. Thus to allow a square (R, C)-crossword,
we only need to pad with (blank) cells that are never scanned. Letting C 0 := C ∪ [B]+ , we get
that an (R, C)-crossword exists if and only if an (R, C 0 )-crossword exists, if and only if a square
(R, C 0 )-crossword exists.
Next, the map ρ of Lemma 6 clearly preserves squareness: every m × n (R, C)-crossword (for
m, n ≥ 2) maps to an (m + 1) × (n + 1) (E, E)-crossword and vice versa. Finally, the maps ψk of
Lemma 9 also preserve squareness. An m × n (T, U )-crossword maps under ψk to a (3k(m + 1) +
1) × (3k(n + 1) + 1) (F, G)-crossword and vice versa.

18
6 Complexity
It has been observed in [Tak14], and independently by us, that if separate regular expressions for
each of the rows and columns are specified for a particular size grid, then the existence problem is
NP-hard, and this is true even for a binary alphabet. The proof (by FrankW) described in [Tak14]
is via a polynomial reduction from VERTEX COVER, which, for the sake of completeness, we
reproduce in Appendix A as well as a reduction from 3-SAT that we found independently. Both
reductions map to regular expressions over binary alphabets. In the former reduction, the regular
expressions constructed for the columns, except for the first, are all the same fixed expression
0∗ 1(0 ∪ 1)∗ , independent of the input. In our latter reduction, all the columnar regular expressions
are the same fixed expression 0+ ∪1+ , independent of the input. In each proof, however, the regular
expressions for the rows are all different from each other. Both results are therefore strengthened
by Theorem 22, below, which is analogous to Theorem 4. First, a technical lemma.

Lemma 21. There exist a polynomial p, a polynomial-time computable function r, and a positive
regular expression C 0 over Σ such that, for any Boolean formula ϕ,

1. R0 := r(ϕ) is a positive regular expression over {0, 1},

2. (R0 , C 0 ) is plural,

3. every (R0 , C 0 )-crossword is q × q, where q := q(|ϕ|), and

4. the number of (R0 , C 0 )-crosswords is equal to the number of satisfying truth assignments to ϕ.

Proof. We modify slightly the proof of Lemma 3 applied to a Turing machine M such that, on any
input w of length n:

1. M ’s tape alphabet contains (at least) the nonblank symbols 0 and 1 and blank symbol B,

2. M ’s computation satisfies the technical conditions given at the start of that proof with respect
to w,

3. if w encodes some Boolean formula ϕ with variables x0 , . . . , xk−1 for some k ≤ n, then for
any a ∈ {0, 1}k , with wBa initially on its tape, M scans wBa in its entirety and halts if and
only if a is a satisfying truth assignment for ϕ, and

4. if M halts, then it halts after exactly p(n) − 1 many steps (thus including p(n) many config-
urations), for some appropriately chosen polynomial p with integer coefficients, independent
of w, such that p(n) ≥ 2n + 3 for all n ≥ 0.

Such a machine M and polynomial p clearly exist. Under these assumptions, we can change the
definition of Iw in Equation 1 to accommodate the presence of a on the tape:

Iw := [B, ↓ q1 ][B, q0 ][w1 ] · · · [wn ][B]([0] ∪ [1])k [B]p(n)−n−k−3 ,

provided w = w1 · · · wn encodes a Boolean formula with k ≤ n variables. Note that Iw is only


matched by strings of length p(n). The rest of the definition of R remains the same. We also
modify C just as we did in Section 5: C 00 := C ∪ [B]+ , where C is as in the proof Lemma 3. Under

19
these modifications, both R and C 00 remain positive. Now setting p := p(n), we observe that for
any w encoding a Boolean formula ϕ with k ≤ n variables,

ϕ is satisfiable ⇐⇒ M halts on wBa for some a ∈ {0, 1}k


⇐⇒ an (R, C 00 )-crossword exists,

and if such is the case, then owing to the determinism and running time of M , the (R, C 00 )-crossword
is unique, is of size p × p, and both w and a are easily recoverable from it, which implies that the
number of (R, C 00 )-crosswords is equal to the number of satisfying assignments to ϕ. Also by
Lemma 3, given ϕ we can compute R, C, and 0p all in polynomial time.
Finally, we apply the function f of Lemma 9 to both R and C 00 . Let Σ be the alphabet of R
and C 00 (cf. Lemma 3). By renaming if necessary, we may assume that Σ = {0, . . . , ` − 1} for some
`. Then we set

q := 3`(p + 1) + 1 ,
R0 := f (`, R) ,
C 0 := f (`, C 00 ) ,
r(ϕ) := R0 .

Any (R0 , C 0 )-crossword thus has exactly q = 3`(p + 1) + 1 rows and columns. The expressions
R0 and C 0 are both positive by Lemma 9, and so (R0 , C 0 ) is plural, because q ≥ 2. Finally, since
f is polynomial-time computable (with constant `), so is r, and since f preserves the number of
crosswords, the number of q × q (R0 , C 0 )-crosswords equals the number of p × p (R, C 00 )-crosswords,
which equals the number of assignments satisfying ϕ.

Theorem 22. For the regular expression C 0 of Lemma 21, the following decision problem D is
NP-complete (with respect to polynomial reductions):

Given as input a positive regular expression R over the alphabet {0, 1} and a positive
integer q in unary, does an q × q (R, C 0 )-crossword exist?

Proof. D belongs to the class NP, because one can verify in deterministic polynomial time whether
or not a given q × q crossword (which has size polynomial in q) is an (R, C 0 )-crossword. (It is
well-known that the problem, “Given a regular expression R and string w, does w match R?” is
decidable in polynomial time, uniformly in R and w.)
For NP-hardness, let q := q(n) be the polynomial and r the function defined in the proof of
Lemma 21. Then that lemma implies that the map taking a Boolean formula ϕ to hr(ϕ), 0q i where
q := q(|ϕ|) is a polynomial reduction from SAT to D.

The next theorem is analogous to Theorem 7.

Theorem 23. The following decision problem S is NP-complete (with respect to polynomial reduc-
tions):

Given a positive regular expression E over the alphabet {0, 1} and positive integer s in
unary, does an s × s (E, E)-crossword exist?

20
Proof. S is clearly in NP. For NP-hardness, let C 0 , q, and r be as in Lemma 21, and let b be the
function of Lemma 6. Given any Boolean formula ϕ, we show that we can produce in polynomial
time a pair hE, 0s i such that ϕ is satisfiable if and only if an s × s (E, E)-crossword exists. We do
this by composing polynomial-time functions as follows:

1. Let R0 := r(ϕ) and q := q(|ϕ|). Then, as in the proof of Theorem 22, a q ×q (R0 , C 0 )-crossword
exists if and only if ϕ is satisfiable. Furthermore, (R0 , C 0 ) is plural by Lemma 21.

2. Let E 0 := b({0, 1}, R0 , C 0 ). Then by Lemma 6, E 0 is a positive regular expression over the
five-letter alphabet Σ0 := {0, 1, ♥, ♦, ♠} such that a (q + 1) × (q + 1) (E, E)-crossword exists
if and only if a q × q (R0 , C 0 )-crossword exists.

3. By renaming letters, we can assume that Σ0 = {0, 1, 2, 3, 4}. Then let E := f (5, E 0 ), where f
is the function of Lemma 9, and let s := 15(q + 2) + 1. Then by Lemma 9, E is a positive
regular expression over the alphabet {0, 1}, and an s × s (E, E)-crossword exists if and only
if a (q + 1) × (q + 1)-crossword exists, if and only if a q × q (R0 , C 0 )-crossword exists, if and
only if ϕ is satisfiable.

The map ϕ 7→ hE, 0s i is thus a polynomial reduction from SAT to S.

Corollary 24. For the regular expression C 0 of Lemma 21, the following two decision problems
are NP-complete:

Given a positive regular expression R over the alphabet {0, 1} and positive integers m
and n in unary, does an m × n (R, C 0 )-crossword exist?

Given a positive regular expression E over the alphabet {0, 1} and positive integers m
and n in unary, does an m × n (E, E)-crossword exist?

6.1 Further results


Since Lemma 21 controls not just the existence but the number of crosswords, we can get more
information out of it. We list a few other results here that follow easily from Lemma 21.

• Counting the number of (R, C)-crosswords of given dimensions (given in unary) is polynomi-
ally equivalent to counting the number of satisfying assignments to a Boolean formula, and
hence is complete for the class #P [Val79].

• As with sudoku puzzles, someone who wants to solve a regex crossword puzzle (found online or
in a newspaper, say) should reasonably expect that a solution exists and is unique. Does the
promise of a unique solution make solving the puzzle any easier in the worst case? The answer
is no, at least with respect to randomized polynomial reductions. Consider the following
search problem:

Input: Regular expressions R and C, and integers m, n ≥ 1 in unary.


Promise: A unique m × n (R, C)-crossword exists.
Ouput: The m × n (R, C)-crossword.

21
Lemma 21 and its proof says that this problem is polynomially equivalent to finding the unique
satisfying assignment to a Boolean formula ϕ with the promise that ϕ is uniquely satisfiable.
This latter problem is known to be NP-hard with respect to randomized polynomial reductions
[VV86].
• Shifting perspective from the last item, a regex crossword puzzle maker may want a test to
determine, given regular expressions R and C and m, n ≥ 1 in unary, whether or not a unique
solution exists. Lemma 21 says that this is polynomially equivalent to USAT, the language
of all uniquely satisfiable Boolean formulas. USAT is known to be NP-hard (it is in the class
Dp , the first level of the difference hierarchy over NP).
Finally, our techniques can be modified easily to show that if the dimensions of the crossword are
both given in binary instead of unary, then the (R, C)-crossword existence problem is complete for
NEXP (nondeterministic exponential time) under polynomial reductions. If one of the dimensions
is given in unary and the other in binary, then the problem becomes PSPACE-complete. (PSPACE-
hardness follows from the techniques of Section 6; membership in PSPACE follows by modifying
slightly the proof of Theorem 8.)

7 Open questions
We have shown that it is NP-hard to determine whether an (R, C)-crossword exists of given di-
mensions (specified in unary), even when R and C are over the binary alphabet. Our reduction
from SAT is rather complicated and indirect, however. It would be nice to know if a simple, di-
rect reduction from some NP-complete problem exists—perhaps some modification of one of the
reductions given in Appendix A.
Theorem 4 gives undecidability for a particular fixed expression C. One may ask more generally:
For which C is the corresponding problem undecidable? How hard is it to determine, given a C,
whether the corresponding problem is decidable? We conjecture that this latter question is m-
complete for Σ3 , the third Σ-level of the arithmetic hierarchy (see, e.g., [Soa87]). Similar questions
can be asked about Theorem 22. For example: For which C is the question (i) NP-hard; (ii)
decidable in polynomial time?

7.1 Two-player regex crossword games


One can imagine a variety of two-player games involving regex crosswords, and some of these may
actually be fun to play. For example:
1. A blank m×n grid is given to start, along with regular expressions R1 , . . . , Rm and C1 , . . . , Cn .
Player 1 (who plays rows) fills in the first row to match R1 , then Player 2 (who plays columns)
fills in the rest of the first column so that it matches C1 , then Player 1 fills in the rest of
row 2 so that it matches R2 , then Player 2 column 2, etc.
2. Same as above, but each player can choose an incomplete row (respectively column) to fill in
on each turn.
3. Same as in item 1 above, but both players fill in rows in order, and a move is legal iff each
column can be completed to match its corresponding Cj (this may or may not be easy to
determine).

22
4. Same as in the last item, but players can choose rows to fill in on each turn.

In all these games, the last player able to make a legal move wins. We conjecture that for all these
games, determining whether Player 1 has a winning strategy is PSPACE-hard, even if all the Ri
are equal and all the Cj are equal and independent of the input, or if all the Ri and Cj are equal
to each other. (It is straightforward to prove that all these problems are in PSPACE.)
One might also consider some unbounded versions of these games:

1. Positive regular expression R and C are given, but the size of the grid is not. Player 1 first
chooses an arbitrary string r1 matching R for the first row of the grid (thus fixing the number
of columns). Player 2 then chooses an arbitrary string c1 matching C for the first column of
the grid (except the first symbol of c1 must equal that of r1 ), thus fixing the number of rows.
Players then proceed as in the games mentioned previously.

2. Same as the last item, but on their first move, each player chooses a string r (respectively c)
and says which row (respectively column) this string is to fill.

The first two moves in each of these games is unbounded, but thereafter, the grid dimensions are
fixed, and so determining the winner under optimal play is decidable, given the first two moves.
The problem of determining if Player 1 wins without knowing the first two moves is then in the class
Σ2 , the second Σ-level of the arithmetic hierarchy (i.e., it is c.e. relative to the Halting Problem).
We conjecture that it is m-complete for this class.

Acknowledgments
The author would like to thank Josh Cooper for introducing him to the subject by showing him the
three-way regex crossword in [Bla13]. He also thanks Jason O’Kane for first suggesting to him the
NP-completeness question for regex crosswords as an exercise. Finally, much of this work was done
at the Dagstuhl seminar 14391, “Algebra in Computational Complexity.” The author wishes to
thank the organizers of that seminar and especially Thomas Thierauf and the Technical University
of Ulm (Germany) for their hospitality and lively discussions on this and other topics. Thanks also
to Klaus-Jörn Lange for providing pointers to the literature on two-dimensional languages.

References
[Ber66] Robert Berger. The Undecidability of the Domino Problem. Number 66 in Memoirs of
the American Mathematical Society. American Mathematical Society, Providence, Rhode
Island, 1966. MR0216954.

[Bla13] Lucy Black. Can you do the regular expression crossword? I Programmer, Febru-
ary 2013. http://www.i-programmer.info/news/144-graphics-and-games/5450-can-you-
do-the-regular-expression-crossword.html.

[GR92] D. Giammarresi and A. Restivo. Recognizable picture languages. International Journal


of Pattern Recognition and Artificial Intelligence, pages 31–46, 1992.

23
[GR97] D. Giammarresi and A. Restivo. Two-dimensional languages. In A. Salomaa and
G. Rosenberg, editors, Handbook of Formal Languages, volume 3, chapter 96, pages
215–267. Springer-Verlag, 1997.

[HMU07] J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory, Lan-


guages, and Computation. Pearson, 3rd edition, 2007.

[LS97] M. Latteux and D. Simplot. Recognizable picture languages and domino tiling. Theoret-
ical Computer Science, 178(1-2):275–283, 1997. Note.

[mik13] February 2013. Slashdot discussion, http://games.slashdot.org/story/13/02/13/2346253/can-


you-do-the-regular-expression-crossword.

[MIT] MIT Mystery Hunt. http://www.mit.edu/ puzzle.

[rc] http://regexcrossword.com.

[rcs] http://www.regexcrosswords.com.

[roy] Royal dinner. http://regexcrossword.com/challenges/experienced/puzzles/1.

[Sip13] M. Sipser. Introduction to the Theory of Computation. Cengage Learning, 3rd edition,
2013.

[Soa87] R. I. Soare. Recursively Enumerable Sets and Degrees. Perspectives in Mathematical


Logic. Springer-Verlag, Berlin, 1987.

[Tak14] Glen Takahashi. Are regex crosswords NP-hard?, September


2014. CS Stack Exchange question 30143, answered by FrankW;
http://cs.stackexchange.com/questions/30143/are-regex-crosswords-np-hard.

[Val79] L. Valiant. The complexity of computing the permanent. Theoretical Computer Science,
8:189–201, 1979.

[VV86] L. Valiant and V. Vazirani. NP is as easy as detecting unique solutions. Theoretical


Computer Science, 47:85–93, 1986.

A Easy polynomial reductions from NP-complete problems


Theorem 25 (FrankW [Tak14]). The following decision problem is NP-hard: “Given lists of regular
expressions hR1 , . . . , Rm i and hC1 , . . . , Cn i, all over the binary alphabet {0, 1}, does there exist an
m × n array of 0’s and 1’s whose ith row matches Ri for all 1 ≤ i ≤ m and whose jth column
matches Cj for all 1 ≤ j ≤ n?”

Proof. We describe a polynomial reduction from the NP-complete language VERTEX COVER.
Let hG, ki be an instance of VERTEX COVER, where G is a graph with m vertices v1 , . . . , vm and
n edges e1 , . . . , en , and k is a positive integer. We define C1 := C2 := · · · := Cn := 0∗ 1(0 ∪ 1)∗ ,
which are matched by binary strings with at least one 1. We define Cn+1 := (0∗ 1?)k 0∗ , which is
matched by binary strings with at most k many 1’s. For 1 ≤ i ≤ m, let ri be the length n string
whose jth symbol is 1 iff vi is an endpoint of edge ej , and 0 otherwise, then define Ri := ri 1 ∪ 0∗ .

24
This construction can clearly be done in polynomial time. Then we show that an m × (n + 1)
crossword exists where the ith row matches Ri and the jth column matches Cj , for all i, j, if and
only if G has a vertex cover of size ≤ k.
To see this, first assume that C is a vertex cover for G of size ≤ k. Then let the ith row be ri 1
if vi ∈ C and 0n+1 otherwise. Then there is at least one 1 in each of the first n columns, because at
least one endpoint of each edge is in C. Also, there are at most k many 1’s in the (n + 1)st column,
because |C| ≤ k. Thus all the row and column regular expressions are matched. Conversely,
suppose all the row and column expressions are matched. Let C := {vi | row i ends with 1}. Then
|C| ≤ k due to the last column, and the ith row must be ri 1 for all i such that vi ∈ C. Then C is a
vertex cover, because each of the first n columns contains a 1 and hence each edge has an endpoint
in C.

Theorem 26. The following decision problem is NP-complete: “Given a list of regular expressions
hR1 , . . . , Rm i, all over the binary alphabet {0, 1}, and a positive integer n in unary, does there exist
an m × n array of 0’s and 1’s, all of whose columns match 0∗ ∪ 1∗ and whose ith row matches Ri
for all 1 ≤ i ≤ m?”

Proof. The problem is in NP because testing whether a given string w matches a given regular
expression E can be done in polynomial time, uniformly in |hw, Ei|. To show NP-hardness, we
reduce from 3-SAT. Given a 3-cnf Boolean formula ϕ = c1 ∧ · · · ∧ cm over Boolean variables
x1 , . . . , xn , we construct an instance hR1 , . . . , Rm i so that any 0–1 array satisfying the criterion is
m × n and encodes the truth value of each xj with respect to some satisfying assignment in the
jth-column. Each Ri ensures that the ith clause is satisfied by the assignment.
If the jth column matches C := 0∗ ∪ 1∗ , then it is either all 1s, meaning xj is set to TRUE,
or all 0s, meaning xj is set to FALSE. For 1 ≤ i ≤ m, suppose ci = `i1 ∨ `i2 ∨ `i3 , where
1 ≤ i1 < i2 < i3 ≤ n and each literal `k is either xk or xk . For j = 1, 2, 3, set bj := 1 if `ij = xij
and bj := 0 otherwise. Then finally set

Ri = (0 ∪ 1)i1 −1 b1 (0 ∪ 1)n−i1 ∪ (0 ∪ 1)i2 −1 b2 (0 ∪ 1)n−i2 ∪ (0 ∪ 1)i3 −1 b3 (0 ∪ 1)n−i3 .

This construction can clearly be done in polynomial time.


Ri is matched by any string that contains, at position i1 or i2 or i3 , a truth value satisfying the
corresponding literal. C guarantees that the truth values are consistent across all clauses. Thus
such an array exists if and only if ϕ is satisfiable.

25

View publication stats

You might also like