Tight Lower Bounds For Query Processing On Streaming and External Memory Data
Tight Lower Bounds For Query Processing On Streaming and External Memory Data
Tight Lower Bounds For Query Processing On Streaming and External Memory Data
n), O(
5
n))-
bounded Turing machines that are not allowed to write intermediate results
to the external memory tape (cf., Corollary 4.9).
We show (cf., Theorem 4.5) that for some XQuery queries, ltering is impos-
sible for machines with r(T) s(T) o
_
n
log n
_
, where n is the size of the input
XML document T.
We show (cf., Corollary 5.5) that for some Core XPath [12] queries, ltering is
impossible for machines with r(T) s(T) o
_
d
_
, where d denotes the depth of
the input XML document T. Furthermore, we show that the lower bound on
Core XPath is tight in that we give an algorithm that solves the Core XPath
ltering problem with a single scan of the external data (zero reversals) and
O(d) buer space.
The primary technical machinery that we use for obtaining lower bounds
is that of communication complexity (cf. [21]). Techniques from communication
complexity have been used previously to study queries on streams [4, 6, 2, 3, 5, 23,
24, 18]. The work reported on in [4] addresses the problem of determining whether
a given relational query can be evaluated scalably on a data stream or not at
all. In comparison, we ask for tight bounds on query evaluation problems, i.e.
we give algorithms for query evaluation that are in a sense worst-case optimal.
As we do, the authors of [6] study XPath evaluation; however, they focus on
instance data complexity while we study worst-case bounds. This allows us to
nd strong and tight bounds for a greater variety of query evaluation problems.
Many of our results apply beyond stream processing in a narrow sense to a more
general framework of queries on data in external storage. Also, our worst-case
bounds apply for any evaluation algorithm possible, that is, our bounds are not
in terms of complexity classes closed under reductions that allow for nonlinear
expansions of the input (such as LOGSPACE) as is the case for the work on the
complexity of XPath in [12, 13, 28].
Lower bound results for a machine model with multiple external memory
tapes (or harddisks) are presented in [17]. In the present paper, we only consider
a single external memory tape, and are consequently able to show (sometimes
exponentially) stronger lower bounds.
Due to space limitations we had to defer detailed proofs of our results to the
full version of this paper [16] which extends the present paper by an appendix
that contains proofs of all results presented here.
2 Preliminaries
In this section we x some basic notation concerning trees, streams, and query
languages. We write N for the set of non-negative integers. If M is a set, then
2
M
denotes the set of all subsets of M. Throughout this paper we make the
following convention: Whenever the letters r and s denote functions from N to
N, then these functions are monotone, i.e., we have r(x) r(y) and s(x) s(y)
for all x, y N with x y.
Trees and Streams. We use standard notation for trees and streamed trees
(i.e. documents). In particular, we write Doc(T) to denote the XML document
associated with an XML document tree T. An example is given in Figure 1.
Query Languages. By Eval(, ) we denote the evaluation function that maps
each tuple (Q, T), consisting of a query Q and a tree T to the corresponding query
result. Let Q be a query language and let T
1
Trees
and T
2
T
1
. We say that
T
2
can be ltered from T
1
by a Q-query if, and only if, there is a query Q Q
such that the following is true for all T T
1
: T T
2
Eval(Q, T) ,= .
We assume that the reader is familiar with rst-order logic (FO) and monadic
second-order logic (MSO). An FO- or MSO-sentence (i.e., a formula without any
free variable) species a Boolean query, whereas a formula with exactly one free
rst-order variable species a unary query, i.e., a query which selects a set of
nodes from the underlying input tree.
It is well-known [9, 30] that the MSO-denable Boolean queries on binary
trees are exactly the (Boolean) queries that can be dened by nite (deterministic
or nondeterministic) bottom-up tree automata. An analogous statement is true
about MSO on unranked trees and unranked tree automata [7].
Theorem 4.5 in section 4 gives a lower bound on the worst case complexity
of the language XQuery. As we prove a lower bound for one particular XQuery
query, we do not give a formal denition of the language but refer to [33].
Apart from FO, MSO, and XQuery, we also consider a fragment of the XPath
language, Core XPath [12, 13]. As we will prove not only lower, but also upper
bounds for Core XPath, we give a precise denition of this query language in
[16]. An example of a Core XPath query is
/descendant::[child::A and child::B]/child::,
which selects all children of descendants of the root node that (i.e., the descen-
dants) have a child node labeled A and a child node labeled B.
Core XPath is a strict fragment of XPath [12], both syntactically and seman-
tically. It is known that Core XPath is in LOGSPACE w.r.t. data complexity
and P-complete w.r.t. combined complexity [13]. In [12], it is shown that Core
XPath can be evaluated in time O([Q[ [D[), where [Q[ is the size of the query
and [D[ is the size of the XML data. Furthermore, every Core XPath query is
equivalent to a unary MSO query on trees (cf., e.g., [11]).
Communication complexity. To prove basic properties and lower bounds
for our machine model, we use some notions and results from communication
complexity, cf., e.g., [21].
Let A, B, C be sets and let F : AB C be a function. In Yaos [34] basic
model of communication two players, Alice and Bob, jointly want to evaluate
F(x, y), for input values x A and y B, where Alice only knows x and Bob
only knows y. The two players can exchange messages according to some xed
protocol T that depends on F, but not on the particular input values x, y. The
exchange of messages starts with Alice sending a message to Bob and ends as
soon as one of the players has enough information on x and y to compute F(x, y).
T is called a k-round protocol, for some k N, if the exchange of messages
consists, for each input (x, y) A B, of at most k rounds. The cost of T on
input (x, y) is the number of bits communicated by T on input (x, y). The cost
of T is the maximal cost of T over all inputs (x, y) AB. The communication
complexity of F, comm-compl(F), is dened as the minimum cost of T, over all
protocols T that compute F. For k 1, the k-round communication complexity
of F, comm-compl
k
(F), is dened as the minimum cost of T, over all k-round
protocols T that compute F.
Many powerful tools are known for proving lower bounds on communication
complexity, cf., e.g., [21]. In the present paper we will use the following basic
lower bound for the problem of deciding whether two sets are disjoint.
Denition 2.1. For n N let the function Disj
n
: 2
{1,. . ,n}
2
{1,. . ,n}
0, 1
be given via
Disj
n
(X, Y ) :=
_
1 , if X Y =
0 , otherwise.
u
i=1
space(, i) s(n),
where u is the number of internal tapes of M.
(b) A string-language L
that belong to L.
(c) A function f :
rR,sS
ST(r, s).
If k N is a constant, then we write ST(k, s) instead of ST(r, s), where r is the
function with r(x) = k for all x N. We freely combine these notations and use
them for NST(, ) instead of ST(, ), too.
If we think of the external memory tape of an (r, s)-bounded Turing machine
as representing the incoming stream, stored on a hard disk, then admitting the
external memory tapes head to reverse its direction might not be very realistic.
But as we mainly use our model to prove lower bounds, it does not do any harm
either, since the reversals can be used to simulate random access. Random access
can be introduced explicitly into our model as follows: A random access Turing
machine is a Turing machine M which has a special internal memory tape that
is used as random access address tape, i.e., on which only binary strings can be
written. Such a binary string is interpreted as a positive integer specifying an
external memory address, that is, the position index number of a cell on the
external tape (we think of the external tape cells being numbered by positive
integers). The machine has a special state q
ra
. If q
ra
is entered, then in one
step the external memory tape head is moved to the cell that is specied by
the number on the random access address tape, and the content of the random
access address tape is deleted.
Denition 3.2. Let q, r, s : N N. A random access Turing machine M is
(q, r, s)-bounded, if it is (r, s)-bounded (in the sense of an ordinary Turing ma-
chine) and, in addition, every run of M on an input of length n involves at
most q(n) random accesses.
Noting that a random access can be simulated with at most 2 changes of the
direction of the external memory tape head, one immediately obtains:
Lemma 3.3. Let q, r, s : N N. If a problem can be solved by a (q, r, s)-bounded
random access Turing machine, then it can also be solved by an (r + 2q, O(s))-
bounded Turing machine.
In the subsequent parts of this paper, we will concentrate on ordinary Turing
machines (without random access). Via Lemma 3.3, all results can be transferred
from ordinary Turing machines to random access Turing machines.
4
It is convenient for technical reasons to add 1 to the number rev() of changes of
the head direction. As dened here, r(n) bounds the number of sequential scans of
the external memory tape rather than the number of changes of head directions.
The class ST(r, s) for trees. We make an analogous denition to ST(r, s) on
strings for trees. This denition is given in detail in [16].
4 Lower bounds for the ST model
A reduction lemma. The following lemma provides a convenient tool for
showing that a problem L does not belong to ST(r, s). The lemmas assumption
can be viewed as a reduction from the problem Disj
n
(, ) to the problem L.
Lemma 4.1. Let be an alphabet and let : N N such that the following is
true: For every n
0
N there is an n n
0
and functions f
n
, g
n
: 2
{1,. . ,n}
T(A, B) : A, B N
2
, A, B nite
TEmptyJoin :=
T(A, B) T
Rels
: A 1 B =
TNonEmptyJoin :=
T(A, B) T
Rels
: A 1 B =
.
Lemma 4.3. T
NonEmptyJoin
can be ltered from T
Rels
by an XQuery query.
Lemma 4.4. Let r, s : Trees
N.
If r(T) s(T) o
_
size(T)
log(size(T))
_
, then T
EmptyJoin
, ST(r, s).
From Lemma 4.4 and Lemma 4.3 we immediately obtain a lower bound on the
worst-case data complexity for ltering relative to an XQuery query:
Theorem 4.5. The tree-language T
EmptyJoin
(a) can be ltered from T
Rels
by an XQuery query,
(b) does not belong to the class ST(r, s), whenever r, s : Trees
N with
r(T) s(T) o
_
size(T)
log(size(T))
_
.
Remark 4.6. Let us note that the above bound is almost tight in the follow-
ing sense: The problem of deciding whether A
1
B = and, in general, all
FO-denable problems belong to ST(1, n) in its single scan of the external
memory tape, the Turing machine simply copies the entire input on one of its
internal memory tapes and then evaluates the FO-sentence by the straightfor-
ward LOGSPACE algorithm for FO-model-checking (cf. e.g. [1]).
Sorting. By KeySort, we denote the problem of sorting a set S of tuples
t = (K, V ) consisting of a key K and a value V by their keys. Let ST
(r, s)
denote the class of all problems in ST(r, s) that can be solved without writing
to the external memory tape. Then,
Theorem 4.7. Let r, s : N N. If KeySort is in ST
_
r(n
2
) + 2, s(n
2
) +O(log n) +O(max
tAB
[t[)
_
.
Remark 4.8. Given that the size of relations A and B is known (which is usually
the case in practical database management systems DBMS), the algorithm given
in the previous proof can do a merge-join without additional scans after the
sort run and without a need to buer more than one tuple. This is guaranteed
even if both relations may contain many tuples with the same join key in
current implementations of the merge join in DBMS, this may lead to grass-
roots swapping. The (substantial) practical drawback of the join algorithm of
the proof of Theorem 4.7, however, is that much larger relations A
, B
need to
be sorted: indeed [A
[ = [A[ [B[.
Corollary 4.9.
(a) Let r, s : N N such that r(n
2
)
_
s(n
2
) + log n
_
o
_
n
log n
_
.
Then, KeySort , ST
(r, s).
(b) KeySort , ST
_
o(
5
n), O(
5
n)
_
.
It is straightforward to see that by using MergeSort, the sorting problem can
be solved using O(log n) scans of external memory provided that three external
memory tapes are available. (In [17], this logarithmic bound is shown to be
tight, for arbitrarily many external tapes.) Corollary 4.9 gives an exponentially
stronger lower bound for the case of a single external tape.
A hierarchy based on the number of scans.
Theorem 4.10. For every xed k 1,
ST(k, O((log k) +log n)) NST(1, O(k log n)) , ST
_
k1, o
_
n
k
5
(log n)
3
__
.
The proof of this theorem is based on a result due to Duris, Galil and Schnit-
ger [10]. Theorem 4.10 directly implies
Corollary 4.11. For every xed k N and all classes S of functions from N to
N such that O(log n) S o
_
N
(lg n)
3
_
we have ST(k, S) ST(k+1, S).
Remark 4.12. On the other hand, of course, the hierarchy collapses if internal
memory space is at least linear in the size of the input: For every r : N N and
for every s : N N with s(n) (n), we have
ST(r, s) ST(1, n +s(n)) and ST(r, O(s(n))) = DSPACE(O(s(n))).
5 Tight bounds for ltering and query evaluation on trees
Lower bound. We need the following notation: We x a set of tag names via
:=
_
root, left, right, blank
_
. Let T
1
be the -tree from Figure 1. Note that
T
1
has a unique leaf v
1
labeled with the tag name left. For any arbitrary -
tree T we let T
1
(T) be the -tree rooted at T
1
s root and obtained by identifying
node v
1
with the root of T and giving the label left to this node. Now, for
every n 2 let T
n
be the -tree inductively dened via T
n
:= T
1
(T
n1
). It
is straightforward to see that T
n
has exactly 2n leaves labeled blank. Let
x
1
, . . , x
n
, y
n
, . . , y
1
denote these leaves, listed in document order (i.e., in the
order obtained by a pre-order depth-rst left-to-right traversal of T
n
). For an
illustration see Figure 2.
root
left
blank
right
left right
blank
<root>
<left>
<blank/>
</left>
<right>
<left/>
<right>
<blank/>
</right>
</right>
</root>
Fig. 1. A -tree T
1
and its XML document Doc(T
1
)
N.
If r(T) s(T) o(depth(T)), then T
NonDisj
, ST(r, s).
From Lemma 5.1 and Lemma 5.2 we directly obtain a lower bound on the
worst-case data complexity of Core XPath ltering:
Theorem 5.3. The tree-language T
NonDisj
(a) can be ltered from T
Sets
by a Core XPath query,
(b) is denable by an FO-sentence (and therefore, also denable by a Boolean
MSO query and recognizable by a tree automaton), and
(c) does not belong to the class ST(r, s), whenever
r, s : Trees
is denable by an
MSO-sentence if, and only if, it is recognizable by an unranked tree automaton,
respectively, if, and only if, the language BinTree(T) : T T of associated
binary trees is recognizable by an ordinary (ranked) tree automaton (cf., e.g.,
[7, 9, 30]).
Theorem 5.4 (implicit in [25, 29]). Let T Trees
be a tree-language. If T
is denable by an MSO-sentence (or, equivalently, recognizable by a ranked or
an unranked nite tree automaton), then T ST
_
1, depth() + 1
_
.
Recall that every Core XPath query is equivalent to a unary MSO query. Thus
a Core XPath lter can be phrased as an MSO sentence on trees. From the
Theorems 5.4 and 5.3 we therefore immediately obtain a tight bound for Core
XPath ltering:
Corollary 5.5. (a) Filtering from the set of unranked trees with respect to every
xed Core XPath query Q belongs to ST
_
1, O(depth())
_
.
(b) There is a Core XPath query Q such that, for all r, s : Trees
N with
r(T) s(T) o
_
depth(T)
_
, ltering w.r.t. Q does not belong to ST(r, s).
Next, we provide an upper bound for the problem of computing the set
Eval(Q, T) of nodes in an input tree T matching a unary MSO (or Core XPath)
query Q. We rst need to clarify what this means, because writing the subtree
of each matching node onto the output tape requires a very large amount of
internal memory (or a large number of head reversals on the external memory
tape), and this gives us no appropriate characterization of the diculty of the
problem. We study the problem of computing, for each node matched by Q, its
index in the tree, in the order in which they appear in the document Doc(T). We
distinguish between the case where these indexes are to be written to the output
tape in ascending order and the case where they are to be output in descending
(i.e., reverse) order.
Theorem 5.6 (implicit in [26, 20]). For every unary MSO or Core XPath
query Q, the problem of computing, for input trees T, the nodes in Eval(Q, T)
(a) in ascending order belongs to ST(3, O(depth())).
(b) in reverse order belongs to ST(2, O(depth())).
Note that this bound is tight: From Corollary 5.5(c) we know that, for some Core
XPath query Q, not even ltering (i.e., checking whether Eval(Q, T) is empty)
is possible in ST(r, s) if r(T) s(T) o
_
depth(T)
_
.
References
1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley,
1995.
2. G. Aggarwal, M. Datar, S. Rajagopalan, and M. Ruhl. On the streaming model
augmented with a sorting primitive. In Proc. FOCS04, pages 540549.
3. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the
frequency moments. Journal of Computer and System Sciences, 58:137147, 1999.
4. A. Arasu, B. Babcock, T. Green, A. Gupta, and J. Widom. Characterizing Mem-
ory Requirements for Queries over Continuous Data Streams. In Proc. PODS02,
pages 221232, 2002.
5. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues
in data stream systems. In Proc. PODS02, pages 116.
6. Z. Bar-Yossef, M. Fontoura, and V. Josifovski. On the Memory Requirements of
XPath Evaluation over XML Streams. In Proc. PODS04, pages 177188, 2004.
7. A. Br uggemann-Klein, M. Murata, and D. Wood. Regular Tree and Regular
Hedge Languages over Non-ranked Alphabets: Version 1, April 3, 2001. Technical
Report HKUST-TCSC-2001-05, Hong Kong Univ. of Science and Technology, 2001.
8. J.-E. Chen and C.-K. Yap. Reversal Complexity. SIAM J. Comput., 20(4):622
638, Aug. 1991.
9. J. Doner. Tree Acceptors and some of their Applications. Journal of Computer
and System Sciences, 4:406451, 1970.
10. P. Duris, Z. Galil, and G. Schnitger. Lower bounds on communication complexity.
Information and Computation, 73:122, 1987. Journal version of STOC84 paper.
11. G. Gottlob and C. Koch. Monadic Datalog and the Expressive Power of Web
Information Extraction Languages. Journal of the ACM, 51(1):74113, 2004.
12. G. Gottlob, C. Koch, and R. Pichler. Ecient Algorithms for Processing XPath
Queries. In Proc. VLDB 2002, pages 95106, Hong Kong, China, 2002.
13. G. Gottlob, C. Koch, and R. Pichler. The Complexity of XPath Query Evalua-
tion. In Proc. PODS03, pages 179190, San Diego, California, 2003.
14. G. Graefe. Query Evaluation Techniques for Large Databases. ACM Computing
Surveys, 25(2):73170, June 1993.
15. T. J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing XML Streams
with Deterministic Automata. In Proc. ICDT03, 2003.
16. M. Grohe, C. Koch, and N. Schweikardt. Tight lower bounds for query processing
on streaming and external memory data. Technical report CoRR cs.DB/0505002,
2005. Full version of ICALP05 paper.
17. M. Grohe and N. Schweikardt. Lower bounds for sorting with few random accesses
to external memory. In Proc. PODS, 2005. To appear.
18. M. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams.
In External memory algorithms, volume 50, pages 107118. DIMACS Series In
Discrete Mathematics And Theoretical Computer Science, 1999.
19. J. E. Hopcroft and J. D. Ullman. Some results on tape-bounded Turing machines.
Journal of the ACM, 16(1):168177, 1969.
20. C. Koch. Ecient Processing of Expressive Node-Selecting Queries on XML Data
in Secondary Storage: A Tree Automata-based Approach. In Proc. VLDB 2003,
pages 249260, 2003.
21. E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge Univ. Press,
1997.
22. U. Meyer, P. Sanders, and J. Sibeyn, editors. Algorithms for Memory Hierarchies,
volume 2832 of Lecture Notes in Computer Science. Springer-Verlag, 2003.
23. J. Munro and M. Paterson. Selection and sorting with limited storage. Theoretical
Computer Science, 12:315323, 1980.
24. S. Muthukrishnan. Data streams: algorithms and applications. In Proc. 14th
SODA, pages 413413, 2003.
25. A. Neumann and H. Seidl. Locating Matches of Tree Patterns in Forests. In
Proc. 18th FSTTCS, LNCS 1530, pages 134145, 1998.
26. F. Neven and J. van den Bussche. Expressiveness of Structured Document Query
Languages Based on Attribute Grammars. J. ACM, 49(1):56100, Jan. 2002.
27. R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill,
2002.
28. L. Segoun. Typing and Querying XML Documents: Some Complexity Bounds.
In Proc. PODS03, pages 167178, 2003.
29. L. Segoun and V. Vianu. Validating Streaming XML Documents. In Proc.
PODS02, 2002.
30. J. Thatcher and J. Wright. Generalized Finite Automata Theory with an Applica-
tion to a Decision Problem of Second-order Logic. Math. Syst. Theory, 2(1):5781,
1968.
31. P. van Emde Boas. Machine Models and Simulations. In J. van Leeuwen, edi-
tor, Handbook of Theoretical Computer Science, volume 1, chapter 1, pages 166.
Elsevier Science Publishers B.V., 1990.
32. J. Vitter. External memory algorithms and data structures: Dealing with massive
data. ACM Computing Surveys, 33(2):209271, June 2001.
33. World Wide Web Consortium. XQuery 1.0 and XPath 2.0 Formal Semantics.
W3C Working Draft (Aug. 16th 2002), 2002. http://www.w3.org/XML/Query.
34. A. Yao. Some complexity questions related to distributive computing. In Proc.
11th STOC, pages 209213, 1979.