Tight Lower Bounds For Query Processing On Streaming and External Memory Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Tight Lower Bounds for Query Processing on

Streaming and External Memory Data


Martin Grohe
1
, Christoph Koch
2
, and Nicole Schweikardt
1
1
Institut f ur Informatik, Humboldt-Universit at Berlin, Germany
{grohe,schweika}@informatik.hu-berlin.de
2
Database Group, Universit at des Saarlandes, Saarbr ucken, Germany
[email protected]
Abstract. We study a clean machine model for external memory and
stream processing. We show that the number of scans of the external data
induces a strict hierarchy (as long as work space is suciently small, e.g.,
polylogarithmic in the size of the input). We also show that neither joins
nor sorting are feasible if the product of the number r(n) of scans of
the external memory and the size s(n) of the internal memory buers is
suciently small, e.g., of size o(
5

n). We also establish tight bounds for


the complexity of XPath evaluation and ltering.
1 Introduction
It is generally assumed that databases have to reside in external, inexpensive
storage because of their sheer size. Current technology for external storage sys-
tems (disks and tapes) presents us with a reality that performance-wise, a small
number of sequential scans of the data is strictly preferable over random data
accesses. Indeed, the combined latencies and access times of moving to a certain
position in external storage are by orders of magnitude greater than actually
reading a small amount of data once the read head has been placed on its start-
ing position.
Database engines rely on main memory buers for assuring acceptable per-
formance. These are usually small compared to the size of the externally stored
data. Database technology in particular query processing technology has de-
veloped around this notion of memory hierarchies with layers of greatly varying
sizes and access times. There has been a wealth of research on query process-
ing and optimization along these lines (cf. e.g. [27, 14, 32, 22]). It seems that the
current technologies scale up to current user expectations, but on closer investi-
gation it may appear that our theoretical understanding of the problems involved
and of optimal algorithms for these problems is not quite as developed.
Recently, data stream processing has become an object of study by the data
management community (e.g. [15]) but from the viewpoint of database theory,
this is, in fact, a special case of the query processing problem on data in external
storage where we are limited to a single scan of the input data.
In summary, it appears that there are a variety of data management and
query processing problems in which a comparably small but eciently accessi-
ble main memory buer is available and where accessing external data is costly
and is best performed by sequential read/write scans. This calls for an appro-
priate formal model that captures the essence of external memory and stream
processing. In this paper, we study such a model, which employs a Turing ma-
chine with one external memory tape (external tape for short) and a number
internal memory tapes (internal tapes for short). The external tape initially
holds the input; the internal tapes correspond to the main memory buers of a
database management system and are thus usually small compared to the input.
As computational resources for inputs of size n, we study the space s(n) avail-
able on the internal tapes and the number r(n) of scans of (or, random accesses
to) the external tape, and we write ST(r, s) to denote the class of all problems
solvable by (r, s)-bounded Turing machines, i.e., Turing machines which comply
to the resource bounds r(n) and s(n) on inputs of size n.
Formally, we model the number of scans, respectively the number of random
accesses, by the number of reversals of the Turing machines read/write head on
the external tape. The number of reversals of the read/write head on the internal
tapes remains unbounded. The reversals done by a read/write head are a clean
and fundamental notion [8], but of course real external storage technology based
on disks does not allow to reverse their direction of rotation. On the other hand,
we can of course simulate k forward scans by 2k reversals in our machine model
and allowing for forward as well as backward scans makes our lower bound
results even stronger.
As we allow the external tape to be both read and written to, the external
tape can be viewed, for example, as modeling a hard disk. By closely watching
reversals of the external tape head, anything close to random I/O will result
in a very considerable number of reversals, while a full sequential scan of the
external data can be eected cheaply. We will obtain strong lower bounds in
this paper that show that even if the external tape (whose size we do not put a
bound on) may be written to and re-read, certain bounds cannot be improved
upon. For our matching upper bounds, we will usually not write to the external
tape. Whenever one of our results requires writing to the external tape, we will
explicitly indicate this.
The model is similar in spirit to the frameworks used in [18, 19], but diers
from the previously considered reversal complexity framework [8]. Reversal com-
plexity is based on Turing machines with a single read/write tape and the overall
number of reversals of the read/write head the main computational resource. In
our notion, only the number of reversals on the external tape is bounded, while
reversals on the internal tapes are free; however, the space on the internal tapes
is considered to be a limited resource.
3
3
The justication for this assumption is simply that accessing data on disks is cur-
rently about ve to six orders of magnitude slower than accessing main memory.
For that reason, processor cycles and main memory access times are often neglected
when estimating query cost in relational query optimizers, where cost measures are
often exclusively based on the amount of expected page I/O as well as disk latency
and access times. Moreover, by taking buer space rather than running time as a
parameter, we obtain more robust complexity classes that rely less on details of the
machine model (see also [31]).
Apart from formalizing the ST(r, s) model, we study its properties and locate
a number of data management problems in the hierarchy of ST(, ) classes.
Our technical contributions are as follows:
We prove a reduction lemma (Lemma 4.1) which allows easy lower bound
proofs for certain problems.
We prove a hierarchy (Corollary 4.11 and Theorem 4.10), stating for each
xed number k that k +1 scans of the external memory tape are strictly more
powerful than k scans of the external memory tape.
We consider machines where the product of the number of scans of the external
memory tape, r(n), and internal memory tape size, s(n), is of size o
_
n
log n
_
,
where n is the input size, and show that joins cannot be computed by (r, s)-
bounded Turing machines (cf., Lemma 4.4).
We show that the sorting problem cannot be solved with (o(
5

n), O(
5

n))-
bounded Turing machines that are not allowed to write intermediate results
to the external memory tape (cf., Corollary 4.9).
We show (cf., Theorem 4.5) that for some XQuery queries, ltering is impos-
sible for machines with r(T) s(T) o
_
n
log n
_
, where n is the size of the input
XML document T.
We show (cf., Corollary 5.5) that for some Core XPath [12] queries, ltering is
impossible for machines with r(T) s(T) o
_
d
_
, where d denotes the depth of
the input XML document T. Furthermore, we show that the lower bound on
Core XPath is tight in that we give an algorithm that solves the Core XPath
ltering problem with a single scan of the external data (zero reversals) and
O(d) buer space.
The primary technical machinery that we use for obtaining lower bounds
is that of communication complexity (cf. [21]). Techniques from communication
complexity have been used previously to study queries on streams [4, 6, 2, 3, 5, 23,
24, 18]. The work reported on in [4] addresses the problem of determining whether
a given relational query can be evaluated scalably on a data stream or not at
all. In comparison, we ask for tight bounds on query evaluation problems, i.e.
we give algorithms for query evaluation that are in a sense worst-case optimal.
As we do, the authors of [6] study XPath evaluation; however, they focus on
instance data complexity while we study worst-case bounds. This allows us to
nd strong and tight bounds for a greater variety of query evaluation problems.
Many of our results apply beyond stream processing in a narrow sense to a more
general framework of queries on data in external storage. Also, our worst-case
bounds apply for any evaluation algorithm possible, that is, our bounds are not
in terms of complexity classes closed under reductions that allow for nonlinear
expansions of the input (such as LOGSPACE) as is the case for the work on the
complexity of XPath in [12, 13, 28].
Lower bound results for a machine model with multiple external memory
tapes (or harddisks) are presented in [17]. In the present paper, we only consider
a single external memory tape, and are consequently able to show (sometimes
exponentially) stronger lower bounds.
Due to space limitations we had to defer detailed proofs of our results to the
full version of this paper [16] which extends the present paper by an appendix
that contains proofs of all results presented here.
2 Preliminaries
In this section we x some basic notation concerning trees, streams, and query
languages. We write N for the set of non-negative integers. If M is a set, then
2
M
denotes the set of all subsets of M. Throughout this paper we make the
following convention: Whenever the letters r and s denote functions from N to
N, then these functions are monotone, i.e., we have r(x) r(y) and s(x) s(y)
for all x, y N with x y.
Trees and Streams. We use standard notation for trees and streamed trees
(i.e. documents). In particular, we write Doc(T) to denote the XML document
associated with an XML document tree T. An example is given in Figure 1.
Query Languages. By Eval(, ) we denote the evaluation function that maps
each tuple (Q, T), consisting of a query Q and a tree T to the corresponding query
result. Let Q be a query language and let T
1
Trees

and T
2
T
1
. We say that
T
2
can be ltered from T
1
by a Q-query if, and only if, there is a query Q Q
such that the following is true for all T T
1
: T T
2
Eval(Q, T) ,= .
We assume that the reader is familiar with rst-order logic (FO) and monadic
second-order logic (MSO). An FO- or MSO-sentence (i.e., a formula without any
free variable) species a Boolean query, whereas a formula with exactly one free
rst-order variable species a unary query, i.e., a query which selects a set of
nodes from the underlying input tree.
It is well-known [9, 30] that the MSO-denable Boolean queries on binary
trees are exactly the (Boolean) queries that can be dened by nite (deterministic
or nondeterministic) bottom-up tree automata. An analogous statement is true
about MSO on unranked trees and unranked tree automata [7].
Theorem 4.5 in section 4 gives a lower bound on the worst case complexity
of the language XQuery. As we prove a lower bound for one particular XQuery
query, we do not give a formal denition of the language but refer to [33].
Apart from FO, MSO, and XQuery, we also consider a fragment of the XPath
language, Core XPath [12, 13]. As we will prove not only lower, but also upper
bounds for Core XPath, we give a precise denition of this query language in
[16]. An example of a Core XPath query is
/descendant::[child::A and child::B]/child::,
which selects all children of descendants of the root node that (i.e., the descen-
dants) have a child node labeled A and a child node labeled B.
Core XPath is a strict fragment of XPath [12], both syntactically and seman-
tically. It is known that Core XPath is in LOGSPACE w.r.t. data complexity
and P-complete w.r.t. combined complexity [13]. In [12], it is shown that Core
XPath can be evaluated in time O([Q[ [D[), where [Q[ is the size of the query
and [D[ is the size of the XML data. Furthermore, every Core XPath query is
equivalent to a unary MSO query on trees (cf., e.g., [11]).
Communication complexity. To prove basic properties and lower bounds
for our machine model, we use some notions and results from communication
complexity, cf., e.g., [21].
Let A, B, C be sets and let F : AB C be a function. In Yaos [34] basic
model of communication two players, Alice and Bob, jointly want to evaluate
F(x, y), for input values x A and y B, where Alice only knows x and Bob
only knows y. The two players can exchange messages according to some xed
protocol T that depends on F, but not on the particular input values x, y. The
exchange of messages starts with Alice sending a message to Bob and ends as
soon as one of the players has enough information on x and y to compute F(x, y).
T is called a k-round protocol, for some k N, if the exchange of messages
consists, for each input (x, y) A B, of at most k rounds. The cost of T on
input (x, y) is the number of bits communicated by T on input (x, y). The cost
of T is the maximal cost of T over all inputs (x, y) AB. The communication
complexity of F, comm-compl(F), is dened as the minimum cost of T, over all
protocols T that compute F. For k 1, the k-round communication complexity
of F, comm-compl
k
(F), is dened as the minimum cost of T, over all k-round
protocols T that compute F.
Many powerful tools are known for proving lower bounds on communication
complexity, cf., e.g., [21]. In the present paper we will use the following basic
lower bound for the problem of deciding whether two sets are disjoint.
Denition 2.1. For n N let the function Disj
n
: 2
{1,. . ,n}
2
{1,. . ,n}
0, 1
be given via
Disj
n
(X, Y ) :=
_
1 , if X Y =
0 , otherwise.

Theorem 2.2 (cf., e.g., [21]). For every n N, comm-compl(Disj


n
) n.
3 Machine Model
We consider Turing machines with (1) an input tape, which is a read/write tape
and will henceforth be called external memory tape or external tape, for
short, (2) an arbitrary number u of work tapes, which will henceforth be called
internal memory tapes or internal tapes, for short, and, if needed, (3) an
additional write-only output tape.
Let M be such a Turing machine and let be a run of M. By rev() we denote
the number of times the external memory tapes head changes its direction in
the run . For i 1, . . , u we let space(, i) be the number of cells of internal
memory tape i that are used by .
The class ST(r, s) for strings.
Denition 3.1 (ST(r, s) for strings). Let r : N N and s : N N.
(a) A Turing machine M is (r, s)-bounded, if every run of M on an input of
length n satises the following conditions:
(1) is nite, (2) 1 + rev() r(n),
4
and (3)

u
i=1
space(, i) s(n),
where u is the number of internal tapes of M.
(b) A string-language L

belongs to the class ST(r, s) (resp., NST(r, s)), if


there is a deterministic (respectively, nondeterministic) (r, s)-bounded Tur-
ing machine which accepts exactly those w

that belong to L.
(c) A function f :

belongs to the class ST(r, s), if there is a determin-


istic (r, s)-bounded Turing machine which produces, for each input string
w

, the string f(w) on its write-only output tape.


For classes R and S of functions, we let ST(R, S) :=

rR,sS
ST(r, s).
If k N is a constant, then we write ST(k, s) instead of ST(r, s), where r is the
function with r(x) = k for all x N. We freely combine these notations and use
them for NST(, ) instead of ST(, ), too.
If we think of the external memory tape of an (r, s)-bounded Turing machine
as representing the incoming stream, stored on a hard disk, then admitting the
external memory tapes head to reverse its direction might not be very realistic.
But as we mainly use our model to prove lower bounds, it does not do any harm
either, since the reversals can be used to simulate random access. Random access
can be introduced explicitly into our model as follows: A random access Turing
machine is a Turing machine M which has a special internal memory tape that
is used as random access address tape, i.e., on which only binary strings can be
written. Such a binary string is interpreted as a positive integer specifying an
external memory address, that is, the position index number of a cell on the
external tape (we think of the external tape cells being numbered by positive
integers). The machine has a special state q
ra
. If q
ra
is entered, then in one
step the external memory tape head is moved to the cell that is specied by
the number on the random access address tape, and the content of the random
access address tape is deleted.
Denition 3.2. Let q, r, s : N N. A random access Turing machine M is
(q, r, s)-bounded, if it is (r, s)-bounded (in the sense of an ordinary Turing ma-
chine) and, in addition, every run of M on an input of length n involves at
most q(n) random accesses.
Noting that a random access can be simulated with at most 2 changes of the
direction of the external memory tape head, one immediately obtains:
Lemma 3.3. Let q, r, s : N N. If a problem can be solved by a (q, r, s)-bounded
random access Turing machine, then it can also be solved by an (r + 2q, O(s))-
bounded Turing machine.
In the subsequent parts of this paper, we will concentrate on ordinary Turing
machines (without random access). Via Lemma 3.3, all results can be transferred
from ordinary Turing machines to random access Turing machines.
4
It is convenient for technical reasons to add 1 to the number rev() of changes of
the head direction. As dened here, r(n) bounds the number of sequential scans of
the external memory tape rather than the number of changes of head directions.
The class ST(r, s) for trees. We make an analogous denition to ST(r, s) on
strings for trees. This denition is given in detail in [16].
4 Lower bounds for the ST model
A reduction lemma. The following lemma provides a convenient tool for
showing that a problem L does not belong to ST(r, s). The lemmas assumption
can be viewed as a reduction from the problem Disj
n
(, ) to the problem L.
Lemma 4.1. Let be an alphabet and let : N N such that the following is
true: For every n
0
N there is an n n
0
and functions f
n
, g
n
: 2
{1,. . ,n}

such that for all X, Y 1, . . , n the string f


n
(X)g
n
(Y ) has length (n).
Then we have for all r, s : N N with r((n)) s((n)) o(n), that there
is no (r, s)-bounded deterministic Turing machine which accepts a string of the
form f
n
(X)g
n
(Y ) if, and only if, X Y = .
Disjointness. Every n-bit string x = x
1
x
n
0, 1
n
species a set S(x) :=
i : x
i
= 1 1, . . , n. Let L
Disj
consist of those strings x#y where x and y
specify disjoint subsets of 1, . . , n, for some n 1. That is,
L
Disj
:=
_
x#y : ex. n 1 with x, y 0, 1
n
and S(x) S(y) =
_
.
From Lemma 4.1 one easily obtains
Proposition 4.2. Let r : N N and s : N N. If r(n) s(n) o(n), then
L
Disj
, ST(r, s).
The bound given by Proposition 4.2 is tight, as it can be easily seen that L
Disj

ST(r, s) for all r, s : N N with r(n) s(n) (n).
Joins. Let be the set of tag names rels, rel1, rel2, tuple, no1, no2, 0, 1 .
We represent a pair (A, B) of nite relations A, B N
2
as a -tree T(A, B)
whose associated XML document Doc(T(A, B)) is a

-string of the following


form: For each number i N let Bin(i) = b
(i)
i
b
(i)
0
be the binary representation
of i. For each tuple (i, j) 1, . . , n
2
let Doc(i, j) :=
tuple) no1) b
(i)
i
/) b
(i)
0
/) /no1) no2) b
(j)
j
/) b
(j)
0
/) /no2) /tuple) .
For each nite relation A N
2
let t
1
, . . , t
|A|
be the lexicographically ordered
list of all tuples in A. We let Doc(A) := Doc(t
1
) Doc(t
|A|
) . Finally, we let
Doc(T(A, B)) := rels) rel1) Doc(A) /rel1) rel2) Doc(B) /rel2) /rels).
It is straightforward to see that the string Doc(T(A, B)) has length O
_
([A[ +
[B[) log n
_
, if A, B 1, . . . , n
2
.
We write A
1
B to denote the join of A and B on their rst component,
i.e., A
1
B := (x, y) : z A(z, x) B(z, y) . We let
T
Rels
:=

T(A, B) : A, B N
2
, A, B nite

TEmptyJoin :=

T(A, B) T
Rels
: A 1 B =

TNonEmptyJoin :=

T(A, B) T
Rels
: A 1 B =

.
Lemma 4.3. T
NonEmptyJoin
can be ltered from T
Rels
by an XQuery query.
Lemma 4.4. Let r, s : Trees

N.
If r(T) s(T) o
_
size(T)
log(size(T))
_
, then T
EmptyJoin
, ST(r, s).
From Lemma 4.4 and Lemma 4.3 we immediately obtain a lower bound on the
worst-case data complexity for ltering relative to an XQuery query:
Theorem 4.5. The tree-language T
EmptyJoin
(a) can be ltered from T
Rels
by an XQuery query,
(b) does not belong to the class ST(r, s), whenever r, s : Trees

N with
r(T) s(T) o
_
size(T)
log(size(T))
_
.
Remark 4.6. Let us note that the above bound is almost tight in the follow-
ing sense: The problem of deciding whether A
1
B = and, in general, all
FO-denable problems belong to ST(1, n) in its single scan of the external
memory tape, the Turing machine simply copies the entire input on one of its
internal memory tapes and then evaluates the FO-sentence by the straightfor-
ward LOGSPACE algorithm for FO-model-checking (cf. e.g. [1]).
Sorting. By KeySort, we denote the problem of sorting a set S of tuples
t = (K, V ) consisting of a key K and a value V by their keys. Let ST

(r, s)
denote the class of all problems in ST(r, s) that can be solved without writing
to the external memory tape. Then,
Theorem 4.7. Let r, s : N N. If KeySort is in ST

(r, s), then computing


the natural join A B of two nite relations A, B is in
ST

_
r(n
2
) + 2, s(n
2
) +O(log n) +O(max
tAB
[t[)
_
.
Remark 4.8. Given that the size of relations A and B is known (which is usually
the case in practical database management systems DBMS), the algorithm given
in the previous proof can do a merge-join without additional scans after the
sort run and without a need to buer more than one tuple. This is guaranteed
even if both relations may contain many tuples with the same join key in
current implementations of the merge join in DBMS, this may lead to grass-
roots swapping. The (substantial) practical drawback of the join algorithm of
the proof of Theorem 4.7, however, is that much larger relations A

, B

need to
be sorted: indeed [A

[ = [A[ [B[.
Corollary 4.9.
(a) Let r, s : N N such that r(n
2
)
_
s(n
2
) + log n
_
o
_
n
log n
_
.
Then, KeySort , ST

(r, s).
(b) KeySort , ST

_
o(
5

n), O(
5

n)
_
.
It is straightforward to see that by using MergeSort, the sorting problem can
be solved using O(log n) scans of external memory provided that three external
memory tapes are available. (In [17], this logarithmic bound is shown to be
tight, for arbitrarily many external tapes.) Corollary 4.9 gives an exponentially
stronger lower bound for the case of a single external tape.
A hierarchy based on the number of scans.
Theorem 4.10. For every xed k 1,
ST(k, O((log k) +log n)) NST(1, O(k log n)) , ST
_
k1, o
_

n
k
5
(log n)
3
__
.
The proof of this theorem is based on a result due to Duris, Galil and Schnit-
ger [10]. Theorem 4.10 directly implies
Corollary 4.11. For every xed k N and all classes S of functions from N to
N such that O(log n) S o
_

N
(lg n)
3
_
we have ST(k, S) ST(k+1, S).
Remark 4.12. On the other hand, of course, the hierarchy collapses if internal
memory space is at least linear in the size of the input: For every r : N N and
for every s : N N with s(n) (n), we have
ST(r, s) ST(1, n +s(n)) and ST(r, O(s(n))) = DSPACE(O(s(n))).
5 Tight bounds for ltering and query evaluation on trees
Lower bound. We need the following notation: We x a set of tag names via
:=
_
root, left, right, blank
_
. Let T
1
be the -tree from Figure 1. Note that
T
1
has a unique leaf v
1
labeled with the tag name left. For any arbitrary -
tree T we let T
1
(T) be the -tree rooted at T
1
s root and obtained by identifying
node v
1
with the root of T and giving the label left to this node. Now, for
every n 2 let T
n
be the -tree inductively dened via T
n
:= T
1
(T
n1
). It
is straightforward to see that T
n
has exactly 2n leaves labeled blank. Let
x
1
, . . , x
n
, y
n
, . . , y
1
denote these leaves, listed in document order (i.e., in the
order obtained by a pre-order depth-rst left-to-right traversal of T
n
). For an
illustration see Figure 2.
root
left
blank
right
left right
blank
<root>
<left>
<blank/>
</left>
<right>
<left/>
<right>
<blank/>
</right>
</right>
</root>
Fig. 1. A -tree T
1
and its XML document Doc(T
1
)

with tag names := {root, left, right, blank}.


root
left
x
1 blank
right
left
left
x
2 blank
right
left right
blank y
2
right
blank y
1
Fig. 2. Tree T
2
and nodes x
1
, x
2
, y
1
, y
2
.
We let
01
:= 0, 1. For all sets X, Y 1, . . , n let T
n
(X, Y ) be the
01
-
tree obtained from T
n
by replacing, for each i 1, . . , n, (*) the label blank
of leaf x
i
by the label 1 if i X, and by the label 0 otherwise and (*) the label
blank of leaf y
i
by the label 1 if i Y , and by the label 0 otherwise.
We let
T
Sets
:=
_
T
n
(X, Y ) : n 1, X, Y 1, . . , n
_
,
T
Disj
:=
_
T
n
(X, Y ) T
Sets
: X Y =
_
,
T
NonDisj
:=
_
T
n
(X, Y ) T
Sets
: X Y ,=
_
.
Lemma 5.1. (a) There is a Core XPath query Q such that the following is true
for all -trees T T
Sets
: Eval(Q, T) ,= T T
NonDisj
.
(b) There is a FO-sentence such that the following is true for all -trees T:
T [= T T
NonDisj
.
Lemma 5.2. Let r, s : Trees

N.
If r(T) s(T) o(depth(T)), then T
NonDisj
, ST(r, s).
From Lemma 5.1 and Lemma 5.2 we directly obtain a lower bound on the
worst-case data complexity of Core XPath ltering:
Theorem 5.3. The tree-language T
NonDisj
(a) can be ltered from T
Sets
by a Core XPath query,
(b) is denable by an FO-sentence (and therefore, also denable by a Boolean
MSO query and recognizable by a tree automaton), and
(c) does not belong to the class ST(r, s), whenever
r, s : Trees

N with r(T) s(T) o(depth(T)).


In the following subsection we match this lower bound with the corresponding
upper bound.
Upper bounds. Recall that a tree-language T Trees

is denable by an
MSO-sentence if, and only if, it is recognizable by an unranked tree automaton,
respectively, if, and only if, the language BinTree(T) : T T of associated
binary trees is recognizable by an ordinary (ranked) tree automaton (cf., e.g.,
[7, 9, 30]).
Theorem 5.4 (implicit in [25, 29]). Let T Trees

be a tree-language. If T
is denable by an MSO-sentence (or, equivalently, recognizable by a ranked or
an unranked nite tree automaton), then T ST
_
1, depth() + 1
_
.
Recall that every Core XPath query is equivalent to a unary MSO query. Thus
a Core XPath lter can be phrased as an MSO sentence on trees. From the
Theorems 5.4 and 5.3 we therefore immediately obtain a tight bound for Core
XPath ltering:
Corollary 5.5. (a) Filtering from the set of unranked trees with respect to every
xed Core XPath query Q belongs to ST
_
1, O(depth())
_
.
(b) There is a Core XPath query Q such that, for all r, s : Trees

N with
r(T) s(T) o
_
depth(T)
_
, ltering w.r.t. Q does not belong to ST(r, s).
Next, we provide an upper bound for the problem of computing the set
Eval(Q, T) of nodes in an input tree T matching a unary MSO (or Core XPath)
query Q. We rst need to clarify what this means, because writing the subtree
of each matching node onto the output tape requires a very large amount of
internal memory (or a large number of head reversals on the external memory
tape), and this gives us no appropriate characterization of the diculty of the
problem. We study the problem of computing, for each node matched by Q, its
index in the tree, in the order in which they appear in the document Doc(T). We
distinguish between the case where these indexes are to be written to the output
tape in ascending order and the case where they are to be output in descending
(i.e., reverse) order.
Theorem 5.6 (implicit in [26, 20]). For every unary MSO or Core XPath
query Q, the problem of computing, for input trees T, the nodes in Eval(Q, T)
(a) in ascending order belongs to ST(3, O(depth())).
(b) in reverse order belongs to ST(2, O(depth())).
Note that this bound is tight: From Corollary 5.5(c) we know that, for some Core
XPath query Q, not even ltering (i.e., checking whether Eval(Q, T) is empty)
is possible in ST(r, s) if r(T) s(T) o
_
depth(T)
_
.
References
1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley,
1995.
2. G. Aggarwal, M. Datar, S. Rajagopalan, and M. Ruhl. On the streaming model
augmented with a sorting primitive. In Proc. FOCS04, pages 540549.
3. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the
frequency moments. Journal of Computer and System Sciences, 58:137147, 1999.
4. A. Arasu, B. Babcock, T. Green, A. Gupta, and J. Widom. Characterizing Mem-
ory Requirements for Queries over Continuous Data Streams. In Proc. PODS02,
pages 221232, 2002.
5. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues
in data stream systems. In Proc. PODS02, pages 116.
6. Z. Bar-Yossef, M. Fontoura, and V. Josifovski. On the Memory Requirements of
XPath Evaluation over XML Streams. In Proc. PODS04, pages 177188, 2004.
7. A. Br uggemann-Klein, M. Murata, and D. Wood. Regular Tree and Regular
Hedge Languages over Non-ranked Alphabets: Version 1, April 3, 2001. Technical
Report HKUST-TCSC-2001-05, Hong Kong Univ. of Science and Technology, 2001.
8. J.-E. Chen and C.-K. Yap. Reversal Complexity. SIAM J. Comput., 20(4):622
638, Aug. 1991.
9. J. Doner. Tree Acceptors and some of their Applications. Journal of Computer
and System Sciences, 4:406451, 1970.
10. P. Duris, Z. Galil, and G. Schnitger. Lower bounds on communication complexity.
Information and Computation, 73:122, 1987. Journal version of STOC84 paper.
11. G. Gottlob and C. Koch. Monadic Datalog and the Expressive Power of Web
Information Extraction Languages. Journal of the ACM, 51(1):74113, 2004.
12. G. Gottlob, C. Koch, and R. Pichler. Ecient Algorithms for Processing XPath
Queries. In Proc. VLDB 2002, pages 95106, Hong Kong, China, 2002.
13. G. Gottlob, C. Koch, and R. Pichler. The Complexity of XPath Query Evalua-
tion. In Proc. PODS03, pages 179190, San Diego, California, 2003.
14. G. Graefe. Query Evaluation Techniques for Large Databases. ACM Computing
Surveys, 25(2):73170, June 1993.
15. T. J. Green, G. Miklau, M. Onizuka, and D. Suciu. Processing XML Streams
with Deterministic Automata. In Proc. ICDT03, 2003.
16. M. Grohe, C. Koch, and N. Schweikardt. Tight lower bounds for query processing
on streaming and external memory data. Technical report CoRR cs.DB/0505002,
2005. Full version of ICALP05 paper.
17. M. Grohe and N. Schweikardt. Lower bounds for sorting with few random accesses
to external memory. In Proc. PODS, 2005. To appear.
18. M. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams.
In External memory algorithms, volume 50, pages 107118. DIMACS Series In
Discrete Mathematics And Theoretical Computer Science, 1999.
19. J. E. Hopcroft and J. D. Ullman. Some results on tape-bounded Turing machines.
Journal of the ACM, 16(1):168177, 1969.
20. C. Koch. Ecient Processing of Expressive Node-Selecting Queries on XML Data
in Secondary Storage: A Tree Automata-based Approach. In Proc. VLDB 2003,
pages 249260, 2003.
21. E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge Univ. Press,
1997.
22. U. Meyer, P. Sanders, and J. Sibeyn, editors. Algorithms for Memory Hierarchies,
volume 2832 of Lecture Notes in Computer Science. Springer-Verlag, 2003.
23. J. Munro and M. Paterson. Selection and sorting with limited storage. Theoretical
Computer Science, 12:315323, 1980.
24. S. Muthukrishnan. Data streams: algorithms and applications. In Proc. 14th
SODA, pages 413413, 2003.
25. A. Neumann and H. Seidl. Locating Matches of Tree Patterns in Forests. In
Proc. 18th FSTTCS, LNCS 1530, pages 134145, 1998.
26. F. Neven and J. van den Bussche. Expressiveness of Structured Document Query
Languages Based on Attribute Grammars. J. ACM, 49(1):56100, Jan. 2002.
27. R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill,
2002.
28. L. Segoun. Typing and Querying XML Documents: Some Complexity Bounds.
In Proc. PODS03, pages 167178, 2003.
29. L. Segoun and V. Vianu. Validating Streaming XML Documents. In Proc.
PODS02, 2002.
30. J. Thatcher and J. Wright. Generalized Finite Automata Theory with an Applica-
tion to a Decision Problem of Second-order Logic. Math. Syst. Theory, 2(1):5781,
1968.
31. P. van Emde Boas. Machine Models and Simulations. In J. van Leeuwen, edi-
tor, Handbook of Theoretical Computer Science, volume 1, chapter 1, pages 166.
Elsevier Science Publishers B.V., 1990.
32. J. Vitter. External memory algorithms and data structures: Dealing with massive
data. ACM Computing Surveys, 33(2):209271, June 2001.
33. World Wide Web Consortium. XQuery 1.0 and XPath 2.0 Formal Semantics.
W3C Working Draft (Aug. 16th 2002), 2002. http://www.w3.org/XML/Query.
34. A. Yao. Some complexity questions related to distributive computing. In Proc.
11th STOC, pages 209213, 1979.

You might also like