Xquery: An XML Query Language
Xquery: An XML Query Language
An XML query
language
by D. Chamberlin
The World Wide Web Consortium has tation for transforming XML documents from one
convened a working group to design a query representation to another.
language for Extensible Markup Language
(XML) data sources. This new query language, XML makes it possible for applications to exchange
called XQuery, is still evolving and has been data in a standard format that is independent of stor-
described in a series of drafts published by age. For example, one application may use a native
the working group. XQuery is a functional XML storage format, whereas another may store data
language comprised of several kinds of in a relational database. Since XML is emerging as
expressions that can be nested and a standard for data exchange, it is natural that que-
composed with full generality. It is based on ries among applications should be expressed as que-
ries against data in XML format. This use gives rise
the type system of XML Schema and is
to a requirement for a query language designed ex-
designed to be compatible with other XML-
pressly for XML data sources. In October 1999, W3C
related standards. This paper explains the
convened the XML Query Working Group 6 for the
need for an XML query language, provides a purpose of designing such a query language, to be
tutorial overview of XQuery, and includes called XQuery.
several examples of its use.
XML data are different from relational data in sev-
eral important respects that influence the design of
a query language. Relational data tend to have a reg-
ular structure, which allows the descriptive meta-data
Increasingly, Extensible Markup Language (XML) 1 for these data to be stored in a separate catalog. XML
is considered the format of choice for the exchange data, in contrast, are often quite heterogeneous, and
of information among various applications on the distribute their meta-data throughout the document.
Internet. The popularity of XML is due in large part XML documents often contain many levels of nested
to its flexibility for representing many kinds of in- elements, whereas relational data are “flat.” XML
formation. The use of tags makes XML data self-de- documents have an intrinsic order, whereas relational
scribing, and the extensible nature of XML makes it data are unordered except where an ordering can
possible to define new kinds of documents for spe- be derived from data values. Relational data are usu-
cialized purposes. As the importance of XML has in- ally “dense” (nearly every column has a value), and
creased, a series of standards has grown up around 娀Copyright 2002 by International Business Machines Corpora-
it, many of which were defined by the World Wide tion. Copying in printed form for private use is permitted with-
Web Consortium (W3C). 2 For example, XML Sche- out payment of royalty provided that (1) each reproduction is done
ma 3 provides a notation for defining new types of without alteration and (2) the Journal reference and IBM copy-
elements and documents; XML Path Language right notice are included on the first page. The title and abstract,
but no other portions, of this paper may be copied or distributed
(XPath) 4 provides a notation for selecting elements royalty free without further permission by computer-based and
within an XML document; and Extensible Stylesheet other information-service systems. Permission to republish any
Language Transformations (XSLT) 5 provides a no- other portion of this paper must be obtained from the Editor.
IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002 0018-8670/02/$5.00 © 2002 IBM CHAMBERLIN 597
relational systems often represent missing informa- man writing and understanding. This paper describes
tion by a special null value. XML data, in contrast, only the human-oriented version of XQuery.
are often “sparse” and can represent missing infor-
mation simply by the absence of an element. For The initial design of XQuery is focused only on in-
these and other reasons, existing relational query lan- formation retrieval and does not provide facilities
guages are not directly suitable for querying XML for updating existing XML documents. The XML
data. Query Working Group may consider the addition
of an update facility after completing the design of
The design of XQuery is still in progress. The XML the first version of XQuery.
Query Working Group has published working drafts
of several documents that describe the current state This paper describes the data model on which
of the design. Of these, perhaps the most important XQuery is based, and then presents an overview of
the XQuery language in the form of a series of ex-
amples. This paper is not intended to provide a rig-
orous or exhaustive definition of the language. The
The design of reader is referred to Reference 7 for an XQuery syn-
XQuery has been tax and a more complete language description.
subject to a number
of influences. Data model
Formally, the input and output of XQuery are de-
fined in terms of a data model, described in Refer-
is XQuery 1.0: An XML Query Language, 7 which con- ence 9. The query data model provides an abstract
tains a syntax and informal description of the lan- representation of one or more XML documents or
guage. The working group has also published a list document fragments. The data model is based on
of requirements, 8 a description of the data model the notion of a sequence. A sequence is an ordered
that underlies the language, 9 a formal semantic de- collection of zero or more items. An item may be a
scription, 10 a list of functions and operators, 11 and node or an atomic value. An atomic value is an in-
a collection of use cases that illustrate applications stance of one of the built-in data types defined by XML
of the language. 12 Each of these documents is up- Schema, such as strings, integers, decimals, and dates.
dated from time to time as the design of XQuery A node conforms to one of seven node kinds, which
evolves. This paper is based on the most recent include element, attribute, text, document, comment,
XQuery design at the time of its publication, but since processing instruction, and namespace nodes. A
this design is still changing, the documents refer- node may have other nodes as children, thus form-
enced in this paragraph should be consulted for the ing one or more node hierarchies. Some kinds of
latest developments. nodes, such as element and attribute nodes, have
names or typed values, or both. A typed value is a
The design of XQuery has been subject to a number sequence of zero or more atomic values. Nodes have
of influences. Perhaps the most important of these identity (that is, two nodes may be distinguishable
is compatibility with existing W3C standards, includ- even though their names and values are the same),
ing Schema, XSLT, XPath, and XML itself. XPath, in but atomic values do not have identity. Among all
particular, is so important and so closely related that the nodes in a hierarchy there is a total ordering
XQuery is defined as a superset of XPath. The over- called document order, in which each node appears
all design of XQuery is based on a language proposal before its children. Document order corresponds to
called Quilt. 13 Quilt, in turn, was influenced by the the order in which the nodes would appear if the
functional approach of Object Query Language node hierarchy were represented in XML format.
(OQL), 14 by the keyword-based syntax of Structured Document order between nodes in different hierar-
Query Language (SQL), 15 and by previous XML query chies is implementation-defined but must be consis-
language proposals including XQL, 16 XML-QL, 17 and tent; that is, all the nodes in one hierarchy must be
Lorel. 18 ordered either before or after all the nodes in an-
other hierarchy.
It is an objective of the XML Query Working Group
to define two syntaxes for XQuery: one that is ex- Sequences may be heterogeneous; that is, they may
pressed in XML, and one that is optimized for hu- contain mixtures of various types of nodes and atomic
D items.xml
E items
T T T T T
values. However, a sequence never appears as an against a particular schema, and represents it as a
item in another sequence. All operations that cre- hierarchy of nodes and atomic values, labeled with
ate sequences are defined to “flatten” their operands type information derived from the schema. If an in-
so that the result of the operation is a single-level put document does not have a schema, it is validated
sequence. There is no distinction between an item against a permissive default schema that assigns ge-
and a sequence of length one—in other words, a neric types—nodes are labeled anyType and atomic
node or atomic value is considered to be identical values are labeled anySimpleType. The process of
to a sequence of length one containing that node or schema validation is described in more detail in Ref-
atomic value. erence 3.
Sequences of length zero are valid and are some- The result of a query may be transformed from the
times used to represent missing or unknown infor- query data model into an XML representation by a
mation, in much the same way that null values are process called serialization. The details of serializa-
used in relational systems. tion are beyond the scope of this paper. It is worth
noting that the result of a query is not always a well-
In addition to sequences, the query data model de- formed XML document. For example, a query might
fines a special value called the error value, which is return an atomic value such as the number 47, or a
the result of evaluating an expression that contains sequence of elements with no common parent.
an error. An error value may not be combined in a
sequence with any other value. Example data
Input XML documents can be transformed into the To illustrate the query data model and provide a ba-
query data model by a process called schema vali- sis for later examples, we consider a small XML da-
dation, which parses the document, validates it tabase that contains data from an on-line auction,
D bids.xml
E bids
bid E E bid
T T T T
based loosely on Use Case R in Reference 12. The D, E, A, and T represent document, element, at-
database consists of two XML documents named tribute, and text nodes, respectively.
items.xml and bids.xml.
Expressions
The items.xml document contains a root element
named items, which in turn contains an item element We now describe expressions in XQuery.
for each item currently for sale at the auction. Each
item element has a status attribute and subelements Basics. Like XML and XPath, XQuery is a case-sen-
named itemno, seller, description, reserve-price, sitive language, and all its keywords are made up of
and end-date. The reserve-price element names a lowercase characters. Detailed rules for lexing and
minimum selling price set by the owner, and the parsing XQuery are described in Reference 7. Char-
end-date element indicates the ending date of the acters enclosed between “{--” and “--}” are con-
auction. sidered to be comments and are ignored during query
processing (except, of course, inside a quoted string,
The bids.xml document contains a root element where they are considered to be part of the string).
named bids, which in turn contains a bid element
for each bid that has been placed for an item. Each XQuery is a functional language, which means that
bid element has subelements named itemno, bidder, it is made up of expressions that return values and
bid-amount, and bid-date. do not have side effects. XQuery has several kinds
of expressions, most of which are composed from
Figures 1 and 2 show the data model representations lower-level expressions, combined by operators or
of the items.xml and bids.xml documents, respec- keywords. XQuery expressions are fully composable,
tively (including only a representative item and a rep- that is, where an expression is expected, any kind of
resentative bid). In the figures, the circles labeled expression may be used. As noted earlier, the value
Path expressions may be written in either unabbre- (Q2) List all description elements found in the doc-
viated syntax or abbreviated syntax. The unabbre- ument items.xml.
viated syntax for an axis step consists of an axis and
a selection criterion, separated by two colons. Q1 il- document(''items.xml'')//description
lustrates a four-step path expression using unabbre-
viated syntax. The first step invokes the built-in Within a path expression, a single dot (“ . ”) refers
document function, which returns the document node to the context node, and two consecutive dots (“ .. ”)
for the document named items.xml. The second step refer to the parent of the context node. These no-
is an axis step that finds all children of the document tations are abbreviated invocations of the self and
node (“ⴱ” selects all nodes on the given axis, which parent axes, respectively. Names found in path ex-
in this case is only a single element node named pressions are usually interpreted as names of ele-
items). The third step follows the child axis again ment nodes; however, a name prefixed by the “@”
to find all the child elements at the next level that character is interpreted as the name of an attribute
are named item and that in turn have a child named node. This is an abbreviation for a step that traverses
seller with the value “Smith.” The result of the third the attribute axis. These abbreviations are illus-
step is a sequence of item element nodes. Each of trated by Q3, which begins at the node that is bound
these item nodes is used in turn as the context node to the variable $description, traverses the parent
for the fourth step, which follows the child axis again axis to the parent item node, and then traverses the
to find the description elements that are children attribute axis to find an attribute named status.
of the given item. The final result of the path expres- The result of Q3 is a single attribute node.
sion is the result of the fourth step: a sequence of
description element nodes, in document order. (Q3) Find the status attribute of the item that is the
parent of a given description.
(Q1) List the descriptions of all items offered for sale
by Smith. $description/../@status
Several different kinds of operators and functions Negation: not is a function rather than an operator.
are often used inside predicates. In the following six It serves to invert a Boolean value, turning true into
false and false into true. The following step uses
paragraphs, some of the commonest and most use-
ful of these operators and functions are described. the not function with an existence test to find item
nodes that have no reserve-price child element:
item[not(reserve-price)].
Value comparison operators: eq, ne, lt, le, gt, ge.
These operators can compare two scalar values, but In all of the above examples, element and attribute
they raise an error if either operand is a sequence names have been simple identifiers. However, the
of length greater than one. If either operand is a XML Namespace recommendation 19 allows elements
node, the value comparison operator extracts its and attributes to have two-part names in which the
value before performing the comparison. For exam- first part is a namespace prefix, followed by a colon.
ple, item[reserve-price gt 1000] selects an item A name qualified by a namespace prefix is called a
node if it has exactly one reserve-price child node QName. Each namespace prefix must be bound to
whose value is greater than 1000. a URI (uniform resource identifier) that uniquely
The simplest kind of element constructor looks ex- The element node produced by an element construc-
actly like the XML syntax for the element to be cre- tor is a new node with its own node identity. If the
ated. For example, the following expression con- newly constructed element has child nodes and at-
structs an element named highbid containing one tributes that are derived from existing nodes, as in
attribute named status and two child elements the above example, the new child nodes and at-
named itemno and bid-amount: tributes are copies of the nodes from which they were
derived, with new node identities.
⬍highbid status ⫽ ''pending''⬎
⬍itemno⬎4871⬍/itemno⬎ In the above examples of element constructors, even
⬍bid-amount⬎250.00⬍/bid-amount⬎ though the content of the element may be computed,
⬍/highbid⬎ the name of the constructed element is a known con-
stant. However, it is sometimes necessary to construct
In the example above, the values of the elements and an element whose name as well as its content is com-
attributes are constants. However, in many cases it puted. For this purpose, XQuery provides a special
is necessary to create an element or an attribute kind of constructor called a computed element con-
The simplest form of iteration in XQuery consists As we have already seen, the function of the for
of a for clause that names a variable and provides clause and let clause is to bind variables. Each of
a sequence of values over which the variable is to these clauses contains one or more variables and an
iterate, followed by a return clause that contains the expression associated with each variable. The expres-
expression to be evaluated for each variable bind- sions evaluate to sequences and may contain refer-
ing. The following example illustrates this simple ences to variables bound in previous clauses. The dif-
form of iteration: ference between a for clause and a let clause is that
a for clause iterates each variable over the associ-
for $n in (2, 3) return $n ⫹ 1 ated sequence, binding the variable in turn to each
This pair of clauses is not a full FLWR expression be- The for clause and let clause produce a binding pair
cause it does not have a return clause. The for clause for each item in items.xml. In each binding pair, $i
and let clause simply produce a sequence of binding is bound to the item and $b is bound to a sequence
tuples. The clauses in the above example produce containing all the bids for that item. The where clause
the following sequence of three binding pairs: retains only those binding tuples in which $b con-
tains more than ten bids. The return clause then gen-
$i ⫽ 1, $j ⫽ 1 erates an output element for each of these bindings,
$i ⫽ 2, $j ⫽ (1, 2) containing the item number, description, and bid
$i ⫽ 3, $j ⫽ (1, 2, 3)
count.
In general, the number of binding tuples produced By default, the order of the output sequence of a
by a series of for clauses and let clauses is equal to FLWR expression preserves the order of the itera-
tion sequences. The prefix operator unordered can
the product of the cardinalities of the iteration ex-
be used before any expression to indicate that the
pressions in the for clauses. A let clause without any
order of the result is not significant. This gives the
for clause, of course, produces only a single binding
implementation greater flexibility to optimize the ex-
tuple.
ecution of the expression (for example, by iterating
in a different order).
The binding tuples produced by the for clauses and
let clauses in a FLWR expression are filtered by the Any sequence can be reordered by a sortby clause
optional where clause. The where clause contains an that contains one or more ordering expressions. For
expression that is evaluated for each binding tuple. each item in the original sequence, the ordering ex-
If the value of the where expression is the Boolean pressions are evaluated using the given item as the
value true or a sequence containing at least one node context item. The items in the original expression
(an “existence test”), the binding tuple is retained; are then reordered into ascending or descending or-
otherwise the binding tuple is discarded. der based on the values of their ordering expressions.
Of course, each ordering expression must return a
The return clause of the FLWR expression is then ex- single result, and these results must be comparable
ecuted once for each binding tuple retained by the by the gt operator. For the purpose of a sortby clause,
where clause, in order. The results of these execu- an empty sequence can be treated either as greater
tions are concatenated into a sequence that serves than any other value or as less than any other value,
as the result of the FLWR expression. under user control.
(Q7) Generate a report containing the status of the bids for various items. Label each bid with
a status “OK,” “too small,” or “too late.” Enclose the report in an element called bid-status-report.
<bid-status-report>
for $i in document ("items.xml")/*/item
return
<item>
{
$i/itemno,
for $b in document ("bids.xml")/*/bid[itemno = $i/itemno]
return
<bid>
{
$b/bidder,
$b/bid-amount,
<status>
{
if ($b/bid-date > $i/end-date) then "too late"
else if ($b/bid-amount < $i/reserve-price)
then "too small"
else "OK"
}
</status>
}
</bid>
}
</item>
</bid-status-report>
Like a FLWR expression, a quantified expression al- (Q8) Find the items in items.xml for which all the bids
lows a variable to iterate over the items in a sequence, received were more than twice the reserve price. Re-
being bound in turn to each item in the sequence. turn copies of all these item elements, enclosed in a
For each variable binding, a test expression is eval- new element called underpriced-items.
uated. A quantified expression that begins with some
returns the value true if the test expression is true for ⬍underpriced-items⬎
some variable binding, as in the following example: for $i in document(''items.xml'')
where every $b in document(''bids.xml'')
some $n in (5, 7, 9, 11) satisfies $n ⬎ 10 /ⴱ/bid[itemno ⫽ $i/itemno]
satisfies $b/bid-amount
A quantified expression that begins with every, in ⬎ 2 ⴱ $i/reserve-price
contrast, returns the value true if the test expres- return $i
sion is true for every variable binding. For example, ⬍/underpriced-items⬎
the following quantified expression returns the value
false because the test expression is true for some
but not all bindings: Functions
every $n in (5, 7, 9, 11) satisfies $n ⬎ 10 We have already seen several examples of functions,
including the document function and aggregating
The use of a quantified expression in a query is il- functions such as avg. XQuery provides a library of
lustrated by Q8. predefined functions, listed in Reference 11, and also
typeswitch($customer/billing-address) as $a $customer/billing-address/zipcode
case element of type USAddress
return $a/state A type-checking XQuery compiler might consider
case element of type CanadaAddress the above example to be a type error, since the static
return $a/province type of $customer/billing-address is Address, and
case element of type JapanAddress the Address type does not in general have a zipcode
return $a/prefecture subelement. However, in the following reformula-
default return ''unknown'' tion of the example, the static type of the expression
is changed to USAddress, which has a zipcode sub-
Type names are also used in three similar-looking element, and the type error is removed:
XQuery expressions called cast, treat, and assert.
Each of these expressions consists of a keyword, a (treat as USAddress
reference to a type, and an expression enclosed in ($customer/billing-address))/zipcode
parentheses.
Like treat, assert is used to provide a query pro-
A cast expression is used to convert the result of an cessor with information that may be useful for type-
expression into one of the built-in types of XML checking. An assert expression serves as an asser-
Schema. A predefined set of casts is supported. For tion to the query processor that its operand
example, the result of the expression $x div 5 could expression has a particular static type. If the proces-
be cast to the xs:double type by the expression sor is checking a query for static type-safety, it may
cast as xs:double($x div 5). A cast may return an raise an error if it cannot verify that the given expres-
error value if it is unsuccessful. For example, sion conforms to the asserted type. An assert expres-
cast as xs:integer($mystring) will be successful sion is more strict than a treat expression, because
if $mystring is a string representation of an integer, it pertains to the static type of the expression, and