XML With C#
XML With C#
XML With C#
Abstract
This paper proposes a language extension that adds native support for XML to the type
system of C#. In our approach XML documents or document fragments become first
class citizens that benefit from the full range of features available in a modern programming language like C#. XML elements can be constructed, loaded, passed, transformed,
updated, and written in a true type-safe manner.
To our knowledge, no other approach to XML provides for type safety and syntactic integration of this sort. Existing approaches are either completely untyped (some form of
string processing) or rely on schema-to-class translation patterns that have many limitations. Our approach is a true semantic integration via the type system.
One of the consequences of type integration is that many runtime checks may be moved
to compile time. This has many benefits for performance and program correctness.
The type system extension is based on XML Schemas and incorporates operational aspects from XPATH, XSLT and XQuery. The implementation uses the existing capabilities of the System.XML namespace in Microsofts Common Language Infrastructure to
provide its functionality.
In this document, we use the term C#-xml to mean the combination of C# and the proposed extensions. We present these extensions through a series of worked examples.
Although we restrict our attention to XML and C# in this paper, our approach can be applied to integrate any standard object-oriented programming language (C#, Visual Basic,
Java, etc.) with any data-structuring language (XML, SQL, etc.).
Introduction
XML has become the lingua franca for data-exchange over the Internet [Box, Skonnard,
Lam]. As an open industry standard, XML lets developers describe the data exchange
between different devices, applications or businesses. XML is used in data base applications as the language to describe the structure of the database and its views as well as for
data access. XML is also used for dynamic layout of web pages. But what is so special
about XML? XML schemas can describe typed content. Typing helps guarantee the interoperability of applications.
Currently, there is a plethora of special purpose XML processing languages. [XSLT] is
for transforming well formed but untyped documents. [XDuce] and [XMLambda] are
typed alternatives, but they are mainly an experiment in language design; their use is restricted to a subset of [XML 1.0]. [XPath] and [XQuery] are languages for query processing. XPath is untyped; XQuery however is typed. Yet all of these languages have the
same problems:
Furthermore as soon as you want to compute something beyond the special purpose for
which they are developed, you reach their limits. Many of them have ended up reimplementing features that are already in programming languages, but do so only partially and in an ad-hoc manner. Table 1 compares the different languages and their feature sets.
So why not use our favorite object-oriented programming language to support XML
processing? The story so far is disappointing: One of the main reasons is that the type
systems of XML and class based object-oriented languages dont match very well. In
[C#], you have subtyping based on named relationships, while the type system of XML
schemas is based on structural equivalence, named relationships and range restrictions.
As a remedy for this problem two approaches have been taken to integrate XML into programming languages (see also Table 1):
A typical example of the former is the support for XML in the Common Language Infrastructure [CLI], where XML processing is supported in the form of libraries. This is extremely flexible; whenever new functionality is needed a library is added. However processing XML is not type safe. Furthermore it is not efficient because one always deals
with untyped data. This requires later validation to be on the safe side. The alternative,
also supported in the CLI distribution is to use a schema compiler that can either map
schemas to classes or classes to schemas. As long as only very primitive forms of schemas or classes are used, the mapping works; however in most cases the results are disappointing. [Box] for example noted that many XML constructs cant be mapped easily
Language/
Feature
XML
Schema
XPath
Purpose
Paradigm/
Syntax
C#-xml
Typesystem
Technology Restrictions
XML
Validating
Parser
Interpreter/
Compiler
Sublanguage
XML
Prototype
Sublanguage
None
Interpreter/
Compiler
Sublanguage
Monomorphic Interpreter/
DTDs
Compiler
Experimental
None
Polymorphic
DTDs
Not implemented
C#
Library
Data defs; C#
Imperative/
C#
Imperative/ C# + XML
C# + XML
+ XPath
Sublanguage
Experimental
Table 1: Different Languages and Approaches How to Use and Integrate XML
We solve these problems by integrating the type system of XML as a first order citizen in
the programming language. We use C# in this paper, but a similar approach would work
for other languages. We call this C#-xml. We add XML Schemas as types, and XML
document fragments as literals. The proposed type system guarantees that
We also integrate ideas from XSLT and XPath for a limited form of pattern matching;
and from XQuery for a limited form of set based operations. For interoperability we provide mappings between the CLI and schema types. The resulting language is an extension
of C#. A pre-compiler translates C#-xml into C#. We assume that the reader is familiar
with basic C# concepts and its syntax and has a rough understanding of using XML.
XML schemas are nicely explained in [XML Schema Part0].
The paper is organized as follows. Section 2 discusses the mapping of C# types to XML
schemas and vice versa. Section 3 shows how we build dynamic documents. Section 4
discusses projection and selective update. Section 5 presents C#-xmls support of iteration, here used for query processing and stream processing. Section 6 discusses C#-xmls
provisions for dealing with well formed but not well typed documents. Section 7 concludes. The Appendices are not yet written. (However by the time this document might
be read they should be on the WEB.) Appendix A gives an example application. Appendix B defines the grammar for the extension; Appendix C describes part of the type system. Appendix D gives the signatures for the new CLI functionality.
The main challenge of the integration of XML into C# is to engineer a bridge between
both type systems. In the following we assume that the reader knows the type system of
C# and knows a little bit about XML and the former [XML 1.0] document definitions.
<xsd:sequence>
<xsd:element ref = "book"
minOccurs = "0" maxOccurs = "unbounded"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name = "Book">
<xsd:sequence>
<xsd:element name = "title" type = "xsd:string"/>
<xsd:element name = "author" type = "xsd:string"
maxOccurs = "unbounded"/>
</xsd:sequence>
<xsd:attribute name = "isbn" use = "required"
type = " xsd:string "/>
<xsd:attribute name = "year" use = "optional"
type = "xsd:string"/>
<xsd:attribute name = "price" use = "optional"
type = "xsd:float"/>
</xsd:complexType>
</xsd:schema>
The schema declares two elements bib and book having type Bib and Book respectively. The bib element declaration describes an XML document tree with root <bib>
.. </bib> whose children consists of a list of book elements. Likewise a book can
be a root element. Each book has a mandatory isbn, optional attributes for year and
price, followed by a title, followed by a non-empty list of authors. Note that the
content of each element and attribute is typed. An example document that conforms to
this schema is the following
<?xml version = "1.0" encoding = "UTF-8"?>
<bib xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation = "file:///C:/XML/XML/Bib.xsd">
<book year = "2000" isbn = "0-201-70914-7">
<title>Essential XML</title>
<author>Box</author>
<author>Skonnard</author>
<author>Lam</author>
</book>
<book isbn = "0-201-17888-5" price ="22.50" year = "1999" >
<title>Component Software</title>
<author>Szyperski</author>
</book>
</bib>
The value of this document is a node-labeled tree (in XML parlance: an abstract information set [Infoset]). The root of the instance document is a document node; it defines the
character encoding and version information. In our example it contains the bib element
as its only child. The node for the bib element contains the namespace node, a node for
the attribute schema location and two element subtrees. The book nodes contain attribute
nodes, and title and author subtrees.
Our aim is to integrate these XML schema types and conforming documents into C#. To
do so let us analyze schemas more carefully.
Simple Types
A simple datatype, as defined in [XML Schema Part 1], is either primitive (e.g., xs:string,
xs:boolean, xs:float, xs:double, xs:ID, xs:IDREF) or it is derived from another simple
type by specifying a set of facets (e.g., xs:language, xs:NMTOKEN, xs:long, etc., or user
defined). Facets are constraints like minimum or maximum values of numbers or regular
expressions for strings. A type hierarchy is induced between simple types by subset ordering on the value spaces of the corresponding types. For instance, the value space of
the XML type string is the set of finite-length sequences of characters. The value space of
normalizedString is the set of strings that do not contain the carriage return, line feed or
tab characters. Thus in XML every value of the type normalized string is a string but not
vice versa; normalized string is a (proper) subtype of string. Note that the value spaces of
simple types may overlap; a simple value may be an instance of more than one schema
simple type. (We defer the discussion of lists and unions of simple datatypes until we
discuss the complex types.)
All of C#s built-in types like bool, int or string have a corresponding XML simple
type. For other simple types, like normalizedString, C#-xml provides corresponding
structures in the namespace System.XML.Types. Constructors for these structures
take C# built-in types and check whether their range restriction is met. Here is the C#
struct for XMLs normalizedString.
namespace System.XML.Types{
public struct NormalizeString{
private string Value;
public NormalizedString(string s){
foreach(c in s)
if(c == \r || c == \f || c == \t)
throw new RangeRestrictionException();
value = s.Clone();
}
public static implicit operator string(NormalizeString s) {
return s.Value;
}
}}
The constructor takes a string, and if the passed string doesnt contain any carriage returns, line feeds or tabs, assigns it to the field Value. Runtime checking is necessary,
since checking whether the constraints are fulfilled is in generally not possible at compile-time. To access the embedded value, C#-xml uses implicit conversion operators.
The proposed encoding of simple types in C#-xml does not maintain the subtype order on
XMLs simple types; for instance in C#-xml a normalized string is not a subtype of
string. However, many of the properties provided by subtypes for instance, ease of use,
flexibility, and type safety are retained because of the implicit casting that allows normalized strings to be passed to functions that expect a string.
Simple types, for which C #-xml provides corresponding C# structures, are very limited.
They have only one field of the base type from which this type derives. Values of simple
types never contain object references.
Complex Types
Complex types contain elements and/or attributes (see [XML Schema Part 2]). Complex
types build their own type hierarchy. A complex type is either a restriction of a complex
base type, an extension of a simple or complex type definition or a restriction of the super
type of all complex types called AnyComplexType (see also Section 6). Restriction and
extension is based (as in class based languages) on named relationships. The content
model for complex types, however, is based on regular tree grammars. Regular tree
grammars support structural subtyping.
The content model for C#-xml is adapted from [XML Schema: Formal Description]. To
simplify type checking it unifies various constructs that are distinct in XML. For instance
mandatory and optional attributes (see for example the attributes isbn and price in
our bib schema) are handled as if they would be defined as a sequence with appropriate
number of occurrences. However it distinguishes everything that is relevant for checking
structural subtyping. In this respect C#-xml follows the design decisions taken by
[XQuery Formal Semantics].
The following simplified grammar makes this more precise:
p in primeType ::= qName | `*:*`
name(set)s
t in typeDefn
global type
::= elem p { g }
|
attr p { g }
|
group p { g }
g in groupDefn ::= p
|
|
|
t
all{g1;;g2}
{g1;;gn}
|
|
|
choice{g1;;gn }
mixed g
g[m-n]
element, group or
attribute reference
local type
interleaved product
heterogeneous sequence
g1 followed by gn
choice, g1 or gn
mixed content of g
homogeneous sequence of type g
A prime type is either a qualified name, denoting a type with a given URI and local
name, or a wildcard. The wildcard `*:*`denotes any name in any namespace. C#-xml allows one to define elements, attributes and (named) groups. We have the following group
constructors: The empty sequence (written {}) matches only the empty document; it is an
identity for sequence and all. The empty choice (written choice {}) matches no document; it is a unit for choice. An interleaved product all{g1;..;gn} matches all documents
which contain values in groups g1 up to gn in arbitrary order. All groups in XML Schema
are a specialization of all in this type system. In XSD they can consist only of global or
local element declarations. The homogeneous sequence type T[n-m] matches a minimum
of m values and a maximum of n values. The length of the sequence is undetermined if
n= *. In the sequel we will homogenous sequences just sequences.
C#-xml users normally dont have to write type definitions using this grammar. C#-xml is
able to directly import a schema; this is shown in Section 3. However, if a query results
in a return type which is not predefined by a schema, then they can denote the type using
the above grammar. For example, the C#-xml types for the given Bib Schema is as follows :
group Bib{
elem bib{Book}[0-*]; }
group Book {
attr isbn{ int };
attr year{ int } [0-1];
attr price{ int } [0-1];
elem title { string};
elem author {string} [1-*];
}
(This declaration uses the fact that the type {g1;; gn} can be written as g1;;gn as
long as it is unambiguous.)
XML schemas are much richer than what is provided in C#-xml. We deliberately decided
not to include the whole type language of XML in C#-xml. To use more advanced features one has to import the corresponding schema; however the result of every query and
of every literal is expressible using the given syntax.
Structural Subtypes
The idea of the subtype relationship is that t is a subtype of t if t describes a subset of the
possible values described by t. For instance t is subtype of {t; t}. We use the symbol <:
to denote the subtype relationship of content types. We write t1 <: t2 if t1 is a subtype of
t2, for instance t <: {t; t}. The subtype relationship is a partial order, i.e. it is reflexive and
transitive. Let t, t1, t2 be types denoting elements. Here are some of the inequalities that
hold (AnyType is the supertype of any XML type, see below and Section 6):
choice{} <: t, t <: AnyType, t1 <: choice{ t1;t2}, t2 <: choice{ t1;t2}.
Sequences are covariant; in addition we have the following relationship on bounds:
If t <: t and m <= m and n <= n then
t[m-n] <: t[m-n].
Finally lets relate sequences and all groups: If t1<: t1 and t2<: t2 then also the following relationships hold:
{t1; t2} <: {t1; t2}, {t1; t2} <: all{t1 ;t2}, all{t1;t2}<: all {t1; t2}.
Elements can also stand in subtype relationship using wildcards. For an elaborate exposition consult [XML Schema: Formal Description].
C#-xml supports structural subtyping of XML types and their values based on these relationships. For instance one can pass a value of type t1 or a value of t2 to a function that
accepts values of choice{t1, t2}. However note that t1 and t2 must be simple or complex
XML types. XML subtyping rules cannot be applied for ordinary C# structures or
classes.
The root of any XML type is AnyType. AnyType inherits from Object. AnyType provides basic functionality for reading and writing XML documents; see Section 5.
AnyType also captures untyped but well formed documents. Their introduction is deferred until Section 6.
Datamodel
The value space of a complex type is a set of ordered node-labeled trees, see [XQuery 1.0
and XPath 2.0 Data Model] for more details. Node values also include a concept of node
identity. Node identity simplifies the representation of XML reference values, e.g.,
IDREF, and URI values. As mandated by [XQuery 1.0 and XPath 2.0 Data Model] two
nodes have the same identity if and only if they were created by the same application of a
node constructor. Note that having identities does not mean that our nodes are reference
types. Instead they are mathematical values (and not an aggregation of individual memory cells); they never contain null.
In C#-xml the equality of any simple schema is reduced to equality on the underlying
base type. Any complex type provides two relations: the Equals method defines nodeequality (in C#-xml, it is also provided by the == operator); the ValueEquals method is
defined structurally (see below for ordering aspects). In addition our type system supports
implicit boxing and unboxing of XML values to objects. Finally, we have implicit conversion operations from homogeneous sequences to arrays and vice versa. But note that
in contrast to arrays, homogeneous sequences might be lazy data structures they might
only be populated as elements are accessed.
A document order is defined on all nodes in a document. It corresponds to the order in
which the XML document (fragment) is written after expansion of entities. Thus, first the
tag is written; then, namespace nodes followed by the attribute nodes, followed by the
children and finally, the end tag. The relative order of attributes is implementation dependent. Nodes from different documents are also totally ordered; the order doesnt
change during a program run, however the chosen order is implementation dependent.
In C#-xml, documents are essentially written as they are in XML. To type check an XML
expression one has to provide the schema type. Schemas can be imported. This is usually
done via an import statement. Here is our first example:
using System;
using Bib.xsd;
public class Sample {
static Book book =
<book isbn = "0-201-17888-5">
<title>Component Software</title>
<author>Szyperski</author>
</book>;
public static void Main() {
Console.WriteLine(book);
}}
The compiler checks that the program is type correct using the type inference rules from
the Formal Schema definition. When executed, the program first builds up an internal
representation of the [XQuery 1.0 and XPath 2.0 Data Model]. Next, it prints the following string on the console: <book isbn = "0-201-17888-5"> <title>
Component Software </title> <author> Szyperski</author>
</book>. How to select isbn, title and author is explained in Section 4.
Preprocessing
The construction of XML is parameterized by three new C# preprocessor flags; they affect the construction of the data model. If the IgnoreComments flag is true, comment
nodes are not preserved in the data model. If the IgnoreProcessingInstructions flag is true,
processing-instruction nodes are not preserved in the data model. If the IgnoreWhitespace
flag is true, insignificant white space is not preserved. For a definition of the notion of
insignificant whitespace see [XQuery 1.0 and XPath 2.0 Data Model].
Dynamic Literals
C#-xml supports the construction of dynamic documents. It uses the XQuery convention
whereby an arbitrary C# expression can be embedded inside an element by escaping it
with curly braces. The expression must yield values of the required type. The following
example uses a parameterized method to compute the same book as the previous example.
static Book book = CreateBook("0-201-70914-7,
Component Software, Szyperski);
static Book CreateBook(String isbn, String title, String author) {
return
<book isbn={isbn}>
<title>{title}</title>
<author>{author}</author>;
</book>;
}
C#-xml extends the XQuery convention by allowing C# blocks inside quotes, too. A
block within a quote must yield values with the type demanded by the context. Conceptually, yield statements generate a sequence of values. When the block exits, the concatenated result is the result of the quote.
Lets look at an example. This time we pass an array of authors to the CreateBook
method. CreateBook then has to generate an author element for each of the passed authors.
private static Book CreateBook(String isbn,
String title, String[] author) {
return
<book isbn={isbn}>
<title>{title}</title>
{for (int i = 0 ; i < authors.Length; i++) {
yield <author>{authors[i]} </author>;
}
</book>;
}
Executing the loop constructs a sequence of author elements. When the loop terminates
the sequence is returned as the result of the quote. Yield statements are basic ingredients
of iterators, see Section 5 for an elaboration of this topic.
C#-xml also allows the quoting of element and attribute names. To escape quotes you
have to use two open curly braces in a row. Example: Suppose that author elements are
in fact email addresses and that you want to abbreviate emails with the familiar bracket
notation, where for instance {a,b}@c means a@c and b@c. You use escapes to suppress
evaluation of the parenthesized group. Here is a declaration of a book with the email of
its authors.
static Book book =
<book year = "2000" isbn = "0-201-70914-7">
<title>Essential XML</title>
<author> {{Box,Skonnard,Lam}}@developmentor.com </author>
</book>
Schema names that dont follow C#s constraints for identifiers must be written in
backquotes (here aw:book and ph:Book).
Projection
In C#-xml projection is expressed using get expressions. Get takes an XPath as an argument. The XPath expression begins with an expression that identifies a specific document
or sequence of documents. Next follows a series of "steps. Each step represents move-
ment through a document: / selects children of the current node maintaining their order; //
selects the current node and all its sub-nodes in document order. Either a match or a function application has to follow / or //. One can match on element, attribute or URI
names, or nodetypes.
For instance, let us assume that a variable bib of type Bib (see section 2) is defined
within the scope of the following declarations. To find all author elements within bib,
we write the C#-xml expression
get bib/book/author
But what is the type of this expression and how does projection work? The expression
bib is of type Bib, next bib/book selects all book elements of bib. According to its
schema type this can result in zero or more books, i.e. the type of bib/book is Book[0*]. Finally the expression bib/book/author selects all authors of each book; therefore it has zero or more author elements. Therefore a correctly typed query is
Author[0-*] q1 = get bib/book/author;
The difference between the queries q1, q2 and q3 is the order of traversal. The XPath
expression of q1 first iterates over all books in bib and then over its authors. The XPath
expression of q2 iterates over all children of the bib database which happen to be books
and then over its authors. Finally q3 selects bib itself, all children, grandchildren,
great-grandchildren and so on that have an author child.
Instead of a match XPath also allows the use of built-in functions, the most prominent
being data() which selects the simple typed content of an element or attribute. This is
the way to find all author elements as strings or to find out all prices as integers.
string[0-*] q4 = get bib//author/data();
int[0-*] q5 = get bib/book/@price/data();
A predicate can follow a match to eliminate nodes that fail to satisfy a given condition.
Predicates are written within square brackets. For instance to find all books which appeared in 2000 one would write:
Book[0-*] q6 = get bib/book[./@year/data() == 2000];
C#-xmls projection also supports an optional sortedby clause. For instance, the following C#-xml statement sorts the resulting sequence of books for titles:
Book[0-*] q7 = get bib/book sortedby title/data();
The sortedby clause can take several keys (in which case it sorts them lexicographically) and the modifiers ascending and descending.
C#-xml also supports aggregation. But there is no magic involved; sequences are ordinary generic data structures predefined in the System.XML.type namespace. Sequences provide the well know aggregation functions from SQL, like count, min, max.
But they also provide additional functions like every and some, which stand for universal and existential quantification, respectively. Here are two examples: the first one
tests whether the bib database contains a book with a particular isbn number. The second query tests whether all books of the bib database were published in the previous
century.
bool q1 = Seq.some(get bib/book [./@isbn/data()== 0-201-70914-7"]);
bool q2 = Seq.every(get bib/book [b/@year/data() < 2000]);
Selective Update
XML is often used to label the information content of diverse data sources including relational databases and object repositories. For these applications selective updates are mandatory.
However XML values are mathematical values. Only variables containing XML values
can be updated. For updates of XML variables C#-xml uses the set statement. A set
statement takes an XPath as its left hand side which describe the node(s) to be updated
and an expression as its right hand side. We first consider updating single nodes and then
look at updating many nodes in parallel.
Suppose that we have a book variable of type Book (as defined in the previous section).
Lets start with updating attributes.
set book/@price = 25.0;
The XPath book/@price selects the node, whose content is updated, i.e. this statement
changes the price attribute of the book. If the attribute price exists, it is overridden; otherwise a new attribute node is generated.
Updates on elements work similarly. For instance here is a statement to change the title of
a book from lower case to uppercase:
set book/title =
<Title>{
(get book/title/data()).ToUpper()
}</Title>;
Note that the string method ToUpper is applied on the result of a projection. In C#xml this is correct if the sequence of type t is known to be a singleton sequence, i.e. has
type t[1-1], which in our example is the case.
The right hand side of the expression can refer to the selected value using the implicit parameter value. The previous statement can be simplified to
set book/title = <Title>{ (get value/data()).ToUpper() }</Title>;
But what happens if the value of the referred node of type t is not guaranteed to exist, for
example since value denotes an optional attribute? In this case value will return an optional type (i.e t[0-1]) and the user has to handle it appropriately (for an example, see
Subsection Parameter Passing below).
If the XPath expression results in many nodes, the modification is done simultaneously.
To change the last name of all authors from lowercase to uppercase you would write:
set bib/book/author=
<author>{ (get value/data()).ToUpper() }</author> ;
C#-xml allows inserting elements before or after existing elements, by adding the modifier before or after to the set statement. Here is a statement to insert a new author
to the book:
set before bib/book/author[1] = <author> Abiteboul </author>;
Note that before and after can also work on multiple nodes.
To delete an attribute or element, use the delete statement.
delete book/@price;
Parameter Passing
By default XML parameters are passed by value. If they should be updated inside a function they must be passed by reference. The example shows how to increment a book price
by 10% if it exists and if it doesnt exist how to generate a new attribute with a default
value.
static inc (ref Book b, int dflt) {
set b/@price = value.Length==1 ? get value/data()*1.1 : dflt;
}
The context conditions for reference parameters carry over from C# to C#-xml, for instance books that are passed by reference must denote (parts of) updateable variables. In
Section 5 we will see that not all XML variables are in fact updateable.
If an XML document represents a data store, then one is often interested in aggregating
information or comparing information from different stores, for instance to do a search
for the best price of a particular product. If the document contains mainly markup, then it
must often be converted from one representation into another, for example to visualize
the received XML in a browser capable of only displaying HTML. In both cases programs iterate over the provided documents. Thus although data processing and stream
processing are different they have something in common, namely iteration. However the
optimizations for the iterations are different. Iteration over data stores should allow query
optimizations; iteration for transformations should be done lazily, i.e. piece by piece.
Database Iteration
The support for iteration comes almost for free: Selections return sequences and sequences implement the IEnumerable and IEnumerator interfaces; thus we are all
set to reuse C#s foreach loop to express iteration. Here is an example of collecting all
titles of the bib database as strings.
String[0-*] ts = new Seq();
foreach (Book b in get bib/book){
ts += get b/title/data();
}
The foreach statement iterates over all book elements in bib, and binds the variable b to
each such element. For each element bound to b, the body of the foreach loop selects the
data of the title and appends it to the resulting sequence ts.
Iterators can be nested. In SQL this corresponds to computing inner joins. For ease of
reading, writing and optimization C#-xml also supports iterated bindings within one
foreach statement. Here is an example to select from two databases of type Bib all books
that have overlapping authors but different titles
Book [0-*] res;
foreach (Book b1 in get bib1/book,
Book b2 in get bib2/book,
String a1 in get b1/author/data(),
String a2 in get b2/author/data()
[a1 == a2 && b1/title/data() != b2/title/data()] )
res +=b1;
res +=b2;
}
In C#-xml the first generator b1 varies faster than b2, which varies faster than a1 and so
on. For instance, let bib1 and bib2 denote the initial Bib document given in Section 2,
furthermore let B1 denote the Essential XML book, and let K11, K12, K13 be its authors,
likewise let B2 denote the book Component Software and K2 its author. Then the above
iteration will generate the following sequence of bindings (where ellipses (...) denotes
that the binding of the particular variables isnt changed).
Variable/
nth Binding
1
2
3
4
5
6
7
8
9
10
11
etc
b1
b2
a1
a2
B1
etc
B1
.
.
B2
etc
K11
K12
K13
.
K11
K12
etc
K11
K12
K13
K11
K12
K13
K11
K12
K13
K2
K2
Etc
Changing the order of the generators, obviously changes the result. But order preservation disallows query optimizations. However for most data processing applications the
order is irrelevant. To abstract from the order C#-xml adds a modifier unordered to the
iterator, i.e. we write
foreach unordered (..){ ..}
Bindings can now occur in any order; this can be exploited by the query processor.
Stream Processing
Normally one uses [XSLT] scripts to transform XML. Here we propose to use C#-xml. It
is as least as efficient as XSLT; in addition it is type safe.
To demonstrate C#-xmls capabilities, let us build a small application that takes a bib input stream and transforms it into an output stream by stripping the year and price attribute. Our main program looks as follows.
public static void Main(string [] args) {
readonly Bib bibIn = Bib.OpenRead(arg[0]));
writeonly Bib bibOut = Bib.OpenWrite(arg[1]);
bibOut = <bib> {StripYearAndPrice(get bibIn/Book)} </bib>;
bibIn.Close();
bibOut.Close();
}}
Bib is a schema root element; schema root elements provide methods to load and unload
a document of corresponding type. Thus we load the input stream and require that nobody will write onto it. This is a specific requirement for stream processing. If bibIn
would represent a datastore we probably wont declare it as readonly. Likewise we open
the output stream and make sure that nobody will read from it. The whole program now
becomes reading from the input variable and writing to the output variable.
Now lets look at StripYearAndPrice. A nave approach would probably result in
the following code.
static Book[0-*] StripYearAndPrice (Book[0-*] is) {
Book[0-*] os = new Seq();
foreach (Book b in is)
os +=<book {b/@isbn}> {b/title}{b/author} </book>;
return os;
}
There is nothing wrong with this code except it doesnt perform well: it first computes
the whole sequence before StripYearAndPrice returns it. However we would like
to delay the computation of the sequence.
In the imperative world computations are delayed by using iterators. C#-xml uses CLU
style iterators as suggested by Proebsting [Iterators for C#]. Iterators are a procedure-like
mechanism that yield a sequence of values. For instance the static iterator prod produces
one or two values.
static iterator char[1-2] prod(bool b) {
yield a;
if (b) yield b;
}
When prod is called within a foreach loop with parameter value true it first returns
a, next it returns b, next it fails. If it is called with the value false, it first returns
a, next it fails. Iterators are a very convenient way to write enumerators, all the state
that needs to be maintained between the calls is handled by the iterator.
Using iterators we can
StripYearAndPrice:
now
write
the
following
simplified
version
of
This code is simple and efficient. In fact, reading and writing of sequences (for example
to compute the result of path expressions or to read data from a file or datastore) is internally always done using iterators. This shows also the intended effect for our running example: consumption and production of books is interleaved in the best possible way, i.e.
one book is processed after another.
So far we have dealt only with schema validated documents. However C#-xml is also capable of handling non-typeable documents as long as they are well-formed. Well-formed
documents obey the following schema.
group
group
group
group
AnyTree
{choice{ AnySimpleType; AnyElement; AnyAttribute}
AnyAttribute {attr `*:*` { AnySimpleType } }
AnyElement
{elem `*:*` { AnyComplexType
} }
AnyComplexType {AnyAttribute[0-*];
choice{ choice{ AnyElement; string} [0-*];
simpleType} }
group AnyType
{AnyTree[0-*]}
AnySimpleType stands for the most general simple type, all simple types like int or
normalizedString are subtypes of it. The type AnyTree stands for any simple
type, attribute or element. AnyAttribute stands for the most general attribute, which
must have a name and a simple type. AnyElement must have a name and a complex
type. The latter can consist of attributes followed by either a simple type or mixed content. Finally AnyType is the most general XML type.
Type Tests, Type Cast and Type Switch. Type test and type cast are used to recover more
precise type information than is statically given. A type test checks whether the given
value is compliant with the required schema type. A type cast validates the given expression and casts it appropriately; in case validation fails an exception is thrown. To make
this more precise lets assume that we load a document. If it is compliant with our bib
schema then we do one thing, otherwise we do a different action. This pattern can be programmed as follows:
public static void Main(string [] args) {
AnyElement a = AnyElement.Load(arg[0]));
if (a is Bib) {
Bib b = (Bib) a;
do_this
} else {
do_that
}}}
A type switch combines type test and type cast. We can write the same statement as
shown in the previous example in a more compact way as follows:
public static void Main(string [] args) {
typeswitch (AnyElement.Load(arg[0])) {
case Bib b:
do_this
break;
default:
do_that
break;
}
}
Summary
C#-xml supports the XML schema type system to a large degree. C#-xml checks the
structural constraints on types. It uses runtime checks to check range constraints on values. We havent dealt with all of the XML schema features like key or unique attributes
or substitution groups; but we are confident that we can deal with them when we extend
this work. Furthermore C#-xml currently doesnt support any meaningful form of Reflection for XML types.
Except for references, C#-xml already supersedes the functionality of XQuery. (References will be added to C#-xml.) It combines full support for queries, i.e., declarative
processing and also provides imperative processing.
C#-xml provides high performance and type safe stream processing. We currently havent integrated any of the functionality of XSLT (except XPaths functionality). This is
something that we would like to do in the future.
Summarizing, we have shown that it is possible to have XML as first order citizen in
modern class based languages. Only a bridge between both type worlds is needed. Building the bridge is mainly an engineering task. But once it is available, it offers the best of
both worlds!
Acknowledgements
We thank Don Box for valuable insights that guided the direction of this work. We thank
Mike Barnett for a careful review of this paper. We thank MSRs database team headed
by Phil Bernstein for many helpful comments.
Bibliography
[Box, Skonnard, Lam] Don Box, Aaron Skannard, John Lam. Essential XML, Addison
Wesley, 2000.
[Box] Don Box, House of Web Services, MSDN Magazine, Nov. 2001.
[XML] Tim Bray and Jean Paoli and C. M. Sperberg-McQueen, Extensible Markup Language (XML) 1.0, , World Wide Web Consortium, 1998. Available at
http://www.w3.org/TR/REC-xml.
[XML Schema: Formal Description] Allen Brown, Matthew Fuchs, Jonathan, Philip
Wadler
World-Wide Web Consortium XML Schema: Formal Description, September 2001.
Working draft. Available at http://www.w3.org/TR/2001/WD-xmlschema-formal20010925/.
[XML Schema Part 1] Henry S. Thompson, David Beech, Murray Maloney and Noah
Mendelsohn, XML Schema Part 1: Structures, World Wide Web Consortium, 2001.
Available at http://www.w3.org/TR/xmlschema-1/.
[XML Schema Part 0] David C. Fallside, XML Schema Part 0: Primer, World Wide Web
Consortium, 2001. Available at http://www.w3.org/TR/xmlschema-0/.
[XML Schema Part 2] Paul V. Biron and Ashok Malhotra, XML Schema Part 2:
Datatypes, World Wide Web Consortium, 2001. Available at
http://www.w3.org/TR/xmlschema-2/.
[XPath 1.0] James Clark and Steve DeRose, XML path language (XPath) version 1.0,
World Wide Web Consortium, 1999. Available at http://www.w3.org/TR/xpath.
[Xduce] Haruo Hosoya and Jerome Vouillon and Benjamin C. Pierce, "Regular Expression Types for XML", Proceedings of the International Conference on Functional Programming (ICFP), 2000.
[XMLambda] Erik Meijer and Mark Shields. Draft. XMLambda: A functional language
for constructing and manipulating XML documents, 1999. Available at
http://www.cse.ogi.edu/~mbs/pub/xmlambda/.
[XSLT] James Clark, XSL Transformations (XSLT), Version 1.0, World Wide Web
Consortium, 1999, Available at http://www.w3.org/TR/1999/REC-xslt-19991116.
[Mitchell] John C. Mitchell, Foundations for Programming Languages, MIT Press, 1996,
Cambridge, Massachusetts.
[CLI] Common Language Infrastructure. Submitted ECMA Standard. Available at
http://www.msdn.microsoft.com/net/ecma/.
[Iterators for C#] Todd Proebsting. Iterators for C#. Internal memo 2000, Microsoft confidential.
[XQuery] Henry S. Thompson, David Beech, Murray Maloney and Noah Mendelsohn,
XML Schema Part 1: Structures, World Wide Web Consortium, 2001, available at
http://www.w3.org/TR/xmlschema-1/ .
[XQuery Formal Semantics] Peter Fankhauser, Mary Fernndez, Ashok Malhotra,,Michael Rys, Jrme Simon Philip Wadler, Working draft available at
http://www.w3.org/TR/2001/WD-query-semantics-20010607 .
[XQuery 1.0 and XPath 2.0 Data Model] World-Wide Web Consortium XQuery 1.0 and
XPath 2.0 Data Model, Working Draft, June 2001. See http://www.w3.org/TR/querydatamodel/.
[C#] Anders Hilsberg, Scott Wiltamuth C# Language Specification
Submitted ECMA Standard. Available at http://www.msdn.microsoft.com/net/ecma/.
[DOM] Arnaud Le Hors, Philippe Le Hgaret,Lauren Wood, Gavin Nicol, Jonathan Robie, Mike Champion Document Object Model (DOM) Level 2 Core Specification, Version 1.0.
W3C Recommendation 13 November, 2000. Available at:
http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113.
[Infoset] John Cowan, Richard Tobin The XML Information Set, W3C Recommendation
24 October 2001. Available at http://www.w3.org/TR/2001/REC-xml-infoset-20011024.