XML Info Set
XML Info Set
1. Introduction
This specification defines an abstract data set called the XML Information
Set (Infoset). Its purpose is to provide a consistent set of definitions for use in
other specifications that need to refer to the information in a well-formed XML
document [XML].
The XML Information Set does not require or favor a specific interface or class
of interfaces. This specification presents the information set as a modified tree
for the sake of clarity and simplicity, but there is no requirement that the XML
Information Set be made available through a tree structure; other types of
interfaces, including (but not limited to) event-based and query-based
interfaces, are also capable of providing information conforming to the XML
Information Set.
The terms "information set" and "information item" are similar in meaning to
the generic terms "tree" and "node", as they are used in computing. However,
the former terms are used in this specification to reduce possible confusion
with other specific data models. Information items do not map one-to-one with
the nodes of the DOM or the "tree" and "nodes" of the XPath data model.
In this specification, the words "must", "should", and "may" assume the
meanings specified in [RFC2119], except that the words do not appear in
uppercase.
XML Versions
Namespaces
Entities
An information set describes its XML document with entity references already
expanded, that is, represented by the information items corresponding to their
replacement text. However, there are various circumstances in which a
processor may not perform this expansion. An entity may not be declared, or
may not be retrievable. A non-validating processor may choose not to read all
declarations, and even if it does, may not expand all external entities. In these
cases an unexpanded entity reference information item is used to represent
the entity reference.
End-of-Line Handling
The values of all properties in the Infoset take account of the end-of-line
normalization described in [XML], 2.11 "End-of-Line Handling".
Base URIs
The value of these properties does not reflect any URI escaping that may be
required for retrieval of the resource, but it may include escaped characters if
these were specified in the document, or returned by a server in the case of
redirection.
In some cases (such as a document read from a string or a pipe) the rules
in [XML Base] may result in a base URI being application dependent. In these
cases this specification does not define the value of the [base URI] or
[declaration base URI] property.
When resolving relative URIs the [base URI] property should be used in
preference to the values of xml:base attributes; they may be inconsistent in
the case of Synthetic Infosets.
Some properties may sometimes have the value unknown or no value, and it
is said that a property value is unknown or that a property has no value
respectively. These values are distinct from each other and from all other
values. In particular they are distinct from the empty string, the empty set, and
the empty list, each of which simply has no members. This specification does
not use the term null since in some communities it has particular connotations
which may not match those intended here.
Synthetic Infosets
This specification describes the information set resulting from parsing an XML
document. Information sets may be constructed by other means, for example
by use of an API such as the DOM or by transforming an existing information
set.
2. Information Items
An information set can contain up to eleven different types of information item,
as explained in the following sections. Every information item has properties.
For ease of reference, each property is given a name, indicated [thus]. Links
to a definition and/or syntax in the XML 1.0 Recommendation [XML] are given
for each information item.
There is exactly one document information item in the information set, and
all other information items are accessible from the properties of the document
information item, either directly or indirectly through the properties of other
information items.
XML Syntax: [41] Attribute (Section 3.1, Start-Tags, End-Tags, and Empty-
Element Tags)
Attributes declared in the DTD with no default value and not specified in the
element's start tag are not represented by attribute information items.
There is a character information item for each data character that appears
in the document, whether literally, as a character reference, or within a
CDATA section.
1. [character code] The ISO 10646 character code (in the range 0 to
#x10FFFF, though not every value in this range is a legal XML
character code) of the character.
2. [element content whitespace] A boolean indicating whether the
character is white space appearing within element content (see [XML],
2.10 "White Space Handling"). Note that validating XML processors
are required to provide this information. If there is no declaration for the
containing element, or there are multiple declarations, this property has
no value for white space characters. If no declaration has been read,
but the [all declarations processed] property of the document
information item is false (so there may be an unread declaration), then
the value of this property is unknown for white space characters. It is
always false for characters that are not white space.
3. [parent] The element information item which contains this information
item in its [children] property.
There is a comment information item for each XML comment in the original
document, except for those appearing in the DTD (which are not represented).
XML Syntax: [28] doctypedecl (section 2.8, Prolog and Document Type
Declaration)
If the XML document has a document type declaration, then the information
set contains a single document type declaration information item. Note
that entities and notations are provided as properties of the document
information item, not the document type declaration information item.
There is a notation information item for each notation declared in the DTD.
Each element in the document has a namespace information item for each
namespace that is in scope for that element.
1. [prefix] The prefix whose binding this item describes. Syntactically, this
is the part of the attribute name following the xmlns: prefix. If the
attribute name is simply xmlns, so that the declaration is of the default
namespace, this property has no value.
2. [namespace name] The namespace name to which the prefix is bound.
3. Conformance
Since the purpose of the Information Set is to provide a set of definitions,
conformance is a property of specifications that use those definitions, rather
than of implementations.
Appendix A. References
Normative References
ISO/IEC 10646
ISO (International Organization for Standardization). ISO/IEC 10646-
1:2000. Information technology — Universal Multiple-Octet Coded
Character Set (UCS) — Part 1: Architecture and Basic Multilingual
Plane and ISO/IEC 10646-2:2001.Information technology — Universal
Multiple-Octet Coded Character Set (UCS) — Part 2: Supplementary
Planes, as, from time to time, amended, replaced by a new edition or
expanded by the addition of new parts. [Geneva]: International
Organization for Standardization. (See http://www.iso.ch for the latest
version.)
Namespaces
Namespaces in XML, W3C, eds. Tim Bray, Dave Hollander, Andrew
Layman. 14 January 1999. Available at http://www.w3.org/TR/REC-xml-
names.
Namespaces 1.1
Namespaces in XML 1.1, W3C, eds. Tim Bray, Dave Hollander, Andrew
Layman, Richard Tobin. 4 February 2004. Available
at http://www.w3.org/TR/xml-names11.
RFC2119
Key words for use in RFCs to Indicate Requirement Levels, ed. S.
Bradner. March 1997. Available at http://www.ietf.org/rfc/rfc2119.txt.
XML
Extensible Markup Language (XML) 1.0 (Third Edition), W3C, eds. Tim
Bray, Jean Paoli, C.M. Sperberg-McQueen, Eve Maler, François
Yergeau. 4 February 2004. Available at http://www.w3.org/TR/REC-xml.
XML 1.1
Extensible Markup Language (XML) 1.1, W3C, eds. Tim Bray, Jean
Paoli, C.M. Sperberg-McQueen, Eve Maler, John Cowan, François
Yergeau. 4 February 2004. Available at http://www.w3.org/TR/xml11.
XML Base
XML Base, W3C, ed. Jonathan Marsh. February 2000. Available
at http://www.w3.org/TR/xmlbase.
Informative References
DOM
Document Object Model (DOM) Level 1 Specification, W3C, eds. Vidur
Apparao, Steve Byrne, Mike Champion, et al. 1 October 1998. Available
at http://www.w3.org/TR/REC-DOM-Level-1.
XPointer-Liaison
XPointer-Information Set Liaison Statement, W3C, ed. Steven J.
DeRose. 24 February 1999. Available at http://www.w3.org/TR/NOTE-xptr-
infoset-liaison.
Relative Namespace URI References
Results of W3C XML Plenary Ballot on relative URI References in
namespace declarations, 3-17 July 2000, W3C, eds. Dave Hollander, C.
M. Sperberg-McQueen. 6 September 2000. Available
at http://www.w3.org/2000/09/xppa.
RDF Schema for the XML Information Set
RDF Schema for the XML Information Set, W3C, ed. Richard Tobin. 6
April 2001. Available at http://www.w3.org/TR/xml-infoset-rdfs.
The reporting requirements include errors, which are outside the scope of this
specification, and document information. All of the XML requirements for
document information reporting have been integrated into the XML
Information Set; numbers in parentheses refer to sections of the XML
Recommendation:
1. An XML processor must always provide all characters in a document
that are not part of markup to the application (2.10).
2. A validating XML processor must inform the application which of the
character data in a document is white space appearing within element
content (2.10).
3. An XML processor must normalize line-ends to LF before passing them
to the application (2.11).
4. An XML processor must normalize the value of attributes according to
the rules in clause 3.3.3 before passing them to the application.
5. An XML processor must pass the names and external identifiers
(system identifiers, public identifiers or both) of declared notations to the
application (4.7).
6. When the name of an unparsed entity appears as the explicit or default
value of an ENTITY or ENTITIES attribute, an XML processor must
provide the names, system identifiers, and (if present) public identifiers
of both the entity and its notation to the application (4.6, 4.7).
7. An XML processor must pass processing instructions to the application
(2.6).
8. An XML processor (necessarily a non-validating one) that does not
include the replacement text of an external parsed entity in place of an
entity reference must notify the application that it recognized but did not
read the entity (4.4.3).
9. A validating XML processor must include the replacement text of an
entity in place of an entity reference (5.2).
10. An XML processor must supply the default value of attributes
declared in the DTD for a given element type but not appearing in the
element's start tag (3.3.2).
<msg:message doc:date="19990421"
xmlns:doc="http://doc.example.org/namespaces/doc"
xmlns:msg="http://message.example.org/"
>Phone home!</msg:message>
The information set for this XML document contains the following information
items: