Unit 5-4
Unit 5-4
The XML Recommendation states that an XML document has both logical and physical
structure. Physically, it is comprised of storage units called entities, each of which may refer to
other entities, similar to the way that includes works in the C language. Logically, an XML
document consists of declarations, elements, comments, character references, and processing
instructions, collectively known as the markup.
NOTE
Although throughout this book we refer to an "XML document," it is crucial to understand that
XML may not exist as a physical file on disk. XML is sometimes used to convey messages
between applications, such as from a Web server to a client. The XML content may be
generated on the fly, for example by a Java application that accesses a database. It may be
formed by combining pieces of several files, possibly mixed with output from a program.
However, in all cases, the basic structure and syntax of XML is invariant.
2. A document type declaration that refers to a DTD (which is optional, but required if you
want validation)
Collectively, the XML declaration and the document type declaration are called the XML prolog.
XML Declaration
The XML declaration is a piece of markup (which may span multiple lines of a file) that identifies
this as an XML document. The declaration also indicates whether the document can be
validated by referring to an external Document Type Definition (DTD). DTDs are the subject of
chapter 4; for now, just think of a DTD as a set of rules that describes the structure of an XML
document.
XML is case-sensitive (more about this in the next subsection), so it's important that you use
lowercase for xml and version. The quotes around the value of the version attribute are
required, as are the ? characters. At the time of this writing, "1.0" is the only acceptable value
for the version attribute, but this is certain to change when a subsequent version of the XML
specification appears.
NOTE
Do not include a space before the string xml or between the question mark and the angle
brackets. The strings <?xml and ?> must appear exactly as indicated. The space before the ?> is
optional. No blank lines or space may precede the XML declaration; adding white space here
can produce strange error messages.
In most cases, this XML declaration is present. If so, it must be the very first line of the
document and must not have leading white space. This declaration is technically optional; cases
where it may be omitted include when combining XML storage units to create a larger,
composite document.
Actually, the formal definition of an XML declaration, according to the XML 1.0 specification is
as follows:
This Extended Backus-Naur Form (EBNF) notation, characteristic of many W3C specifications,
means that an XML declaration consists of the literal sequence '<?xml', followed by the
required version information, followed by optional encoding and standalone declarations,
followed by an optional amount of white space, and terminating with the literal sequence '?>'.
In this notation, a question mark not contained in quotes means that the term that precedes it
is optional.
The following declaration means that there is an external DTD on which this document
depends. See the next subsection for the DTD that this negative standalone value implies.
<?xml version="1.0" standalone="no" ?>
On the other hand, if your XML document has no associated DTD, the correct XML declaration
is:
The XML 1.0 Recommendation states: "If there are external markup declarations but there is no
standalone document declaration, the value 'no' is assumed."
The optional encoding part of the declaration tells the XML processor (parser) how to interpret
the bytes based on a particular character set. The default encoding is UTF-8, which is one of
seven character-encoding schemes used by the Unicode standard, also used as the default for
Java. In UTF-8, one byte is used to represent the most common characters and three bytes are
used for the less common special characters. UTF-8 is an efficient form of Unicode for ASCII-
based documents. In fact, UTF-8 is a superset of ASCII.3
For Asian languages, however, an encoding of UTF-16 is more appropriate because two bytes
are required for each character. It is also possible to specify an ISO character encoding, such as
in the following example, which refers to ASCII plus Greek characters. Note, however, that
some XML processors may not handle ISO character sets correctly since the
specification requires only that they handle UTF-8 and UTF-16.
Yes, it is. The order of attributes does not matter. Single and double quotes can be used
interchangeably, provided they are of matching kind around any particular attribute value.
(Although there is no good reason in this example to use double quotes for version and single
quotes for the other, you may need to do so if the attribute value already contains the kind of
quotes you prefer.) Finally, the lack of a blank space between 'no' and ?> is not a problem.
Neither of the following XML declarations is valid.
The first is invalid because these particular attribute names must be lowercase, as must "xml".
The problem with the second declaration is that the value of the standalone attribute must be
literally "yes" or "no", not "No". (Do I dare call this a "no No"?)
where <!DOCTYPE is a literal string, RootElement is whatever you name the outermost element
of your hierarchy, followed by either the literal keyword SYSTEM or PUBLIC. The
optional ExternalDeclarations portion is typically the relative path or URL to the DTD that
describes your document type. (It is really only optional if the entire DTD appears as
an InternalDeclaration, which is neither likely nor desirable.) If there are InternalDeclarations,
they must be enclosed in square brackets. In general, you'll encounter far more cases
with ExternalDeclarations than InternalDeclarations, so let's ignore the latter for now. They
constitute the internal subset, which is described in chapter 4.
Let's start with a simple but common case. In this example, we are indicating that the DTD and
the XML document reside in the same directory (i.e., the ExternalDeclarations are contained in
the file employees.dtd) and that the root element is Employees:
Similarly,
indicates a root element PriceList and the DTD is in the local file: prices.dtd.
In the next example, we use normal directory path syntax to indicate a different location for the
DTD.
As is often the case, we might want to specify a URL for the DTD since the XML file may not
even be on the same host as the DTD. This case also applies when you are using an XML
document for message passing or data transmission across servers and still want the validation
by referencing a common DTD.
"http://somewhere.com/dtds/employees.dtd">
Next, we have the case of the PUBLIC identifier. This is used in formal environments to declare
that a given DTD is available to the public for shared use. Recall that XML's true power as a
syntax relates to developing languages that permit exchange of structured data between
applications and across company boundaries. The syntax is a little different:
The new aspect here is the notion of a PublicID, which is a slightly involved formatted string
that identifies the source of the DTD whose path follows as the URI. This is sometimes known as
the Formal Public Identifier (FPI).
For example, I was part of a team that developed (Astronomical) Instrument Markup Language
(AIML, IML) for NASA Goddard Space Flight Center. We wanted our DTD to be available to
5
other astronomers. Our document type declaration (with a root element named Instrument)
was:
"http://pioneer.gsfc.nasa.gov/public/iml/iml.dtd">
http://pioneer.gsfc.nasa.gov/public/iml/iml.dtd
Let's decompose the PublicID. The leading hyphen indicates that NASA is not a standards body.
If it were, a plus sign would replace the hyphen, except if the standards body were ISO, in which
case the string "ISO" would appear. Next we have the name of the organization responsible for
the DTD (NASA, in this case), surrounded with double slashes, then a short free-text description
of the DTD ("Instrument Markup Language 0.2"), double slashes, and a two-character language
identifier ("EN" for English, in this case).
Since the XML prolog is the combination of the XML declaration and the document type
declaration, for our NASA example the complete prolog is:
"http://pioneer.gsfc.nasa.gov/public/iml/iml.dtd">
As another example, let's consider a common case involving DTDs from the W3C, such as those
for XHTML 1.0.
DTD-related keywords such as DOCTYPE, PUBLIC, and SYSTEM must be uppercase. XML-related
attribute names such as version, encoding, and standalone must be lowercase.
Document Body
The document body, or instance, is the bulk of the information content of the document.
Whereas across multiple instances of a document of a given type (as identified by
the DOCTYPE) the XML prolog will remain constant, the document body changes with each
document instance (in general). This is because the prolog defines (either directly or indirectly)
the overall structure while the body contains the real instance-specific data. Comparing this to
data structures in computer languages, the DTD referenced in the prolog is analogous to
a struct in the C language or a class definition in Java, and the document body is analogous to a
runtime instance of the struct or class.
Because the document type declaration specifies the root element, this must be the first
element the parser encounters. If any other element but the one identified by
the DOCTYPE line appears first, the document is immediately invalid.
Listing 3-1 shows a very simple XHTML 1.0 document. The DOCTYPE is "html" (not "xhtml"), so
the document body begins with <html ....> and ends with </html>.
Listing 3-1 Simple XHTML 1.0 Document with XML Prolog and Document Body
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>XHTML 1.0</title>
</head>
<body>
<h1>Simple XHTML 1.0 Example</h1>
<p>See the <a href=
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">DTD</a>.</p>
</body>
</html>
<Address>
<Street>123 Milky Way</Street>
<City>Columbia</City>
<State>MD</State>
<Zip>20777</Zip>
</Address>
The character data comprises the four strings "123 Milky Way", "Columbia", "MD", and
"20777"; the markup comprises the start and end tags for the five
elements Address, Street, City, State, and Zip. Note that this is similar but not identical, to what
we previously called content. For example, although each chunk of character data is the
content of a particular element, the content of the Address element is all of the child
elements. We can think of all the character data belonging to both the element that directly
contains it and indirectly to Address. (In fact, in some XML applications such as XSLT, if we ask
for the text content of Address, we'll get the concatenation of all the individual strings.)
The markup itself can be divided into a number of categories, as per section 2.4 of the XML 1.0
specification.
start tags and end tags (e.g., <Address> and </Address> )
empty-element tags (e.g., <Divider/> )
entity references (e.g., &footer; or %otherDTD; )
character references (e.g., < or > )
comments (e.g., <!-- whatever --> )
CDATA section delimiters (e.g., <![CDATA[ insert code here ]]> )
document type declarations (e.g., <!DOCTYPE ....> )
processing instructions (e.g., <?myJavaApp numEmployees="25" location="Columbia" ....
?> )
XML declarations (e.g., <?xml version=.... ?> )
text declarations (e.g., <?xml encoding=.... ?> )
any white space at the top level (before or after the root element)
We will discuss each of these markup aspects in either this chapter or the next. Note that for all
types of markup, there are some delimiters, most but not all of which involve angle brackets.
The specification states that all text that is not markup constitutes the character data of the
document. In other words, if you stripped all markup from the document, the remaining
content would be the character data. Consider this example:
The Kenster Silly Little Cowgirl Hi, there. How is your gardening going?
In general this is essentially the text between the start and end tags, which we previously called
the content of the element, but there is a subtlety related to parsing. Depending on parser
details, the newlines after </From> and </To> might be replaced by single spaces, as shown.
Alternatively, the newlines might be preserved.
Parsing is the process of splitting up a stream of information into its constituent pieces (often
called tokens). In the context of XML, parsing refers to scanning an XML document (which need
not be a physical file—it can be a data stream) in order to split it into its various markup and
character data, and more specifically, into elements and their attributes. XML parsing reveals
the structure of the information since the nesting of elements implies a hierarchy. It is possible
for an XML document to fail to parse completely if it does not follow the well-formedness
rules described in the XML 1.0 Recommendation. A successfully parsed XML document may be
either well-formed (at a minimum) or valid, as discussed in detail later in this chapter and the
next.
There is a subtlety about processing character data. During the parsing process, if there is
markup that contains entity references, the markup will be converted into character data. A
typical example from XHTML would be:
After the parser substitutes for the entities, the resultant character data is:
After parsing and substituting for special characters, the character data that remains after the
substitution is parsed character data, which is referred to as #PCDATA in DTDs and always
refers to textual content of elements. Character data that is not parsed is called CDATA in DTDs;
this relates exclusively to attribute values.