0% found this document useful (0 votes)
17 views11 pages

Unit 5-4

An XML document has both logical and physical structures, consisting of entities and markup elements like declarations, elements, comments, and processing instructions. The document is structured into three parts: an XML declaration, a document type declaration, and the document body, with the prolog combining the first two. The document body contains the actual data, while the markup defines the structure, and parsing involves breaking down the document into its components.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views11 pages

Unit 5-4

An XML document has both logical and physical structures, consisting of entities and markup elements like declarations, elements, comments, and processing instructions. The document is structured into three parts: an XML declaration, a document type declaration, and the document body, with the prolog combining the first two. The document body contains the actual data, while the markup defines the structure, and parsing involves breaking down the document into its components.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

XML Document Structure

The XML Recommendation states that an XML document has both logical and physical
structure. Physically, it is comprised of storage units called entities, each of which may refer to
other entities, similar to the way that includes works in the C language. Logically, an XML
document consists of declarations, elements, comments, character references, and processing
instructions, collectively known as the markup.
NOTE

Although throughout this book we refer to an "XML document," it is crucial to understand that
XML may not exist as a physical file on disk. XML is sometimes used to convey messages
between applications, such as from a Web server to a client. The XML content may be
generated on the fly, for example by a Java application that accesses a database. It may be
formed by combining pieces of several files, possibly mixed with output from a program.
However, in all cases, the basic structure and syntax of XML is invariant.

An XML document consists of three parts, in the order given:

1. An XML declaration (which is technically optional, but recommended in most normal


cases)

2. A document type declaration that refers to a DTD (which is optional, but required if you
want validation)

3. A body or document instance (which is required)

Collectively, the XML declaration and the document type declaration are called the XML prolog.

XML Declaration
The XML declaration is a piece of markup (which may span multiple lines of a file) that identifies
this as an XML document. The declaration also indicates whether the document can be
validated by referring to an external Document Type Definition (DTD). DTDs are the subject of
chapter 4; for now, just think of a DTD as a set of rules that describes the structure of an XML
document.

The minimal XML declaration is:


<?xml version="1.0" ?>

XML is case-sensitive (more about this in the next subsection), so it's important that you use
lowercase for xml and version. The quotes around the value of the version attribute are
required, as are the ? characters. At the time of this writing, "1.0" is the only acceptable value
for the version attribute, but this is certain to change when a subsequent version of the XML
specification appears.
NOTE

Do not include a space before the string xml or between the question mark and the angle
brackets. The strings <?xml and ?> must appear exactly as indicated. The space before the ?> is
optional. No blank lines or space may precede the XML declaration; adding white space here
can produce strange error messages.
In most cases, this XML declaration is present. If so, it must be the very first line of the
document and must not have leading white space. This declaration is technically optional; cases
where it may be omitted include when combining XML storage units to create a larger,
composite document.

Actually, the formal definition of an XML declaration, according to the XML 1.0 specification is
as follows:

XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'

This Extended Backus-Naur Form (EBNF) notation, characteristic of many W3C specifications,
means that an XML declaration consists of the literal sequence '<?xml', followed by the
required version information, followed by optional encoding and standalone declarations,
followed by an optional amount of white space, and terminating with the literal sequence '?>'.
In this notation, a question mark not contained in quotes means that the term that precedes it
is optional.
The following declaration means that there is an external DTD on which this document
depends. See the next subsection for the DTD that this negative standalone value implies.
<?xml version="1.0" standalone="no" ?>

On the other hand, if your XML document has no associated DTD, the correct XML declaration
is:

<?xml version="1.0" standalone="yes" ?>

The XML 1.0 Recommendation states: "If there are external markup declarations but there is no
standalone document declaration, the value 'no' is assumed."

The optional encoding part of the declaration tells the XML processor (parser) how to interpret
the bytes based on a particular character set. The default encoding is UTF-8, which is one of
seven character-encoding schemes used by the Unicode standard, also used as the default for
Java. In UTF-8, one byte is used to represent the most common characters and three bytes are
used for the less common special characters. UTF-8 is an efficient form of Unicode for ASCII-
based documents. In fact, UTF-8 is a superset of ASCII.3

<?xml version="1.0" encoding="UTF-8" ?>

For Asian languages, however, an encoding of UTF-16 is more appropriate because two bytes
are required for each character. It is also possible to specify an ISO character encoding, such as
in the following example, which refers to ASCII plus Greek characters. Note, however, that
some XML processors may not handle ISO character sets correctly since the
specification requires only that they handle UTF-8 and UTF-16.

<?xml version="1.0" encoding="ISO-8859-7" ?>


Both the standalone and encoding information may be supplied:

<?xml version="1.0" standalone="no" encoding="UTF-8" ?>

Is the next example valid?

<?xml version="1.0" encoding='UTF-8' standalone='no'?>

Yes, it is. The order of attributes does not matter. Single and double quotes can be used
interchangeably, provided they are of matching kind around any particular attribute value.
(Although there is no good reason in this example to use double quotes for version and single
quotes for the other, you may need to do so if the attribute value already contains the kind of
quotes you prefer.) Finally, the lack of a blank space between 'no' and ?> is not a problem.
Neither of the following XML declarations is valid.

<?XML VERSION="1.0" STANDALONE="no"?>

<?xml version="1.0" standalone="No"?>

The first is invalid because these particular attribute names must be lowercase, as must "xml".
The problem with the second declaration is that the value of the standalone attribute must be
literally "yes" or "no", not "No". (Do I dare call this a "no No"?)

Document Type Declaration


The document type declaration follows the XML declaration. The purpose of this declaration is
to announce the root element (sometimes called the document element) and to provide the
location of the DTD. The general syntax is:
4

<!DOCTYPE RootElement (SYSTEM | PUBLIC)


ExternalDeclarations? [InternalDeclarations]? >

where <!DOCTYPE is a literal string, RootElement is whatever you name the outermost element
of your hierarchy, followed by either the literal keyword SYSTEM or PUBLIC. The
optional ExternalDeclarations portion is typically the relative path or URL to the DTD that
describes your document type. (It is really only optional if the entire DTD appears as
an InternalDeclaration, which is neither likely nor desirable.) If there are InternalDeclarations,
they must be enclosed in square brackets. In general, you'll encounter far more cases
with ExternalDeclarations than InternalDeclarations, so let's ignore the latter for now. They
constitute the internal subset, which is described in chapter 4.
Let's start with a simple but common case. In this example, we are indicating that the DTD and
the XML document reside in the same directory (i.e., the ExternalDeclarations are contained in
the file employees.dtd) and that the root element is Employees:

<!DOCTYPE Employees SYSTEM "employees.dtd">

Similarly,

<!DOCTYPE PriceList SYSTEM "prices.dtd">

indicates a root element PriceList and the DTD is in the local file: prices.dtd.
In the next example, we use normal directory path syntax to indicate a different location for the
DTD.

<!DOCTYPE Employees SYSTEM "../dtds/employees.dtd">

As is often the case, we might want to specify a URL for the DTD since the XML file may not
even be on the same host as the DTD. This case also applies when you are using an XML
document for message passing or data transmission across servers and still want the validation
by referencing a common DTD.

<!DOCTYPE Employees SYSTEM

"http://somewhere.com/dtds/employees.dtd">

Next, we have the case of the PUBLIC identifier. This is used in formal environments to declare
that a given DTD is available to the public for shared use. Recall that XML's true power as a
syntax relates to developing languages that permit exchange of structured data between
applications and across company boundaries. The syntax is a little different:

<!DOCTYPE RootElement PUBLIC PublicID URI>

The new aspect here is the notion of a PublicID, which is a slightly involved formatted string
that identifies the source of the DTD whose path follows as the URI. This is sometimes known as
the Formal Public Identifier (FPI).
For example, I was part of a team that developed (Astronomical) Instrument Markup Language
(AIML, IML) for NASA Goddard Space Flight Center. We wanted our DTD to be available to
5

other astronomers. Our document type declaration (with a root element named Instrument)
was:

<!DOCTYPE Instrument PUBLIC

"-//NASA//Instrument Markup Language 0.2//EN"

"http://pioneer.gsfc.nasa.gov/public/iml/iml.dtd">

In this case the PublicID is:


"-//NASA//Instrument Markup Language 0.2//EN"

The URI that locates the DTD is:

http://pioneer.gsfc.nasa.gov/public/iml/iml.dtd

Let's decompose the PublicID. The leading hyphen indicates that NASA is not a standards body.
If it were, a plus sign would replace the hyphen, except if the standards body were ISO, in which
case the string "ISO" would appear. Next we have the name of the organization responsible for
the DTD (NASA, in this case), surrounded with double slashes, then a short free-text description
of the DTD ("Instrument Markup Language 0.2"), double slashes, and a two-character language
identifier ("EN" for English, in this case).
Since the XML prolog is the combination of the XML declaration and the document type
declaration, for our NASA example the complete prolog is:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<!DOCTYPE Instrument PUBLIC

"-//NASA//Instrument Markup Language 0.2//EN"

"http://pioneer.gsfc.nasa.gov/public/iml/iml.dtd">

As another example, let's consider a common case involving DTDs from the W3C, such as those
for XHTML 1.0.

<?xml version="1.0" encoding="utf-8"?>


The XHTML Basic 1.0 PublicID is similar but not identical to the XHTML 1.0 case and of course
the DTD is different since it's a different language.
If you noticed that the NASA example uses uppercase for the encoding value UTF-8 and the
W3C examples use lowercase, you may have been bothered because that is inconsistent with
what we learned about the case-sensitive value for the standalone attribute. The only
explanation I can offer is that although element and attribute names are always case-sensitive,
attributes values may or may not be. A reasonable guess is that if the possible attribute values
are easily enumerated (i.e., "yes" or "no", or other relatively short lists of choices), then case
probably matters.
NOTE

DTD-related keywords such as DOCTYPE, PUBLIC, and SYSTEM must be uppercase. XML-related
attribute names such as version, encoding, and standalone must be lowercase.

Document Body
The document body, or instance, is the bulk of the information content of the document.
Whereas across multiple instances of a document of a given type (as identified by
the DOCTYPE) the XML prolog will remain constant, the document body changes with each
document instance (in general). This is because the prolog defines (either directly or indirectly)
the overall structure while the body contains the real instance-specific data. Comparing this to
data structures in computer languages, the DTD referenced in the prolog is analogous to
a struct in the C language or a class definition in Java, and the document body is analogous to a
runtime instance of the struct or class.
Because the document type declaration specifies the root element, this must be the first
element the parser encounters. If any other element but the one identified by
the DOCTYPE line appears first, the document is immediately invalid.
Listing 3-1 shows a very simple XHTML 1.0 document. The DOCTYPE is "html" (not "xhtml"), so
the document body begins with <html ....> and ends with </html>.
Listing 3-1 Simple XHTML 1.0 Document with XML Prolog and Document Body

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>XHTML 1.0</title>
</head>
<body>
<h1>Simple XHTML 1.0 Example</h1>
<p>See the <a href=
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">DTD</a>.</p>
</body>
</html>

Markup, Character Data, and Parsing


An XML document contains text characters that fall into two categories: either they are part of
the document markup or part of the data content, usually called character data, which simply
means all text that is not part of the markup. In other words, XML text consists of intermingled
character data and markup. Let's revisit an earlier fragment.

<Address>
<Street>123 Milky Way</Street>
<City>Columbia</City>
<State>MD</State>
<Zip>20777</Zip>
</Address>

The character data comprises the four strings "123 Milky Way", "Columbia", "MD", and
"20777"; the markup comprises the start and end tags for the five
elements Address, Street, City, State, and Zip. Note that this is similar but not identical, to what
we previously called content. For example, although each chunk of character data is the
content of a particular element, the content of the Address element is all of the child
elements. We can think of all the character data belonging to both the element that directly
contains it and indirectly to Address. (In fact, in some XML applications such as XSLT, if we ask
for the text content of Address, we'll get the concatenation of all the individual strings.)
The markup itself can be divided into a number of categories, as per section 2.4 of the XML 1.0
specification.
 start tags and end tags (e.g., <Address> and </Address> )
 empty-element tags (e.g., <Divider/> )
 entity references (e.g., &footer; or %otherDTD; )
 character references (e.g., &lt; or &gt; )
 comments (e.g., <!-- whatever --> )
 CDATA section delimiters (e.g., <![CDATA[ insert code here ]]> )
 document type declarations (e.g., <!DOCTYPE ....> )
 processing instructions (e.g., <?myJavaApp numEmployees="25" location="Columbia" ....
?> )
 XML declarations (e.g., <?xml version=.... ?> )
 text declarations (e.g., <?xml encoding=.... ?> )
 any white space at the top level (before or after the root element)

We will discuss each of these markup aspects in either this chapter or the next. Note that for all
types of markup, there are some delimiters, most but not all of which involve angle brackets.

The specification states that all text that is not markup constitutes the character data of the
document. In other words, if you stripped all markup from the document, the remaining
content would be the character data. Consider this example:

<?xml version="1.0" standalone="no" ?>


<!DOCTYPE Message SYSTEM "message.dtd">
<Message mime-type="text/plain">
<!-- This is a trivial example. -->
<From>The Kenster</From>
<To>Silly Little Cowgirl</To>
<Body>
Hi, there. How is your gardening going?
</Body>
</Message>

The character data when the markup is removed would be:

The Kenster Silly Little Cowgirl Hi, there. How is your gardening going?
In general this is essentially the text between the start and end tags, which we previously called
the content of the element, but there is a subtlety related to parsing. Depending on parser
details, the newlines after </From> and </To> might be replaced by single spaces, as shown.
Alternatively, the newlines might be preserved.
Parsing is the process of splitting up a stream of information into its constituent pieces (often
called tokens). In the context of XML, parsing refers to scanning an XML document (which need
not be a physical file—it can be a data stream) in order to split it into its various markup and
character data, and more specifically, into elements and their attributes. XML parsing reveals
the structure of the information since the nesting of elements implies a hierarchy. It is possible
for an XML document to fail to parse completely if it does not follow the well-formedness
rules described in the XML 1.0 Recommendation. A successfully parsed XML document may be
either well-formed (at a minimum) or valid, as discussed in detail later in this chapter and the
next.

There is a subtlety about processing character data. During the parsing process, if there is
markup that contains entity references, the markup will be converted into character data. A
typical example from XHTML would be:

<p>&quot;AT&amp;T is a winning company,&quot; he said.</p>

After the parser substitutes for the entities, the resultant character data is:

"AT&T is a winning company," he said.

After parsing and substituting for special characters, the character data that remains after the
substitution is parsed character data, which is referred to as #PCDATA in DTDs and always
refers to textual content of elements. Character data that is not parsed is called CDATA in DTDs;
this relates exclusively to attribute values.

You might also like