Extensible Markup Language: What Is XML?
Extensible Markup Language: What Is XML?
What is XML?
What's a Document?
So XML is Just Like HTML?
So XML Is Just Like SGML?
Why XML?
Validity
Conclusion
Introduction
Extensible Markup Language, abbreviated XML, describes a class of data
objects called XML Documents and partially describes the behavior of computer
programs which process them. XML is an application profile or restricted form
of SGML, the Standard Generalized Markup Language. By construction, XML
documents are conforming SGML documents.
What is XML?
The Extensible Markup Language (XML) is a subset of SGML that is
completely described in this document. Its goal is to enable generic SGML to be served,
received, and processed on the Web in the way that is now possible with HTML. XML
has been designed for ease of implementation and for interoperability with both SGML
and HTML.
What's a Document?
This document has been reviewed by W3C Members and other interested
parties and has been endorsed by the Director as a W3C Recommendation. It is a stable
document and may be used as reference material or cited as a normative reference from
another document. W3C's role in making the Recommendation is to draw attention to the
specification and to promote its widespread deployment. This enhances the functionality
and interoperability of the Web.
The number of applications currently being developed that are based on,
or make use of, XML documents is truly amazing (particularly when you consider that
XML is not yet a year old)! For our purposes, the word "document" refers not only to
traditional documents, like this one, but also to the myriad of other XML "data formats".
These include vector graphics, e-commerce transactions, mathematical equations, object
meta-data, server APIs, and a thousand other kinds of structured information.
XML specifies neither semantics nor a tag set. In fact XML is really a
meta-language for describing markup languages. In other words, XML provides a facility
to define tags and the structural relationships between them. Since there's no predefined
tag set, there can't be any preconceived semantics. All of the semantics of an XML
document will either be defined by the applications that process them or by style sheets.
No. Well, yes, sort of. XML is defined as an application profile of SGML.
SGML is the Standard Generalized Markup Language defined by ISO 8879. SGML has
been the standard, vendor-independent way to maintain repositories of structured
documentation for more than a decade, but it is not well suited to serving documents over
the web (for a number of technical reasons beyond the scope of this article). Defining
XML as an application profile of SGML means that any fully conformant SGML system
will be able to read XML documents. However, using and understanding XML
documents does not require a system that is capable of understanding the full generality
of SGML. XML is, roughly speaking, a restricted form of SGML.
For technical purists, it's important to note that there may also be subtle
differences between documents as understood by XML systems and those same
documents as understood by SGML systems. In particular, treatment of white space
immediately adjacent to tags may be different.
Why XML?
In order to appreciate XML, it is important to understand why it was
created. XML was created so that richly structured documents could be used over the
web. The only viable alternatives, HTML and SGML, are not practical for this purpose.
This is not to say that XML can be expected to completely replace SGML.
While XML is being designed to deliver structured content over the web, some of the
very features it lacks to make this practical, make SGML a more satisfactory solution for
the creation and long-time storage of complex documents. In many organizations,
filtering SGML to XML will be the standard procedure for web delivery.
Defines the syntax of XML. The XML specification is the primary focus
of this article.
XML Pointer Language and XML linking Language (X link):
One topic that may be new is the use of EBNF to describe the syntax of
XML.
<oldjoke>
<burns>Say <quote>goodnight</quote>,
Gracie.</burns>
<allen><quote>Goodnight,
Gracie.</quote></allen>
<applause/>
</oldjoke>
A few things may stand out to you:
• The document begins with a processing instruction: <?xml ...?>. This is the XML
declaration. While it is not required, its presence explicitly identifies the
document as an XML document and indicates the version of XML to which it was
authored.
• There's no document type declaration. Unlike SGML, XML does not require a
document type declaration. However, a document type declaration can be
supplied, and some documents will require one in order to be understood
unambiguously.
• Empty elements (<applause/> in this example) have a modified syntax. While
most elements in a document are wrappers around some content, empty elements
are simply markers where something occurs (a horizontal rule for HTML's <hr>
tag, for example, or a cross reference for Docbook's <xref> tag). The trailing /> in
the modified syntax indicates to a program processing the XML document that the
element is empty and no matching end-tag should be sought. Since XML
documents do not require a document type declaration, without this clue it could
be impossible for an XML parser to determine which tags were intentionally
empty and which had been left empty by mistake.
• XML has softened the distinction between elements which are declared as EMPTY
and elements which merely have no content. In XML, it is legal to use the empty-
element tag syntax in either case. It's also legal to use a start-tag/end-tag pair for
empty elements: <applause></applause>. If interoperability is of any concern, it's
best to reserve empty-element tag syntax for elements which are declared as
EMPTY and to only use the empty-element tag form for those elements.
XML documents are composed of markup and content. There are six kinds
of markup that can occur in an XML document: elements, entity references, comments,
processing instructions, marked sections, and document type declarations. The following
sections introduce each of these markup concepts.
Elements
Elements are the most common form of markup. Delimited by angle
brackets, most elements identify the nature of the content they surround. Some elements
may be empty, as seen above, in which case they have no content. If an element is not
empty, it begins with a start-tag, <element>, and ends with an end-tag, </element>.
Attributes
Attributes are name-value pairs that occur inside start-tags after the element name. For
example,
<div class="preface">
Is a div element with the attribute class having the value preface? In XML, all attribute
values must be quoted.
Entity References
In order to introduce markup into a document, some characters have been
reserved to identify the start of markup. The left angle bracket, < , for instance, identifies
the beginning of an element start- or end-tag. In order to insert these characters into your
document as content, there must be an alternative way to represent them. In XML,
entities are used to represent these special characters. Entities are also used to refer to
often repeated or varying text and to include the content of external files.
Every entity must have a unique name. Defining your own entity names is
discussed in the section on entity declarations. In order to use an entity, you simply
reference it by name. Entity references begin with the ampersand and end with a
semicolon.
For example, the it entity inserts a literal < into a document. So the string
<element> can be represented in an XML document as <element>.
Comments
Comments begin with <!-- and end with -->. Comments can contain any
data except the literal string --. You can place comments between markup anywhere in
your document.
Processing Instructions
Processing instructions (PIs) are an escape hatch to provide information to
an application. Like comments, they are not textually part of the XML document, but the
XML processor is required to pass them to an application.
Processing instructions have the form: <?name pidata?>. The name, called
the PI target, identifies the PI to the application. Applications should process only the
targets they recognize and ignore all other PIs. Any data that follows the PI target is
optional, it is for the application that recognizes the target. The names used in PIs may be
declared as notations in order to formally identify them.
CDATA Sections
In a document, a CDATA section instructs the parser to ignore most
markup characters.
<![CDATA[
*p = &q;
b = (i <= 3);
]]>
Between the start of the section, <![CDATA[ and the end of the section,
]]>, all character data is passed directly to the application, without interpretation.
Elements, entity references, comments, and processing instructions are all unrecognized
and the characters that comprise them are passed literally to the application.
One of the greatest strengths of XML is that it allows you to create your
own tag names. But for any given application, it is probably not meaningful for tags to
occur in a completely arbitrary order. Consider the old joke example introduced earlier.
Would this be meaningful?
<gracie><quote><oldjoke>Goodnight,
<applause/>Gracie</oldjoke></quote>
<burns><gracie>Say <quote>goodnight</quote>,
</gracie>Gracie.</burns></gracie>
It's so far outside the bounds of what we normally expect that it's
nonsensical. It just doesn't mean anything.
This declaration identifies the element named oldjoke. Its content model
follows the element name. The content model defines what an element may contain. In
this case, an oldjoke must contain burns and allen and may contain applause. The
commas between element names indicate that they must occur in succession. The plus
after burns indicates that it may be repeated more than once but must occur at least once.
The question mark after applause indicates that it is optional (it may be absent, or it may
occur exactly once). A name with no punctuation, such as allen, must occur exactly once.
Declarations for burns, allen, applause and all other elements used in any
content model must also be present for an XML processor to check the validity of a
document.
The vertical bar indicates an or relationship, the asterisk indicates that the
content is optional (may occur zero or more times); therefore, by this definition, burns
may contain zero or more characters and quote tags, mixed in any order. All mixed
content models must have this form: #PCDATA must come first, all of the elements must
be separated by vertical bars, and the entire group must be optional.
Two other content models are possible: EMPTY indicates that the element
has no content (and consequently no end-tag), and ANY indicates that any content is
allowed. The ANY content model is sometimes useful during document conversion, but
should be avoided at almost any cost in a production environment because it disables all
content checking in that element.
name
ID
#REQUIRED
label
CDATA
#IMPLIED
In this example, the old joke element has three attributes: name, which is an ID and is
required; label, which is a string (character data) and is not required; and status, which
must be either funny or not funny and defaults to funny, if no value is specified.
Each attribute in a declaration has three parts: a name, a type, and a default value.
You are free to select any name you wish, subject to some slight restrictions, but names
cannot be repeated on the same element.
CDATA
CDATA attributes are strings, any text is allowed. Don't confuse CDATA
attributes with CDATA sections, they are unrelated.
ID
The value of an ID attribute must be a name. All of the ID values used in a
document must be different. IDs uniquely identify individual elements in a
document. Elements can have only a single ID attribute.
IDREF
or IDREFS
An IDREF attribute's value must be the value of a single ID attribute on some
element in the document. The value of an IDREFS attribute may contain multiple
IDREF values separated by white space.
ENTITY
or ENTITIES
An ENTITY attribute's value must be the name of a single entity. The value of an
ENTITIES attribute may contain multiple entity names separated by white space.
NMTOKEN
or NMTOKENS
Name token attributes are a restricted form of string attribute. In general, an
NMTOKEN attribute must consist of a single word, but there are no additional
constraints on the word, it doesn't have to match another attribute or declaration.
The value of an NMTOKENS attribute may contain multiple NMTOKEN values
separated by white space.
A list of names
You can specify that the value of an attribute must be taken from a specific list of
names. This is frequently called an enumerated type because each of the possible
values is explicitly enumerated in the declaration. Alternatively, you can specify
that the names must match a notation name.
#REQUIRED
The attribute must have an explicitly specified value on every occurrence of the
element in the document.
#IMPLIED
The attribute value is not required, and no default value is provided. If a value is
not specified, the XML processor must proceed without one.
"value"
An attribute can be given any legal value as a default. The attribute value is not
required on each element in the document, and if it is not present, it will appear to
be the specified default.
#FIXED
"value"
An attribute declaration may specify that an attribute has a fixed value. In this
case, the attribute is not required, but if it occurs, it must have the specified value.
If it is not present, it will appear to be the specified default. One use for fixed
attributes is to associate semantics with an element. A complete discussion is
beyond the scope of this article, but you can find several examples of fixed
attributes in the X link specification.
Entity Declarations
Entity declarations allow you to associate a name with some other
fragment of content. That construct can be a chunk of regular text, a chunk of the
document type declaration, or a reference to an external file containing either text or
binary data.
Internal Entities
Internal entities associate a name with a string of literal text. The first entity in is
an internal entity. Using &ATI; anywhere in the document will insert Arbor Text,
Inc. at that location. Internal entities allow you to define shortcuts for frequently
typed text or text that is expected to change, such as the revision status of a
document.
Internal entities can include references to other internal entities, but it is an error
for them to be recursive.
External Entities
External entities associate a name with the content of another file.
External entities allow an XML document to refer to the contents of another file. External
entities contain either text or binary data. If they contain text, the content of the external
file is inserted at the point of reference and parsed as part of the referring document.
Binary data is not parsed and may only be referenced in an attribute. Binary data is used
to reference figures and other non-XML content in the document.
The entity ATIlogo is also an external entity, but its content is binary. The
ATIlogo entity can only be used as the value of an ENTITY (or ENTITIES) attribute (on
a graphic element, perhaps). The XML processor will pass this information along to an
application, but it does not attempt to process the content of /standard/logo.gif.
Parameter Entities
Parameter entities can only occur in the document type declaration. A
parameter entity declaration is identified by placing % (percent-space) in front of its
name in the declaration. The percent sign is also used in references to parameter entities,
instead of the ampersand. Parameter entity references are immediately expanded in the
document type declaration and their replacement text is part of the declaration, whereas
normal entity references are not expanded. Parameter entities are not recognized in the
body of a document.
Looking back at the element declarations in Example 2, you'll notice that two of them
have the same content model:
<!ELEMENT burns (#PCDATA | quote)*>
At the moment, these two elements are the same only because they happen
to have the same literal definition. In order to make more explicit the fact that these two
elements are semantically the same, use a parameter entity to define their content model.
The advantage of using a parameter entity is two-fold. First, it allows you to give a
descriptive name to the content, and second it allows you to change the content model in
only a single place, if you wish to update the element declarations, assuring that they
always stay the same:
<!ENTITY % personcontent "#PCDATA | quote">
Notation Declarations
Notation declarations identify specific types of external binary data. This
information is passed to the processing application, which may make whatever use of it it
wishes. A typical notation declaration is:
Authoring Environments
Most authoring environments need to read and process document type
declarations in order to understand and enforce the content models of the
document.
Default Attribute Values
If an XML document relies on default attribute values, at least part of the
declaration must be processed in order to obtain the correct default values.
The document type declaration identifies the root element of the document
and may contain additional declarations. All XML documents must have a single root
element that contains all of the content of the document. Additional declarations may
come from an external DTD, called the external subset, or be included directly in the
document, the internal subset, or both:
<!ATTLIST ulink
]>
<chapter>...</chapter>
This example references an external DTD, dbook.dtd, and includes
element and attribute declarations for the ulink element in the internal subset. In this case,
ulink is being given the semantics of a simple link from the X link specification.
<oldjoke>
Probably not.
But how can you tell? You can only determine if white space is significant
if you know the content model of the elements in question. In a nutshell, white space is
significant in mixed content and is insignificant in element content.
The rule for XML processors is that they must pass all characters that are
not markup through to the application. If the processor is a validating processor, it must
also inform the application about which white space characters are significant.
The special attribute xml:space may be used to indicate explicitly that white
space is significant. On any element which includes the attribute specification
xml:space='preserve', all white space within that element (and within sub elements that do
not explicitly reset xml:space ) is significant.
The only legal values for xml:space are preserve and default. The value default
indicates that the default processing is desired. In a DTD, the xml:space attribute must be
declared as an enumerated type with only those two values.
One last note about white space: in parsed text, XML processors are
required to normalize all end-of-line markers to a single line feed character (&#A;). This
is rarely of interest to document authors, but it does eliminate a number of cross-platform
portability issues.
Language Identification
Many document processing applications can benefit from information
about the natural language in which a document is written, XML defines the attribute
xml:lang to identify the language. Since the purpose of this attribute is to standardize
information across applications, the XML specification also describes how languages are
to be identified.
Validity
Given the preceding discussion of type declarations, it follows that some
documents are valid and some are not. There are two categories of XML documents:
well-formed and valid.
Well-formed Documents
A document can only be well-formed if it obeys the syntax of XML. A
document that includes sequences of markup characters that cannot be parsed or are
invalid cannot be well-formed.
Valid Documents
A well-formed document is valid only if it contains a proper document
type declaration and if the document obeys the constraints of that declaration (element
sequence and nesting is valid, required attributes are provided, attribute values are of the
correct type, etc.). The XML specification identifies all of the criteria in detail.
Since XML does not have a fixed set of elements, the name of the linking
element cannot be used to locate links. Instead, XML processors identify links by
recognizing the XML: link attribute. Other attributes can be used to provide additional
information to the XML processor. An attribute renaming facility exists to work around
name collisions in existing applications.
Two of the attributes, show and actuate allow you to exert some control over
the linking behavior. The show attribute determines whether the document linked-to is
embedded in the current document, replaces the current document, or is displayed in a
new window when the link is traversed. Actuate determines how the link is traversed,
either automatically or when selected by the user.
Some applications will require much finer control over linking behaviors.
For those applications, standard places are provided where the additional semantics may
be expressed.
Simple Links
A Simple Link strongly resembles an HTML <A> link:
A Simple Link identifies a link between two resources, one of which is the content of the
linking element itself. This is an in-line link.
The locator identifies the other resource. The locator may be a URL, a query, or an
Extended Pointer..
Extended Links
Extended Links allow you to express relationships between more than two resources:
</elink>
Extended Links can be in-line, so that the content of the linking element
(other than the locator elements), participates in the link as a resource, but that is not
necessarily the case. The example above is an out-of-line link because it does not use its
content as a resource.
Extended Pointers
Cross references with the XML ID/IDREF mechanism (which is similar to
the #fragment mechanism in HTML) require that the document being linked-to has
defined anchors where links are desired (and technically requires that both the ID and the
IDREF occur in the same document). This may not always be the case and sometimes it
is not possible to modify the document to which you wish to link.
XML X Pointers borrow concepts from HYTime and the Text Encoding
Initiative(TEI). X Pointers offer a syntax that allows you to locate a resource by
traversing the element tree of the document containing the resource.
For example,
child(2,oldjoke).(3,.)
locates the third child (whatever it may be) of the second old joke in the document.
span(child(2,oldjoke),child(3,oldjoke))
In addition to selecting by elements, X Pointers allow for selection by ID, attribute value,
and string matching. In this article, the X Pointer
span(root()child(3,sect1)string(1,"Here",0),
root()child(3,sect1)string(1,"Here",4))
selects the first occurrence of the word "Here" in the What Do XML Documents Look
Like? section of this article. The link can be established by an extended link without
modifying the target document.
Note that an X Pointer range can span a structurally invalid section of the document. The
X Link specification does not specify how applications should deal with such ranges.
Following the annotated text example above, assuming that the actual text
is read only, the XML processor must load at least the text and the document that
contains the extended link.
<ml:cn type="rational">3<ml:sep/>4</ml:cn>.</bk:para>
Again, since XML documents have no fixed tag set, this approach will not
work. The presentation of an XML document is dependent on a style sheet.
The standard style sheet language for XML documents is the Extensible
Style Language(XSL). At the time of this writing, the XSL effort is well underway, but
many questions remain unanswered. The XSL working group produced its first Working
draft on 18 Aug 1998.
Other style sheet languages, like Cascading style sheets, are likely to be
supported as well.
Conclusion
In this article, most of the major features of the XML Language have been
discussed, and some of the concepts behind X Link, Namespaces, and XSL have been
described. Although some things have been left out in the interest of the big picture (such
as character encoding issues), hopefully you now have enough background to pick up and
read the XML Specifications without difficulty.