XML Ref Essential
XML Ref Essential
Essential XML
1
Markup Languages
Markup languages are all about describing the form of the document—that
is, the way the content of the document should be interpreted.The markup
language that most people are familiar with today, of course, is HTML,
01 0201 CH01 10/12/00 9:47 AM Page 2
which you use to create standard Web pages. Here’s a sample HTML
page:
<HTML>
<HEAD>
<TITLE>Hello From HTML</TITLE>
</HEAD>
<BODY>
<CENTER>
<H1>
Hello From HTML
</H1>
</CENTER>
Welcome to the wild and woolly world of HTML.
</BODY>
</HTML>
You can see the results of this HTML in Figure 1.1 in Netscape Navigator.
Note that the HTML markup in this page—that is, tags such as <HEAD>,
<CENTER>, <H1>, and so on—is there to give directions to the browser.That’s
what markup does; it specifies directions on the way the content is to be
interpreted.
There is a real relationship between HTML and XML; both are based on
Standard Generalized Markup Language (SGML). As its name implies, SGML
is a very general markup language, with enormous capabilities. Because of
the large number of things you can do with SGML, however, it can be very
difficult to learn, and it hasn’t caught on in general use. XML is actually an
easier-to-use subset of SGML (and technically speaking, HTML is called an
application of SGML).You can read more about the relationship between
SGML and XML at www.w3.org/TR/NOTE-sgml-xml.
01 0201 CH01 10/12/00 9:47 AM Page 3
Markup Languages 3
The markup language that most people are familiar with these days is
HTML, but it’s easy to see how that language doesn’t provide enough
power for anything beyond creating standard Web pages.
HTML 1.0 consisted of only a dozen or so tags, but the most recent
version, HTML 4.01, consists of almost 100—if you include the other tags
added by the major browsers, that number is closer to 120. But as handling
01 0201 CH01 10/12/00 9:47 AM Page 4
data on the Web and other nets intensifies, it’s clear that 120 tags isn’t
enough—in fact, you can never have enough.
For example, what if your hobby was building model ships, and you
wanted to exchange specifications with others on the topic? HTML doesn’t
include tags such as <BEAMWIDTH>, <MIZZENHEIGHT>, <DRAFT>, <SHIPCLASS>, and
others that you might want.What if you worked for a major bank and
wanted to exchange financial data with other institutions—would you
prefer tags such as <B>, <UL>, and <FONT> or tags such as <FISCALYEAR>,
<ACCOUNTNUMBER>, and <TRANSFERACCOUNT>? (In fact, such markup languages,
including Extensible Business Reporting Language, exist now and are built
on XML.)
Likewise, what if you were a Web browser manufacturer who wanted to
create your own markup language to let people configure your browser,
adding scrollbars, toolbars, and other elements? You might create your own
markup language. In fact, Netscape has done just that with the XML-based
User Interface Language, which we’ll see in this chapter.
The upshot is that there are as many reasons to create markup languages
as there are ways of handling data—and, of course, both are unlimited
numbers.That’s where XML comes in: It’s a meta-markup specification
that lets you create your own markup languages.
We’ll see the parts of an XML document in detail in the next chapter, but in
overview, here’s how this one works: I start with the XML processing instruc-
tion <?xml version=”1.0” encoding=”UTF-8”?> (all XML processing instructions
01 0201 CH01 10/12/00 9:47 AM Page 5
start with <? and end with ?>), which indicates that I’m using XML version
1.0 (the only version currently defined) and the UTF-8 character encoding,
an 8-bit condensed version of Unicode (more on this later in the chapter).
Also, when I add new sections of code, they’ll be highlighted with shading to
point out the actual lines I’m discussing.
<?xml version=”1.0” encoding=”UTF-8”?>
<DOCUMENT>
<GREETING>
Hello From XML
</GREETING>
<MESSAGE>
Welcome to the wild and woolly world of XML.
</MESSAGE>
</DOCUMENT>
Next, I create a new tag named <DOCUMENT>. As we’ll see in the next chapter,
you can use any name for a tag, not just DOCUMENT, as long as the name
starts with a letter or an underscore (_), and as long as the following charac-
ters consist of letters, digits, underscores, dots (.), or hyphens (-), but no
spaces. In XML, tags always start with < and end with >.
XML documents are made up of XML elements. Much like in HTML,
you create XML elements with an opening tag, such as <DOCUMENT>, followed
by the element content (if any), such as text or other elements, and ending
with the matching closing tag that starts with </, such as </DOCUMENT>. (There
are additional rules we’ll see in the next chapter if the element doesn’t
contain any content.) It’s necessary to enclose the entire document, except
for processing instructions, in one element, called the root element—that’s the
<DOCUMENT> element here:
Now I’ll add to this XML document a new element that I made up,
<GREETING>, which encloses text content (in this case, Hello From XML),
like this:
01 0201 CH01 10/12/00 9:47 AM Page 6
Next, I can add a new element, <MESSAGE>, which also encloses text content,
like this:
<?xml version=”1.0” encoding=”UTF-8”?>
<DOCUMENT>
<GREETING>
Hello From XML
</GREETING>
<MESSAGE>
Welcome to the wild and woolly world of XML.
</MESSAGE>
</DOCUMENT>
Now the <DOCUMENT> root element contains two elements: <GREETING> and
<MESSAGE>. Each of the <GREETING> and <MESSAGE> elements also hold text
themselves. In this way, I’ve created a new XML document.
Note the similarity of this document to the HTML page we saw earlier.
Note also, however, that in the HTML document, all the tags were prede-
fined, and a Web browser knew how to handle them. Here, we’ve just created
the tags <DOCUMENT>, <GREETING> and <MESSAGE> from thin air. How can we use
an XML document like this one? What would a browser make of these new
tags?
You can see our complete XML document in Figure 1.2, but it’s nothing
like the image you see in Figure 1.1; there’s no particular formatting at all.
So, now that we’ve created our own markup elements, how do you tell a
browser how to display them?
Many people who are new to XML find the claim that you can use XML
to create new markup languages very frustrating—after all, what then? It
turns out that it’s up to you to assign meaning to the new elements you
create, and you can do that in two main ways. First, you can use a style sheet
to indicate to a browser how you want the content of the elements you’ve
created to be formatted.The second way is to use a programming language,
such as Java or JavaScript, to handle the XML document in programming
code.We’ll see both ways throughout this book, and I’ll take a quick look at
them in this chapter as well. I’ll start by adding a style sheet to the XML
document we’ve already created.
There are two main ways of specifying styles when you format XML
documents: with cascading style sheets (CSS) and with the Extensible Style
Sheets Language (XSL).We’ll see both in this book; here, I’ll apply a CSS
style sheet by using the XML processing instruction <?xml-stylesheet
type=”text/css” href=”greeting.css”?>, which tells the browser that the
type of the style sheet I’ll be using is CSS and that its name is greeting.css:
<?xml version=”1.0” encoding=”UTF-8”?>
<?xml-stylesheet type=”text/css” href=”greeting.css”?>
<DOCUMENT>
<GREETING>
Hello From XML
continues .
01 0201 CH01 10/12/00 9:47 AM Page 8
</GREETING>
<MESSAGE>
Welcome to the wild and woolly world of XML.
</MESSAGE>
</DOCUMENT>
Here’s what the contents of the file greeting.css itself looks like. In this case,
I’m customizing the <GREETING> element to display its content in red, cen-
tered in the browser, and in 36-point font.The <MESSAGE> element has been
customized to display its text in black 18-point font.The display: block part
indicates that I want the content of these elements to be displayed in a
block, which translates here to being displayed on individual lines:
GREETING {display: block; font-size: 36pt; color: #FF0000; text-align: center}
MESSAGE (display: block; font-size: 18pt; color: #000000}
You can see the results in two browsers that support XML in Figures 1.3 and
1.4. Figure 1.3 shows greeting.xml in Netscape 6 (available only in a preview
version as of this writing), and Figure 1.4 shows the same document in
Internet Explorer. As you can see, we’ve formatted the document as we like
it—in fact, this result already represents an advance over HTML because we
can format exactly how we want to display the text instead of having to rely
on predefined elements such as <H1>.
That gives us a taste of XML. Now we’ve seen how to create a first XML
document and how to use a style sheet to display it in a browser.We’ve seen
what it looks like, so what’s so great about XML? Take a look at the
overview, coming up next.
In XML, data and markup are stored as text that you can configure. If you
like, you can use XML editors to create XML documents, but if something
goes wrong, you can examine or modify the document directly because its
all just text.The data also is not encoded in some way that has been patented
or copyrighted, so it’s more accessible.
You might think that binary formats would be more efficient because
they can store data more compactly, but that’s not the way things have
worked out. For example, Microsoft Corporation is notorious for turning
out huge applications that store even simple data in huge files (the not-so-
affectionate name for this is “bloatware”). If you store only the letters “abc”
in a Microsoft Word 97 document, you may be surprised to find that the
document is something like 20,000 bytes long. A similar XML file might be
30 or 40 bytes. Even large amounts of data are not necessarily stored effi-
ciently; for instance, Microsoft Excel routinely creates even larger files that
are five times as long as the corresponding text. As we’ll see, XML provides
a very efficient way of storing most data.
In addition, when you standardize markup languages, many different
people can use them. I’ll take a look at that next.
Self-Describing Data
The data in XML documents is self-describing.Take a look at this
document:
<?xml version=”1.0” encoding=”UTF-8”?>
<DOCUMENT>
<GREETING>
Hello From XML
</GREETING>
<MESSAGE>
Welcome to the wild and woolly world of XML.
</MESSAGE>
</DOCUMENT>
Based solely on the names we’ve given to each XML element here, you can
figure out what’s going on:This document has a greeting and a message to
impart. Even if you came back to this document years later, you could figure
out what’s going on.This means that XML documents are, to a large extent,
self-documenting. (We’ll also see in the next chapter that you can add
explicit comments to XML files.)
HTML, but in XML, you actually can build in the semantic rules that specify
the structure of the document so that the document can be checked to make
sure it’s set up correctly.
Take a look at this XML document:
<?xml version=”1.0”?>
<SCHOOL>
<CLASS type=”seminar”>
<CLASS_TITLE>XML In The Real World</CLASS_TITLE>
<CLASS_NUMBER>6.031</CLASS_NUMBER>
<SUBJECT>XML</SUBJECT>
<START_DATE>6/1/2002</START_DATE>
<STUDENTS>
<STUDENT status=”attending”>
<FIRST_NAME>Edward</FIRST_NAME>
<LAST_NAME>Samson</LAST_NAME>
</STUDENT>
<STUDENT status=”withdrawn”>
<FIRST_NAME>Ernestine</FIRST_NAME>
<LAST_NAME>Johnson</LAST_NAME>
</STUDENT>
</STUDENTS>
</CLASS>
</SCHOOL>
Here I’ve set up an XML seminar and added two students to it. As we’ll see
in Chapter 2, “Creating Well-Formed XML Documents,” and Chapter 3,
“Valid XML Documents: Creating Document Type Definitions,” with XML
you can specify, for example, that each <STUDENT> element needs to enclose a
<FIRST_NAME> and a <LAST_NAME> element, that the <START_DATE> element can’t
go in the <STUDENTS> element, and more.
In fact, this emphasis on the correctness of documents is strong in XML.
In HTML, a Web author could (and frequently did) write sloppy HTML,
knowing that the Web browser would take care of any syntax problems
(some Web authors even exploited this intentionally to create special effects
in some browsers). In fact, some people estimate that 50% or more of the
code in modern browsers is there to take care of sloppy HTML in Web
pages. For that kind of reason, the story is different in XML. In XML,
browsers are supposed to check your document; if there’s a problem, they are
not supposed to proceed any further.They should let you know about the
problem, but that’s as far as they’re supposed to go.
01 0201 CH01 10/12/00 9:47 AM Page 13
So, how does an XML browser check your document? XML browsers can
make two main checks: a check to see whether your document is well-formed,
and a check to see whether it’s valid.We’ll see what these terms mean in
more detail in the next chapter, and I’ll go over them briefly here.
We’ll see more about DTDs in Chapter 3, but this DTD indicates that you
can have <GREETING> and <MESSAGE> elements inside a <DOCUMENT> element, that
the <DOCUMENT> element is the root element, and that the <GREETING> and
<MESSAGE> elements can hold text.
Most XML browsers will check XML documents for well-formedness,
but only a few will check for validity. I’ll talk more about where to find
XML validators in the later section “XML Validators.”
We’ve gotten an overview of XML documents now, including how to
display them using style sheets, and what constitutes a well-formed and valid
document. However, this is only part of the story. Many XML documents are
not designed to be displayed in browsers at all, for example, or even if they
are, they’re are not designed to be used with modern style sheets (such as
browsers that convert XML into industry-specific graphics such as molecular
structure, physics equations, or even musical scales).The more powerful use
of XML involves parsing an XML document to break it down into its
component parts and then handling the resulting data yourself. Next I’ll
take a look at a few ways of parsing XML data that are available to us.
Now say that you want to extract the greeting Hello From XML from this
XML document. One way of doing that is by using XML data islands in
Internet Explorer and then using a scripting language, such as JavaScript, to
extract and display the text content of the <GREETING> element. Here’s how
that looks in a Web page:
<HTML>
<HEAD>
<TITLE>
Finding Element Values in an XML Document
</TITLE>
<SCRIPT LANGUAGE=”JavaScript”>
function getData()
{
xmldoc= document.all(“firstXML”).XMLDocument;
nodeDoc = xmldoc.documentElement;
nodeGreeting = nodeDoc.firstChild;
outputMessage = “Greeting: “ +
nodeGreeting.firstChild.nodeValue;
message.innerHTML=outputMessage;
}
</SCRIPT>
</HEAD>
<BODY>
<CENTER>
<H1>
Finding Element Values in an XML Document
</H1>
continues .
01 0201 CH01 10/12/00 9:47 AM Page 16
<DIV ID=”message”></DIV>
<P>
<INPUT TYPE=”BUTTON” VALUE=”Get The Greeting”
ONCLICK=”getData()”>
</CENTER>
</BODY>
</HTML>
This Web page displays a button with the caption “Get The Greeting,” as you
see in Figure 1.5.When you click the button, the JavaScript code in this page
reads in greeting.xml, extracts the text from the <GREETING> element, and
displays that text, as you also can see in Figure 1.5. In this way, you can
see how you can create applications that handle XML documents in a
customized way—even customized XML browsers.
We’ll see more about using JavaScript to work with XML later in this book,
and I’ll also cover how JavaScript itself works first. If you haven’t used
JavaScript before, you won’t have a problem doing so now.
Although JavaScript is useful for lightweight XML uses, most XML
programming is done in Java today, and we’ll take advantage of that in this
book. Here’s an example Java program, readXML.java, using XML4J,
probably the most widely used XML parser, which is free from IBM’s
AlphaWorks.This program also reads greeting.xml and extracts the text
content of the <GREETING> element:
01 0201 CH01 10/12/00 9:47 AM Page 17
import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.Text;
When you compile and run this program (you’ll see how to do so in
Chapter 11), the output looks like this. (I’ll use “%” to represent the
command-line prompt in this book; if you’re using UNIX, this prompt
01 0201 CH01 10/12/00 9:47 AM Page 18
may look familiar, or your prompt may look something like /home/steve:.
If you’re using Windows, you get a command-line prompt by opening an
MS-DOS window, and your prompt may look something like C:\XML>.)
%java readXML
Hello From XML
(Note that this program returns all the text in the <GREETING> element,
including the leading spaces.) We’ll see how to use Java to parse XML docu-
ments later in this book, mostly using the IBM AlphaWorks XML4J parser,
which adheres closely to the W3C Document Object Model. I’ll also cover
the Java you need to know before getting into the programming, so if you
haven’t programmed in Java before, that’s no problem.
We now have a good overview of how XML works. It’s time to take a
look at how it’s already working in the real world, starting with an overview
of the XML resources available to you.
XML Resources
Many XML resources are available to you online. Because it’s very important
that you know about them to get a solid background in XML, I list them
here.
The XML specification is defined by W3C, and that’s where you should
start looking for XML resources. Here’s a good starter list (we’ll see all these
topics in this book):
n www.w3c.org/xml.W3C’s main XML site, the starting point for all things
XML.
n www.w3.org/XML/1999/XML-in-10-points. “XML in 10 Points” (actually only
seven), an XML overview.
n www.w3.org/TR/REC-xml.The official W3C recommendation for XML 1.0,
the current (and only) version. It won’t be easy to read, however—that’s
what this book is all about, translating that kind of document to English.
n www.w3.org/TR/xml-stylesheet/. All about using style sheets and XML.
XML Resources 19
Many non-W3C XML resources are out there, too—a casual search for
“XML” on the Web turns up a mere 561,870 pages. Here’s a list to get you
started:
n www.xml.com; XML.com. A site filled with XML resources, discussions, and
notifications of public events.
n www.xml-zone.com. Excellent XML overviews and listings of events.
You’ll also find quite a few XML tutorials online (searching for “XML
Tutorial” brings up more than 500 matches). Here are a few to start with:
n www2.software.ibm.com/developer/education.nsf/xml-onlinecourse-
bytitle.IBM’s free tutorials.
n www.ucc.ie/xml/. A comprehensive frequently asked questions (FAQ) list
about XML, kept up by some of the contributors to the W3C’s XML
Working Group.This is considered by many to be the definitive FAQ on
XML.
n msdn.microsoft.com/xml/tutorial/default.asp. Microsoft’s XML tutorial.
In addition, you might find some newsgroups on Usenet useful (note that
your news server may not carry all these groups):
n comp.text.xml. A good, general-purpose, free-floating XML forum.
XML Editors
To create the XML documents we’ll use in this book, all you need is a text
editor of some kind, such as vi, emacs, pico,Windows Notepad, or Windows
WordPad. By default, XML documents are supposed to be written in
Unicode, although in practice you can write them in ASCII, and nearly all
of them are written that way so far. Just make sure that when you write an
XML document, you save it in your editor’s plain-text format.
XML Editors 21
You can see XML Spy at work in Figure 1.6, XML Writer in Figure 1.7,
XML Notepad in Figure 1.8, and eNotepad in Figure 1.9.
XML Browsers
Creating a true XML browser is not easy. Besides supporting XML, the
browser would have to support a style language such as CSS or XSL. It also
should support a scripting language, such as JavaScript.These are heavy
01 0201 CH01 10/12/00 9:47 AM Page 23
XML Browsers 23
requirements for most third-party vendors, so the true XML browsers out
there are few. In fact, no complete, general XML browsers currently exist.
None of the browsers listed here validate XML documents—they just check
for well-formedness—but a few come close.
Internet Explorer 5
Whether you love or hate Microsoft, the fact remains that Internet Explorer
is the most powerful XML browser available now; you currently can get it at
www.microsoft.com/windows/ie/default.htm.
Internet Explorer can display XML documents directly, as you saw
in Figure 1.2, and also can handle them in scripting languages (JScript,
Microsoft’s version of JavaScript, and Microsoft’s VBScript are supported).
There is good support for style sheets and other features such as the <XML>
element, which lets you create XML data islands into which you can load
XML documents, and ways of binding XML to ActiveX Data Object (ADO)
database recordsets.
Internet Explorer 5.5, in preview at this writing, also supports additional
XML features, such as the XPath specification.There’s no question that
Microsoft’s XML commitment is strong—XML has been integrated even
into the Office 2000 suite of applications—but Microsoft sometimes swerves
significantly from the W3C standards (that is, when Microsoft isn’t writing
those standards itself).
Netscape Navigator 6
Netscape has just released the preview version of Netscape Navigator 6
(available at www.netscape.com/download/previewrelease.html), which has
significant XML support.You can see Netscape Navigator 6 at work back
in Figure 1.3.This preview version is based on Netscape’s open source
Mozilla browser, which you can pick up at www.mozilla.org. Unfortunately,
both Mozilla and the preview version of Netscape Navigator 6 have a
reputation for crashing machines frequently.
As with Internet Explorer, support for style sheets is good in Netscape
Navigator.The preview version of Netscape Navigator 6 also supports the
XML-based User Interface Language (XUL), which lets you configure the
controls in the browser. In fact, the preview version’s user interface is based
on XUL. More XML features will come in Netscape 6, but right now
documentation is virtually nonexistent.
01 0201 CH01 10/12/00 9:47 AM Page 24
Jumbo
One of the more famous true XML browsers that exist is Jumbo, an XML
browser designed to work with XML and the CML.You can pick up Jumbo
for free at www.xml-cml.org/jumbo.html.This browser not only can display
XML (although not with style sheets), but it also can use CML to draw
molecules, as you see in Figure 1.10.
Relatively few real XML browsers exist, but there are a large number of
XML parsers.You can use these parsers to read XML documents and break
them up into their component parts.
XML Parsers
XML parsers are software packages that you use as part of an application
such as Oracle 8i (which includes good XML support) or as part of your
own programs. For example, later in this book, I’ll use the IBM AlphaWorks
XML for Java (XML4J) parser; it is written in Java and connects well to your
own Java code. Here’s a list of some of the parsers out there:
n SAX: The Simple API for XML. Written by David Megginson et al.
(www. megginson.com/SAX/index.html), SAX is a well-known parser that
uses event-based parsing. I’ll use SAX in this book.
01 0201 CH01 10/12/00 9:47 AM Page 25
XML Parsers 25
n expat. This famous XML parser was written in the C programming lan-
guage by James Clark (www.jclark.com/xml/expat.html).This parser is used
in Netscape Navigator 6 and in the Perl language’s XML::Parser module.
n expat as a Perl Module. XML::Parser is maintained by Clark Cooper
(ftp://ftp.perl.org/pub/CPAN/modules/by-module/XML/).
n TclExpat. This is expat written for use in the Tcl programming language
by Steve Ball. Superceded by TclXML. (www.zveno.com/zm.cgi/in-tclxml).
n LT XML. This is an XML developers’ toolkit from the Language
Technology Group at the University of Edinburgh (www.ltg.ed.ac.uk/
software/xml/).
Parsers will break up your document into its component pieces and make
them accessible to other parts of a program; some parsers also check for well-
formedness and fewer check for document validity. However, if you just want
to check whether your XML is both well-formed and valid, all you need is
an XML validator.
XML Validators
How do you know whether your XML document is well-formed and valid?
One way is to check it with an XML validator, and you have plenty to
choose from.Validators are packages that will check your XML and give you
feedback. For example, if you have the XML for Java parser from IBM’s
AlphaWorks installed, you can use the DOMWriter example as a complete
XML validator. Let’s say you wanted to check this document, greeting.xml:
<?xml version=”1.0” encoding=”UTF-8”?>
<DOCUMENT>
<GREETING>
Hello From XML
</GREETING>
<MESSAGE>
Welcome to the wild and woolly world of XML.
</MESSAGE>
</DOCUMENT>
To do this, you’d set things up for the XML4J package (we’ll see how to do
so later in the book) and run the DOMWriter sample on it, like this:
%java dom.DOMWriter greeting.xml
greeting.xml:
[Error] greeting.xml:2:11: Element type “DOCUMENT” must be declared
[Error] greeting.xml:3:15: Element type “GREETING” must be declared
[Error] greeting.xml:6:14: Element type “MESSAGE” must be declared.
<?xml version=”1.0” encoding=”UTF-8”?>
<DOCUMENT>
<GREETING>
Hello From XML
01 0201 CH01 10/12/00 9:47 AM Page 27
XML Validators 27
</GREETING>
<MESSAGE>
Welcome to the wild and woolly world of XML.
</MESSAGE>
</DOCUMENT>
If all goes well, DOMWriter simply displays the document you’ve asked it to
validate, but if there are errors, it will display them. Here, DOMWriter is
indicating that because we haven’t included a DTD in greeting.xml, it can’t
check for the validity of the document.
That’s fine if you have the XML for Java package installed, but more
accessible XML validators are available to you as well. Here’s a list of some of
the XML validators on the Web:
n W3C XML Validator, validator.w3.org/. This is the official W3C
HTML validator. Although it’s officially for HTML, it also includes some
XML support.Your XML document must be online to be checked with
this validator.
n Tidy, www.w3.org/People/Raggett/tidy/. Tidy is a beloved utility for clean-
ing up and repairing Web pages, and it includes limited support for XML.
Your XML document must be online to be checked with this validator.
n www.xml.com/xml/pub/tools/ruwf/check.html. This is XML.com’s XML
validator based on the Lark processor.Your XML document must be
online to be checked with this validator.
n www.ltg.ed.ac.uk/~richard/xml-check.html. This is the Language
Technology Group at the University of Edinburgh’s validator, based
on the RXP parser.Your XML document must be online to be checked
with this validator.
n www.stg.brown.edu/service/xmlvalid/. This is an excellent XML validator
from the Scholarly Technology Group at Brown University.This is the
only online XML validator I know of that allows you to check XML
documents that are not online.You can use the Web page’s file upload
control to specify the name of the file on your hard disk that you want to
have uploaded and checked.
To see a validator at work, take a look at Figure 1.11.There, I’m asking the
XML validator from the Scholarly Technology Group to validate this XML
document, c:\xml\greeting.xml. I’ve intentionally exchanged the order of
the <MESSAGE> and </GREETING> tags:
01 0201 CH01 10/12/00 9:47 AM Page 28
You can see the results in Figure 1.12. As you can see, the validator is
indicating that there is a problem with these two tags.
01 0201 CH01 10/12/00 9:47 AM Page 29
XML validators give you a powerful way of checking your XML documents.
That’s useful because XML is much stricter than HTML about making sure
that a document is correct. (Recall that XML browsers are not supposed to
make attempts to fix XML documents if they find a problem; they’re just
supposed to stop loading the document.)
We’ve gotten a good overview of XML already in this chapter. In a few
pages, I’ll start taking a look at a number of XML languages that are already
developed. But there are a few more useful topics to cover first, especially if
you have programmed in HTML and want to know the differences between
XML and HTML.
The story is different in XML because you create your own elements in
XML.Thus, if you want a browser to display them, you have to tell it how.
This is both good and bad: It’s good because it enables you to use the
powerful CSS and XSL specifications to customize the appearance of your
XML elements far beyond what’s possible with standard HTML. It’s bad
because it can demand a lot of additional work. (One way of getting around
the necessity of designing your own style sheets is to use an established XML
language that has its own style sheets.)
All this is to say that XML defines the structure and semantics of the
document, not its format; if you want to display XML directly, you can either
use the default presentation in Internet Explorer, or use a style sheet to set
up the presentation yourself.
You have two main ways to specify a style sheet for an XML document:
with CSS and with XSL, both of which I’ll dig into in this book. CSS is
popular with those creating HTML documents and is widely supported.
Using CSS, you can specify the formatting of individual elements, create
style classes, set up fonts, use colors, and even specify placement of elements
in the page.
XSL, on the other hand, is ultimately a better choice to work with XML
documents because it’s more powerful (in fact, XSL style sheets themselves
are well-formed XML documents). XSL documents are made up of rules
that are applied to XML documents.When a pattern that you’ve specified in
the XSL document is recognized in the XML document, the rules transform
the matched XML into something entirely new.You can even transform
XML into HTML in this way.
Although CSS can set only the format and placement of elements, XSL
can reorder elements in a document, change them entirely, display some but
hide others, select styles based not just on elements but also on element
attributes (XML elements can have attributes just as HTML elements can,
and I’ll introduce them in the next chapter), select styles based on element
location, and much more.There are two ways to approach XSL: with XSL
transformations and with XSL formatting objects.We’ll take a look at both
in this book.
Here are some good online resources for style sheets that provide a good
reference:
n www.w3.org/Style/CSS/.The W3C outline and overview of CSS program-
ming.
n www.w3.org/TR/REC-CSS1. The W3C CSS1 specification.
01 0201 CH01 10/12/00 9:47 AM Page 31
URLs are well understood and well supported on the Internet today. On
the other hand (as you’d expect, given the addition of XLinks and XPointers
to XML), the idea of URIs is more general than with simple URLs.
URIs let you represent a way of finding resources on the Internet, and
they center more on the resource than the actual location.The idea is that,
in theory, URIs can locate the nearest mirror site for a resource or even
track a document that has been moved from one location to another.
In practice, the concept of URIs is still being developed, and most
software still handles only URLs.
ASCII documents because multiple bytes are needed for many non-ASCII
symbols, and ASCII documents converted to Unicode are two times as long.)
Here’s how to specify the UTF-8 character encoding in an XML document:
<?xml version=”1.0” encoding=”UTF-8”?>
<DOCUMENT>
<GREETING>
Hello From XML
</GREETING>
<MESSAGE>
Welcome to the wild and woolly world of XML.
</MESSAGE>
</DOCUMENT>
The default for XML processors today is to assume that your document is in
UTF-8, so if you omit the encoding specification, UTF-8 is assumed. If
you’re writing XML documents in ASCII, you’ll have no trouble.
Actually, not even Unicode has enough space for all symbols in common
use. A new specification, the Universal Character System (UCS, also called
ISO 10646) uses 4 bytes per symbol, which gives it a range of two billion
symbols, far more than needed.You can specify that you want to use pure
Unicode encoding in your XML documents by using the UCS-2 encoding
(also called ISO-10646-UCS-2), which is compressed 2-byte UCS.You also
can use UTF-16, which is a special encoding that represents UCS symbols
using 2 bytes so that the result corresponds to UCS-2. Straight UCS
encoding is referred to as UCS-4 (also called ISO-10646-UCS-4).
I’ll stick to ASCII for most XML documents in this book because support
for Unicode and UCS is not yet widespread. For example, I know of no true
Unicode editors. On the other hand, you can write documents in a local
character set and use a translation utility to convert them to Unicode, or
you can insert the actual Unicode codes directly into your documents. For
example, the Unicode for π is 03C0 in hexadecimal, so you can insert
π into your document with the character entity (more on entities in the next
chapter) π.
More character sets are available than those mentioned here; for a longer
list, take a look at the list posted by the Internet Assigned Numbers
Authority (IANA) at www.isi.edu/in-notes/iana/assignments/character-sets.
01 0201 CH01 10/12/00 9:47 AM Page 34
XML Applications
We’ve seen a lot of theory in this chapter, so I’m going to spend the rest of
this chapter taking a look at how XML is used in the real world.The world
of XML is huge these days; in fact, XML is now used internally even in
Netscape and Microsoft products, as well as installations of programming
languages such as Perl.You can find a good list of organizations that produce
their own XML-based languages at www.xml.org/xmlorg_catalog.htm.
It’s useful and encouraging to see how XML is being used today in these
XML-based languages. It’s a new piece of terminology: As you know, XML
is a meta-markup language, so it’s actually used to create languages.The lan-
guages so created are applications of XML, so they’re called XML applications.
Note that the term XML application refers to an application of XML to a
specific domain, such as MathML, the mathematics markup language; it does
not refer to a program that uses XML (a fact that causes a lot of confusion
among people who know nothing about XML).
Thousands of XML applications exist today, and we’ll see some of them
here.You can see the advantage to various groups (such as physicists or
chemists) for defining their own markup languages, allowing them to use
the symbols and graphics of their discipline in customized browsers. I’ll start
with discussing CML.
XML Applications 35
I’ve already mentioned Jumbo, a famous CML browser that you can
download for free from www.xml-cml.org/jumbo.html. Jumbo not only works
for handling CML, but you also can use it to display the structure of an
XML document in general. However, there’s no question that the novelty
of Jumbo is that it can use CML to create graphical representations of
molecules.
We’ve already seen an example in Jumbo in Figure 1.10, in which Jumbo
is displaying the molecule thiophenol. Here is the file, thiophenol.xml, that
it’s reading to display that molecule (this document is an example that comes
with the Jumbo browser):
<?jumbo:namespace ns=”http://www.xml-cml.org” prefix=”C”
java=”jumbo.cmlxml.*Node” ?>
<C:molecule id=”thiophenol”>
<C:atomArray builtin=”elsym”>
C C C C C C C S C C O O
</C:atomArray>
<C:atomArray builtin=”x2” type=”float”>
0 0.866 0.866 0 -0.866 -0.866
0.0 0.0 1.732 -1.732 1.732 -1.732
</C:atomArray>
<C:atomArray builtin=”y2” type=”float”>
1 0.5 -0.5 -1.0 -0.5 0.5
-2.0 2.0 1.0 1.0 2.0 2.0
</C:atomArray>
<C:bondArray builtin=”atid1”>
1 2 3 4 5 6 1 4 2 9 6 10
</C:bondArray>
<C:bondArray builtin=”atid2”>
2 3 4 5 6 1 8 7 9 11 10 12
</C:bondArray>
<C:bondArray builtin=”order” type=”integer”>
4 4 4 4 4 4 1 1 1 2 1 2
</C:bondArray>
</C:molecule>
01 0201 CH01 10/12/00 9:47 AM Page 36
XML Applications 37
</m:mrow>
<m:mo>=</m:mo>
<m:mn>0</m:mn>
</m:mrow>
</math>
You can see the results of this document in the Amaya browser shown in
Figure 1.13.
Here’s the way CDF works: you add a link to a .cdf file to a Web page,
and then you give the link text, something like “Subscribe to this channel!”
If the user is using Internet Explorer and clicks the hyperlink to navigate to
the CDF file, Internet Explorer adds the site to the user’s Favorites folder
and subscribes to the channel, checking back periodically for updates.
Here’s an example:This document, w3c.cdf, lets the user subscribe to the
W3C XML and XSL pages:
<?xml version=”1.0”?>
<CHANNEL HREF=”http://www.w3.org/”>
<TITLE>World Wide Web Consortium</TITLE>
<ABSTRACT>
Leading the Web to its Full Potential
</ABSTRACT>
<ITEM HREF=”http://www.w3.org/XML/”>
<TITLE>Extensible Markup Language (XML)
</TITLE>
<ABSTRACT>
The Extensible Markup Language (XML) is the universal
format for structured documents and data on the Web.
</ABSTRACT>
</ITEM>
<ITEM HREF=”http://www.w3.org/Style/XSL/”>
<TITLE>Extensible Stylesheet Language (XSL)
</TITLE>
<ABSTRACT>
Extensible Stylesheet Language (XSL) is a
language for expressing style sheets.
</ABSTRACT>
</ITEM>
</CHANNEL>
You can see the results in Figure 1.14.When the user opens this .cdf file, a
W3C channel is added to the Internet Explorer’s Favorites folder. (You can
see the current channels in Internet Explorer by clicking the Favorites but-
ton in the standard toolbar and then double-clicking the Channels folder in
the Favorites frame that opens at left.)
01 0201 CH01 10/12/00 9:47 AM Page 39
XML Applications 39
<?xml version=”1.0”?>
<!DOCTYPE smil PUBLIC “-//W3C//DTD SMIL 1.0//EN”
“http://www.w3.org/TR/REC-smil/SMIL10.dtd”>
<smil>
<body>
<seq id=”mozart”>
<audio src=”mozart1.wav”/>
<video src=”amadeus1.mov”/>
<text src=”mozart1.htm”/>
<audio src=”mozart2.wav”/>
<video src=”amadeus2.mov”/>
<text src=”mozart2.htm”/>
</seq>
</body>
</smil>
XML Applications 41
<BODY>
<DIV CLASS=”time” t:REPEAT=”5” t:DUR=”10” t:TIMELINE=”par”>
<DIV CLASS=”time” t:BEGIN=”0” t:DUR=”10”>Hello</DIV>
<DIV CLASS=”time” t:BEGIN=”2” t:DUR=”10”>there</DIV>
<DIV CLASS=”time” t:BEGIN=”4” t:DUR=”10”>from</DIV>
<DIV CLASS=”time” t:BEGIN=”6” t:DUR=”10”>HTML+TIME.</DIV>
</DIV>
</BODY>
</HTML>
You can see the results of this HTML+TIME document in Figure 1.15.
XML Applications 43
<?xml version=”1.0”?>
<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.0 Transitional//EN”
“http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd”>
<html xmlns=”http://www.w3.org/1999/xhtml” xml:lang=”en” lang=”en”>
<head>
<title>
Web page number one!
</title>
</head>
<body>
<h1>
Welcome to XHTML!
</h1>
<center>
This is simple text that appears in this page.
<p>
Here’s a new paragraph!
</p>
</center>
</body>
</html>
You can see the results of this XHTML in Figure 1.16.Writing XHTML is a
lot like HTML, except that you must adhere to XML syntax (such as making
sure that every element has a closing tag).
XML Applications 45
<HEAD>
<TITLE>
Using Vector Markup Language
</TITLE>
<STYLE>
v\:* {behavior: url(#default#VML);}
</STYLE>
</HEAD>
continues .
01 0201 CH01 10/12/00 9:47 AM Page 46
<BODY>
<CENTER>
<H1>
Using Vector Markup Language
</H1>
</CENTER>
<P>
<v:oval STYLE=’width:100pt; height:75pt’
fillcolor=”yellow”> </v:oval>
<P>
<v:rect STYLE=’width:100pt; height:75pt’ fillcolor=”blue”
strokecolor=”red” STROKEWEIGHT=”2pt”/>
<P>
<v:polyline
POINTS=”20pt,55pt,100pt,-10pt,180pt,65pt,260pt,25pt”
strokecolor=”red” STROKEWEIGHT=”2pt”/>
</BODY>
</HTML>
XML Applications 47
Here’s a sample XUL document, scroll.xul, that adds scrollbars around the
browser’s document display area. (For formatting reasons, I have to break up
the expression
“http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul” to two
lines in this book; make sure you rejoin that text into a quoted text string on
one line before giving this a try).
<?xml version=”1.0”?>
<window align=”horizontal”
xmlns=
“http://www.mozilla.org/keymaster/gatekeeper/
there.is.only.xul”>
<scrollbar align=”vertical”/>
<box align=”vertical” flex=”100%”>
<scrollbar align=”horizontal”/>
<spring flex=”100%” style=”background-color: white”/>
<scrollbar align=”horizontal”/>
</box>
<scrollbar align=”vertical”/>
</window>
You can see the results of this XUL document in Figure 1.18.
01 0201 CH01 10/12/00 9:47 AM Page 48
XML Applications 49
<item period=”1997-06-01/1998-07-31”>0.12</item>
<item period=”1997-09-01/1997-11-30”>0.16</item>
<item period=”1997-12-01/1998-02-28”>0.17</item>
<item period=”1998-03-01/1998-05-31”>-0.12</item>
<item period=”1998-06-01/1998-05-31”>0.33</item>
</group>
<group type=”gpsi:detail.quarterly” period=”1999-05-31”>
<item period=”1998-06-01/1998-08-31”>0.15</item>
<item period=”1998-09-01/1998-11-30”>0.20</item>
<item period=”1998-12-01/1999-02-28”>0.23</item>
<item period=”1999-03-01/1999-05-31”>0.28</item>
<item period=”1998-06-01/1999-05-31”>0.86</item>
</group>
<group type=”gpsi:detail.quarterly” period=”1998-05-31”>
<item period=”1997-06-01/1998-07-31”>0.11</item>
<item period=”1997-09-01/1997-11-30”>0.15</item>
<item period=”1997-12-01/1998-02-28”>0.17</item>
<item period=”1998-03-01/1998-05-31”>-0.12</item>
<item period=”1998-06-01/1998-05-31”>0.32</item>
</group>
<RDF:RDF xmlns:RDF=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:DC=”http://purl.org/DC/”>
<RDF:Description about=”http://www.starpowder.com/xml”>
<DC:Format>HTML</DC:Format>
<DC:Language>en</DC:Language>
<DC:Date>2002-02-02</DC:date>
<DC:Type>tutorial</DC:Type>
<DC:Title>Welcome to XML!</DC:Title>
</RDF:Description>
</RDF:RDF>
Note that many more XML applications exist than can be covered in one
chapter—and plenty of them work behind the scenes. For example,
Microsoft Office 2000 can handle HTML as well as other types of docu-
ments, but because HTML doesn’t allow it to store everything it needs in a
document, Office 2000 also includes some XML behind the scenes (in fact,
Office 2000’s vector graphics are done using VML). Even relatively early ver-
sions of Netscape Navigator allowed you to look for sites much like the cur-
rent one you’re viewing; to do that, it connected to a CGI program that uses
XML internally. As you can see, XML is all around, everywhere you look on
the Internet.
That’s it for our overview chapter.We’ve received a solid foundation in
XML here.The next step is to start getting all the actual ground rules for
creating XML documents under our belts. I’ll cover that in Chapter 2.