What Is XML
What Is XML
• XML was designed to transport and store data, with focus on what data is.
• HTML was designed to display data, with focus on how data looks.
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
The note above is quite self descriptive. It has sender and receiver information, it also has a heading and a
message body.
But still, this XML document does not DO anything. It is just pure information wrapped in tags. Someone
must write a piece of software to send, receive or display it.
However, XML-aware applications can handle the XML tags specially. The functional meaning of the tags
depends on the nature of the application.
The tags used in HTML (and the structure of HTML) are predefined. HTML documents can only use tags
defined in the HTML standard (like <p>, <h1>, etc.).
XML allows the author to define his own tags and his own document structure.
It is important to understand that XML is not a replacement for HTML. In most web applications, XML is used
to transport data, while HTML is used to format and display the data.
To read more about the XML activities at W3C, please read our W3C Tutorial.
XML is Everywhere
We have been participating in XML development since its creation. It has been amazing to see how quickly
the XML standard has developed, and how quickly a large number of software vendors has adopted the
standard.
XML is now as important for the Web as HTML was to the foundation of the Web.
XML is everywhere. It is the most common tool for data transmissions between all sorts of applications, and
is becoming more and more popular in the area of storing and describing information.
XML is used in many aspects of web development, often to simplify data storage and
sharing.
With XML, data can be stored in separate XML files. This way you can concentrate on using HTML for layout
and display, and be sure that changes in the underlying data will not require any changes to the HTML.
With a few lines of JavaScript, you can read an external XML file and update the data content of your HTML.
You will learn more about this in a later chapter of this tutorial.
XML data is stored in plain text format. This provides a software- and hardware-independent way of storing
data.
This makes it much easier to create data that different applications can share.
One of the most time-consuming challenges for developers is to exchange data between incompatible
systems over the Internet.
Exchanging data as XML greatly reduces this complexity, since the data can be read by different
incompatible applications.
XML data is stored in text format. This makes it easier to expand or upgrade to new operating systems, new
applications, or new browsers, without losing data.
Different applications can access your data, not only in HTML pages, but also from XML data sources.
With XML, your data can be available to all kinds of "reading machines" (Handheld computers, voice
machines, news feeds, etc), and make it more available for blind people, or people with other disabilities.
XML is Used to Create New Internet Languages
A lot of new Internet languages are created with XML.
The syntax rules of XML are very simple and logical. The rules are easy to learn, and easy to
use.
<p>This is a paragraph
<p>This is another paragraph
In XML, it is illegal to omit the closing tag. All elements must have a closing tag:
<p>This is a paragraph</p>
<p>This is another paragraph</p>
Note: You might have noticed from the previous example that the XML declaration did not have a closing
tag. This is not an error. The declaration is not a part of the XML document itself, and it has no closing tag.
XML tags are case sensitive. With XML, the tag <Letter> is different from the tag <letter>.
Opening and closing tags must be written with the same case:
<Message>This is incorrect</message>
<message>This is correct</message>
Note: "Opening and closing tags" are often referred to as "Start and end tags". Use whatever you prefer. It
is exactly the same thing.
In the example above, "Properly nested" simply means that since the <i> element is opened inside the <b>
element, it must be closed inside the <b> element.
<root>
<child>
<subchild>.....</subchild>
</child>
</root>
In XML the attribute value must always be quoted. Study the two XML documents below. The first one is
incorrect, the second is correct:
<note date=12/11/2007>
<to>Tove</to>
<from>Jani</from>
</note>
<note date="12/11/2007">
<to>Tove</to>
<from>Jani</from>
</note>
The error in the first document is that the date attribute in the note element is not quoted.
Entity References
Some characters have a special meaning in XML.
If you place a character like "<" inside an XML element, it will generate an error because the parser
interprets it as the start of a new element.
To avoid this error, replace the "<" character with an entity reference:
Note: Only the characters "<" and "&" are strictly illegal in XML. The greater than character is legal, but it is
a good habit to replace it.
Comments in XML
The syntax for writing comments in XML is similar to that of HTML.
XML Elements
« Previous Next Chapter »
An element can contain other elements, simple text or a mixture of both. Elements can also have attributes.
<bookstore>
<book category="CHILDREN">
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title>Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
In the example above, <bookstore> and <book> have element contents, because they contain other
elements. <author> has text content because it contains text.
In the example above only <book> has an attribute (category="CHILDREN").
Names should be short and simple, like this: <book_title> not like this: <the_title_of_the_book>.
Avoid "-" characters. If you name something "first-name," some software may think you want to subtract
name from first.
Avoid "." characters. If you name something "first.name," some software may think that "name" is a
property of the object "first."
Avoid ":" characters. Colons are reserved to be used for something called namespaces (more later).
XML documents often have a corresponding database. A good practice is to use the naming rules of your
database for the elements in the XML documents.
Non-English letters like éòá are perfectly legal in XML, but watch out for problems if your software vendor
doesn't support them.
<note>
<to>Tove</to>
<from>Jani</from>
<body>Don't forget me this weekend!</body>
</note>
Let's imagine that we created an application that extracted the <to>, <from>, and <body> elements from
the XML document to produce this output:
MESSAGE
To: Tove
From: Jani
Imagine that the author of the XML document added some extra information to it:
<note>
<date>2008-01-10</date>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
No. The application should still be able to find the <to>, <from>, and <body> elements in the XML
document and produce the same output.
One of the beauties of XML, is that it can often be extended without breaking applications.
XML Attributes
« Previous Next Chapter »
XML elements can have attributes in the start tag, just like HTML.
XML Attributes
From HTML you will remember this: <img src="computer.gif">. The "src" attribute provides additional
information about the <img> element.
<img src="computer.gif">
<a href="demo.asp">
Attributes often provide information that is not a part of the data. In the example below, the file type is
irrelevant to the data, but important to the software that wants to manipulate the element:
<file type="gif">computer.gif</file>
XML Attributes Must be Quoted
Attribute values must always be enclosed in quotes, but either single or double quotes can be used. For a
person's sex, the person tag can be written like this:
<person sex="female">
or like this:
<person sex='female'>
If the attribute value itself contains double quotes you can use single quotes, like in this example:
<person sex="female">
<firstname>Anna</firstname>
<lastname>Smith</lastname>
</person>
<person>
<sex>female</sex>
<firstname>Anna</firstname>
<lastname>Smith</lastname>
</person>
In the first example sex is an attribute. In the last, sex is an element. Both examples provide the same
information.
There are no rules about when to use attributes and when to use elements. Attributes are handy in HTML. In
XML my advice is to avoid them. Use elements instead.
My Favorite Way
The following three XML documents contain exactly the same information:
<note date="10/01/2008">
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
<note>
<date>10/01/2008</date>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
<note>
<date>
<day>10</day>
<month>01</month>
<year>2008</year>
</date>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Attributes are difficult to read and maintain. Use elements for data. Use attributes for information that is not
relevant to the data.
<messages>
<note id="501">
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
<note id="502">
<to>Jani</to>
<from>Tove</from>
<heading>Re: Reminder</heading>
<body>I will not</body>
</note>
</messages>
The ID above is just an identifier, to identify the different notes. It is not a part of the note itself.
What I'm trying to say here is that metadata (data about data) should be stored as attributes, and that data
itself should be stored as elements.
XML Validation
« Previous Next Chapter »
The DOCTYPE declaration in the example above, is a reference to an external DTD file. The content of the
file is shown in the paragraph below.
XML DTD
The purpose of a DTD is to define the structure of an XML document. It defines the structure with a list of
legal elements:
<!DOCTYPE note
[
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
If you want to study DTD, you will find our DTD tutorial on our homepage.
XML Schema
W3C supports an XML-based alternative to DTD, called XML Schema:
<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
An introduction to XML
Originally, the intention with HTML was that the elements should be used to
mark up information according to its meaning, without regard to how this
would actually be rendered in a browser. In other words: title, main header,
emphasized text and the contact information of the author should be placed
inside the elements TITLE, H1, EM (or possibly STRONG) and ADDRESS. To
use FONT or I and similar elements to get a nice layout makes it a lot more
difficult to present the information to the best possible effect regardless of the
users environment. Processing the information automatically also becomes
difficult (or even impossible). (See reference 1.)
The reason why the browser should decide for itself how to display title and
headers etc is that it knows a lot more about the users preferences and
environment and so can make decisions based on that. The author, not
knowing his reader, cannot do this as well, of course. This is especially useful
for people who are blind, run non-graphical browsers or who have weak
eyesight, and therefore need larger font sizes. This means that an author who
doesn't follow the rules will cause problems for those of the readers who read
in a non-standard environment.
The result is that a lot of pages on the web now contain tagging that's written
for a specific version of a specific browser (with default preferences) and a
specific screen resolution. These pages are often more or less unreadable to
those who use something else. Thus, HTML has gradually been turned into a
presentational language for Netscape and MSIE by the vendors and their
users.
This, however, is not the only problem. If you want to mark up your
information really precisely according to its meaning you'll want lots of
elements that just aren't present in HTML. If you are, say, a chemist, you'll
probably want special elements for chemical formulas, for measurement data
and so on. If you are an airplane manufacturer you'll want to be able to talk
about engines, parts and models. Catering to the needs of all trades and
people will obviously mean having an enormous amount of elements, which is
quite simply a Bad Thing for both developers and users.
Another problem is that HTML has very little internal structure, which means
that you can easily write valid HTML that does not make sense at all when
you consider the semantics of the elements. This is because (among other
things) the contents of BODY have been defined so that you can place the
elements allowed therein in any order you please. This means that you don't
need a H1 with the H2s inside it and H3s inside the H2s. (Think of H1 as a
book title, H2 as part title and H3 as chapter title.) HTML should ideally be
written this way, but the HTML standard does not require it. (Se references 1
and 3.)
People have been aware of these problems for quite some time, and in the
summer of '96 the W3C (which defines the web standards) started work on a
new standard to deal with these problems. The W3C has set up a working
group that is now creating this new standard called XML, for eXtensible
Markup Language. The working group (from now on called XWG, for XML
working group) has split their work into three phases.)
Phase 1
Phase 2
Phase 3
Develop a common standard for specifying the layout of documents
encoded in these languages.
Phase 1 is now completed, since the XML 1.0 specification is now finished.
Phase 2 is still under way, although there is a working draft. Phase 3 has not
yet reached that stage, as there only exists a suggestion at this stage.
XML
Please note that the descriptions given below are simplified and only meant to
give an impression of XML. They leave out a lot of the standards and are (for
reasons of readability) a little inaccurate. If you want more detailed and
accurate information you should go on to read the appendicesbelow. Also
note that these standards are not finalized yet, so that they may change
before they're officially accepted. As a first introduction, however, this
document should be useful.
XML itself
There already exists a standard for defining markup languages like HTML,
which is called SGML. HTML is actually defined in SGML. SGML could have
been used as this new standard, and browsers could have been extended
with SGML parsers. However, SGML is quite complex to implement and
contains a lot of features that are very rarely used. Its support for different
character sets is also a bit weak, which is something that can cause problems
on the web where people use many different kinds of computers and
languages. It's also difficult to interpret an SGML document without having the
definition of the markup language (the DTD) available. Because of this, the
XWG decided to develop a simplified version of SGML, which they called
XML. (As they like to say, XML is more like SGML light, than HTML++.)
The main point of XML is that you, by defining your own markup language,
can encode the information of your documents much more precisely than is
possible with HTML. This means that programs processing these documents
can "understand" them much better and therefore process the information in
ways that are impossible with HTML (or ordinary text processor documents).
Imagine that you marked up recipes (for, say, soups and sea food dishes etc)
according to a DTD tailored for recipes where you entered the amounts of
each ingredient and alternatives for some ingredients. You could then easily
make a program that, given a list of the contents of your fridge, would go
through the entire list of recipes and make a list of the dishes you could make
with them. Given nutritional information about the ingredients (x calories per
ounce of this, y calories per once of that etc) the program could sort the
suggestions by the amount of calories in each dish. Or by how long they'd
take to prepare, or the price (given price information for the ingredients), or...
The possibilites are almost endless, because the information is encoded in a
way that the computer can "understand".
Defining your own markup language with XML is actually surprisingly simple.
If you wanted to make a markup language for FAQs you might want it to be
used like this: (note that this example is really too simple to be very useful)
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE FAQ SYSTEM "FAQ.DTD">
<FAQ>
<INFO>
<SUBJECT> XML </SUBJECT>
<AUTHOR> Lars Marius Garshol</AUTHOR>
<EMAIL> [email protected] </EMAIL>
<VERSION> 1.0 </VERSION>
<DATE> 20.jun.97 </DATE>
</INFO>
<PART NO="1">
<Q NO="1">
<QTEXT>What is XML?</QTEXT>
<A>SGML light.</A>
</Q>
<Q NO="2">
<QTEXT>What can I use it for?</QTEXT>
<A>Anything.</A>
</Q>
</PART>
</FAQ>
In XML, the markup language shown above (let's call it FAQML) had a DTD
like this:
<!ELEMENT FAQ (INFO, PART+)>
<!ELEMENT> is used to define elements like this: <!ELEMENT NAME CONTENTS>. NAME
gives the name of the element, and CONTENTS describes which elements
that are allowed where inside the element we've defined. A,B means that you
must have an A first, followed by a B. ? after an element means that it can be
skipped, + means that it must be included one or more times and * means that
it can be skipped or included one or more times. #PCDATA means ordinary text
without markup (more or less).
<!ATTLIST> defines the attributes of an element. In the DTD given above it's
used to give PART and Q an attribute called NO, which contains ordinary text
and which can be skipped. As you can see, PART has two attributes, and the
last one is called TITLE, contains text and can be skipped.
Linking in XML
XML links can be between two or more resources, and resources can be
either files (and not necessarily XML or HTML files) or elements in files. Links
can be specified with the ACTUATE-attribute to be followed either (if the value
is USER) when the user explicitly requests this (for instance by clicking) or
(value AUTO) automatically (ie: when the system reads the linking element).
What happens when you follow the link is specified with SHOW, which can
take the following values:
EMBED
This means that the resource the link points to is to be inserted into the
document the link comes from. This will happen either during the
displaying of the document or during processing of the document. This
can be useful for including text from other files (with ACTUATE=AUTO)
or to include a picture in a page. It can also be used to insert footnotes
into the text and ACTUATE will then specify if the user has to click on
the footnotes to include them or whether all footnotes will be inserted
automatically.
REPLACE
This means that the resource the link points to is to replace the linking
element. If you have two different versions of a paragraph you can link
them in such a way that one can see the other version in the same
context by following the link.
NEW
In this case, following the link will not affect the resource the link came
from. Instead, the linked resource will be processed/displayed in a new
context. Ordinary HTML links are of type NEW, as the new page is
displayed in place of the previous one.
XML is even more advanced than this. Links can be between more than one
resource, they can be specified outside the actual documents themselves and
the linked-to element inside a resource can be specified in very powerful
ways. The element can be identified with an ID-attribute, position in the
element structure and one can even specify that the link goes to things like
"fourth LI inside the first UL inside BODY".
In FAQML this could have been used both for specifying links to relevant
information outside the FAQ as well as specifying internal relationships
between different answers. It could also have been used for footnotes etc.
There is actually already an SGML standard for this as well, and it's called
DSSSL, and isn't very simple, either. The XWG has therefore decided to make
a simplified version of DSSSL as well and call it XSL. So far, not much has
been done about this. One proposal (see references) has been submitted,
but it hasn't been accepted yet, and it's uncertain if it will be. So, I'm going to
describe DSSSL instead of XSL, at least until the future shape of XSL
becomes clearer.
Below I try to show how we could make a stylesheet for FAQML, but without
explaining very much of what really happens. I've split the DSSSL file into
several parts in order to be able to comment it as it's written, but it is meant to
be a single file.
DSSSL consists of several different parts, and the most basic one is the
expression language which is quite simply a subset of Scheme. This means
that DSSSL-stylesheets are really one large Scheme expression that is
calculated by the DSSSL engine, with a file as the result of the calculation.
Another important part (which is built on the expression language) is the style
language, which I've used almost exclusively in this example. A third part is
the query language, which can be used to find any element you want in your
document. I've used it in this example to find the number of a FAQ question
from inside the QTEXT element. This was necessary because NO is an
attribute of the surrounding Q element, and not QTEXT itself.
All formatting in DSSSL is done with so-called flow objects. In the code below
you'll se a lot of (element X (make Y-expressions which indicate that when
element X shows up the DSSSL engine is to create a flow object of type Y.
Then style rules for Y and then the contents of Y are specified. There's much
more to DSSSL than this, but the rest is considered to be outside the scope of
this document.
<!doctype style-sheet PUBLIC "-//James Clark//DTD DSSSL Style Sheet//EN">
;---Constants
The first line tells the SGML parser that this document follows the DTD for
DSSSL. (Yes, DSSSL is an SGML application.) The next two lines are
comments (after ; the rest of the line is ignored). Then I define two constants
that I use below in the styles themselves. This is done to make it easy to
change the font size of the entire document without having to adjust sizes for
all kinds of headers etc. Instead, I just change the value of *font-size*.
;---Element styles
(element FAQ
(make simple-page-sequence
font-family-name: *font*
font-size: *font-size*
input-whitespace-treatment: 'collapse
line-spacing: (* *font-size* 1.2)
(process-children)))
This part creates a flow object for the FAQ element, ie: the whole document.
The flow object is "simple-page-sequence", which I assume is meant for small
articles. I then specify what font to use, font size, that whitespace is to be
considered insignificant (like in HTML) and then I give the line height. The line
height is set to be 1.2 times the font size.
(element INFO
(make paragraph
quadding: 'center
space-after: (* *font-size* 1.5)
(process-children)))
This indicates that the element INFO (from start-tag to end-tag) is to be laid
out as a paragraph that is centered and has a blank space as high as 1.5 lines
after it. After creating the paragraph flow object the DSSSL engine is to go on
to process the child elements of INFO.
(element SUBJECT
(make paragraph
font-size: (* *font-size* 2)
line-spacing: (* *font-size* 2)
space-after: (* *font-size* 2)
(process-children)))
The subject element gets its own paragraph and is displayed in double font
size. AUTHOR and EMAIL are simpler versions of this, so I skip them. (You
can find them in the complete DSSSL file linked to below.)
(element VERSION
(make paragraph
(make sequence
(literal "Version: "))
(process-children)))
The VERSION element is given its own paragraph, which contains sequence
flow objects. I insert one containing the text "Version: " before the actual
contents of VERSION are processed. This means that the text "Version: " will
be inserted in front of the actual version number. DATE is similar, so I skip
that.
(element PART
(make paragraph
font-size: (* *font-size* 1.5)
line-spacing: (* *font-size* 2)
(make sequence
(literal (attribute-string "NO" (current-node)))
(literal ". ")
(literal (attribute-string "TITLE" (current-node)))
)
(process-children)))
I wanted PART to have a large font size and contain both number and title.
We've already seen how to do this with sequence, but the problem of getting
hold of the number and title is new. They are only given as attributes, and thus
will be ignored by (process-children). The function attribute-string gives us
what we want. (attribute-string "NO" (current-node)) returns the value of the
attribute NO in the current element. The rest of this style sheet is so simple
that I'll just skip it without comments.
In case anyone's interested, they can find the entire DSSSL file here, together
with the results in RTF and PostScript formats. The RTF file is produced by
Jade (see reference 12) and the Postscript file is produced from this. Note
that the RTF and Postscript files are from the Norwegian version. This should
make no difference, though.
At this point it doesn't seem like XSL will be based on Scheme, since
Microsoft and Netscape already have implemented JavaScript in their
browsers. So XSL will probably be defined as an XML DTD that uses
JavaScript for programming. That's a pity, since DSSSL has such a nice
syntax and Scheme is such a great programming language, but Netscape
Navigator and MSIE are of course large enough as it is.
The first thing I hope XML can put right is the problem of making web pages
with decent layout that are still accessible to anyone, regardless of browser.
Considering that XSL will be a complete standard to be supported one should,
after a while, expect a stable standard to write against. XSL also lets you
check whether optional features are present or not and if not you can supply
alternative code to take care of those cases.
A FAQ-maintainer will also be rid of the problems with maintaining the FAQ in
HTML, .txt and PDF versions (or whatever). Instead s/he can make one (or
more) DSSSL stylesheets to be run each time the original has been updated
to create new versions of the distribution files. (Just like I produced .RTF
and .PS files for my FAQ above.)
Considering that neither Microsoft nor Netscape have been able to implement
CSS (or even HTML) properly one can wonder what will happen when they try
to implement XML and XSL. My hope is that they'll decide they have to make
a real effort and do it properly and that if they don't somebody who does will
take over the market. They've now promised to support XML, so there's room
for hope, but no more... (See references 5 and 6.)
This can be used in a nearly infinite number of ways, but examples of what
the developers have in mind are footnotes that are invisible until you click the
footnote number in the text, that you can start from the table of contents in a
document and descend through the levels by clicking (like in Windows
Explorer). You can also make things like tables that can be sorted by any
column by clicking on it. The possibilities are nearly unlimited, and this is only
the tip of the iceberg.
This can be made significantly much more advanced. One could imagine that
VRML (a language for coding 3D worlds) was redefined in XML and VRML
viewers were written as Java applets using DOM. (If you think this is science
fiction take a look at reference 8.) This would mean that VRML could be used
together with HTML with no need for extra software on the client side. (Well,
there would be the applets, but they both install and remove themselves.)
The applications described here are currently not feasible, but I hope that in
time they may be.
Exploiting this for global search engines like Excite and Altavista is going to be
a lot more difficult becaouse of the number of different DTDs. With an
overview of the most important ones and a little artificial intelligence in the
search engines this could perhaps be handled, but for now this is pure
science fiction.
Jon Bosak writes about using this sort of technique with intelligent agents,
which are personal robots that search the web (and possibly other services as
well) for information for you based on your preferences. This might be easier,
as you could list the DTDs and your preferences privately, but it's still science
fiction.