0% found this document useful (0 votes)
88 views27 pages

What Is XML

XML is a markup language that is used to carry data, not display it. It was designed to be self-descriptive and allow users to define their own tags. XML is just plain text that follows certain syntax rules, such as requiring closing tags. It is commonly used to transport and store data between incompatible systems or applications.

Uploaded by

Shailendra Singh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views27 pages

What Is XML

XML is a markup language that is used to carry data, not display it. It was designed to be self-descriptive and allow users to define their own tags. XML is just plain text that follows certain syntax rules, such as requiring closing tags. It is commonly used to transport and store data between incompatible systems or applications.

Uploaded by

Shailendra Singh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 27

What is XML?

• XML stands for EXtensible Markup Language


• XML is a markup language much like HTML
• XML was designed to carry data, not to display data
• XML tags are not predefined. You must define your own tags
• XML is designed to be self-descriptive
• XML is a W3C Recommendation

The Difference Between XML and HTML


XML is not a replacement for HTML.

XML and HTML were designed with different goals:

• XML was designed to transport and store data, with focus on what data is.
• HTML was designed to display data, with focus on how data looks.

HTML is about displaying information, while XML is about carrying information.

XML Does not DO Anything


Maybe it is a little hard to understand, but XML does not DO anything. XML was created to structure, store,
and transport information.

The following example is a note to Tove from Jani, stored as XML:

<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

The note above is quite self descriptive. It has sender and receiver information, it also has a heading and a
message body.

But still, this XML document does not DO anything. It is just pure information wrapped in tags. Someone
must write a piece of software to send, receive or display it.

XML is Just Plain Text


XML is nothing special. It is just plain text. Software that can handle plain text can also handle XML.

However, XML-aware applications can handle the XML tags specially. The functional meaning of the tags
depends on the nature of the application.

With XML You Invent Your Own Tags


The tags in the example above (like <to> and <from>) are not defined in any XML standard. These tags are
"invented" by the author of the XML document.

That is because the XML language has no predefined tags.

The tags used in HTML (and the structure of HTML) are predefined. HTML documents can only use tags
defined in the HTML standard (like <p>, <h1>, etc.).

XML allows the author to define his own tags and his own document structure.

XML is Not a Replacement for HTML


XML is a complement to HTML.

It is important to understand that XML is not a replacement for HTML. In most web applications, XML is used
to transport data, while HTML is used to format and display the data.

My best description of XML is this:

XML is a software- and hardware-independent tool for carrying information.

XML is a W3C Recommendation


XML became a W3C Recommendation 10. February 1998.

To read more about the XML activities at W3C, please read our W3C Tutorial.

XML is Everywhere
We have been participating in XML development since its creation. It has been amazing to see how quickly
the XML standard has developed, and how quickly a large number of software vendors has adopted the
standard.

XML is now as important for the Web as HTML was to the foundation of the Web.

XML is everywhere. It is the most common tool for data transmissions between all sorts of applications, and
is becoming more and more popular in the area of storing and describing information.

How Can XML be Used?


« Previous Next Chapter »

XML is used in many aspects of web development, often to simplify data storage and
sharing.

XML Separates Data from HTML


If you need to display dynamic data in your HTML document, it will take a lot of work to edit the HTML each
time the data changes.

With XML, data can be stored in separate XML files. This way you can concentrate on using HTML for layout
and display, and be sure that changes in the underlying data will not require any changes to the HTML.

With a few lines of JavaScript, you can read an external XML file and update the data content of your HTML.

You will learn more about this in a later chapter of this tutorial.

XML Simplifies Data Sharing


In the real world, computer systems and databases contain data in incompatible formats.

XML data is stored in plain text format. This provides a software- and hardware-independent way of storing
data.

This makes it much easier to create data that different applications can share.

XML Simplifies Data Transport


With XML, data can easily be exchanged between incompatible systems.

One of the most time-consuming challenges for developers is to exchange data between incompatible
systems over the Internet.

Exchanging data as XML greatly reduces this complexity, since the data can be read by different
incompatible applications.

XML Simplifies Platform Changes


Upgrading to new systems (hardware or software platforms), is always very time consuming. Large amounts
of data must be converted and incompatible data is often lost.

XML data is stored in text format. This makes it easier to expand or upgrade to new operating systems, new
applications, or new browsers, without losing data.

XML Makes Your Data More Available


Since XML is independent of hardware, software and application, XML can make your data more available
and useful.

Different applications can access your data, not only in HTML pages, but also from XML data sources.

With XML, your data can be available to all kinds of "reading machines" (Handheld computers, voice
machines, news feeds, etc), and make it more available for blind people, or people with other disabilities.
XML is Used to Create New Internet Languages
A lot of new Internet languages are created with XML.

Here are some examples:

• XHTML the latest version of HTML


• WSDL for describing available web services
• WAP and WML as markup languages for handheld devices
• RSS languages for news feeds
• RDF and OWL for describing resources and ontology
• SMIL for describing multimedia for the web

XML Syntax Rules


« Previous Next Chapter »

The syntax rules of XML are very simple and logical. The rules are easy to learn, and easy to
use.

All XML Elements Must Have a Closing Tag


In HTML, you will often see elements that don't have a closing tag:

<p>This is a paragraph
<p>This is another paragraph

In XML, it is illegal to omit the closing tag. All elements must have a closing tag:

<p>This is a paragraph</p>
<p>This is another paragraph</p>

Note: You might have noticed from the previous example that the XML declaration did not have a closing
tag. This is not an error. The declaration is not a part of the XML document itself, and it has no closing tag.

XML Tags are Case Sensitive


XML elements are defined using XML tags.

XML tags are case sensitive. With XML, the tag <Letter> is different from the tag <letter>.

Opening and closing tags must be written with the same case:

<Message>This is incorrect</message>
<message>This is correct</message>

Note: "Opening and closing tags" are often referred to as "Start and end tags". Use whatever you prefer. It
is exactly the same thing.

XML Elements Must be Properly Nested


In HTML, you might see improperly nested elements:

<b><i>This text is bold and italic</b></i>

In XML, all elements must be properly nested within each other:

<b><i>This text is bold and italic</i></b>

In the example above, "Properly nested" simply means that since the <i> element is opened inside the <b>
element, it must be closed inside the <b> element.

XML Documents Must Have a Root Element


XML documents must contain one element that is the parent of all other elements. This element is called
the root element.

<root>
<child>
<subchild>.....</subchild>
</child>
</root>

XML Attribute Values Must be Quoted


XML elements can have attributes in name/value pairs just like in HTML.

In XML the attribute value must always be quoted. Study the two XML documents below. The first one is
incorrect, the second is correct:

<note date=12/11/2007>
<to>Tove</to>
<from>Jani</from>
</note>
<note date="12/11/2007">
<to>Tove</to>
<from>Jani</from>
</note>

The error in the first document is that the date attribute in the note element is not quoted.

Entity References
Some characters have a special meaning in XML.

If you place a character like "<" inside an XML element, it will generate an error because the parser
interprets it as the start of a new element.

This will generate an XML error:

<message>if salary < 1000 then</message>

To avoid this error, replace the "<" character with an entity reference:

<message>if salary &lt; 1000 then</message>

There are 5 predefined entity references in XML:

&lt; < less than

&gt; > greater than

&amp; & ampersand

&apos; ' apostrophe

&quot; " quotation mark

Note: Only the characters "<" and "&" are strictly illegal in XML. The greater than character is legal, but it is
a good habit to replace it.

Comments in XML
The syntax for writing comments in XML is similar to that of HTML.

<!-- This is a comment -->


White-space is Preserved in XML
HTML truncates multiple white-space characters to one single white-space:

HTML: Hello my name is Tove

Output: Hello my name is Tove.

With XML, the white-space in a document is not truncated.

XML Stores New Line as LF


In Windows applications, a new line is normally stored as a pair of characters: carriage return (CR) and line
feed (LF). The character pair bears some resemblance to the typewriter actions of setting a new line. In Unix
applications, a new line is normally stored as a LF character. Macintosh applications also use an LF to store a
new line.

XML Elements
« Previous Next Chapter »

An XML document contains XML Elements.

What is an XML Element?


An XML element is everything from (including) the element's start tag to (including) the element's end tag.

An element can contain other elements, simple text or a mixture of both. Elements can also have attributes.

<bookstore>
<book category="CHILDREN">
<title>Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title>Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>

In the example above, <bookstore> and <book> have element contents, because they contain other
elements. <author> has text content because it contains text.
In the example above only <book> has an attribute (category="CHILDREN").

XML Naming Rules


XML elements must follow these naming rules:

• Names can contain letters, numbers, and other characters


• Names cannot start with a number or punctuation character
• Names cannot start with the letters xml (or XML, or Xml, etc)
• Names cannot contain spaces

Any name can be used, no words are reserved.

Best Naming Practices


Make names descriptive. Names with an underscore separator are nice: <first_name>, <last_name>.

Names should be short and simple, like this: <book_title> not like this: <the_title_of_the_book>.

Avoid "-" characters. If you name something "first-name," some software may think you want to subtract
name from first.

Avoid "." characters. If you name something "first.name," some software may think that "name" is a
property of the object "first."

Avoid ":" characters. Colons are reserved to be used for something called namespaces (more later).

XML documents often have a corresponding database. A good practice is to use the naming rules of your
database for the elements in the XML documents.

Non-English letters like éòá are perfectly legal in XML, but watch out for problems if your software vendor
doesn't support them.

XML Elements are Extensible


XML elements can be extended to carry more information.

Look at the following XML example:

<note>
<to>Tove</to>
<from>Jani</from>
<body>Don't forget me this weekend!</body>
</note>

Let's imagine that we created an application that extracted the <to>, <from>, and <body> elements from
the XML document to produce this output:
MESSAGE

To: Tove
From: Jani

Don't forget me this weekend!

Imagine that the author of the XML document added some extra information to it:

<note>
<date>2008-01-10</date>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

Should the application break or crash?

No. The application should still be able to find the <to>, <from>, and <body> elements in the XML
document and produce the same output.

One of the beauties of XML, is that it can often be extended without breaking applications.

XML Attributes
« Previous Next Chapter »

XML elements can have attributes in the start tag, just like HTML.

Attributes provide additional information about elements.

XML Attributes
From HTML you will remember this: <img src="computer.gif">. The "src" attribute provides additional
information about the <img> element.

In HTML (and in XML) attributes provide additional information about elements:

<img src="computer.gif">
<a href="demo.asp">

Attributes often provide information that is not a part of the data. In the example below, the file type is
irrelevant to the data, but important to the software that wants to manipulate the element:

<file type="gif">computer.gif</file>
XML Attributes Must be Quoted
Attribute values must always be enclosed in quotes, but either single or double quotes can be used. For a
person's sex, the person tag can be written like this:

<person sex="female">

or like this:

<person sex='female'>

If the attribute value itself contains double quotes you can use single quotes, like in this example:

<gangster name='George "Shotgun" Ziegler'>

or you can use character entities:

<gangster name="George &quot;Shotgun&quot; Ziegler">

XML Elements vs. Attributes


Take a look at these examples:

<person sex="female">
<firstname>Anna</firstname>
<lastname>Smith</lastname>
</person>

<person>
<sex>female</sex>
<firstname>Anna</firstname>
<lastname>Smith</lastname>
</person>

In the first example sex is an attribute. In the last, sex is an element. Both examples provide the same
information.
There are no rules about when to use attributes and when to use elements. Attributes are handy in HTML. In
XML my advice is to avoid them. Use elements instead.

My Favorite Way
The following three XML documents contain exactly the same information:

A date attribute is used in the first example:

<note date="10/01/2008">
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

A date element is used in the second example:

<note>
<date>10/01/2008</date>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

An expanded date element is used in the third: (THIS IS MY FAVORITE):

<note>
<date>
<day>10</day>
<month>01</month>
<year>2008</year>
</date>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

Avoid XML Attributes?


Some of the problems with using attributes are:

• attributes cannot contain multiple values (elements can)


• attributes cannot contain tree structures (elements can)
• attributes are not easily expandable (for future changes)

Attributes are difficult to read and maintain. Use elements for data. Use attributes for information that is not
relevant to the data.

Don't end up like this:

<note day="10" month="01" year="2008"


to="Tove" from="Jani" heading="Reminder"
body="Don't forget me this weekend!">
</note>

XML Attributes for Metadata


Sometimes ID references are assigned to elements. These IDs can be used to identify XML elements in
much the same way as the ID attribute in HTML. This example demonstrates this:

<messages>
<note id="501">
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
<note id="502">
<to>Jani</to>
<from>Tove</from>
<heading>Re: Reminder</heading>
<body>I will not</body>
</note>
</messages>

The ID above is just an identifier, to identify the different notes. It is not a part of the note itself.

What I'm trying to say here is that metadata (data about data) should be stored as attributes, and that data
itself should be stored as elements.

XML Validation
« Previous Next Chapter »

XML with correct syntax is "Well Formed" XML.

XML validated against a DTD is "Valid" XML.

Well Formed XML Documents


A "Well Formed" XML document has correct XML syntax.

The syntax rules were described in the previous chapters:

• XML documents must have a root element


• XML elements must have a closing tag
• XML tags are case sensitive
• XML elements must be properly nested
• XML attribute values must be quoted

<?xml version="1.0" encoding="ISO-8859-1"?>


<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

Valid XML Documents


A "Valid" XML document is a "Well Formed" XML document, which also conforms to the rules of a Document
Type Definition (DTD):

<?xml version="1.0" encoding="ISO-8859-1"?>


<!DOCTYPE note SYSTEM "Note.dtd">
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

The DOCTYPE declaration in the example above, is a reference to an external DTD file. The content of the
file is shown in the paragraph below.

XML DTD
The purpose of a DTD is to define the structure of an XML document. It defines the structure with a list of
legal elements:

<!DOCTYPE note
[
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>

If you want to study DTD, you will find our DTD tutorial on our homepage.

XML Schema
W3C supports an XML-based alternative to DTD, called XML Schema:

<xs:element name="note">

<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>

</xs:element>

An introduction to XML

By: Lars Marius Garshol

This document | The weaknesses of


HTML | What's XML? | The potential of
XML | References

Norsk versjon | På svenska | Other articles

About this document


This document is something my thesis advisor asked me to write in
connection with my thesis, to clarify what XML is and I saw in it. It ocurred to
me that this might be useful for a lot of people so I put it out on the web. After
the Norwegian version recieved quite a few hits I asked on comp.text.sgml
whether there was any interest in an English version. There was, so I
translated it.

What's wrong with HTML?


(If you don't know the difference between a tag and an element in
HTML/SGML you should read the glossary at the end of this document.)

Originally, the intention with HTML was that the elements should be used to
mark up information according to its meaning, without regard to how this
would actually be rendered in a browser. In other words: title, main header,
emphasized text and the contact information of the author should be placed
inside the elements TITLE, H1, EM (or possibly STRONG) and ADDRESS. To
use FONT or I and similar elements to get a nice layout makes it a lot more
difficult to present the information to the best possible effect regardless of the
users environment. Processing the information automatically also becomes
difficult (or even impossible). (See reference 1.)

The reason why the browser should decide for itself how to display title and
headers etc is that it knows a lot more about the users preferences and
environment and so can make decisions based on that. The author, not
knowing his reader, cannot do this as well, of course. This is especially useful
for people who are blind, run non-graphical browsers or who have weak
eyesight, and therefore need larger font sizes. This means that an author who
doesn't follow the rules will cause problems for those of the readers who read
in a non-standard environment.

Unfortunately, browser vendors have either not understood this or decided to


ignore it, as they have ignored standards that tried to place information about
layout outside the HTML documents themselves, like CSS. (See reference 2.)
Instead, they've introduced their own elements and attributes whose only
purpose is to specify the layout, like FONT, CENTER, BGCOLOR etc. They've
also made HTML editors (like Netscape Gold) which produce HTML where the
markup is presentational rather than semantic. (For instance, Netscape Gold
uses UL to produce indentation, and not just for lists.)

The result is that a lot of pages on the web now contain tagging that's written
for a specific version of a specific browser (with default preferences) and a
specific screen resolution. These pages are often more or less unreadable to
those who use something else. Thus, HTML has gradually been turned into a
presentational language for Netscape and MSIE by the vendors and their
users.

This, however, is not the only problem. If you want to mark up your
information really precisely according to its meaning you'll want lots of
elements that just aren't present in HTML. If you are, say, a chemist, you'll
probably want special elements for chemical formulas, for measurement data
and so on. If you are an airplane manufacturer you'll want to be able to talk
about engines, parts and models. Catering to the needs of all trades and
people will obviously mean having an enormous amount of elements, which is
quite simply a Bad Thing for both developers and users.

Another problem is that HTML has very little internal structure, which means
that you can easily write valid HTML that does not make sense at all when
you consider the semantics of the elements. This is because (among other
things) the contents of BODY have been defined so that you can place the
elements allowed therein in any order you please. This means that you don't
need a H1 with the H2s inside it and H3s inside the H2s. (Think of H1 as a
book title, H2 as part title and H3 as chapter title.) HTML should ideally be
written this way, but the HTML standard does not require it. (Se references 1
and 3.)

People have been aware of these problems for quite some time, and in the
summer of '96 the W3C (which defines the web standards) started work on a
new standard to deal with these problems. The W3C has set up a working
group that is now creating this new standard called XML, for eXtensible
Markup Language. The working group (from now on called XWG, for XML
working group) has split their work into three phases.)

Phase 1

Define a standard for the creation of markup languages.

Phase 2

Develop a common standard for linking in these markup languages.

Phase 3
Develop a common standard for specifying the layout of documents
encoded in these languages.

Phase 1 is now completed, since the XML 1.0 specification is now finished.
Phase 2 is still under way, although there is a working draft. Phase 3 has not
yet reached that stage, as there only exists a suggestion at this stage.

XML
Please note that the descriptions given below are simplified and only meant to
give an impression of XML. They leave out a lot of the standards and are (for
reasons of readability) a little inaccurate. If you want more detailed and
accurate information you should go on to read the appendicesbelow. Also
note that these standards are not finalized yet, so that they may change
before they're officially accepted. As a first introduction, however, this
document should be useful.

XML itself

There already exists a standard for defining markup languages like HTML,
which is called SGML. HTML is actually defined in SGML. SGML could have
been used as this new standard, and browsers could have been extended
with SGML parsers. However, SGML is quite complex to implement and
contains a lot of features that are very rarely used. Its support for different
character sets is also a bit weak, which is something that can cause problems
on the web where people use many different kinds of computers and
languages. It's also difficult to interpret an SGML document without having the
definition of the markup language (the DTD) available. Because of this, the
XWG decided to develop a simplified version of SGML, which they called
XML. (As they like to say, XML is more like SGML light, than HTML++.)

The main point of XML is that you, by defining your own markup language,
can encode the information of your documents much more precisely than is
possible with HTML. This means that programs processing these documents
can "understand" them much better and therefore process the information in
ways that are impossible with HTML (or ordinary text processor documents).
Imagine that you marked up recipes (for, say, soups and sea food dishes etc)
according to a DTD tailored for recipes where you entered the amounts of
each ingredient and alternatives for some ingredients. You could then easily
make a program that, given a list of the contents of your fridge, would go
through the entire list of recipes and make a list of the dishes you could make
with them. Given nutritional information about the ingredients (x calories per
ounce of this, y calories per once of that etc) the program could sort the
suggestions by the amount of calories in each dish. Or by how long they'd
take to prepare, or the price (given price information for the ingredients), or...
The possibilites are almost endless, because the information is encoded in a
way that the computer can "understand".

Defining your own markup language with XML is actually surprisingly simple.
If you wanted to make a markup language for FAQs you might want it to be
used like this: (note that this example is really too simple to be very useful)
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE FAQ SYSTEM "FAQ.DTD">
<FAQ>
<INFO>
<SUBJECT> XML </SUBJECT>
<AUTHOR> Lars Marius Garshol</AUTHOR>
<EMAIL> [email protected] </EMAIL>
<VERSION> 1.0 </VERSION>
<DATE> 20.jun.97 </DATE>
</INFO>

<PART NO="1">
<Q NO="1">
<QTEXT>What is XML?</QTEXT>
<A>SGML light.</A>
</Q>

<Q NO="2">
<QTEXT>What can I use it for?</QTEXT>
<A>Anything.</A>
</Q>

</PART>
</FAQ>

In XML, the markup language shown above (let's call it FAQML) had a DTD
like this:
<!ELEMENT FAQ (INFO, PART+)>

<!ELEMENT INFO (SUBJECT, AUTHOR, EMAIL?, VERSION?, DATE?)>


<!ELEMENT SUBJECT (#PCDATA)>
<!ELEMENT AUTHOR (#PCDATA)>
<!ELEMENT EMAIL (#PCDATA)>
<!ELEMENT VERSION (#PCDATA)>
<!ELEMENT DATE (#PCDATA)>
<!ELEMENT PART (Q+)>
<!ELEMENT Q (QTEXT, A)>

<!ELEMENT QTEXT (#PCDATA)>


<!ELEMENT A (#PCDATA)>

<!ATTLIST PART NO CDATA #IMPLIED


TITLE CDATA #IMPLIED>
<!ATTLIST Q NO CDATA #IMPLIED>

<!ELEMENT> is used to define elements like this: <!ELEMENT NAME CONTENTS>. NAME
gives the name of the element, and CONTENTS describes which elements
that are allowed where inside the element we've defined. A,B means that you
must have an A first, followed by a B. ? after an element means that it can be
skipped, + means that it must be included one or more times and * means that
it can be skipped or included one or more times. #PCDATA means ordinary text
without markup (more or less).

An important difference between XML and SMGL is that elements in XML


which do not have any contents (like IMG and BR of HTML) are written like
this in XML: <IMG SRC="pamela.gif"/>. Note the slash before the final >. This
means that a program can read the document without knowing the DTD
(which is where it says that IMG does not have any contents) and still know
that IMG does not have an end tag and that what comes after IMG is not
inside the element.

<!ATTLIST> defines the attributes of an element. In the DTD given above it's
used to give PART and Q an attribute called NO, which contains ordinary text
and which can be skipped. As you can see, PART has two attributes, and the
last one is called TITLE, contains text and can be skipped.

Linking in XML

HyTime is a standard for adding linking attributes and elements to SGML


DTDs. It is far more advanced than what's possible with HTML and contains a
lot of stuff not useful on the web. The XWG is therefore currently making a
similar standard for XML which borrows a lot from HyTime (and similar
standards) and simplifies it.

To make it possible to use this linking standard in any DTD (regardless of


which elements the DTD has) there aren't defined any particular elements for
linking. Instead, linking elements use special attributes that identify them as
linking elements. All elements that have an attribute called XML-LINK will be
considered linking. The value of XML-LINK specifies what kind of link the
element specifies.

XML links can be between two or more resources, and resources can be
either files (and not necessarily XML or HTML files) or elements in files. Links
can be specified with the ACTUATE-attribute to be followed either (if the value
is USER) when the user explicitly requests this (for instance by clicking) or
(value AUTO) automatically (ie: when the system reads the linking element).
What happens when you follow the link is specified with SHOW, which can
take the following values:

EMBED

This means that the resource the link points to is to be inserted into the
document the link comes from. This will happen either during the
displaying of the document or during processing of the document. This
can be useful for including text from other files (with ACTUATE=AUTO)
or to include a picture in a page. It can also be used to insert footnotes
into the text and ACTUATE will then specify if the user has to click on
the footnotes to include them or whether all footnotes will be inserted
automatically.

REPLACE

This means that the resource the link points to is to replace the linking
element. If you have two different versions of a paragraph you can link
them in such a way that one can see the other version in the same
context by following the link.

NEW

In this case, following the link will not affect the resource the link came
from. Instead, the linked resource will be processed/displayed in a new
context. Ordinary HTML links are of type NEW, as the new page is
displayed in place of the previous one.

XML is even more advanced than this. Links can be between more than one
resource, they can be specified outside the actual documents themselves and
the linked-to element inside a resource can be specified in very powerful
ways. The element can be identified with an ID-attribute, position in the
element structure and one can even specify that the link goes to things like
"fourth LI inside the first UL inside BODY".

In FAQML this could have been used both for specifying links to relevant
information outside the FAQ as well as specifying internal relationships
between different answers. It could also have been used for footnotes etc.

XML and layout

There is actually already an SGML standard for this as well, and it's called
DSSSL, and isn't very simple, either. The XWG has therefore decided to make
a simplified version of DSSSL as well and call it XSL. So far, not much has
been done about this. One proposal (see references) has been submitted,
but it hasn't been accepted yet, and it's uncertain if it will be. So, I'm going to
describe DSSSL instead of XSL, at least until the future shape of XSL
becomes clearer.

DSSSL is actually a full programming language, based on Scheme (a LISP


dialect), and is very powerful. It can be used both as a stylesheet specifying
fonts and positioning for the different elements and as a transformational
language that can be used to transform documents from one DTD to another.

The most common use of DSSSL is to convert SGML documents to other


formats better suited for presentation, like PDF (also known as Acrobat),
PostScript, LaTeX, HTML or RTF. What the XWG is planning is to use XSL to
specify how XML documents are to be displayed on screen.

Below I try to show how we could make a stylesheet for FAQML, but without
explaining very much of what really happens. I've split the DSSSL file into
several parts in order to be able to comment it as it's written, but it is meant to
be a single file.

DSSSL consists of several different parts, and the most basic one is the
expression language which is quite simply a subset of Scheme. This means
that DSSSL-stylesheets are really one large Scheme expression that is
calculated by the DSSSL engine, with a file as the result of the calculation.
Another important part (which is built on the expression language) is the style
language, which I've used almost exclusively in this example. A third part is
the query language, which can be used to find any element you want in your
document. I've used it in this example to find the number of a FAQ question
from inside the QTEXT element. This was necessary because NO is an
attribute of the surrounding Q element, and not QTEXT itself.

All formatting in DSSSL is done with so-called flow objects. In the code below
you'll se a lot of (element X (make Y-expressions which indicate that when
element X shows up the DSSSL engine is to create a flow object of type Y.
Then style rules for Y and then the contents of Y are specified. There's much
more to DSSSL than this, but the rest is considered to be outside the scope of
this document.
<!doctype style-sheet PUBLIC "-//James Clark//DTD DSSSL Style Sheet//EN">

;--- DSSSL stylesheet for FAQML

;---Constants

(define *font-size* 12pt)


(define *font* "Times New Roman")

The first line tells the SGML parser that this document follows the DTD for
DSSSL. (Yes, DSSSL is an SGML application.) The next two lines are
comments (after ; the rest of the line is ignored). Then I define two constants
that I use below in the styles themselves. This is done to make it easy to
change the font size of the entire document without having to adjust sizes for
all kinds of headers etc. Instead, I just change the value of *font-size*.
;---Element styles

(element FAQ
(make simple-page-sequence
font-family-name: *font*
font-size: *font-size*
input-whitespace-treatment: 'collapse
line-spacing: (* *font-size* 1.2)

(process-children)))

This part creates a flow object for the FAQ element, ie: the whole document.
The flow object is "simple-page-sequence", which I assume is meant for small
articles. I then specify what font to use, font size, that whitespace is to be
considered insignificant (like in HTML) and then I give the line height. The line
height is set to be 1.2 times the font size.
(element INFO
(make paragraph
quadding: 'center
space-after: (* *font-size* 1.5)
(process-children)))

This indicates that the element INFO (from start-tag to end-tag) is to be laid
out as a paragraph that is centered and has a blank space as high as 1.5 lines
after it. After creating the paragraph flow object the DSSSL engine is to go on
to process the child elements of INFO.
(element SUBJECT
(make paragraph
font-size: (* *font-size* 2)
line-spacing: (* *font-size* 2)
space-after: (* *font-size* 2)

(process-children)))

The subject element gets its own paragraph and is displayed in double font
size. AUTHOR and EMAIL are simpler versions of this, so I skip them. (You
can find them in the complete DSSSL file linked to below.)
(element VERSION
(make paragraph

(make sequence
(literal "Version: "))
(process-children)))

The VERSION element is given its own paragraph, which contains sequence
flow objects. I insert one containing the text "Version: " before the actual
contents of VERSION are processed. This means that the text "Version: " will
be inserted in front of the actual version number. DATE is similar, so I skip
that.
(element PART
(make paragraph
font-size: (* *font-size* 1.5)
line-spacing: (* *font-size* 2)

(make sequence
(literal (attribute-string "NO" (current-node)))
(literal ". ")
(literal (attribute-string "TITLE" (current-node)))
)

(process-children)))

I wanted PART to have a large font size and contain both number and title.
We've already seen how to do this with sequence, but the problem of getting
hold of the number and title is new. They are only given as attributes, and thus
will be ignored by (process-children). The function attribute-string gives us
what we want. (attribute-string "NO" (current-node)) returns the value of the
attribute NO in the current element. The rest of this style sheet is so simple
that I'll just skip it without comments.

In case anyone's interested, they can find the entire DSSSL file here, together
with the results in RTF and PostScript formats. The RTF file is produced by
Jade (see reference 12) and the Postscript file is produced from this. Note
that the RTF and Postscript files are from the Norwegian version. This should
make no difference, though.

The difference between XSL and DSSSL

At this point it doesn't seem like XSL will be based on Scheme, since
Microsoft and Netscape already have implemented JavaScript in their
browsers. So XSL will probably be defined as an XML DTD that uses
JavaScript for programming. That's a pity, since DSSSL has such a nice
syntax and Scheme is such a great programming language, but Netscape
Navigator and MSIE are of course large enough as it is.

What will XML be used for?


Please note that what follows is only my personal views on the future of the
web and should as such be regarded with a pinch of salt. ( Reference 4 is an
excellent article by XWG Chairman Jon Bosak on XML and the future of the
web.)

The layout problem

The first thing I hope XML can put right is the problem of making web pages
with decent layout that are still accessible to anyone, regardless of browser.
Considering that XSL will be a complete standard to be supported one should,
after a while, expect a stable standard to write against. XSL also lets you
check whether optional features are present or not and if not you can supply
alternative code to take care of those cases.

A FAQ-maintainer will also be rid of the problems with maintaining the FAQ in
HTML, .txt and PDF versions (or whatever). Instead s/he can make one (or
more) DSSSL stylesheets to be run each time the original has been updated
to create new versions of the distribution files. (Just like I produced .RTF
and .PS files for my FAQ above.)

Considering that neither Microsoft nor Netscape have been able to implement
CSS (or even HTML) properly one can wonder what will happen when they try
to implement XML and XSL. My hope is that they'll decide they have to make
a real effort and do it properly and that if they don't somebody who does will
take over the market. They've now promised to support XML, so there's room
for hope, but no more... (See references 5 and 6.)

More versatile ways of displaying data

An API to be supported by all XML and HTML processors (that is browsers


and other tools) is under development under the name Document Object
Model (or DOM). This happens a little on the side of the XWGs work, but is
still well under way. (See reference 7.) This API will make it possible to make
Java applets (or JavaScript snippets) that can be used to change the display
of XML-encoded information in web browsers. (The members of XWG like to
call this "giving Java something to work with.")

This can be used in a nearly infinite number of ways, but examples of what
the developers have in mind are footnotes that are invisible until you click the
footnote number in the text, that you can start from the table of contents in a
document and descend through the levels by clicking (like in Windows
Explorer). You can also make things like tables that can be sorted by any
column by clicking on it. The possibilities are nearly unlimited, and this is only
the tip of the iceberg.

This can be made significantly much more advanced. One could imagine that
VRML (a language for coding 3D worlds) was redefined in XML and VRML
viewers were written as Java applets using DOM. (If you think this is science
fiction take a look at reference 8.) This would mean that VRML could be used
together with HTML with no need for extra software on the client side. (Well,
there would be the applets, but they both install and remove themselves.)

Jon Bosak describes an even more advanced possibility in reference 4. The


major vendors of electronic components (so-called chips) have joined forces
to make a DTD that can be used to describe components. Together with the
right Java applets this could be used to download any descriptions of chips
and then model how these work together.
Searching and agents

The applications described here are currently not feasible, but I hope that in
time they may be.

That the information in XML documents is so precisely described by the


markup means that one can search them in much better ways than the
primitive text searches currently available from search engines like Excite and
Altavista today. There are already SGML query languages that are similar to
SQL in power and this field is still under research. (See references 9 and 10.)

With standardized DTDs for different applications one could retrieve


information much more accurately than today. One could envision things like a
central search engine for chip vendors where you could do very precise
searches for components by specification, almost as if they were in an
ordinary relational database. Similar services would be possible for all
documents with a common DTD.

Exploiting this for global search engines like Excite and Altavista is going to be
a lot more difficult becaouse of the number of different DTDs. With an
overview of the most important ones and a little artificial intelligence in the
search engines this could perhaps be handled, but for now this is pure
science fiction.

Jon Bosak writes about using this sort of technique with intelligent agents,
which are personal robots that search the web (and possibly other services as
well) for information for you based on your preferences. This might be easier,
as you could list the DTDs and your preferences privately, but it's still science
fiction.

Exchanging information between different systems

Because a DTD gives a standard format for information related to a specific


subject it can be used to simplify the exchange of information between
different sources. Many kinds of applications have or will have standard
DTDs. I've already mentioned chip manufacturers, and many other industries
already have standard DTDs and more will follow. This means that systems
can use these common DTDs to exchange information with each other,
regardless of their internal format. The main applications of this will probably
be the exchange of data between companies in the same industry or
researchers within an academic field, although many other applications for
ordinary users are imaginable.

There is already an XML DTD for chemists, called CML. (See


reference reference 11.) CML will be very useful for exchanging research
results and other data between chemists and companies working with
chemistry in any way. It can also be used with Java-applets in education. The
list of possibilities just goes on and on.

You might also like