E-Science E-Business E-Government and Their Technologies: Core XML
E-Science E-Business E-Government and Their Technologies: Core XML
11
Unicode
Unicode (http://www.unicode.org) is an international
standard character set that covers alphabets of all the
World’s common written languages.
• Eventually it should cover all languages, living and dead.
• Unicode helps make the Web truly “worldwide”?!
Unlike, say, ASCII, which allows for 128 characters,
Unicode has space for over 1,000,000, of which around
96,000 are currently allocated.
Unicode itself assigns a unique sequence number (code
point) to any character, regardless its alphabet.
• Three Unicode encoding forms map these code points to
sequences of fixed size units—UTF-8, UTF-16, UTF-32.
Unicode Code Points
A Unicode code point is a numeric value between 0 and
10FFFF16, commonly denoted in one of the formats:
U+XXXX
U+XXXXX
U+10XXXX
where X is a hexadecimal digit.
There are a total of 1,114,112 (= 17 · 164) code points, but most of
the World’s common characters are encoded in the first 65,536
points—the Basic Multilingual Plane (BMP).
• 2048 code points in BMP are disallowed because their values have a
special role in UTF-16 encoding.
For each assigned character code, the Unicode standard defines
a name, and “semantic” properties like case, directionality, ...
Planes
The space of 17 · 216 Unicode code points is
conventionally divided into 17 planes of 216 points each.
Currently used planes include:
Layout of
planes:
Blocks
Planes are subdivided into blocks.
Blocks have variable size. Each block contains the
characters of one alphabet or a group of related
alphabets.
The following slides are a random sampling of the
blocks in BMP.
• I have put 128 code points on each slide, but this is just what
would fit… no general significance to pages of size 128.
For all blocks in the current Unicode standard see:
http://www.unicode.org/charts/
“Basic Latin” (a.k.a. ASCII)
“Latin 1” (supplement)
“Greek and Coptic”
“Arabic” (1 of 2)
“Devanagari”
“Hangul Jamo” (1 of 2)
“CJK Unified Ideographs” (1 of 164)
Unicode
Allocation
Layout of
Basic
Multilingual
Plane:
Unicode
Allocation
Layout of
Plane 1:
Encoding Forms
In electronic documents or computer programs the
space of Unicode code points is normally broken down
into a sequence of units, each unit having a convenient,
fixed number of bits.
The Unicode standard defines 3 encoding forms.
The most straightforward is UTF-32, in which the units
have size 32 bits.
This unit is easily large enough to hold the integer value
of a single code point, so UTF-32 encoding is “obvious”.
But for nearly all documents, UTF-32 wastes at least
half the available storage space.
• Also, most programming languages work with 8 bit or 16 bit
character units.
UTF-16
The UTF-16 encoding form breaks Unicode characters
into 16 bit units.
• Java, for example, uses UTF-16 for chars and Strings.
One 16 bit unit is not large enough to represent all
possible Unicode code points.
Code points higher than 216-1 are split over two
consecutive units.
• These are called surrogate pairs. The leading unit is a high-
surrogate unit; trailing is a low-surrogate unit.
• There are 1024 code points reserved in the BMP for high
surrogates, and 1024 more reserved for low surrogates.
This allows for 1024 · 1024 surrogate pairs representing code points
higher than 216-1, while ensuring a legal BMP code point can always
be represented in a single unit, and such a unit can never be confused
with a surrogate unit.
UTF-16 Bit Distribution
UTF-8
The UTF-8 encoding form breaks Unicode characters
into 8 bit units (i.e., individual bytes).
UTF-8 is a variable-width encoding with the following
properties:
• Any Unicode code point maps to 1, 2, 3, or 4 bytes.
• Byte sub-sequences for individual characters can always be
recognized by local search in the encoded string.
• The Basic Latin block coding points (U+0000..U+007F) map
to one byte, identical to their ASCII value.
• All code points in the BMP map to at most 3 bytes.
• For European texts UTF-8 will normally use 8 or 16 bits per
character (vs 16 bits for UTF-16).
• For East Asian texts UTF-8 will normally use 24 bits per
character (vs 16 bits for UTF-16).
UTF-8 Bit Distribution
Encoding Schemes
The 3 encoding forms don’t quite complete the encoding
schemes of Unicode, because they don’t address the
endianness with which the UTF-32, UTF-16 numeric
unit values are rendered to bytes (byte-serialization).
To allow applications to distinguish the endianness of a
given document instance, Unicode allows a Byte Order
Mark (BOM) as the first character of a document.
• BOM is a code point (U+FEFF) for which the byte-reversed
unit value doesn’t correspond to a legal code point, so serves
to determine the actual byte order.
The Seven Unicode Encoding Schemes
Unicode Summary
Unicode is a large and important standard that is a
foundation for XML, HTML, etc.
Although you are unlikely to manipulate the encodings
yourself, you should be aware of the pros and cons of
UTF-16, UTF-8.
• UTF-8 is backwards compatible with ASCII—Basic Latin
texts can be read by legacy applications.
• UTF-16 is better-suited for internationalization. It is the
internal representation used by Java, C#, ECMAScript, …
Core XML I:
The XML Specification
34
Introduction
In this section we will describe core XML, as
defined by the XML specification document from
W3C.
XML is a format for documents—originally
documents for the Web—but its scope is wider than
that.
XML is a subset of SGML—Standard Generalized
Markup Language. Some features of XML exist
simply for compatibility with SGML.
XML can also be viewed as a kind of generalization
of HTML—presumably familiar from the Web.
XML Parsers and Applications
For purposes of this section an application is any
program that reads data from an XML document.
Applications normally do not (and probably should
not) read the text of XML documents directly.
The XML specification assumes that this text is initially
processed by a piece of software called an XML
processor. We will also refer to this as an XML parser.
The parser exhaustively checks that the text is in a legal
XML form, then extracts the essential data from the
document, and hands that data to the application.
Reading XML Data
<?xml version="1.0"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN"
"http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">
<svg width="500" height="500">
<g transform='rotate(45)'>
<circle cx='150' cy='50' r='25'/>
XML Parser
<text x='125' y='100'>A Circle</text>
</g>
</svg>
XML Source
svg width height
500 500
g transform
rotate(45) Parsed XML Data
Application circle cx
150
cy
50
r
25
text x
125
y
100
A Circle
Well-formed Documents
An XML document follows a strict syntax. For
example:
• An XML document contains regions of text called elements,
delimited by matching start-tags and end-tags. Elements must
be correctly nested.
• Start-tags may include attribute specifications, where attribute
values are strings delimited by matching quote marks.
A document that obeys the full set of these rules is
called well-formed.
Every legal XML document must be well-formed,
otherwise it cannot be parsed.
Examples
Well-formed:
<html>
<body style="font-style: italic">
This is a well-formed document.
</body>
</html>
Not well-formed:
<html>
<body style=font-style: italic>
This is not a well-formed document.
</html>
</body>
• The style attribute value is not in quote marks, and the html
and body tags don’t nest correctly.
Install Xerces
The Xerces parser is a product of the Apache XML
project, http://xml.apache.org.
Follow the “Xerces Java 2” project link and go to the
download area, then to the master distribution
directory or a mirror directory.
Download Xerces-J-bin.2.6.2.zip, and extract it to a
suitable place, e.g. C:\
• When extracting, remember to select “Use folder names”!!
• This should create a folder called, e.g., C:\xerces-2_6_2\.
Put Xerces on your Class Path
Using the menu at
Control Panel→System→Advanced→Environment Variables
add the jar files xercesSamples.jar, xercesImpl.jar,
and xml-apis.jar, to you class path.
• E.g. append
…;C:\xerces2_6_2\xercesSamples.jar;C:\xerces2_6_2\xer
cesImpl.jar;C:\xerces-2_6_2\xml-apis.jar
to the value of your CLASSPATH variable.
Example Using Xerces
Copy the two HTML examples given above to files
called, say, wellformed.html and illformed.html. Then,
in a new Command Prompt window, try running the
commands:
> java dom.Writer wellformed.html
…
> java dom.Writer illformed.html
…
• The first command should just echo the document. The
second should print a syntax error message.
• dom.Writer is one of the sample applications in the Xerces
release. It simply uses the Xerces parser to convert the
source file to a tree data structure (DOM), then converts the
tree back to nicely formatted XML, which it prints.
“Rolling Your Own” Parser?
People approaching XML sometimes decide they can
write their own “lightweight” parser that handles just
the bit of XML their application needs.
• In general this is a bad idea!
• We will see later that even basic XML is a moderately
complex specification; unless you are going to invest a lot of
effort it is unlikely you can parse the full specification more
efficiently than existing parsers.
• If you subset the specification you may be compromising the
most crucial advantage that XML brings to your application
—interoperability.
• Later in these lectures we will see how to use the Xerces
parser from your own Java programs, to read XML input.
Valid Documents
An XML document may optionally include a Document
Type Definition (DTD).
• This declares the names of all elements and attributes
appearing in the document, and how they may nest.
• The DTD also declares and defines entities that may be
referenced from within document content.
A well-formed XML document that includes a DTD—
and accurately follows the declarations in that DTD—is
called valid.
Invalid Documents
It is quite possible to parse invalid (but well-formed)
documents, by using a non-validating parser.
Many applications accept XML files without DTDs,
which are therefore technically invalid.
Applications may exploit “validation” mechanisms
other than DTDs. An important one is XML Schema
which we will discuss later.
• A document validated against an XML Schema usually does
not have a DTD, so technically is invalid as far as the base
XML specification is concerned.
• But of course it is valid relative to the XML Schema
specification!
Validation Side Effects
The use of a validating parser certainly affects what
documents are treated as legal.
In some cases “switching on” validation may also alter
the exact data passed from the parser to application.
These effects will be considered when we discuss DTDs.
Physical Entities
An XML document is represented by one or more
“storage units” (typically files), called “entities”.
We can enumerate five kinds:
• Document entities—root XML documents.
• Parsed external entities, which contain fragmentary XML
content.
• External DTD subsets, which contain some or all of the DTD
declarations needed by a document.
• External parameter entities, which also contain fragmentary
DTD content.
• Unparsed external entities, which are usually complete
“binary” files in some native format (not XML).
Physical Structure
The structure of a non-trivial XML document is
illustrated in the following figure.
Every XML document must have exactly one document
entity.
It may also involve zero or more external entities:
• The document entity may reference any number of external
general entities. These can be parsed external entities or
unparsed external entities. A parsed external entity may in
turn reference other external general entities.
• The document may have at most one external DTD subset.
• A DTD subset in the document entity, or an external DTD
subset, may reference any number of external parameter
entities (which may in turn reference other external
parameter entities).
A Complex XML Document
Document External
External DTD Parameter Entity
Entity Subset
DTD External
Parameter Entity
Parsed External
Unparsed Entity
External Entity
Syntactic Features
The following two tables summarize the “top-level”
syntax of all the constructs in XML. Full details will be
given in later slides, as needed.
• The first columns give an abbreviated example of the syntax,
the second columns (“what?”) describe the construct, and the
third columns (“where?”) specify the places in an XML
document where the construct may appear.
• In a “where?” column, Document means at the top-level of the
document entity, and Content means in the kind of content
allowed in an element—also called Parsed Character Data.
• A Literal is character data in quotes—exactly what can
appear in a literal depends strongly on its context.
• XML Names will be discussed shortly.
Syntax I: Logical Structures
Example Syntax What? Where?
<Name … Element Document, Content
>Content</Name>
Name = Literal Attribute specification Element start tag
<?xml …> XML declaration/ Document/
Text declaration External entity
<?Name …> Processing instruction Document, DTD, Content
<!-- … --> Comment Document, DTD, Content
<!DOCTYPE …> DTD Document
<!ELEMENT …> Element declaration DTD
<!ATTLIST …> Attributes declaration DTD
<!ENTITY …> Entity declaration DTD
<!NOTATION …> Notation declaration DTD
Syntax II: References, Sections
Example Syntax What? Where?
&#Code-point; Character reference Content, Literal
71
Document Types
In the syntax for the document entity, we saw that the
document type declaration was an optional feature.
This declaration, if present, contains the document type
definition, or DTD.
A validating parser will read the DTD, which should
contain (among other things) declarations of all the
elements and attributes appearing in the body of the
document.
The DTD is required if the parser is validating, but
optional for a non-validating parser
• Even a non-validating parser may read the DTD if it is
present, to look for entity declarations. These will be
discussed later.
Document Type Declaration
The document type declaration, if present, appears
before the root element of the document.
The most general form of this declaration contains
three things:
– The type (i.e. name) of the following root element
– An identifier for an External DTD Subset
– An Internal DTD Subset.
Items 2. and 3. are optional.
The DTD itself is either given in an external file
(“external subset”), or “in line” in the document
(“internal subset”), or divided between the two.
General Syntax
General syntax is one of:
<!DOCTYPE Name [ Declarations ] >
<!DOCTYPE Name External-ID >
<!DOCTYPE Name External-ID [ Declarations ] >
where Name is the type of the root element, External-ID
is an identifier for an external entity—a separate file
containing the external DTD subset—and Declarations
is the internal DTD subset.
• Syntactically, the form:
<!DOCTYPE Name >
is allowed, but can never yield a valid document. (Why?)
External Entity Identifiers
External entity identifiers can occur in a couple of
places: they will be discussed in the next section when
we discuss entity declarations.
Meanwhile the simplest form is “SYSTEM Literal”,
where Literal contains a URI—a file name or URL or
(in principle) a URN.
So a valid document with an external DTD might be:
<!DOCTYPE my-root SYSTEM "my-type.dtd">
<my-root>
…
</my-root>
106
Collecting Things Together
So far we have described an ideal subset of XML in
which a document contains DDT, elements, attributes,
and character data, all laid out linearly.
In reality fragments of content and DTD may be
defined in places other than where they ultimately
appear (perhaps in other files), and portions of text
may need special processing before they are made
available to the application.
To complete the discussion we must cover:
• Character references and entity references.
• Internal and external entity declarations, and notations.
• CDATA sections.
• Conditional sections.
Character References
Character references can be viewed as an “escape”
mechanism that allows us to include specially-treated
or hard-to-type characters in the XML document.
They include the Unicode code point for a character,
taking either of the forms “&#dd…;” or “&#xXX…;”,
where ds are decimal digits and Xs are hexadecimal
digits.
For example:
• & or & represents “&” (ampersand).
• < or < represents “<” (left angle bracket).
• σ or Σ represents “Σ” (large Greek sigma).
One application is for including the literal characters
“<” or “&” in parsed character data.
CDATA Sections
CDATA sections provide a way of including a section of
character data in an XML document.
• The section can include “<” and “&” characters without
escaping—in a CDATA section these characters have no
special significance (so “markup” syntax is ignored).
The syntax is
<![CDATA[ Text ]]>
where Text is any text, except that it must not contain
the string “]]>”.
• One application is for including scripting in XML—e.g.
JavaScript uses < and & for operators.
• You cannot include any characters generally forbidden in
XML: can’t put raw “binary” data in a CDATA section!
Entity References
A character reference includes a single Unicode
character in the document. An entity reference includes
the content of some “entity”, which may be the contents
of an external file.
The simple syntax is:
&Name;
where Name is an XML name.
The name of the entity, Name, must have been declared
in the document DTD. There are just five exceptions to
this rule.
Predefined Entities
As a convenience the entities amp, lt, gt, apos, and quot
are considered predefined.
• You may declare them in a DTD, but it isn’t necessary.
They must contain the single-character values:
• & expands to “&” (ampersand).
• < expands to “<” (left angle bracket, or less than).
• > expands to “>” (right angle bracket, or greater than).
• ' expands to “'” (single quote, or apostrophe).
• " expands to “"” (double quote).
Provide a more convenient way of including reserved
characters.
Note these are entity references, not character
references (affects processing in some contexts).
A Hofstadteresque XHTML Example
<html>
<body>
The source of this document is:
<pre>
<html>
<body>
The source of this document is:
<pre>
&lt;html&gt;
…
&lt;html&gt;
</pre>
</body>
</html>
</pre>
</body>
</html>
Declaring Entities
Entities are defined in the DTD by an ENTITY
declaration with the syntax:
<!ENTITY Name Definition >
Here Name is of course the name of the declared
entity.
This general form can declare:
– Internal entities
– Parsed external entities
– Unparsed external entities
Only 1. and 2. can be included by an entity reference.
• Later we will see another form of the ENTITY declaration
that defines parameter entities.
Internal Entities
The simplest kind of entity is an internal entity. In this
case the Definition is just a literal string, and the entity
behaves like a kind of macro.
The string can contain character data and markup.
E.g.:
<!ENTITY me "Bryan" >
<!ENTITY lt "&#60;" >
<!ENTITY Sigma "Σ" >
<!ENTITY flag "Stars & Stripes" >
<!ENTITY icon '<image xlink:href="icon.jpg" />' >
These declarations allow the shorthand forms “&me;”,
“<”, “Σ”, “&flag;”, “&icon;” in the body of the
document.
Replacement Text for Internal Entities
Character references (and also parameter entity
references, see later) appearing in the entity definition
are expanded when the declaration is processed. So the
replacement text for lt is “<” and for Sigma is “Σ”.
References to other entities appearing in an entity
definition are not expanded at the time the declaration
is processed (see next slide). The replacement text for
flag is “Stars & Stripes”.
Expansion of General Entities
When the parser encounters an entity reference in the
content of a document, the reference is expanded to the
replacement text for the entity.
The parser then resumes processing, starting at the
beginning of the inserted text. If the replacement text
contains further entity references, these are replaced in
turn, as they are encountered.
An entity reference must not expand to fragmentary
markup. For example, the replacement text may
include complete elements, but not isolated start tags,
or isolated “<” characters.
External Entities
For an external entity the Definition appearing in the
ENTITY declaration is an external ID—same syntax as
in DOCTYPE declarations.
There are two forms (for parsed external entities):
<!ENTITY Name SYSTEM URI-Literal >
<!ENTITY Name PUBLIC Public-ID-Literal
URI-Literal >
The first form is fairly self-explanatory: URI-Literal is
typically an absolute or relative URL.
• If relative, it is relative to the location of the document entity
or external entity containing the ENTITY declaration.
References to External Entities
External entities may be referenced anywhere markup
can appear in document content, except in the literal
value of an attribute.
• Internal entities can be referenced in attribute specifications.
Public Identifiers
Where an external entity is likely to widely used one
can give a PUBLIC identifier.
• This acts as a logical identifier, something like a URN.
• XML standard itself doesn’t specify a syntax for public
identifiers, but XML-based standards usually follow the
SGML format for Formal Public Identifiers.
For example the DOCTYPE declaration for an
XHMTL 1.1 document should be:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
• When an document type declaration or an entity declaration
gives a public identifier, this must be followed by a URI,
which acts as a fall back for locating the entity.
Format of Parsed External Entities
A parsed external entity may optionally start with a text
declaration. The rest of the file should be Content, such
as may appear in an XML element.
• The replacement text for an external entity is the unprocessed
content (minus text declaration).
An external DTD subset may optionally start with a text
declaration. The rest of the file should be Declarations,
such as may appear in an internal DTD subset.
• We will see that external DTD subsets allow a couple of
processing options beyond those allowed in internal subsets.
Text Declaration
It is recommended to start a parsed external entity with
a text declaration.
The text declaration has syntax:
<?xml …optional version declaration… encoding=Literal ?>
i.e. syntax is identical to an XML declaration, except
that now it is the encoding declaration that is
mandatory (and no standalone declaration is allowed).
• It is allowed for the entity character encoding to be different
from the document that references it.
XHTML Example
A document entity:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd" [
<!ENTITY myfooter SYSTEM "myfooter.xml" >
]>
<html>
<body> This is one page in a complicated Web site.
&myfooter; </body>
</html>
An external entity “myfooter.xml”:
<hr/>Maintained by
<a xlink:href="mailto:[email protected]">
Bryan Carpenter</a>.
Combining DTD Subsets
The example above is illustrative—sadly no known
Web browser will recognize this kind of DTD.
In principle it shows how one can use both an external
subset and an internal subset in the same DTD.
• We will see shortly that one can go further with this using
parameter entities.
• Although the internal subset appears after the reference to
the external subset, the internal subset is processed first. If
there are conflicting declarations of entities, the internal
subset takes precedence.
• In general if an entity or attribute is declared more than once,
declarations after the first are ignored.
Unparsed Entities
Unparsed entities are declared in a similar way to
parsed entities, but used in a completely different way.
The XML parser does not include the contents of the
entity in the XML data, it just forwards a reference to
the application, e.g. as a URI.
An unparsed entity must be annotated with information
about what type of content it holds (or, equivalently,
what kind of application can process it).
• This notation is also passed to the application.
We give the syntax on the following slides—there is a
quite lot of syntax with modest payoff.
Notation Declarations
Before using a notation to label an unparsed entity, its
notation is declared. Declaration syntax is quite free,
e.g. the following are legal:
<!NOTATION myformat >
<!NOTATION jpeg SYSTEM "image/jpeg" >
<!NOTATION perl SYSTEM "/usr/bin/perl" >
<!NOTATION Name PUBLIC Public-ID >
• According to the specification, public and system identifiers
may “locate a helper application capable of processing data”.
• Some authors interpret that they may be, say, MIME types.
• A declared notation may also be the target of a processing
instruction, or specified in the value of an attribute of type
NOTATION or NOTATIONS.
Declaring and Using Unparsed Entities
Declaration similar to parsed external entities, except it
ends with “NDATA Name” where Name is a notation.
E.g.:
<!ENTITY notafake SYSTEM "notafake.jpg" NDATA jpeg >
The only way to reference a declared unparsed entity is
in an attribute declared of type ENTITY or ENTITIES,
e.g. if we have:
<!ATTLIST figure source req ENTITY #REQUIRED >
then the name notafake can finally be used as follows:
<figure source="notafake"/>
Parameter Entities
General entity references are expanded when they
appear to in the document content. They are not useful
for abstracting sections within a DTD.
• The only time general entities are expanded while processing
a DTD is if they appear in the default value of an attribute.
There is a separate “macro-expansion” mechanism that
is designed for use within DTDs. This uses parameter
entities.
Declaration and Reference
Declaration of parameter entities follows a syntax
similar to parsed general entities, except the name is
preceded by a “%” sign, e.g.:
<!ENTITY % my-dec '<!ENTITY me "Bryan" >' >
This declares the internal parameter entity my-dec. Its
value is the declaration of the general entity me.
To actually insert the declaration of me, later in the
139
The Document Object Model
The Document Object Model or DOM is a set of W3C
specifications that define standard ways to access parts
of an XML or HTML document from within a program.
This addresses, for example, how a Java program
extracts the data from an XML document, after the
document has been processed by an XML parser.
But the origins of the DOM lie in JavaScript
programming for interactive and dynamic Web pages.
Dynamic HTML Example
<html>
<head>
<script language="javascript">
function appendText() {
paraNode = document.getElementsByTagName("p") [0] ;
textNode = paraNode.firstChild ;
textNode.nodeValue += " Ouch!" ;
}
</script>
</head>
<body>
<p>Hello.</p>
<form>
<input type="button" value="Push Me" onclick="appendText()"/>
</form>
</body>
</html>
The DOM from JavaScript
The DOM represents an XML (or HTML) document as
a series of nodes.
In the example, the JavaScript method appendText() is
called when the user clicks on the button labeled “Push
Me”.
• This method identifies the DOM node representing the <p>
element, extracts the nested text node, and modifies the data
(the text) associated with that text node.
• The function getElementsByTagName() is a method defined
in the DOM, and firstChild and nodeValue are node
properties in the DOM.
A Document is a Tree
Here is how
an example
fragment of
HTML and
can be
thought of
as a tree.
DOM Nodes
In the DOM, one builds the document tree out of a set
of Node objects
Each Node object has a set of capabilities (properties
and methods) and implements specific interfaces.
Node
Properties
Methods
Remarks
DOM interfaces are traditionally defined in the OMG’s
Interface Definition Language.
• IDL interfaces can be mechanically converted to interface
definitions for languages like Java, ECMAScript, C++, …
By design, the Node interface is sufficiently generic
that any DOM tree can be manipulated using the
methods and properties of this interface alone.
• Use of the more specialized derived interfaces—Element,
Attribute, etc—is at programmer option. Those interfaces
provide more type-safety, and various type-specific
convenience methods and properties.
nodeName, nodeValue Properties
Each node type has different rules for values of some of the
properties—most importantly nodeName and nodeValue.
The attributes property is only relevant for an element node.
Node Type
Node Children
The following node types may have children, as
indicated, in the DOM tree:
• Document: DocumentType, Element (maximum of one),
ProcessingInstruction, Comment.
• DocumentFragment: Element, Text, CDATASection,
EntityReference, ProcessingInstruction, Comment.
• Element: Element, Text, CDATASection, EntityReference,
ProcessingInstruction, Comment.
• Attr: Text, EntityReference.
• EntityReference: Element, Text, CDATASection,
EntityReference, ProcessingInstruction, Comment.
• Entity: Element, Text, CDATASection, EntityReference,
ProcessingInstruction, Comment.
The Document Interface
The Document interface has some indispensable factory methods
for creating other node types:
Levels of the DOM
W3C recognizes 4 stages of evolution of the DOM,
called levels:
• Level 0: Legacy HTML “DOM” features from Navigator 3.0
and IE 3.0. No W3C specification.
• Level 1: Specification completed 1998. Basic tree structure
for HTML and XML:
Level 2 DOM
Specification completed 2000. Notably added support for Events.
Also XML Namespaces, DOM representation of Cascading Style
Sheets, and new facilities for manipulating the tree.
Level 3 DOM
Specification still in progress. It will add, for example, support
for XPath search operations on a DOM tree, and formally define
the mapping between DOM and XML Infoset. Load/Save
standardizes parsing/serialization APIs.
Using Xerces
Earlier we used the dom.Writer sample application
from the Xerces project to validate XML documents.
In the following slides we illustrate how to use the
Xerces parser from your own Java program.
• We will use the conventional Java JAXP API to control
parsing.
• In future the Load/Save features of Level 3 DOM are likely to
be the preferred API (Xerces already provides an
experimental implementation).
• To control certain aspects of parsing, it may be necessary to
use the org.apache.xerces.parsers.DOMParser
implementation class directly.
Checking Well-Formedness
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import java.io.File;
public class MyChecker {
public static void main(String [] args) throws Exception {
File source = new File(args [0]) ;
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true) ;
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(source) ;
}
}
Remarks
In JAXP one first obtains a parser factory, then uses the
factory to create a parser instance (called a document
builder in JAXP).
• Behaviors of the parser (e.g. whether namespace-aware,
validating/non-validating, …) are controlled by setting
properties of the factory before creating the instance.
Actual parsing is done by the parse() method, which
returns a DOM instance (Document node).
• Example above does nothing if document is well-formed;
prints an exception if it is not.
Validating
public static void main(String [] args) throws Exception {
File source = new File(args [0]) ;
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true) ;
factory.setValidating(true) ;
DocumentBuilder builder = factory.newDocumentBuilder();
builder.setErrorHandler(new MyErrorHandler()) ;
Document document = builder.parse(source) ;
}
Simple Error Recovery
import org.xml.sax.* ;
class MyErrorHandler implements ErrorHandler {
public void warning(SAXParseException e) throws SAXException {
System.out.println(e.getMessage());
}
public void error(SAXParseException e) throws SAXException {
System.out.println(e.getMessage());
}
public void fatalError(SAXParseException e) throws SAXException {
System.out.println(e.getMessage());
System.exit(1);
}
}
Processing the DOM
The two previous examples ended by creating a
Document object, then did nothing with it.
The following example summarizes several important
Java DOM classes and methods in one recursive
method that extracts all useful information in the
elements, attributes, and text nodes of a document.
• Assumes a declaration like
import org.w3c.dom.* ;
is in effect.
• After parsing the document, this method could be invoked
from main() by:
process(document.getDocumentElement()) ;
Recursive Processing of DOM
static void process(Node node) {
System.out.println(node.getNodeName()) ;
switch(node.getNodeType()) {
case Node.ELEMENT_NODE:
NamedNodeMap attributes = node.getAttributes() ;
for(int i = 0 ; i < attributes.getLength() ; i++)
process(attributes.item(i)) ;
NodeList children = node.getChildNodes() ;
for(int i = 0 ; i < children.getLength() ; i++)
process(children.item(i)) ;
break ;
case Node.TEXT_NODE:
case Node.ATTRIBUTE_NODE:
System.out.println(node.getNodeValue()) ;
}
}
Interactive XML
We have been discussing use of the DOM in Java as a convenient
intermediate representation for handling XML data.
This is different from the original application of the DOM,
illustrated in our introductory dynamic HTML example.
In that application, JavaScript event handlers manipulate the
DOM embedded in a Web browser, dynamically altering the text
displayed by the browser.
This technique can also be applied to XML documents that have
a visual representation, for example to Scalable Vector Graphics
documents.
Following screen captures are from www.svgarena.org.
XML Chess
XML Solitaire
Namespaces in XML
165
Motivation
XML DTDs allow us to define a set of customized
element, attribute, and entity names to support a
particular application.
But in a document of any complexity one may want to
mix and match markup from different application
domains.
In particular in a Web page authored using XHTML
tags one might want to embed pictures represented
with SVG elements; in a technical Web page one might
want to embed mathematical formulae represented
using MathML elements.
Namespaces
Typically different XML vocabularies are designed by
different committees, and it is impractical to avoid
name clashes.
In general we can’t simply merge the DTDs—unclear
this would be desirable anyway: loss of modularity.
The Namespaces in XML specification address this issue
by allowing each application to define names in a
particular namespace. A single XML document can
then include markup from more than one namespace.
Relation to the XML Spec
The namespaces specification was released a year after
the XML 1.0 specification.
It is formally 100% compatible with the XML
specification—it “merely” puts some restrictions on the
document form.
Significant practical issues arise. DTDs—set in stone in
the XML/SGML standards—are not an ideal match for
namespaces: DTDs can be used with namespaces, but it
takes some hacks.
• The likely intention was to move away from DTD validation
towards XML Schema validation, which is fully namespace
compatible. So far this transition is incomplete?
Qualified Names
The simple general idea is that names occurring in
document instances may be prefixed in a way that
identifies their name space.
Here is an (imaginary!) example including SVG tags in
an XHTML document:
<body>
<h1>What are circles really like?</h1>
This is what a circle really looks like:
<object>
<svg:svg>
<svg:circle cx='150' cy='50' r='25'/>
</svg:svg>
</object>
</body>
Name Syntax
In the namespaces specification use of the colon, “:”, in
XML names is reserved.
Namespaces qualified names have at most one colon; if
there is a colon it separates the prefix from the local
part:
Prefix:Local-part
In the example above we had qualified SVG names
svg:svg
svg:circle
Here the prefixes are svg and the local parts are svg
and circle respectively. HTML names had no prefix,
only local parts.
Namespace Names
Each namespace (e.g. the XHTML namespace or the
SVG namespace) itself has a “name”.
• This is neither the prefix appearing on qualified names, nor
in general a legal XML name!
Instead a namespace name is an IRI.
• An IRI is an Internationalized Resource Identifier: it is a
generalization of a URI allowing non-ASCII characters.
E.g. the name of the XHTML 1.0 namespace is:
http://www.w3.org/1999/xhtml
and the name of the SVG 1.1 namespace is:
http://www.w3.org/2000/svg
Use of URIs
The use of URIs (or IRIs) for namespace names is quite
confusing.
Though namespace names follow the syntax of resource
identifiers, the namespace is not a resource; there need
not be any resource at the location—all that is relevant
is the sequence of characters in the URI. This string of
characters identifies the namespace.
One day this odd situation may be resolved to allow
applications to automatically find schema at namespace
URIs—see for example www.rddl.org.
For now namespace names and schema locations are
two independent IRIs.
Defining Prefixes
In principle an instance document can choose any
convenient prefix to use within that document
• Though an existing DTD may restrict the choice.
A prefix is declared in an attribute named
xmlns:Prefix. The scope is the element the declaration
appears on, and any nested elements. Value specified
for attribute must be the namespace name. E.g.:
<object xmlns:svg='http://www.w3.org/2000/svg' >
<svg:svg>
<svg:circle cx='150' cy='50' r='25'/>
</svg:svg>
</object>
Default Namespaces
If a “vocabulary” has been defined in a namespace, a
document instance must acknowledge this, even if
prefixes aren’t needed.
Use the attribute xmlns to declare a default namespace.
E.g. all XHTML documents should start:
<html xmlns="http://www.w3.org/1999/xhtml" >
<head>
…
</html>
Scope of Declarations
So far as the namespaces specification is concerned:
• Declarations of prefixes and default namespaces can appear
on any elements in a document. Scope is the declaration is
limited to the element it appears on.
• Different namespaces can be the default in different elements.
• The same prefix could be used to represent different
namespaces in different parts of the document, or vice versa
(these are bad ideas!)
DTDs may limit these possibilities.
Default namespace declarations don’t affect attribute
names. Interpretation of an un-prefixed attribute name
is determined by the element it is attached to.
• Means attribute is in same namespace as element? No!
Namespaces and DTDs
DTDs are not “namespace aware”. So far as DTDs are
concerned a qualified name with a colon is just an
atomic XML name.
A partial workaround is to make all names introduced
in DTD declarations parameter entity references, then
factor out the prefix as a single parameter entity that
can be set in a document’s internal DTD subset.
This is ugly, and it doesn’t solve all problems.
• E.g. DTDs still won’t recognize an equivalence between
names in the same namespace, if they have the same local
name but different prefixes: so far as DTDs are concerned
they are different XML names.
Namespaces and XML Schema
Later lectures discuss XML Schema—one of the
alternative validation frameworks to DTDs—in detail.
You will probably find the issues associated with
namespaces become much more concrete once XML
Schema (which are fully namespace-aware) are
understood.
Elementary XPath
178
XML Path Language
The XML Path Language, or XPath, is a language for
addressing nodes of an XML document.
• Understand “nodes” as in the DOM (although technically there
are minor differences).
XPath is an important part of XML Schema, XSLT, and
XPointer, and is used as a query language in some XML
databases.
In simple cases an XPath expression looks like a UNIX
path name, with nested directory names replaced by
nested element names, so for example:
• “/” corresponds to the document node.
• Expressions may be absolute (relative to /) or relative to some
context node.
Simple Examples
The XPath:
/planets
represents a document element called planets.
The XPath:
/planets/planet/density
represents the set of all elements named density that are
directly nested in any element named planet that is
directly nested in a document element planets.
The XPath:
/planets/planet/*
represents all elements directly nested in an element
planet directly nested in document element planets.
<?xml version="1.0" encoding="UTF-8"?>
<planets>
<planet>
<name>Mercury</name>
<mass>0.0553</mass>
Example <day units="days">58.65</day>
<radius units="miles">1516</radius>
<density>0.983</density>
Document is a simplified </planet>
version of an example <planet>
given the in the book <name>Venus</name>
Inside XML, by Holzner. <mass>0.815</mass>
<day units="days">116.75</day>
<radius units="miles">3716</radius>
Highlighted node is <density>0.943</density>
evaluation of </planet>
<planet>
/planets <name>Earth</name>
<mass>1</mass>
<day units="days">1</day>
<radius units="miles">3960</radius>
<density>1</density>
</planet>
</planets>
<?xml version="1.0" encoding="UTF-8"?>
<planets>
<planet>
<name>Mercury</name>
<mass>0.0553</mass>
<day units="days">58.65</day>
<radius units="miles">1516</radius>
Example <density>0.983</density>
</planet>
<planet>
Highlighted nodes are <name>Venus</name>
evaluation of <mass>0.815</mass>
<day units="days">116.75</day>
/planets/planet/density <radius units="miles">3716</radius>
<density>0.943</density>
</planet>
<planet>
<name>Earth</name>
<mass>1</mass>
<day units="days">1</day>
<radius units="miles">3960</radius>
<density>1</density>
</planet>
</planets>
<?xml version="1.0" encoding="UTF-8"?>
<planets>
<planet>
<name>Mercury</name>
<mass>0.0553</mass>
<day units="days">58.65</day>
<radius units="miles">1516</radius>
Example <density>0.983</density>
</planet>
<planet>
Highlighted nodes are <name>Venus</name>
evaluation of <mass>0.815</mass>
<day units="days">116.75</day>
/planets/planet/* <radius units="miles">3716</radius>
<density>0.943</density>
</planet>
<planet>
<name>Earth</name>
<mass>1</mass>
<day units="days">1</day>
<radius units="miles">3960</radius>
<density>1</density>
</planet>
</planets>
Types of XPath Expressions
In full generality XPath expressions can evaluate to
various kinds of thing (including numbers, Booleans,
and strings) but one is usually interested in expressions
that evaluate to node-sets.
A node-set is a collection of nodes in an XML document.
In general he nodes that can appear in a node-set
include:
• element nodes, attribute nodes,
• text nodes (plain text children of some element), root nodes,
namespace nodes, processing instruction nodes and comment
nodes.
We will only discuss the first two cases.
Location Paths and Location Steps
The most important kind of XPath expression is
the location path.
In general a location path consists of a series of
location steps separated by the slash “/”.
• Compare to a UNIX path, where the individual “step”
is the name of an immediately nested subdirectory or
file, or a wildcard (*), or a move up into the parent
directory (..), etc.
Steps and Context Nodes
While a UNIX path may be relative to some current
directory, an XPath expression is generally evaluated
relative to some context node.
Starting from this context node of the path, the location
path takes a series of steps in various possible
“directions”, e.g.
• into the set of child elements of the current element,
• into the set of attributes of the current element,
• into the set of siblings of the current element, etc.
This “direction” is called the axis of the step.
Each individual step has its own context node,
determined by preceding steps in the path.
Syntax for Location Steps
The commonest example of a location step—analogous
to a UNIX directory name—is an XML element name.
• This should be the name of an element that is an immediate
child of the context node.
Recalling this example:
/planets/planet
it has two steps, planets and planet.
• Actually this is an example of what is called abbreviated syntax.
In the full, unabbreviated, syntax, that location path would
be:
/child :: planets/child :: planet
child being the name of the axis. In this brief exposition we
only cover abbreviated syntax.
Abbreviated Syntax for Steps
The location step Name selects some element nodes: the
children of the context node called Name.
• “*” selects all children of the context node.
The location step “.” represents the context node.
The location step “..” represents the parent node of the
context node.
The location step @Name represents an attribute node
—the attribute of the context node named Name.
• @* selects all attributes of the context node.
A blank location step effectively selects the context node
or any descendant element, e.g.
• //Name selects all Name elements in the document.
• .//Name selects all Name descendants of the context node.
<?xml version="1.0" encoding="UTF-8"?>
<planets>
<planet>
<name>Mercury</name>
<mass>0.0553</mass>
<day units="days">58.65</day>
<radius units="miles">1516</radius>
Example <density>0.983</density>
</planet>
<planet>
Highlighted nodes are <name>Venus</name>
evaluation of <mass>0.815</mass>
/planets/planet/radius/@units <day units="days">116.75</day>
<radius units="miles">3716</radius>
<density>0.943</density>
</planet>
<planet>
<name>Earth</name>
<mass>1</mass>
<day units="days">1</day>
<radius units="miles">3960</radius>
<density>1</density>
</planet>
</planets>
<?xml version="1.0" encoding="UTF-8"?>
<planets>
<planet>
<name>Mercury</name>
<mass>0.0553</mass>
<day units="days">58.65</day>
<radius units="miles">1516</radius>
Example <density>0.983</density>
</planet>
<planet>
Highlighted nodes are <name>Venus</name>
evaluation of <mass>0.815</mass>
<day units="days">116.75</day>
//mass <radius units="miles">3716</radius>
<density>0.943</density>
</planet>
<planet>
<name>Earth</name>
<mass>1</mass>
<day units="days">1</day>
<radius units="miles">3960</radius>
<density>1</density>
</planet>
</planets>
<?xml version="1.0" encoding="UTF-8"?>
<planets>
<planet>
<name>Mercury</name>
<mass>0.0553</mass>
<day units="days">58.65</day>
<radius units="miles">1516</radius>
Example <density>0.983</density>
</planet>
<planet>
If the context node is the <name>Venus</name>
second <planet/> element, <mass>0.815</mass>
the Highlighted nodes are <day units="days">116.75</day>
evaluation of <radius units="miles">3716</radius>
<density>0.943</density>
*/@units </planet>
<planet>
<name>Earth</name>
<mass>1</mass>
<day units="days">1</day>
<radius units="miles">3960</radius>
<density>1</density>
</planet>
</planets>
Unions
The operator “|” forms the union of two node sets.
For example:
//radius | //density
represents the set of all radius elements and all density
elements in the document.
<?xml version="1.0" encoding="UTF-8"?>
<planets>
<planet>
<name>Mercury</name>
<mass>0.0553</mass>
<day units="days">58.65</day>
<radius units="miles">1516</radius>
Example <density>0.983</density>
</planet>
<planet>
Highlighted nodes are <name>Venus</name>
evaluation of <mass>0.815</mass>
<day units="days">116.75</day>
//radius | //density <radius units="miles">3716</radius>
<density>0.943</density>
</planet>
<planet>
<name>Earth</name>
<mass>1</mass>
<day units="days">1</day>
<radius units="miles">3960</radius>
<density>1</density>
</planet>
</planets>
XPath and Namespaces
If the elements or attributes named an XPath
expression belong to a namespace, then in general those
names must have a namespace prefix.
• You cannot use a default namespace in an XPath expression.
When XPath expressions are embedded in XML
documents (which is normal), the prefixes for the
qualified names are scoped in the usual way—by
namespace declarations (i.e. xmlns:Prefix attribute
specifications) in the XML.
Summary
In its full generality, XPath is a fairly complex (but
powerful) language for addressing subsets of the nodes
of an XML document.
• It incorporates extensive computational and filtering
functionalities that we have not described here.
The subset we covered in this lecture is nevertheless
useful in itself.
• And happens to include all the XPath needed to understand
XML Schema.