XML Tutorial
XML Tutorial
David G. Durand
Director of Electronic Publishing Services
Ingenta Inc.
Adjunct Associate Professor
Brown University
Thanks
Steven J. DeRose, Eve Mahler
What Is Markup?
•
Documents and “semi-structured” data share features
–
Hierarchical structure
–
String content
–
Variations in structure
•
Their applications also share needs
–
Need for a lingua franca, independent of APIs
–
Ability to cope with international characters
–
“Fit” with WWW and HTTP.
XML is more general
•
Tags label arbitrary information units
–
More suited to multiple purposes
–
“Looking right” is needed but not enough
•
Supports custom information structures
–
If you have “price” or “procedure”, you can make a tag for it, and validate its usage
–
Can support many different information models
•
E.g., molecular models, vector graphics, etc.
•
More “teeth” to enforce consistent syntax
–
Works hard to avoid semi-interoperable docs
Better rendering than
HTML
•
Fully internationalized
–
Also better for visually-impaired users
•
Supports multiple renderings
–
Customize to the user, time, situation, device
–
Separates formatting from structure
–
And processing other than rendering
•
Large documents don’t break it
–
Easy to trade off server/client work
–
Artificial “next tiny bit” links no longer necessary
–
No searches that fail because big doc was split
•
XHTML is XML-conforming flavor of HTML
–
Clean existing HTML is already close...
XML treats documents like
databases
XML brings benefits of DBs to
documents
Schema to model information directly
Formal validation, locking, versioning,
rollback...
But
Not all traditional database concepts
map cleanly, because documents are
fundamentally different in some ways
What is structure
•
To Relational Database theorists, structure is:
–
Tables with fixed sets of non-repeating named fields, that have little internal
structure
–
E-R diagrams with fixed number of nodes
•
Structured documents are different:
–
The order of SECs, Ps, etc. matters (a lot)
–
Many hierarchical layers (which text crosses)
–
Text/graphic data mixes with aggregate objects
–
Optional or repeatable sub-parts abound
–
Interaction with natural language phenomena
•
These are very different requirements
When structure is
essential
•
Large scale data
•
Data with individual parts you care about
–
(like price-tag, tool-list, citation, author,...)
•
Need for good navigation tools
•
Mission-critical information
•
Information that must last
•
Multi-author publishing process
•
Multiple delivery media
What’s the difference?
Without structure
Data conversion is far more expensive
Multi-platform and/or multi-media delivery
require re-authoring and hand-work
Paper production is inconsistent
Late format changes are far more risky
Retrieval is prone to many false hits
“Pay me now, or pay me later”
XML design principles
•
Straightforwardly usable over the Internet
•
Support for a wide variety of applications
•
Compatible with SGML
•
Make writing XML programs easy
•
Avoid optional features
•
Human-readable (if not terse) markup
•
Formal and concise design
•
Design produced quickly
Opportunities with XML
•
Scalability and openness of Web solutions
•
“Rich clients” for complex information
–
Dynamic user views
•
XML as interprocess communication protocol for “data” (as opposed to “text”)
•
eCommerce integration
•
New methods of creation
–
Schema combination/composition
–
Free-form, schema-less data development
Web usage
XML works with familiar Web paradigms
Locations are expressed as URIs
High interoperability because of few
options
Easily implementable and usable
Robust against network failures
Avoids serving schemas every time with
documents
(but can do better validation anyway,
when needed)
Some additional XML
details
Well-formedness
Error handling
Case sensitivity
HTML compatibility
Well-formedness
•
Document has a single root element, and
•
Elements nest properly
–
Try <B>foo<I>bar</B>baz</I> in your browser!
•
Entities are whole subtrees (not </P><P>)
•
No tag omission (close what you open)
•
Attributes must be quoted
•
< and & must always be escaped in some way
•
A document can be well-formed (and parsable) whether or not it fits a given
schema
Partial and missing DTDs
•
DTDs (schemas) are needed for validation
•
DTD processing adds a burden
•
Because of Well-formedness,
–
DTDs are not needed just to parse
–
Even subtrees can be parsed in isolation
•
One exception: Default attributes
•
Very handy for development/experimentation
Error handling
•
“Draconian error handling”
–
Major errors cause processor to stop passing data in the “normal way”
•
Fatal errors:
–
Ill-formed document
–
Certain entity references in incorrect places
–
Misplaced character-encoding declarations
•
This helps save huge $ on error-recovery
–
Hopefully, the $ will go to better features instead
–
NS and MS wanted this (détente?)
Case sensitivity
•
HTML is
–
Case-insensitive for tag names: <P> = <p>
–
Case-sensitive for entity names: < ≠ <
•
XML is case-sensitive for both!
–
Unicode standard advises against case-folding
–
Folding is not well-defined for all languages
•
Turkish has two lower-case i’s, only one upper
•
In languages with no accented caps, can’t reverse
•
Error-prone for programmers
•
XHTML uses lower case
Summary
•
XML has:
–
Representational power and extensibility
•
Custom tags, order constraints, etc.
–
Validation and consistency (several ways)
–
Much of HTML’s simplicity for users/implementors
•
XML trashes:
–
SGML’s syntax/feature complexity
–
SGML’s high startup costs
–
HTML’s inflexibility
–
ASCII legacy
XML System
Architectures
First, an HTML system
HTML •Web
document Server
Internet
Web
Client
Parser,
formatter,
interface
How do you get
the data?
Documents, stylesheets, and other data can
all be expressed in XML. Any application can
plug in via an API
But their information is accessed directly. called “Document
Object Model”
Browser/
XML XSLT HTML
http Interface
data +CSS
Stylesheet
Header stuff
The XML Processing Instruction
<?xml version="1.0" standalone="yes"?>
The DOCTYPE
<!DOCTYPE catalog SYSTEM
"http://www.xyz.com/DTDs/catalog.dtd">
Main document stuff
–
Elements: <title>...</title>
–
Attributes: <xref tgt="#h185">
–
Text or other content: Tools, computer
–
Entity references: <…®
–
Comments <!-- Prepared by... -->
Anatomy of an element
Element type
Element type
Attribute
(character)
entity
Attribute Attribut
e reference
name
value
<p type="rule">Use a hyphen: ­.</p>
Element
Audiences XML aims to
help
Parser writers
The Mythical CS Grad Student
Application writer
The Desperate Perl Hacker
Document creators
Newbies of all stripes
The World Wide Web itself
HTML compatibility
•
3 Leading contenders (all can win):
•
XML Schema
–
Backed by the W3C
–
Very powerful
–
Very large + Complex theory
•
Relax/NG
–
Backed by ISO
–
Based on tree automata
–
Very small
•
Schematron
–
Independent effort
–
Validation tool, not complete language
The DTD (schema)
•
A DTD is a simple schema, based on SGML
•
They consist of declarations for the parts:
–
<!ELEMENT CHAP (TI, SEC*, SUM)>
–
<!ATTLIST P ID ID #IMPLIED>
–
<!ELEMENT P (#PCDATA)>
•
Can reference from DOCTYPE, or include:
•
<!DOCTYPE book SYSTEM “book.dtd” [
<!ELEMENT P (#PCDATA)>…
]>
•
Other schema languages are available
–
They use XML syntax (why not?)
Elements
•
Specify properties/characteristics of elements
–
That generally apply to the elements as wholes
•
Values are atomic strings
–
Though applications may impose more structure
•
Represented by assignments within start-tags:
–
<P TYPE="SECRET" ID="FOO">
•
Schemas control what attributes can occur on any given type of element
•
One special type: ID, unique per document
•
Attributes are not ordered
General Entities
•
A lexical mechanism for inclusion
–
But, constrained to including subtrees
–
This preserves fragment parsability
–
This allows lazy evaluation of structure nodes
•
Also used for referring to graphic or other non-directly-XML data objects
•
References occur in the document instance:
–
<PROCEDURE TYPE="REPAIR">
&warn37;&warn12;...</PROCEDURE>
•
Declarations associate the name with a URI or a “public identifier”
Predefined entities
Two purposes:
Escaping a lot of markup
Conditional inclusion
In XML:
Escaping only in the document instance:
•
<![CDATA[ <P>Hello</P> ]]>
Conditional content only in schemas:
•
<![IGNORE[ ... ]]>
•
<![INCLUDE[ ... ]]>
Processing instructions
•
Form/example:
–
<?target-name target-specific-stuff ?>
–
<?xmleditor insertionpoint?>
•
Used to insert instructions to processors
–
Not commonly needed
–
No way to escape “?>” inside
–
May declare targets in DTD as Notations
•
One special one: to identify XML documents
–
<?xml version="1.0"?>
The “XML Declaration” PI
–
<!-- DTD for Friendly Letter -->
–
<!-- FPI: -//sjd//DTD Friendly letter//EN -->
<!ELEMENT LETTER (DATE, GREET, BODY, SIG)>
<!ELEMENT DATE (#PCDATA)>
<!ELEMENT GREET (#PCDATA)>
<!ELEMENT BODY (P)*>
<!ELEMENT SIG (#PCDATA)>
<!ELEMENT P (#PCDATA | EMPH | FIG)*>
<!ELEMENT EMPH (#PCDATA)>
<!ATTLIST EMPH TYPE NAME ”WOW">
<!ELEMENT FIG EMPTY>
<!ATTLIST FIG HREF CDATA #REQUIRED>
Another Example
–
<!ENTITY % inline “emph | strong”>
–
<!ELEMENT doc (chap*)>
–
<!ELEMENT chap (title, section*)>
–
<!ELEMENT title (#PCDATA | %inline;)*>
–
<!ELEMENT section P+>
–
<!ELEMENT p (#PCDATA|%inline;)*>
–
<!ATTLIST p ID ID #IMPLIED>
–
<!ELEMENT emph (#PCDATA)>
–
<!ELEMENT strong (#PCDATA)>
A corresponding
document
–
<?xml version="1.0">
<!DOCTYPE LETTER PUBLIC
"-//sjd//DTD Friendly letter//EN"
–
[]>
<LETTER><DATE>October 3, 1998</DATE>
<GREET>Sammy</GREET>
<BODY>
<P>How <EMPH>are</EMPH> you doing?</P>
<P>This is my dog:
<FIG HREF=”http://www.me.com/dog.gif”/></P>
</BODY>
<SIG>Todd</SIG>
</LETTER>
Content Models
Joining
Sequence a,b,c
Alternation a | b | c
Grouping (a)
Repetition
0 or more a*
1 or more a+
Optional a?
Data
#PCDATA
Element names
Model groups
Mixed content (#PCDATA | x | …)*
ANY
EMPTY
Not quite regular
expressions
Ambiguity restriction
No alternatives must be found for any
model group
This restriction is preserved in W3C
Schema, relaxed in RelaxNG
Handy terminology
decoder ring
Element: a text feature distinguished by
markup
Tag: a string in angle brackets. <a> or </a>.
Two tags delimit an element
Content: anything in an element (children in
the parse tree) tags and characters between an
element’s tags
Attribute: a (name, value) pair associated with
an element
Element Type Name: a string like “p” or “img”
that identifies the type of an element
Decoder ring…
•
Content Model: description of restrictions on the content of an element
•
Model Group: content model subexpression in parentheses
•
Repetition indicator: *, +, ?
•
Prolog: All of the stuff before the document instance starts.
Ambiguity
Data types
Default values / omissability
<!ATTLIST p
type (summary | body) “body”
id ID #IMPLIED
prefix CDATA “”>
<!ATTLIST syntax
•
<!ATTLIST element-name
att-name type defaults
att-name type defaults
…>
•
<!ATTLIST element-group
att-name type defaults
att-name type defaults
…>
Attribute Data Types
CDATA
NMTOKEN / NMTOKENS
Enumeration Type (a | b)
ENTITY / ENTITIES
ID / IDREF / IDREFS
NOTATION
Attribute defaults
#REQUIRED
#IMPLIED
#FIXED “value”
Literal default value
Parameter Entities
• Declaring
<!ENTITY % pent “value”>
<!ENTITY % include-file SYSTEM
“http://www.w3.org//”>
Using
%include-file;
<![ option [ <!… optional
declaration …> ]]>
General Entities
Simple
<!ENTITY ent “value”>
External
<!ENTITY include-file SYSTEM
“http://www.w3.org//”>
Notations
•
declaring
•
<!NOTATION blob SYSTEM “application/binary”>
•
Using (to declare entity datatypes)
•
<!ENTITY something SYSTEM http://blob.org/blobel
–
NDATA blob>
•
Using an NDATA entity
•
<!ATTLIST img ref ENTITY #REQUIRED>
•
… in instance …
•
<img ref=“something”>
•
Or one can just use URIs and MIME types in software… less
validation, more simplicity
Processing instructions
Escape hatch
Way to add declarations to XML in
some cases
Way to “pickle” application state in a
document.
Namespaces
Cross-references
Structural divisions (headings, blurbs,
ambiguities)
Tradeoff between freedom and
processing
Normalization of data items
What external data and catalogs may
exist
Restrictions on data items
Content model
Data values (are there controlled or
semi-controlled vocabularies?)
Are there “authority files” for large
open sets (like lists of authors)
How variable is the content, and how
realistic the idea to normalize it.
Presentation issues