XML
XML
XML is not…
• A replacement for HTML
• A presentation format
• A programming language
• A database
2
But then – what is it?
3
XML by Example
<article>
<author>Gerhard Weikum</author>
<title>The Web in 10 Years</title>
</article>
4
XML by Example
<t108>
<x87>Gerhard Weikum</x87>
<g10>The Web in 10 Years</g10>
</t108>
5
Possible Advantages of Using XML
• Truly Portable Data
• Easily readable by human users
• Very expressive (semantics near data)
• Very flexible and customizable (no finite tag set)
• Easy to use from programs
• Easy to convert into other representations
(XML transformation languages)
• Many additional standards and tools
• Widely used and supported
6
App. Scenario : Content Mgt.
Clients
Database with
XML documents
7
App. Scenario : XML for Metadata
<rdf:RDF
<rdf:Description rdf:about="http://www-dbs/Sch03.pdf">
<dc:title>A Framework for…</dc:title>
<dc:creator>Ralf Schenkel</dc:creator>
<dc:description>While there are...</dc:description>
<dc:publisher>Saarland University</dc:publisher>
<dc:subject>XML Indexing</dc:subject>
<dc:rights>Copyright ...</dc:rights>
<dc:type>Electronic Document</dc:type>
<dc:format>text/pdf</dc:format>
<dc:language>en</dc:language>
</rdf:Description>
</rdf:RDF>
8
App. Scenario : Document Markup
<article>
<section id=„1“ title=„Intro“>
This article is about <index>XML</index>.
</section>
<section id=„2“ title=„Main Results“>
<name>Weikum</name> <cite idref=„Weik01“/> shows
the following theorem (see Section <ref idref=„1“/>)
<theorem id=„theo:1“ source=„Weik01“>
For any XML document x, ...
</theorem>
</section>
<literature>
<cite id=„Weik01“><author>Weikum</author></cite>
</literature>
</article>
9
App. Scenario : Document Markup
• Document Markup adds structural and semantic
information to documents, e.g.
– Sections, Subsections, Theorems, …
– Cross References
– Literature Citations
– Index Entries
– Named Entities
10
XML
Part 2 – Basic XML Concepts
11
XML Documents
What‘s in an XML document?
• Elements
• Attributes
• plus some other details
12
A Simple XML Document
<article>
<author>Gerhard Weikum</author>
<title>The Web in Ten Years</title>
<text>
<abstract>In order to evolve...</abstract>
<section number=“1” title=“Introduction”>
The <index>Web</index> provides the universal...
</section>
</text>
</article>
13
A Simple XML Document
<article> Freely definable tags
<author>Gerhard Weikum</author>
<title>The Web in Ten Years</title>
<text>
<abstract>In order to evolve...</abstract>
<section number=“1” title=“Introduction”>
The <index>Web</index> provides the universal...
</section>
</text>
</article>
14
A Simple XML Document
<article> Start Tag
<author>Gerhard Weikum</author>
<title>The Web in Ten Years</title>
<text>
<abstract>In order to evolve...</abstract>
<section number=“1” title=“Introduction”>
The <index>Web</index> provides the universal...
</section>
</text>
</article>
Content of
End Tag the Element
Element (Subelements
and/or Text)
15
A Simple XML Document
<article>
<author>Gerhard Weikum</author>
<title>The Web in Ten Years</title>
<text>
<abstract>In order to evolve...</abstract>
<section number=“1” title=“Introduction”>
The <index>Web</index> provides the universal...
</section>
</text>
</article>
Attributes with
name and value
16
Elements in XML Documents
• (Freely definable) tags: article, title, author
– with start tag: <article> etc.
– and end tag: </article> etc.
• Elements: <article> ... </article>
• Elements have a name (article) and a content (...)
• Elements may be nested.
• Elements may be empty: <this_is_empty/>
• Element content are strings with special characters, and/or nested
elements (mixed content if both).
• Each XML document has exactly one root element and forms a
tree.
• Elements with a common parent are ordered.
17
Elements vs. Attributes
Elements may have attributes (in the start tag) that have a name
and
a value, e.g. <section number=“1“>.
What is the difference between elements and attributes?
• Only one attribute with a given name per element
• Attributes have no structure, simply strings (while elements can
have subelements)
As a rule of thumb:
• Content into elements
• Metadata into attributes
Example:
<person born=“1912-06-23“ died=“1954-06-07“>
Alan Turing</person> proved that…
18
XML Documents as Ordered Trees
article
19
Well-Formed XML Documents
A well-formed document must adher to, among others, the
following rules:
• Every start tag has a matching end tag.
• Elements may nest, but must not overlap.
• There must be exactly one root element.
• Attribute values must be quoted.
• An element may not have two attributes with the same
name.
• Comments and processing instructions may not appear
inside tags.
21
Namespace
<dbs:book xmlns:dbs=“http://www-dbs/dbs“>
23
Namespace Example
<dbs:book xmlns:dbs=“http://www-dbs/dbs“>
<dbs:description> ... </dbs:description>
<dbs:text>
<dbs:formula>
<mathml:math
xmlns:mathml=“http://www.w3.org/1998/Math/MathML“>
...
</mathml:math>
</dbs:formula>
</dbs:text>
</dbs:book>
24
Default Namespace
• Default namespace may be set for an element and its
content (but not its attributes):
<book xmlns=“http://www-dbs/dbs“>
<description>...</description>
<book>
• Can be overridden in the elements by specifying the
namespace there (using prefix or default namespace)
25
XML
Part 3 – Defining XML Data Formats
26
3.1 Document Type Definitions
Sometimes XML is too flexible:
• For exchanging data, the format (i.e., elements,
attributes and their semantics) must be fixed
Document Type Definitions (DTD) for establishing the
vocabulary for one XML application (in some sense
comparable to schemas in databases)
A document is valid with respect to a DTD if it conforms
to the rules specified in that DTD.
Most XML parsers can be configured to validate.
27
DTD Example: Elements
<!ELEMENT article (title,author+,text)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT text (abstract,section*,literature?)>
<!ELEMENT abstract (#PCDATA)>
<!ELEMENT section (#PCDATA|index)+>
<!ELEMENT literature (#PCDATA)>
<!ELEMENT index (#PCDATA)>
Content of the text element may
Content of the title element contain zero or more section
is parsed character data elements in this position
Content of the article element is a title element,
followed by one or more author elements,
followed by a text element
Element Declarations in DTDs
One element declaration for each element type:
<!ELEMENT element_name content_specification>
where content_specification can be
• (#PCDATA) parsed character data
• (child) one child element
• (c1,…,cn) a sequence of child elements c1…cn
• (c1|…|cn) one of the elements c1…cn
For each component c, possible counts can be specified:
– c exactly one such element
– c+ one or more
– c* zero or more
– c? zero or one
Plus arbitrary combinations using parenthesis:
<!ELEMENT f ((a|b)*,c+,(d|e))*>
29
More on Element Declarations
• Elements with mixed content:
<!ELEMENT text (#PCDATA|index|cite|glossary)*>
• Elements with empty content:
<!ELEMENT image EMPTY>
• Elements with arbitrary content (this is nothing for
production-level DTDs):
<!ELEMENT thesis ANY>
30
Attribute Declarations in DTDs
Attributes are declared per element:
<!ATTLIST section number CDATA #REQUIRED
title CDATA #REQUIRED>
declares two required attributes for element section.
element name
attribute name
attribute type
attribute default
31
Attribute Declarations in DTDs
Attributes are declared per element:
<!ATTLIST section number CDATA #REQUIRED
title CDATA #REQUIRED>
declares two required attributes for element section.
33
Attribute Examples
<ATTLIST publication type (journal|inproceedings) #REQUIRED
pubid ID #REQUIRED>
<ATTLIST cite cid IDREF #REQUIRED>
<ATTLIST citation ref IDREF #IMPLIED
cid ID #REQUIRED>
<publications>
<publication type=“journal“ pubid=“Weikum01“>
<author>Gerhard Weikum</author>
<text>In the Web of 2010, XML <cite cid=„12“/>...</text>
<citation cid=„12“ ref=„XML98“/>
<citation cid=„15“>...</citation>
</publication>
<publication type=“inproceedings“ pubid=“XML98“>
<text>XML, the extended Markup Language, ...</text>
</publication>
</publications>
34
Attribute Examples
<ATTLIST publication type (journal|inproceedings) #REQUIRED
pubid ID #REQUIRED>
<ATTLIST cite cid IDREF #REQUIRED>
<ATTLIST citation ref IDREF #IMPLIED
cid ID #REQUIRED>
<publications>
<publication type=“journal“ pubid=“Weikum01“>
<author>Gerhard Weikum</author>
<text>In the Web of 2010, XML <cite cid=„12“/>...</text>
<citation cid=„12“ ref=„XML98“/>
<citation cid=„15“>...</citation>
</publication>
<publication type=“inproceedings“ pubid=“XML98“>
<text>XML, the extended Markup Language, ...</text>
</publication>
</publications>
35
Linking DTD and XML Docs
• Document Type Declaration in the XML document:
<!DOCTYPE article SYSTEM “http://www-dbs/article.dtd“>
36
Linking DTD and XML Docs
• Internal DTD:
<?xml version=“1.0“?>
<!DOCTYPE article [
<!ELEMENT article (title,author+,text)>
...
<!ELEMENT index (#PCDATA)>
]>
<article>
...
</article>
• Both ways can be mixed, internal DTD overwrites external
entity information:
<!DOCTYPE article SYSTEM „article.dtd“ [
<!ENTITY % pub_content (title+,author*,text)
]>
37
Flaws of DTDs
• No support for basic data types like integers, doubles,
dates, times, …
• No structured, self-definable data types
• No type derivation
• id/idref links are quite loose (target is not specified)
38