Module 5.1
Module 5.1
EXTENSIBLE MARKUP
LANGUAGE
What is XML
Application X
Repository Database
Overview
XML and Structured Data
ISOM
<html>
<head><title>Example</title></head.
<body>
<h1>This is an example of a page.</h1>
<h2>Some information goes here.</h2>
</body>
</html>
Example of an XML Document
<?xml version=“1.0”/>
<address>
<name>Alice Lee</name>
<email>[email protected]</email>
<phone>212-346-1234</phone>
<birthday>1995-03-22</birthday>
</address>
Content of
End Tag the Element
Element (Subelements
and/or Text)
A Simple XML Document
<article>
<author>Gerhard Weikum</author>
<title>The Web in Ten Years</title>
<text>
<abstract>In order to evolve...</abstract>
<section number=“1” title=“Introduction”>
The <index>Web</index> provides the universal...
</section>
</text>
</article>
Attributes with
name and value
Elements in XML Documents
• (Freely definable) tags: article, title, author
– with start tag: <article> etc.
– and end tag: </article> etc.
• Elements: <article> ... </article>
• Elements have a name (article) and a content (...)
• Elements may be nested.
• Elements may be empty: <this_is_empty/>
• Element content is typically parsed character data (PCDATA),
i.e., strings with special characters, and/or nested elements (mixed
content if both).
• Each XML document has exactly one root element and forms a
tree.
• Elements with a common parent are ordered.
• XML elements must follow these
naming rules:
• Names can contain letters, numbers,
and other characters
• Names must not start with a number or
"_" (underscore)
• Names must not start with the letters
xml (or XML or Xml ..)
• Names can not contain spaces
• XML tags are case sensitive
18
Elements vs. Attributes
Elements may have attributes (in the start tag) that have a name
and
a value, e.g. <section number=“1“>.
What is the difference between elements and attributes?
• Only one attribute with a given name per element (but an
arbitrary number of subelements)
• Attributes have no structure, simply strings (while elements can
have subelements)
As a rule of thumb:
• Content into elements
• Metadata into attributes
Example:
<person born=“1912-06-23“ died=“1954-06-07“>
Alan Turing</person> proved that…
XML Documents as Ordered Trees
article
• Yes
<xml? Version=“1.0” ?>
<PARENT>
<CHILD1>This is element 1</CHILD1>
<CHILD2/>
<CHILD3></CHILD3>
</PARENT>
Example-HTML
• Print - Sanjay Madria
Web Warehouse Tutorial, ADBIS’99
HTML
<H2> Sanjay Madria </H2>
<I> Web Warehouse Tutorial, ADBIS’99</I>
Very difficult to understand, structure is
hidden, describes only appearance
24
XML
• <Ref>
<Speaker> <Firstname> Sanjay</firstname>
<Lastname> Madria</lastnaame>
</Speaker>
<Title > Web Warehouse Tutorial</Title>
<Conference> ADBIS’99</Conference>
</empty>
</Ref>
25
Example 2
• Book Title: My First XML
• Chapter 1: Introduction to XML
• What is HTML
• What is XML
• Chapter 2: XML Syntax
• Elements must have a closing tag
• Elements must be correctly nested
26
• <book>
• <title>My First XML</title>
• <prod id="33-657" media="paper"></prod>
• <chapter>Introduction to XML
• <para>What is HTML</para>
• <para>What is XML</para>
• </chapter>
• <chapter>XML Syntax <para>Elements must
have a closing tag</para> <para>Elements
must be properly nested</para> </chapter>
• </book>
27
Well-Formed XML Documents
A well-formed document must adher to, among others, the
following rules:
• Every start tag has a matching end tag.
• Elements may nest, but must not overlap.
• There must be exactly one root element.
• Attribute values must be quoted.
• An element may not have two attributes with the same
name.
• Comments and processing instructions may not appear
inside tags.
• No unescaped < or & signs may occur inside character
data.
Well-Formed XML Documents
A well-formed document must adher to, among others, the
following rules:
• Every start tag has a matching end tag.
• Elements may nest, but must not overlap.
• There must be exactly one root element.
Only well-formed documents
• Attribute values must be quoted.
canmay
• An element benot
processed by XML
have to attributes with the sa me
name. parsers.
• Comments and processing instructions may not appear
inside tags.
• No unescaped < or & signs may occur inside character
data.
Document Type Definitions
Sometimes XML is too flexible:
• Most Programs can only process a subset of all possible
XML applications
• For exchanging data, the format (i.e., elements,
attributes and their semantics) must be fixed
⇒Document Type Definitions (DTD) for establishing the
vocabulary for one XML application (in some sense
comparable to schemas in databases)
A document is valid with respect to a DTD if it conforms
to the rules specified in that DTD.
Most XML parsers can be configured to validate.
DTD Example: Elements
<!ELEMENT article (title,author+,text)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT text (abstract,section*,literature?)>
<!ELEMENT abstract (#PCDATA)>
<!ELEMENT section (#PCDATA|index)+>
<!ELEMENT literature (#PCDATA)>
<!ELEMENT index (#PCDATA)>
Content of the text element may
Content of the title element contain zero or more section
is parsed character data elements in this position
Content of the article element is a title element,
followed by one or more author elements,
followed by a text element
Element Declarations in DTDs
One element declaration for each element type:
<!ELEMENT element_name content_specification>
where content_specification can be
• (#PCDATA) parsed character data
• (child) one child element
• (c1,…,cn) a sequence of child elements c1…cn
• (c1|…|cn) one of the elements c1…cn
DTD (cont’d)
Occurrence Indicator:
Indicator Occurrence
element name
attribute name
attribute type
attribute default
Attribute Declarations in DTDs
Attributes are declared per element:
<!ATTLIST section number CDATA #REQUIRED
title CDATA #REQUIRED>
<publications>
<publication type=“journal“ pubid=“Weikum01“>
<author>Gerhard Weikum</author>
<text>In the Web of 2010, XML <cite cid=„12“/>...</text>
<citation cid=„12“ ref=„XML98“/>
<citation cid=„15“>...</citation>
</publication>
<publication type=“inproceedings“ pubid=“XML98“>
<text>XML, the extended Markup Language, ...</text>
</publication>
</publications>
Attribute Examples
<ATTLIST publication type (journal|inproceedings) #REQUIRED
pubid ID #REQUIRED>
<ATTLIST cite cid IDREF #REQUIRED>
<ATTLIST citation ref IDREF #IMPLIED
cid ID #REQUIRED>
<publications>
<publication type=“journal“ pubid=“Weikum01“>
<author>Gerhard Weikum</author>
<text>In the Web of 2010, XML <cite cid=„12“/>...</text>
<citation cid=„12“ ref=„XML98“/>
<citation cid=„15“>...</citation>
</publication>
<publication type=“inproceedings“ pubid=“XML98“>
<text>XML, the extended Markup Language, ...</text>
</publication>
</publications>
• PCDATA
• XML parsers treat all text as Parsable
Characters (PCDATA).
• When an XML element is parsed, the
text between the XML tags is also
parsed:
• CDATA
• Everything inside a CDATA section is
ignored by the parser.
• Starts with "<![CDATA[" and ends with
"]]>": 40
Linking DTD and XML Docs
• Document Type Declaration in the XML document:
< !DOCTYP article SYSTEM “http://www-dbs/article.dtd“ >
E
<!DOCTYPE db [
<!ELEMENT db (person*)>
<!ELEMENT person (name,age,email)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT age (#PCDATA)>
<!ELEMENT email (#PCDATA)>
]>
43
DTD Example
<BOOKLIST> <!DOCTYPE BOOKLIST[
<BOOK GENRE = “Science” <!ELEMENT BOOKLIST(BOOK)*>
FORMAT = “Hardcover”> <!ELEMENT BOOK(AUTHOR)>
<AUTHOR> <!ELEMENT
<FIRSTNAME> AUTHOR(FIRSTNAME,LASTNAM
RICHRD E)>
</FIRSTNAME> <!ELEMENT
<LASTNAME> KARTER FIRSTNAME(#PCDATA)>
</LASTNAME> <!ELEMENT>LASTNAME(#PCDATA)
</AUTHOR> >
</BOOK> <!ATTLIST BOOK GENRE
</BOOKS> (Science|Fiction)#REQUIRED>
<!ATTLIST BOOK FORMAT
(Paperback|Hardcover)
“PaperBack”>]>
Flaws of DTDs
• No support for basic data types like integers, doubles,
dates, times, …
• No structured, self-definable data types
• No type derivation
• id/idref links are quite loose (target is not specified)
⇒ XML Schema
Limitations of DTD
• Impose Order
• No notion of atomic type, for example “age” can
be integer, but in DTD, it will be PCDATA
• No constraints
• Do not constrain the type of IDREFs; state-of
must be an identifier of a state element, while
cities-in must be of type city
• Name tag may corresponds to classname and
student name both
46
The XML Schemas: Overview
ISOM
<Students> <xs:schema>
<Student id=“p1”> <xs:complexType name = “StudnetType”>
<Name>Allan</Name> <xs:attribute name=“id” type=“xs:string” />
<Age>62</Age> <xs:element name=“Name” type=“xs:string />
<Email>[email protected] <xs:element name=“Age” type=“xs:integer” />
</Email> <xs:element name=“Email” type=“xs:string” />
</Student> </xs:complexType>
</Students> <xs:element name=“Student”
type=“StudentType” />
</xs:schema>
XSL
• Extensible Stylesheet Language (XSL)
– family of transformation languages
• To format and / or transform XML – documents
• XSL Family consists of three languages
– XSL Transformations (XSLT)
– XSL Formatting Objects (XSL-FO)
– XML Path Language (XPath)
• All languages are W3C recommendations
XSLT Overview
What is XSLT?
– XSL is the Extensible Style Language.
– It has two parts: the transformation language and the
formatting language.
– XSLT provides a syntax for defining rules that
transform an XML document to another document.
• For example, to an HTML document.
– An XSLT “style sheet” consists primarily of a set of
template rules that are used to transform nodes
matching some patterns.
XSLT Overview
The xml-stylesheet element in the XML instance references an XSL
style sheet.
In general, children of the stylesheet element in a stylesheet are
templates.
A template specifies a pattern; the template is applied to nodes in the
XML source document that match this pattern.
– Note: the pattern “/” matches the root node of the document, we will see
this later
In the transformed document, the body of the template element
replaces the matched node in the source document.
In addition to text, the body may contain further XSL terms, e.g.:
– xsl:value-of extracts data from selected sub-nodes.
Languages
1. XSL Transformations (XSLT)
– XML language for transforming XML documents
• XSL Formatting Objects
– XML language for specifying visual formatting
• XML Path Language
– A non-XML language used by XSLT, addressing the
parts of an XML document. Also available for use
in non-XSLT contexts
Main Idea of XSLT
XSLT
Processor
Browser is used
as an XSLT
processor. (client-
side)
XSLT
• XSLT is an transformation language.
• With XSLT you can make transformations:
– XML -> XML in General:
• XML -> XHTML
• XML -> SVG
• ...
– XML -> HTML
– XML -> TEXT
• With XSLT you can transform a XML document to
other text format (can be any).
Main Idea of XSLT:
XML -> XHTML
XSLT
Processor
XPath
Selecting Nodes (W3schools)
Expression Description
nodename Selects all child nodes of the named node
/ Selects from the root node
// Selects nodes in the document from the current node that match
the selection no matter where they are
. Selects the current node
.. Selects the parent of the current node
@ Selects attributes
Examples (W3schools)
Expression Description
/bookstore Selects the root element bookstore
bookstore/book Selects all book elements that are children of bookstore
//book Selects all book elements no matter where they are in the
document
bookstore//book Selects all book elements that are descendant of the bookstore
element, no matter where they are under the bookstore element
//@lang Selects all attributes that are named lang
More Examples
/bookstore/book[1]
/bookstore/book[last()]
/bookstore/book[last()-1]
/bookstore/book[position()<3]
//title[@lang]
//title[@lang='eng']
/bookstore/book[price>35.00]
/bookstore/book[price>35.00]/title
Example of an XSLT-transformation:
XML-file
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl"
href="transform.xslt"?>
<book>
<title>Programming with Java</title>
</book>
Example of an XSLT-transformation:
XSLT-file
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
Title: <xsl:value-of select="/book/title"/>
</xsl:template>
</xsl:stylesheet>
Result:
Simple text – file..
Title: Programming with Java
The xsl:template element
When you match or select nodes, a template tells the XSLT
processor how to transform the node for output
So all our templates will have the form:
<xsl:template match=“pattern”>
template body
</xsl:template>
The pattern is an Xpath expression describing the nodes to
which the template can be applied.
The processor scans the input document for nodes
matching this pattern, and replaces them with the text
included in the template body.
In a nutshell, this explains the whole operation of XSLT.
for-each
<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" encoding="utf-8"
doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN" />
<xsl:template match="/">
<html><head><title>Example</title></head>
<body>
<xsl:for-each select="/books/book">
<p><xsl:value-of select="title"/></p>
</xsl:for-each> Repeat for every
</body> book..
</html>
</xsl:template>
</xsl:stylesheet>
sort
<?xml version="1.0" encoding="iso-8859-1"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" encoding="utf-8"
doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN" />
<xsl:template match="/">
<html><head><title>Example</title></head> Repeat for every
book..
<body>
<xsl:for-each select="/books/book">
<xsl:sort select="price" order="ascending"/>
<p><xsl:value-of select="title"/></p>
</xsl:for-each>
<xsl:template match="/">
<html><head><title>Example</title></head>
<body>
<img>
<xsl:attribute name="src">
<xsl:value-of select="books/book/url"/>
</xsl:attribute>
</img>
</body>
</html>
</xsl:template>
</xsl:stylesheet>
An Example XML Document
<?xml version=“1.0”?> <comment>Brian pays</comment>
<?xml-stylesheet type="text/xsl" <items>
href="demo1.xsl"?> <item partNum=“123-AB”>
<purchaseOrder orderDate=“1999-10-20”> <productName>Porsche</productName>
<shipTo country=“US”> <quantity>1</quantity>
<name>Matthias Hauswirth</name> <price>129400.00</price>
<street>4500 Brookfield Dr.</street> <comment>Need a new one</comment>
<city>Boulder</city> </item>
<state>CO</state> <item>
<zip>80303</zip> <productName>Ferrari</productName>
</shipTo> <quantity>2</quantity>
<price>189000.25</price>
<shipDate>1999-05-21</shipDate>
</item>
</items>
</purchaseOrder>
Simple XSL Document
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/WD-xsl">
<xsl:template match="/">
<HTML>
<BODY>
<table border="2" bgcolor="yellow">
<xsl:for-each select="records/purchaseOrder">
<tr>
<td><xsl:value-of select="shipTo/name"/></td>
<td><xsl:value-of select="comment"/></td>
<xsl:for-each select="items/item">
<td><xsl:value-of select="price"/></td>
</xsl:for-each>
</tr>
</xsl:for-each>
</table>
</BODY>
</HTML>
</xsl:template>
</xsl:stylesheet>
Simple XSL Results
<xsl:stylesheet
XSL:CHOOSE
xmlns:xsl="http://www.w3.org/TR/WD
-xsl">
<xsl:template match="/">
<HTML> <xsl:otherwise>
<BODY>
<td><xsl:value-of
<table border="2" bgcolor="yellow">
select="price"/></td>
<xsl:for-each
select="records/purchaseOrder"> </xsl:otherwise>
<tr> </xsl:choose>
<td><xsl:value-of </xsl:for-each>
select="shipTo/name"/></td> </tr>
<td><xsl:value-of
</xsl:for-each>
select="comment"/></td>
<xsl:for-each select="items/item">
</table>
<xsl:choose> </BODY>
<xsl:when match=".[price>100]"> </HTML>
<td bgcolor="red"><xsl:value-of </xsl:template>
select="price"/></td> </xsl:stylesheet>
</xsl:when>
XSL:CHOOSE Results
XML Parser
• An XML parser is a software library or package that provides
interfaces for client applications to work with an XML document.
The XML Parser is designed to read the XML and create a way for
programs to use XML.
• XML parser validates the document and check that the document is
well formatted.
• Types of XML Parsers:
– DOM
– SAX
XML DOM Parser
• A DOM document is an object which contains all the information of an
XML document. It is composed like a tree structure. The DOM Parser
implements a DOM API. This API is very simple to use.
• Features of DOM Parser
• A DOM Parser creates an internal structure in memory which is a DOM
document object and the client applications get information of the original
XML document by invoking methods on this document object.
• DOM Parser has a tree based structure.
• Advantages
• 1) It supports both read and write operations and the API is very simple to
use.
• 2) It is preferred when random access to widely separated parts of a
document is required.
• Disadvantages
• 1) It is memory inefficient. (consumes more memory because the whole
XML document needs to loaded into memory).
XML SAX (Simple API for XML) Parser
• A SAX Parser implements SAX API. This API is an event based API
and less intuitive.
• Features of SAX Parser
• It does not create any internal structure.
• Clients does not know what methods to call, they just overrides the
methods of the API and place his own code inside method.
• It is an event based parser, it works like an event handler in Java.
• Advantages
• 1) It is simple and memory efficient.
• 2) It is very fast and works for huge documents.
• Disadvantages
• 1) It is event-based so its API is less intuitive.
• 2) Clients never know the full information because the data is broken
into pieces.
DOM v/s SAX XML Parser
• DOM parser loads whole XML documents in memory
while SAX only loads a small part of the XML file in
memory.
• DOM parser is faster than SAX because it accesses the
whole XML document in memory.
• SAX parser in Java is better suitable for a large XML file
than DOM Parser because it doesn't require much memory.
• DOM parser works on Document Object Model while
SAX is an event based XML parser.
Applications of XML
• Database applications
• Document Mark-up( with HTML)
• Mathematical Mark-up language(MATHML)
• Messaging b/w different business platforms
• Channel definition Format (CDF)
• Metacontent definition
• Platform for Internet Context Selection (PICS)
• Resource Description Format (RDF)
• Scaleable Vector Graphics (SVG)
• Synchronized Multemedia Integration Language (SMIL)