0% found this document useful (0 votes)
419 views75 pages

XML Basics PDF

XML is a hierarchical, human-readable format for encoding documents and structured data. It allows data to be blended with schema information to provide structure. XML is part of a broader ecosystem that includes tools for data access, querying, and distributed programs. XML encodes data as elements, which can contain nested elements and attributes. This semi-structured format makes XML flexible but also necessitates additional schema languages to constrain the structure.

Uploaded by

avbc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
419 views75 pages

XML Basics PDF

XML is a hierarchical, human-readable format for encoding documents and structured data. It allows data to be blended with schema information to provide structure. XML is part of a broader ecosystem that includes tools for data access, querying, and distributed programs. XML encodes data as elements, which can contain nested elements and attributes. This semi-structured format makes XML flexible but also necessitates additional schema languages to constrain the structure.

Uploaded by

avbc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

XML Basics

Web Technologies and SOA


Gaining Access to Diverse Data

We have focused on data integration A B


in the relational model a1 b1
Simplest model to understand a2 b2

Real-world data is often not in relational form


e.g., Excel spreadsheets, Web tables, Java objects, RDF, …
– One approach: convert using custom wrappers

– But suppose tools would adopt a standard export


(and import) mechanism?
… This is the role of XML, the eXtensible Markup Language

2
Web Technologies and SOA 2
What Is XML?

Hierarchical, human-readable format Procedural


– A “sibling” to HTML, always parsable XQuery
language
(Java, JavaScript,
– “Lingua franca” of data: encodes documents C++, …)
and structured data
– Blends data and schema (structure) XPath

SAX/DOM REST/
Core of a broader ecosystem SOAP+
WSDL
– Data – XML (also RDF, Ch. 12) HTTP
DTD/
– Schema – DTD and XML Schema Schema
XML
– Programmatic access – DOM and SAX
– Query – XPath, XSLT, XQuery
– Distributed programs – Web services Database Document Web
Service

Web Technologies and SOA 3


XML Anatomy
<?xml version="1.0" encoding="ISO-8859-1" ?> Processing Instr.
<dblp> Open-tag
<mastersthesis mdate="2002-01-03" key="ms/Brown92">
<author>Kurt P. Brown</author>
<title>PRPL: A Database Workload Specification Language</title>
<year>1992</year>
<school>Univ. of Wisconsin-Madison</school> Element
</mastersthesis>
<article mdate="2002-01-03" key="tr/dec/SRC1997-018">
<editor>Paul R. McJones</editor>
<title>The 1995 SQL Reunion</title>
Attribute
<journal>Digital System Research Center Report</journal>
<volume>SRC1997-018</volume> Close-tag
<year>1997</year>
<ee>db/labs/dec/SRC1997-018.html</ee>
<ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>
</article>
4
Web Technologies and SOA 4
XML Data Components
XML includes two kinds of data items:
Elements <article mdate="2002-01-03" …>
<editor>Paul R. McJones</editor> …
</article>
• Hierarchical structure with open tag-close tag pairs
• May include nested elements
• May include attributes within the element’s open-tag
• Multiple elements may have same name
• Order matters
Attributes mdate="2002-01-03"
• Named values – not hierarchical
• Only one attribute with a given name per element
• Order does NOT matter
Web Technologies and SOA 5
Well-Formed XML: Always Parsable

Any legal XML document is always parsable by an XML


parser, without knowledge of tag meaning
– The start – preamble – tells XML about the char. encoding
<?xml version=“1.0” encoding=“utf-8”?>
– There’s a single root element
– All open-tags have matching close-tags (unlike many HTML
documents!), or a special:
<tag/> shortcut for empty tags (equivalent to <tag></tag>)
– Attributes only appear once in an element
– XML is case-sensitive

6
Web Technologies and SOA 6
Outline

XML data model


– Node types
– Encoding relations and semi-structured data
– Namespaces
• XML schema languages
• XML querying
• XML query processing
• XML schema mapping
Web Technologies and SOA 7
XML as a Data Model

XML “information set” includes 7 types of nodes:


– Document (root)
– Element
– Attribute
– Processing instruction
– Text (content)
– Namespace
– Comment
XML data model includes this, plus typing info, plus
order info and a few other things
8
Web Technologies and SOA 8
XML Anatomy
<?xml version="1.0" encoding="ISO-8859-1" ?>
<dblp>
<mastersthesis mdate="2002-01-03" key="ms/Brown92">
<author>Kurt P. Brown</author>
<title>PRPL: A Database Workload Specification Language</title>
<year>1992</year>
<school>Univ. of Wisconsin-Madison</school>
</mastersthesis>
<article mdate="2002-01-03" key="tr/dec/SRC1997-018">
<editor>Paul R. McJones</editor>
<title>The 1995 SQL Reunion</title>
<journal>Digital System Research Center Report</journal>
<volume>SRC1997-018</volume>
<year>1997</year>
<ee>db/labs/dec/SRC1997-018.html</ee>
<ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>
</article>
9
Web Technologies and SOA 9
XML Data Model Visualized
(and simplified!) root attribute

p-i element
Root
text
?xml dblp

mastersthesis article

mdate mdate
key key
2002… author title year school editor title journal volume year ee ee
1992 2002…
ms/Brown92 The… 1997
tr/dec/…
PRPL…
Digital… db/labs/dec
Kurt P…. Univ…. Paul R.
SRC… http://www.

10
Web Technologies and SOA 10
XML Easily Encodes Relations

Student-course-grade sid cid exp-grade


1 570103 B
23 550103 A
<student-course-grade>
<tuple><sid>1</sid><cid>570103</cid>
<exp-grade>B</exp-grade></tuple>
<tuple><sid>23</sid><cid>550103</cid>
<exp-grade>A</exp-grade></tuple>
</student-course-grade> OR

<student-course-grade>
<tuple sid=“1” cid=“570103” exp-grade=“B”/>
<tuple sid=“23” cid=“550103” exp-grade=“A”/>
</student-course-grade>
11
XML is “Semi-Structured”

<parents>
<parent name=“Jean” >
<son>John</son>
<daughter>Joan</daughter>
<daughter>Jill</daughter>
</parent>
<parent name=“Feng”>
<daughter>Ella</daughter>
</parent>

12
Web Technologies and SOA 12
Combining XML from Multiple Sources with the Same
Tags: Namespaces

• Namespaces allow us to specify a context for different tags


• Two parts:
– Binding of namespace to URI Default namespace for
– Qualified names non-qualified names
<root xmlns=“http://www.first.com/aspace” xmlns:otherns=“…”>
<myns:tag xmlns:myns=“http://www.fictitious.com/mypath”>
<thistag>is in the default namespace
(www.first.com/aspace)</thistag> Defines “otherns”
qualifier
<myns:thistag>is in myns</myns:thistag>
<otherns:thistag>is a different tag in otherns</otherns:thistag>
</myns:tag>
</root>

13
Web Technologies and SOA 13
Outline

XML data model


XML schema languages
– DTDs
– XML Schema (XSD)
• XML querying
• XML query processing
• XML schema mapping

Web Technologies and SOA 14


XML Isn’t Enough on Its Own

It’s too unconstrained for many cases!


– How will we know when we’re getting garbage?
– How will we know what to query for?
– How will we understand what we receieved?

We also need:
– An idea of (at least part of) the structure
– Some knowledge of how to interpret the tags…

15
Web Technologies and SOA 15
Structural Constraints:
Document Type Definitions (DTDs)

The DTD is an EBNF grammar defining XML structure


– The XML document specifies an associated DTD, plus the
root element of the document
– DTD specifies children of the root (and so on)

DTD also defines special attribute types:


– IDs – special attributes that are analogous to keys for
elements
– IDREFs – references to IDs
– IDREFS – a list of IDREFs, space-delimited (!)
– All other attributes are essentially treated as strings
16
Web Technologies and SOA 16
An Example DTD and How to
Reference It from XML
Example DTD:
<!ELEMENT dblp((mastersthesis | article)*)>
<!ELEMENT
mastersthesis(author,title,year,school,committeemember*)>
<!ATTLIST mastersthesis(mdate CDATA #REQUIRED
key ID #REQUIRED
advisor CDATA #IMPLIED>
<!ELEMENT author(#PCDATA)>

Example use of DTD in XML file:


<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE dblp SYSTEM “my.dtd">
<dblp>…

17
Web Technologies and SOA 17
The Limitations of DTDs

DTDs capture grammatical structure, but have some


drawbacks:
– Don’t capture database datatypes’ domains
– IDs aren’t a good implementation of keys
• Why not?
– No way of defining OO-like inheritance

– “Almost XML” syntax – inconvenient to build tools for


them
18
Web Technologies and SOA 18
XML Schema (XSD)

Aims to address the shortcomings of DTDs


– XML syntax
– Can define keys using XPaths (we’ll discuss later)
– Type subclassing that also includes restrictions on ranges
• “By extension” (adds new data) and “by restriction” (adds
constraints)
– … And, of course, domains and built-in datatypes

(Note there are other XML schema formats like RELAX NG)

19
Web Technologies and SOA 19
Basics of XML Schema

Need to use the XML Schema namespace (generally named xsd)


• simpleTypes are a way of restricting domains on scalars
– Can define a simpleType based on integer, with values within a particular range
• complexTypes are a way of defining element/attribute structures
– Basically equivalent to !ELEMENT, but more powerful
– Specify sequence, choice between child elements
– Specify minOccurs and maxOccurs (default 1)
• Must associate an element/attribute with a simpleType, or an element
with a complexType

20
Web Technologies and SOA 20
Simple XML Schema Example
Associates “xsd” namespace
with XML Schema
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name=“mastersthesis" type=“ThesisType"/>
This is the root element,
<xsd:complexType name=“ThesisType"> with type specified below
<xsd:attribute name=“mdate" type="xsd:date"/>
<xsd:attribute name=“key" type="xsd:string"/>
<xsd:attribute name=“advisor" type="xsd:string"/>
<xsd:sequence>
<xsd:element name=“author" type=“xsd:string"/>
<xsd:element name=“title" type=“xsd:string"/>
<xsd:element name=“year" type=“xsd:integer"/>
<xsd:element name=“school" type=“xsd:string”/>
<xsd:element name=“committeemember" type=“CommitteeType”
minOccurs=“0"/>
</xsd:sequence>
</xsd:complexType>
</xsd:schema> 21
Web Technologies and SOA 21
Designing an XML Schema/DTD

Not as formalized as relational data design


– Typically based on an existing underlying design, e.g., relational DBMS or
spreadsheet

We generally orient the XML tree around the “central” objects

Big decision: element vs. attribute


– Element if it has its own properties, or if you might have more than one of them
– Attribute if it is a single property – though element is OK here too!

22
Web Technologies and SOA 22
Outline

XML data model


XML schema languages
XML querying
– DOM and SAX
– XPath
– XQuery
• XML query processing
• XML schema mapping
Web Technologies and SOA 23
XML to Your Program: Document Object Model
(DOM) and Simple API for XML (SAX)

• A huge benefit of XML – standard parsers and standard (cross-


language) APIs for processing it

• DOM: an object-oriented representation of the XML parse tree


(roughly like the Data Model graph)
– DOM objects have methods like “getFirstChild()”, “getNextSibling”
– Common way of traversing the tree
– Can also modify the DOM tree – alter the XML – via insertAfter(), etc.

• Sometimes we don’t want all of the data: SAX


– Parser interface that calls a function each time it parses a processing-
instruction, element, etc.
– Your code can determine what to do, e.g., build a data structure, or discard a
particular portion of the data

24
Web Technologies and SOA 24
Querying XML

Alternate approach to processing the data: a query language


– Define some sort of a template describing traversals from the root of the directed
graph

– Potential benefits in parallelism, views, schema mappings, and so on

– In XML, the basis of this template is called an XPath


• Can also declare some constraints on the values you want
• The XPath returns a node set of matches

25
Web Technologies and SOA 25
Querying XML

• The doc() function is used to open the "books.xml"


file:
– doc("books.xml")

• Select all the title elements in the "books.xml" file:


– doc("books.xml")/bookstore/book/title

• Select all the book elements under the bookstore


element that have a price element with a value that is
less than 30:
– doc("books.xml")/bookstore/book[price<30]

Web Technologies and SOA 26


XPaths
Path Expression Result
Selects all nodes with the name
bookstore
"bookstore"
Selects the root element bookstoreNote: If
the path starts with a slash ( / ) it always
/bookstore
represents an absolute path to an
element!
Selects all book elements that are children
bookstore/book
of bookstore
Selects all book elements no matter where
//book
they are in the document
Selects all book elements that are
descendant of the bookstore element, no
bookstore//book
matter where they are under the
bookstore element
//@lang Selects all attributes that are named lang

Web Technologies and SOA 27


XPaths
Path Expression Result
Selects the first book element that is the
/bookstore/book[1]
child of the bookstore element.
Selects the last book element that is the
/bookstore/book[last()]
child of the bookstore element
Selects the last but one book element
/bookstore/book[last()-1]
that is the child of the bookstore element
Selects the first two book elements that
/bookstore/book[position()<3]
are children of the bookstore element

Web Technologies and SOA 28


Recall Our Sample XML
<?xml version="1.0" encoding="ISO-8859-1" ?>
<dblp>
<mastersthesis mdate="2002-01-03" key="ms/Brown92">
<author>Kurt P. Brown</author>
<title>PRPL: A Database Workload Specification
Language</title>
<year>1992</year>
<school>Univ. of Wisconsin-Madison</school>
</mastersthesis>
<article mdate="2002-01-03" key="tr/dec/SRC1997-018">
<editor>Paul R. McJones</editor>
<title>The 1995 SQL Reunion</title>
<journal>Digital System Research Center Report</journal>
<volume>SRC1997-018</volume>
<year>1997</year>
<ee>db/labs/dec/SRC1997-018.html</ee>
<ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>
</article>

29
Web Technologies and SOA 29
Recall Our XML Tree root attribute

p-i element
Root
text
?xml dblp

mastersthesis article

mdate mdate
key key
2002… author title year school editor title journal volume year ee ee
1992 2002…
ms/Brown92 The… 1997
tr/dec/…
PRPL…
Digital… db/labs/dec
Kurt P…. Univ…. Paul R.
SRC… http://www.

30
Web Technologies and SOA 30
Some Example XPath Queries

• /dblp/mastersthesis/title
• /dblp/*/editor
• //title
• //title/text()

31
Web Technologies and SOA 31
Context Nodes and Relative Paths

XPath has a notion of a context node: it’s analogous to


a current directory
– “.” represents this context node
– “..” represents the parent node
– We can express relative paths:
subpath/sub-subpath/../.. gets us back to the
context node

By default, the document root is the context node

32
Web Technologies and SOA 32
Predicates – Selection Operations

A predicate allows us to filter the node set based on


selection-like conditions over sub-XPaths:

/dblp/article[title = “Paper1”]

which is equivalent to:

/dblp/article[./title/text() = “Paper1”]

33
Web Technologies and SOA 33
Axes: More Complex Traversals

Thus far, we’ve seen XPath expressions that go down the tree
(and up one step)
– But we might want to go up, left, right, etc. via axes:
• self::path-step
• child::path-step parent::path-step
• descendant::path-step ancestor::path-step
• descendant-or-self::path-step ancestor-or-self::path-step
• preceding-sibling::path-step following-sibling::path-step
• preceding::path-step following::path-step
– The previous XPaths we saw were in “abbreviated form”
/child::dblp/child::mastersthesis/child::title
/descendant-or-self::title

34
Web Technologies and SOA 34
Querying Order

• We saw in the previous slide that we could query for


preceding or following siblings or nodes
• We can also query a node’s position according to
some index:
– fn::first() , fn::last() index of 0th & last element
matching the last step
– fn::position() relative count of the current node

child::article[fn::position() = fn::last()]

35
Web Technologies and SOA 35
XPath Is Used within Many Standards

• XML Schema uses simple XPaths in defining keys


and uniqueness constraints
• XQuery
• XSLT
• XLink and Xpointer – hyperlinks for XML

36
Web Technologies and SOA 36
XPath Is Used to Express XML Schema Keys &
Foreign Keys
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:complexType name=“ThesisType">
<xsd:attribute name=“key" type="xsd:string"/>
<xsd:sequence>
<xsd:element name=“author" type=“xsd:string"/> …
<xsd:element name=“school" type=“xsd:string”/> …
</xsd:sequence>
</xsd:complexType> Foreign key refers
<xsd:element name=“dblp”> <xsd:sequence> to key by its ID
<xsd:element name=“mastersthesis" type=“ThesisType">
<xsd:keyref name=“schoolRef” refer=“schoolId">
<xsd:selector xpath=“./school”/> <xsd:field xpath=“./text()"/>
</xsd:keyref> </xsd:element>
<xsd:element name=“university" type=“SchoolType“>…</xsd:element>
</xsd:sequence>
<xsd:key name=“schoolId">
<xsd:selector xpath=“university”/><xsd:field xpath="@key"/>
</xsd:key> </xsd:element> </xsd:schema>
Item w/key = selector
Field is its key 37
Web Technologies and SOA 37
Beyond XPath: XQuery

A strongly-typed, Turing-complete XML manipulation language


– Attempts to do static typechecking against XML Schema
– Based on an object model derived from Schema

Unlike SQL, fully compositional, highly orthogonal:


– Inputs & outputs collections (sequences or bags) of XML nodes
– Anywhere a particular type of object may be used, may use the results of a
query of the same type
– Designed mostly by DB and functional language people

Can be used to define queries, views, and (using a subset) schema


mappings

38
Web Technologies and SOA 38
XQuery’s Basic Form

• Has an analogous form to SQL’s SELECT..FROM..WHERE..GROUP


BY..ORDER BY
• The model: bind nodes (or node sets) to variables; operate over each
legal combination of bindings; produce a set of nodes
• “FLWOR” statement [note case sensitivity!]:
for {iterators that bind variables}
let {collections}
where {conditions}
order by {order-paths}
return {output constructor}

• Mixes XML + XQuery syntax; use {} as “escapes”


39
Web Technologies and SOA 39
Recall Our XML Tree root attribute

p-i element
Root
text
?xml dblp

mastersthesis article

mdate mdate
key key
2002… author title year school editor title journal volume year ee ee
1992 2002…
ms/Brown92 The… 1997
tr/dec/…
PRPL…
Digital… db/labs/dec
Kurt P…. Univ…. Paul R.
SRC… http://www.

40
Web Technologies and SOA 40
“Iterations” in XQuery
A series of (possibly nested) FOR statements assigning the results of XPaths to variables

for $root in doc (“http://my.org/my.xml”)


for $sub in $root/rootElement,
$sub2 in $sub/subElement, …

• Something like a template that pattern-matches, produces a “binding tuple”


• For each of these, we evaluate the WHERE and possibly output the RETURN
template
• document() or doc() function specifies an input file as a URI
– Early versions used “document”; modern versions use “doc”

41
Web Technologies and SOA 41
Two XQuery Examples
<root-tag> {
for $p in doc (“dblp.xml”)/dblp/article,
$yr in $p/yr
where $yr = “1997”
return <paper> { $p/title } </paper>
} </root-tag>

for $i in doc (“dblp.xml”)/dblp/article[author/text() = “John Smith”]


return <smith-paper>
<title>{ $i/title/text() }</title>
<key>{ $i/@key }</key>
{ $i/crossref }
</smith-paper>

42
Web Technologies and SOA 42
Restructuring Data in XQuery
Nesting XML trees is perhaps the most common operation
In XQuery, it’s easy – put a subquery in the return clause where you want things to
repeat!

for $u in doc(“dblp.xml”)/dblp/university
where $u/country = “USA”
return <ms-theses-99>
{ $u/name } {
for $mt in doc(“dblp.xml”)/dblp/mastersthesis
where $mt/year/text() = “1999” and $mt/school = $u/name
return $mt/title }
</ms-theses-99>

43
Web Technologies and SOA 43
Collections & Aggregation in XQuery

In XQuery, many operations return collections


– XPaths, sub-XQueries, functions over these, …
– The let clause assigns the results to a variable
Aggregation simply applies a function over a collection, where the
function returns a value (very elegant!)

let $allpapers := doc (“dblp.xml”)/dblp/article


return <article-authors>
<count> { fn:count(fn:distinct-values($allpapers/authors)) } </count>
{ for $paper in doc(“dblp.xml”)/dblp/article
let $pauth := $paper/author
return <paper> {$paper/title}
<count> { fn:count($pauth) } </count>
</paper>
} </article-authors>

44
Web Technologies and SOA 44
Collections, Ctd.

Unlike in SQL, we can compose aggregations and


create new collections from old:
<result> {
let $avgItemsSold := fn:avg(
for $order in doc(“my.xml”)/orders/order
let $totalSold = fn:sum($order/item/quantity)
return $totalSold)
return $avgItemsSold
} </result>

45
Web Technologies and SOA 45
Distinct-ness

In XQuery, DISTINCT-ness happens as a function over a


collection
– But since we have nodes, we can do duplicate removal
according to value or node
– Can do fn:distinct-values(collection) to remove duplicate
values, or fn:distinct-nodes(collection) to remove duplicate
nodes

for $years in fn:distinct-values(doc(“dblp.xml”)//year/text())


return $years

46
Web Technologies and SOA 46
Sorting in XQuery

• In XQuery, what we order is the sequence of “result


tuples” output by the return clause:

for $x in doc (“dblp.xml”)/proceedings


order by $x/title/text()
return $x

47
Web Technologies and SOA 47
Querying & Defining Metadata

Can get a node’s name by querying name():


for $x in doc (“dblp.xml”)/dblp/*
return name($x)

Can construct elements and attributes using computed names:


for $x in doc (“dblp.xml”)/dblp/*,
$year in $x/year,
$title in $x/title/text()
return
element { name($x) } {
attribute { “year-” + $year } { $title }
}

48
Web Technologies and SOA 48
Views in XQuery
• A view is a named query
• We use the name of the view to invoke the query
(treating it as if it were the relation it returns)

Using the view:


XQuery:
declare function V() as element(content)* {
for $v in V()/content,
for $r in doc(“R”)/root/tree,
$r in doc(“r”)/root/tree
$a in $r/a, $b in $r/b, $c in $r/c
where $v/b = $r/b
where $a = “123”
return $v
return <content>{$a, $b, $c}</content>
}

49
Web Technologies and SOA 49
Outline

XML data model


XML schema languages
XML querying
XML query processing
• XML schema mapping

Web Technologies and SOA 50


Streaming Query Evaluation

• In data integration scenarios, the query processor must


fetch remote data, parse the XML, and process

• Ideally: we can pipeline processing of the data as it is


“streaming” to the system

“Streaming XPath evaluation”

… which is also a building block to pipelined XQuery


evaluation…

Web Technologies and SOA 51


Main Observations

• XML is sent (serialized) in a form that corresponds


to a left-to-right depth-first traversal of the
parse tree

• The “core” part of XPath (child, descendent axes)


essentially corresponds to regular expressions
over edge labels

Web Technologies and SOA 52


The First Enabler:
SAX (Simple API for XML)
• If we are to match XPaths in streaming fashion, we
need a stream of XML nodes

• SAX provides a series of event notifications


– Events include open-tag, close-tag, character data

– Events will be fired in depth-first, left-to-right traversal


order of the XML tree

53
Web Technologies and SOA 53
The Second Key: Finite Automata
• Convert each XPath to an equivalent regular
expression

• Build a finite automaton (NFA or DFA) for the


regexp

/dblp/article dblp article

//year year


Matching an XPath
• Assume a “cursor” on active state in the automaton
• On matching open-tag: push advance active state
• On close-tag: pop active state

dblp article Stack:


1 2 3 1

<?xml version="1.0" encoding="ISO-8859-1" ?>


<dblp> event: start-element “dblp”
<mastersthesis>

</mastersthesis>
<article mdate="2002-01-03" key="tr/dec/SRC1997-018">
<editor>Paul R. McJones</editor>
<title>The 1995 SQL Reunion</title>
<journal>Digital System Research Center Report</journal>
<volume>SRC1997-018</volume>
<year>1997</year>
<ee>db/labs/dec/SRC1997-018.html</ee>
<ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>
</article>
Matching an XPath
• Assume a “cursor” on active state in the automaton
• On matching open-tag: push advance active state
• On close-tag: pop active state

dblp article Stack:


dead 1 2 3 21
1

<?xml version="1.0" encoding="ISO-8859-1" ?>


<dblp>
<mastersthesis>

event: start-element “mastersthesis”
</mastersthesis>
<article mdate="2002-01-03" key="tr/dec/SRC1997-018">
<editor>Paul R. McJones</editor>
<title>The 1995 SQL Reunion</title>
<journal>Digital System Research Center Report</journal>
<volume>SRC1997-018</volume>
<year>1997</year>
<ee>db/labs/dec/SRC1997-018.html</ee>
<ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>
</article>
Matching an XPath
• Assume a “cursor” on active state in the automaton
• On matching open-tag: push advance active state
• On close-tag: pop active state

dblp article Stack:


1 2 3 21
1

<?xml version="1.0" encoding="ISO-8859-1" ?>


<dblp>
<mastersthesis>

</mastersthesis>
<article mdate="2002-01-03" key="tr/dec/SRC1997-018">
<editor>Paul R. McJones</editor>
event: end-element “mastersthesis”
<title>The 1995 SQL Reunion</title>
<journal>Digital System Research Center Report</journal>
<volume>SRC1997-018</volume>
<year>1997</year>
<ee>db/labs/dec/SRC1997-018.html</ee>
<ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>
</article>
Matching an XPath
• Assume a “cursor” on active state in the automaton
• On matching open-tag: push advance active state
• On close-tag: pop active state

dblp article Stack:


1 2 3 21
1

<?xml version="1.0" encoding="ISO-8859-1" ?>


<dblp>
<mastersthesis>

</mastersthesis>
<article mdate="2002-01-03" key="tr/dec/SRC1997-018">
<editor>Paul R. McJones</editor>
<title>The 1995 SQL Reunion</title>
event: start-element “article”
<journal>Digital System Research Center Report</journal>
<volume>SRC1997-018</volume>
<year>1997</year>
<ee>db/labs/dec/SRC1997-018.html</ee>
<ee>http://www.mcjones.org/System_R/SQL_Reunion_95/</ee>
match
</article>
!
Different Options

• Many different “streaming XPath” algorithms


– What kind of automaton to use
• DFA, NFA, lazy DFA, PDA, proprietary format
– Expressiveness of the path language
• Full regular path expressions, XPath, …
• Axes

– Which operations can be pushed into the operator


• XPath predicates, joins, position predicates, etc.

59
Web Technologies and SOA 59
From XPaths to XQueries
• An XQuery takes multiple XPaths in the FOR/LET
clauses, and iterates over the elements of each XPath
(binding the variable to each)
FOR $rootElement in doc(“dblp.xml”)/dblp,
$rootChild in $rootElement/article[author=“Bob”],
$textContent in $rootChild/text()
– We can think of an XQuery as doing tree matching, which
returns tuples ($i, $j) for each tree matching $i and $j in a
document

• Streaming XML path evaluator that supports a hierarchy


of matches over an XML document

Web Technologies and SOA 60


XQuery Path FOR $rootElement in doc(“dblp.xml”)/dblp,
$rootChild in $rootElement/article[author=“Bob”],

Evaluation $textContent in $rootChild/text()

• Multiple, dependent state machines outputting


binding tuples $rootElem $rootCh $textCont
ent ild ent
dblp Only activate $rootChild +
$ rootElement $textContent on a match to $rootElement

article
$ rootChild

author = “Bob”

text() Evaluate a pushed-down


$ textContent set
selection predicate
?
Beyond the Initial FOR Paths

• The streaming XML evaluator operator returns tuples


of bindings to nodes $rootElement $rootChild $textContent

• We can now use standard relational operators to join,


sort, group, etc.

• Also in some cases we may want to do further XPath


evaluation against one of the XML trees bound to a
variable

Web Technologies and SOA 62


Creating XML

• To return XML, we need to be able to take streams


of binding tuples and:
– Add tags around certain columns
– Group tuples together and nest them under tags

• Thus XQuery evaluators have new operators for


performing these operations

Web Technologies and SOA 63


An Example XQuery Plan
(<BobResult><editor>Paul R. McJones</editor>

XML output XML


<title>The 1995…</title><text>Paul R. McJones</text>
<text>The 1995…</text></BobResult>)

operator tagging
(<editor>Paul R. McJones</editor>,
<title>The 1995…</title>,

Π
<text>Paul R. McJones</text><text>The 1995…</text>)

XPath evaluation (<article>…</article>, [“Paul R. McJones”,”The 1995…”, …],


<editor>Paul R. McJones</editor>,
<title>The 1995…</title>,

against a binding <text>Paul R. McJones</text><text>The 1995…</text>)

hild $ editor
editor
XPath $rootC
title
matcher $ title set

(<dblp>…</dblp>, <article>…</article>, [“Paul R. McJones”,”The 1995…”, …])

Relational-style ⊐⋈
...

(<text>Paul R. McJones</text><text>The 1995…</text>)

query operators XML


grouping
(outerjoin) (<article>…</article>,

...
[“Paul R. McJones”,”The 1995…”, …]) (<text>Paul R. McJones</text>)
(<text>The 1995…</text>)

XML
tagging
(“Paul R. McJones”)
(“The 1995…”)

ntent $ txt
XPath $textCo
matcher

Streaming XPath Π
(<dblp>…</dblp>, <article>…</article>, [“Paul R. McJones”,”The 1995…”, …])
... dblp
l $ rootElement
.xm
evaluation Streaming d
XPath
b l p
$ rootChild
article

text()
author =
“B
ob”
$ textContent set
Σ

<dblp>...

dblp.xml 64
Web Technologies and SOA
Optimizing XQueries

• An entire field in and of itself

• A major challenge versus relational query


optimization: estimating the “fan-out” of path
evaluation

• A second major challenge: full XQuery supports


arbitrary recursion and is Turing-complete

Web Technologies and SOA 65


Outline

XML data model


XML schema languages
XML querying
XML query processing
XML schema mapping

Web Technologies and SOA 66


Schema Mappings for XML

• In Chapter 3 we saw how schema mappings were


described for relational data
– As a set of constraints between source and target databases

• In the XML realm, we want a similar constraint


language, but must address:
– Nesting – XML is hierarchical
– Identity – how do we merge multiple partial results into a
single XML tree?

Web Technologies and SOA 67


One Approach: Piazza XML Mappings

Derived from a subset of XQuery extended with node identity


– The latter is used to merge results with the same node ID
Directional mapping language based on annotations to XML templates

An output element in the template, ~ XQuery RETURN


<output>
{: $var IN document(“doc”)/path WHERE condition :}
<tag>$var</tag>
</output> Create the element for each
Populate with the match to this set of XPaths
value of a binding & conditions
– Translates between parts of data instances
– Supports special annotations and object fusion

68
Web Technologies and SOA 68
Mapping Example between
Two XML Schemas
Target: Publications by book Source: Publications by author
<authors>
<pubs> <author>*
<book>* <full-name>
<publication>*
<title> <title>
<author>* <pub-type>
<name>
Has an entity-relationship model representation like:
publication writtenBy author

title pub-type name


69
Web Technologies and SOA 69
Example Piazza-XML Mapping

<pubs>
<book>
{: $a IN document(“…”)/authors/author,
$an IN $a/full-name,
$t IN $a/publication/title,
$typ IN $a/publication/pub-type Output one
WHERE $typ = “book” :} book per
match to
<title>{$t}</title> author
<author><name>{$an}</name></author>
</book> Insert title and author
</pubs> name subelements

70
Web Technologies and SOA 70
Example Piazza-XML Mapping
Merge elements if they are
<pubs> for the same value of $t
<book piazza:id={$t}>
{: $a IN document(“…”)/authors/author,
$an IN $a/full-name,
$t IN $a/publication/title,
$typ IN $a/publication/pub-type Output one
WHERE $typ = “book” :} book per
match to
<title piazza:id={$t}>{$t}</title> author
<author><name>{$an}</name></author>
</book> Insert title and author
</pubs> name subelements

71
Web Technologies and SOA 71
A More Formal Model:
Nested TGDs
The underpinnings of the Piazza-XML mapping language
can be captured using nested tuple-generating dependencies
(nested TGDs)
– Recall relational TGDs from Chapter 3
 X , Y , S ( ( X , Y )  (S )  Z , T ( ( X , Z )  (T )))

Formulas Formulas Formulas


Formulas
over source over set- over set-
over target
valued source valued target
variables variables, with
grouping keys
– As before, we’ll typically omit the  quantifiers…

Web Technologies and SOA 72


Example Piazza-XML Mapping
as a Nested TGD
<pubs>
<book piazza:id={$t}>
{: $a IN document(“…”)/authors/author,
$an IN $a/full-name,
$t IN $a/publication/title,
$typ IN $a/publication/pub-type
WHERE $typ = “book” :}

<title piazza:id={$t}>{$t}</title>
<author><name>{$an}</name></author>
</book>
</pubs>
authors(author )  author ( f , publicatio n)  publicatio n(t , book ) 
p( pubs(book )  bookt (t , author ' , publisher )  author 't , f ( f )  publishert ( p))

Grouping keys in target 73


Web Technologies and SOA 73
Query Reformulation for XML

• Two main versions:


– Global-as-view-style:
• Query is posed over the target of a nested TGD, or a Piazza-
XML mapping
• Can answer the query through standard XQuery view
unfolding

– Bidirectional mappings, more like GLAV mappings in the


relational world:
• An advanced topic – see the bibliographic notes

Web Technologies and SOA 74


XML Wrap-up

• XML forms an important part of the data


integration picture – it’s a “bridge” enabling rapid
connection to external sources

• It introduces new complexities in:


– Query processing – need streaming XPath / XQuery
evaluation
– Mapping languages – must support identity and nesting
– Query reformulation

Web Technologies and SOA 75

You might also like